[PDF] EEGFuseNet: Hybrid Unsupervised Deep Feature Characterization and Fusion for High-Dimensional EEG with An Application to Emotion Recognition

Abstract

How to effectively and efficiently extract valid and reliable features from high-dimensional electroencephalography (EEG), particularly how to fuse the spatial and temporal dynamic brain information into a better feature representation, is a critical issue in brain data analysis. Most current EEG studies are working on handcrafted features with a supervised modeling, which would be limited by experience and human feedbacks to a great extent. In this paper, we propose a practical hybrid unsupervised deep CNN-RNN-GAN based EEG feature characterization and fusion model, which is termed as EEGFuseNet. EEGFuseNet is trained in an unsupervised manner, and deep EEG features covering spatial and temporal dynamics are automatically characterized. Comparing to the handcrafted features, the deep EEG features could be considered to be more generic and independent of any specific EEG task. The performance of the extracted deep and low-dimensional features by EEGFuseNet is carefully evaluated in an unsupervised emotion recognition application based on a famous public emotion database. The results demonstrate the proposed EEGFuseNet is a robust and reliable model, which is easy to train and manage and perform efficiently in the representation and fusion of dynamic EEG features. In particular, EEGFuseNet is established as an optimal unsupervised fusion model with promising subject-based leave-one-out results in the recognition of four emotion dimensions (valence, arousal, dominance and liking), which demonstrates the possibility of realizing EEG based cross-subject emotion recognition in a pure unsupervised manner.

Full PDF

1 EEGFuseNet: Hybrid Unsupervised Deep Feature Characterization and Fusion for High-Dimensional EEG with An Application to Emotion Recognition

Zhen Liang a,b,c , Rushuang Zhou a,b , Li Zhang a,b , Linling Li a,b , Gan Huang a,b , Zhiguo Zhang a,b,d,e,* and Shin Ishii c,f a School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen, Guangdong 518060, China b Guangdong Provincial Key Laboratory of Biomedical Measurements and Ultrasound Imaging, Shenzhen, Guangdong 518060, China c Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan d Marshall Laboratory of Biomedical Engineering, Shenzhen, Guangdong 518060, China e Peng Cheng Laboratory, Shenzhen, Guangdong 518055, China f ATR Neural Information Analysis Laboratories, Kyoto 619-0288, Japan E-mails: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract

Electroencephalography; information fusion; hybrid deep encoder-decoder network; CNN-RNN-GAN; unsupervised; emotion recognition. Introduction

Electroencephalography (EEG) is a vital measurement of brain activity that could reflect the activities of neuron dynamics originated from the central nervous system and respond rapidly to different brain states [Alarcao and Fonseca, 2017; Cao, 2020]. Recently, EEG based emotion recognition has become an increasingly important topic for human emotion understanding, regulation, and management [Hu et al., 2020]. From a neurophysiological perspective, EEG offers a more direct, sensitive, comprehensive, reliable, and objective modality to represent human real and dynamic emotion status comparing to the other spontaneous physiological signals such as facial expressions, speech and body gesture [Zhang et al., 2020]. The features extracted from EEG at specific brain locations have been proven to be connected to human emotion in various studies. Generally, EEG features were extracted from time domain [Hosseini and Naghibi-Sistani, 2011; Liu et al., 2011; Jie et al., 2014; Oktavia et al., 2019], frequency domain [Li and Lu, 2009; Jenke et al., 2014; Nawaz et al., 2018; Zheng et al., 2018], wavelet domain [Puthankattil and Joseph, 2014; Wang et al., 2014; Shahnaz et al., 2016; Li et al., 2018], or multiple feature fusion [Lan et al., 2016; Zhang et al., 2016; García-Martínez et al., 2019; He et al., 2019; Liang et al., 2019]. To improve EEG feature representation, some commonly used feature selection methods, such as maximum relevance minimum redundancy (mRMR) [Atkinson and Campos, 2016; Liu et al., 2016] and transfer recursive feature elimination (T-RFE) [Yin et al., 2017b], were adopted to select the most relevant features or map the features to a new and efficient feature space with lower feature dimensionality. Then a classifier was built to analyze the relationship between the extracted features and emotion labels and consequently an emotion decoding system from EEG signals was constructed. Due to the microvolt-range amplitude of EEG, the collected EEG data is easily contaminated with noises (e.g., physiological artifacts including ocular activity, muscle activity, respiration, cardiac activity; and non-physiological artifacts including body movement, AC electrical, electromagnetic inferences). In spite of the large number of studies working on EEG-based emotion decoding, how to effectively and efficiently extract valid and useful EEG features from the collected data is still a big challenging and unsolved question. For example, how to fuse the EEG signals collected at different brain locations and at different time points in an efficient approach remains unclear. In general, there are three types of information fusion in the processing of EEG signals. (1) Fusion of spatial information: based on the given time point(s), fuse the relationship dynamics (e.g. correlation) of the EEG signals at different brain regions. The commonly used methods in this fusion level could be connectivity [Sakkalis, 2011; Haufe et al., 2013], microstate [Khanna et al., 2015; Milz et al., 2016], or other topographic analysis [Castelnovo et al., 2016; Ma et al., 2017], that interpret cortical region communication behavior by assessing the interaction functions between cortical areas and measuring the direction and strength of the interactions. (2) Fusion of temporal information: based on the given brain region(s), truncate continuous time-series EEG signals into short data segments and fuse the relationship dynamics (e.g. changing trend) of the EEG signals at different time points. The well-known techniques in time-series signal processing for non-stationary signal analysis include short-time Fourier transform (STFT) [Sivasankari and Thanushkodi, 2014; Ramos-Aguilar et al., 2020] and wavelet analysis [Tzimourta et al., 2017; Islam and Ahmad, 2019], that compute time-frequency dynamics of the data by performing successive calculations and measuring the data interaction along the time. (3) Fusion of both spatial and temporal information: not only assess cortical region interaction behavior but also estimate dynamic cortical involvement in a serial reaction time. For the fusion strategy of type (3), it is still a substantial challenge in current feature extraction methods for EEG signals considering the factors of validity and reliability, which needs to be tackled urgently. Although it is possible to estimate spatial features and temporal features separately and then concatenate these two types of features, such an approach does not effectively extract and fuse useful but latent information from a joint temporal-spatial domain. Deep learning in neural networks provides a good solution to characterize and fuse deep semantic features from the input data and has achieved tremendous success in various computer vision and image understanding applications [Tang et al., 2017; Ma et al., 2019; Zhao et al., 2019]. For example, Tang et al. [2017] employed deep convolutional neural network (CNN) to fuse multi-stage features into a more semantic and structure representation and proved the beneficial of the features in solving scene recognition problem. CNN, as one of the most popular deep neural network (DNN), has strong representation learning ability. Ma et al. [2019] realized a fusion model to combine audio and visual information in videos through incorporating CNN and deep belief network (DBN) and demonstrated the model showed excellent performance in cross-modal feature fusion, denoising and redundancy removing. To further improve spatial feature representation, various DNNs have been developed to leverage the relationships between spatial appearance clues for better scene understanding [Finn et al., 2015; He et al., 2018; Uddin et al., 2019]. He et al. [2018] proposed a deep spatial feature reconstruction (DSR) network on the basis of fully convolutional network (FCN) and fused the multi-scale spatial features in a feasible approach. The performance has been well demonstrated to solve an incomplete person image re-identification problem. To improve the performance on the processing of sequential multimedia data instead of static images, CNN is proposed to work together with recurrent neural network (RNN) and long short-term memory (LSTM) [Ullah et al., 2017; Nguyen et al., 2018; Li et al., 2019]. The sequential information is learnt by analyzing the changes of CNN spatial features from time to time, and a fusion of temporal and spatial information is realized. Take video analysis as an instance. The spatial features of each individual frame are first characterized by CNN, and the sequential information of the dependencies in terms of frame features are then measured by LSTM [Ullah et al., 2017]. Recently, researchers extend DNNs to characterize valid deep EEG features and solve EEG based decoding problems [Jirayucharoensak et al., 2014; Zheng and Lu, 2015; Schirrmeister et al., 2017; Xu et al., 2019; Cimtay and Ekmekcioglu, 2020; Raghu et al., 2020]. For example, Jirayucharoensak et al. [2014] introduced a deep learning network with a stack of three autoencoders and two softmax classifiers to perform EEG-based emotion classification and improved the three-level emotion classification (arousal: 46.03%; valence: 49.52%) compared to support vector machine (SVM) and naïve Bayes classifiers. Zheng and Lu [2015] constructed deep belief networks (DBNs) to investigate the critical frequency bands and channels in EEG signals and selected the optimal ones by considering the weight distribution learned from the trained DBNs. Song et al. [2018] presented a novel dynamic graph convolutional neural network (DGCNN) to solve a multichannel EEG emotion recognition problem, where the discriminant EEG features as well as the intrinsic relationship were learnt. This model manifested a non-linear DNN and it is powerful in solving of EEG signals which are highly non-linear in nature. Cimtay and Ekmekcioglu [2020] adopted a pretrained state-of-the-art CNN, InceptionResnetV2, to extract useful and hidden features from the raw EEG signals. In these studies, the effectiveness of deep feature extraction and representation has been well demonstrated. However, all the above-mentioned studies were supervised learning based, where a great size of training samples with emotion labels was highly required, especially for deep networks. For example, a smaller size of training samples would make the deep network fail to generalize well due to the overfitting problem. To well train a deep network, not only the network design but also the sample size would significantly affect the network performance. Unlike multimedia sources which can be easily obtained from social media platforms like YouTube, it is unrealistic to collect a huge number of EEG signals from different participants and manually annotate each sample with emotional labels in the real-world application scenarios. Also, there is a risk to induce “label noise” during the sample annotation process [Luo et al., 2017]. Unsupervised learning would provide a more natural approach to decode EEG signals and is more aligned with human learning mechanism that requires useful information from the available samples without any associated teachers [Barlow, 1989]. How to appropriately characterize EEG signals is one of the most important part in an unsupervised based EEG decoding model, as the nature of unsupervised learning is to explore the hidden complex relationships among the EEG data underlying different mental states such as various emotion classes in terms of the characterized features. An improper feature representation would lead to a wrong estimation of relationship structure among the samples. Recently, Liang et al. [2019] introduced a novel hypergraph-based unsupervised EEG decoding model for human emotion recognition, where traditional and shallow EEG features, such as statistical features, Hjorth features, frequency bandpowers, energy and entropy properties, were adopted. Unfortunately, traditional and shallow features mostly rely on heuristics, prior knowledge and experience, and the modelling performance would be limited. Also, traditional features may fail to efficiently elicit the complicated and non-linear patterns from the raw EEG data. There is now a need for valid and reliable deep feature extraction method for time-series high-dimensional EEG signals under an information fusion of spatial and temporal cortical dynamics, but how to meet this need in an unsupervised manner remains unclear. Emerging progress in unsupervised based encoder-decoder networks has offered a huge success in feature characterization and representation for images [Tao et al., 2015], videos [Kiran et al., 2018], and audios [Deng et al., 2014]. The encoder-decoder structures perform excellently in the aspects of highly non-linear feature extraction. Wen and Zhang [2018] proposed a deep autoencoder based deep convolution network (AE-CDNN) to learn low-dimensional features from high-dimensional EEG data in an unsupervised manner and adopted several commonly used supervised classifiers to demonstrate an improvement in the detection accuracy could be achieved. Similarly, considering the non-stationary and chaotic behavior of the high-dimensional EEG signals, Shoeibi et al. [2021] developed a convolutional autoencoder (CNN-AE) for EEG feature learning and showed an accurate and reliable performance in a computer-aided diagnosis system. Instead of directly using EEG raw signals, Tabar and Halici [2016] converted high-dimensional EEG data to two-dimensional images by STFT and fed into a stack autoencoder network to solve a classification problem. In this study, considering the nature of non-stationary time series EEG signals (sequential high-dimensional data), we propose a hybrid deep unsupervised fusion network (termed as EEGFuseNet below) to characterize efficient sequential features and them fuse them into an informative feature representation vector. Specifically, based on the extracted features by CNN from raw EEG signals, RNN network is adopted to enhance the feature representation by exploring the potential feature relationships at temporal adjacencies. To improve the training performance, generative adversarial network (GAN) is incorporated to improve the training process of the CNN-RNN network through dynamic updates in an unsupervised manner, which is potential beneficial to high-quality feature generation. The extracted deep features incorporate both spatial and temporal dynamic characteristics, that could represent the spatial relationship among the channels and the dependencies of the signals collected at adjacent time points. EEGFuseNet could be considered as a generalized and efficient hybrid deep unsupervised feature characterization and fusion approach for high-dimensional EEG signals, where no task-specific knowledge or task labels are required. Also, the requirement on the training data size is also quite flexible. Current studies mostly focused on classification in a specific and single brain-computer interface (BCI) task, using task-specific prior knowledge in the design of the network architecture and training together with the task labels. Due to the difficulty of collecting a large number of EEG signals under similar experimental designs, it is still difficult to evaluate the generalizability of the previously proposed deep neural network architectures for EEG analysis to the other tasks with varied training data size as well as labels. For the proposed EEGFuseNet, it would be more applicable to various EEG-based classification and interpretation applications. Moreover, an EEGFuseNet based pure unsupervised framework is proposed to solve classification problems with hypergraph construction and partitioning algorithm. We evaluate the classification performance with an emotion recognition application on a well-known public database and compare to the other state-of-art methods. The results show the proposed EEGFuseNet for EEG feature characterization and fusion perform efficiently. The generalizability of the proposed unsupervised framework is established and the individual difference is well solved in the cross-individual task. In addition, the efficacy of the proposed method is also demonstrated on various training data sizes and a comparison with classic EEG features. Regardless of the existing deep unsupervised network based EEG feature extraction and fusion methods, to our knowledge, there is no example of studies where a solid and thorough exploration on hybrid deep configuration for converting high-dimensional EEG signals to low-dimension valid and reliable feature representation in a fusion and unsupervised manner has been conducted. The proposed EEGFuseNet together with hypergraph construction and partitioning would be beneficial to brain decoding applications and offer a pure unsupervised framework for EEG feature extraction and fusion for other use-cases. The remainder of this paper is structured as following. Section 2 describes the proposed hybrid deep unsupervised encoder-decoder network, EEGFuseNet. Section 3 presents the feature extraction performance with an application of EEG-based emotion recognition. A full comparison with the existing literature is conducted and a full discussion is given. Section 4 summarizes our findings and draws a conclusion of this paper. Methodology

In this section, we introduce the proposed hybrid deep model, EEGFuseNet, with the corresponding design and configuration and explain how to efficiently characterize non-stationary high-dimensional EEG signals in an unsupervised manner.

An overall preprocessing is first conducted on the collected raw EEG signals to remove noises such as physiological artifacts (e.g. ocular activity and muscle activity) and non-physiological artifacts (e.g. AC electrical and electromagnetic inferences). A full explanation of the preprocessing steps is provided in Appendix A of Supplementary Materials. After preprocessing, the EEG data at each trial is further partitioned into a number of segments with fixed-length. A segment-based EEG data is denoted as 𝑋 , representing the signals collected from the electrode channels ( 𝐶 ) at a period of time points ( 𝑇 ). Next, the segment data 𝑋 ∈ ℝ

𝐶×𝑇 is treated as the input to the proposed hybrid EEGFuseNet and the corresponding deep features are characterized and fused based on unsupervised learning.

In the following, we first explain the basic ideas of a deep encoder-decoder architecture. We then introduce the proposed EEGFuseNet step by step. An autoencoder is a general framework of deep networks of the encoder-decoder architecture. It mainly includes three parts: encoder, hidden vector and decoder. In the framework, the encoder is responsible to convert the input into a single dimensional vector (hidden vector), and the decoder transfers the hidden vector to the output [Masci et al., 2011]. Through maximizing the similarity between the given input ( 𝑋 ) and the generated output ( 𝑌 ), the autoencoder structure is jointed trained. A basic loss function widely used is defined as ℒ(𝑋, 𝑌) = ‖𝑋 − 𝑌‖ . (1) The existing studies show the learnt hidden vector could be considered as a strong latent representation of the input which are useful for accurate and efficient data description [Rifai et al., 2011; Gehring et al., 2013]. Most noteworthy, the autoencoder architecture is a self-learning paradigm, which does not require any labeling information during training process and is significantly easier to train comparing to the other recent feature extraction architectures [Jiao et al., 2018; Chen et al., 2019]. Thus, it would be perfectly used to solve the small size problem of EEG data with label missing. An illustration of a deep autoencoder framework for feature extraction with an unsupervised classification application is presented in Fig. 1, where the learnt hidden vector (code) is treated as features and input to the classification model for categorizing the data into two groups. As this framework is a pure unsupervised method without any labeled data required, it is easy to use and extend to various applications. For exploring an efficient unsupervised feature characterization and fusion method, we investigate a hybrid deep encoder-decoder network with specific architecture designs, termed as EEGFuseNet. The proposed EEGFuseNet mitigates the limitations of the existing state-of-the-art feature extraction and fusion methods and provides a number of practical benefits, for example, easy modification and simple training, for EEG signals collected under different environment variables in various applications. Next, the proposed hybrid deep encoder-decoder network architecture is illustrated in details. More precisely, we will introduce (1) how to construct the basic architecture of the proposed EEGFuseNet from the classical CNN, (2) how to incorporate GAN into the CNN-based EEGFuseNet to generate high-quality features, (3) how to incorporate RNN into the CNN-based EEGFuseNet to better fuse both temporal and spatial information, and (4) how to combine CNN, RNN, and GAN together to develop the final architecture of EEGFuseNet. Figure 1. An illustration of an unsupervised learning based deep encoder-decoder framework for feature characterization and fusion in a classification application. It is well known that CNN architecture is powerful for feature extraction without any task-specific design or training [Masci et al., 2011], which is capable of effectively extracting spatial information and estimating the network weights. CNN based deep networks have been widely and successfully used in various applications [Szegedy et al., 2015; Liu et al., 2017; Ufer and Ommer, 2017; Amin-Naji et al., 2019]. A typical deep encoder-decoder network works together with CNN, where the encoder consists of convolution layers for extracting useful information and the decoder consists of deconvolutional layers for upscaling the encoder feature maps [Yasrab et al., 2018; Ye and Sung, 2019].

Input OutputCodeEncoder DecoderFeatures Unsupervised Clustering D ee p U n s up e r v i s e d F ea t u r e C h a r ac t e r i za ti on & F u s i on A pp li ca ti on For time-series EEG signals, both spatial and temporal information are important which represents the relationships of the brain activities at different brain locations and the changing dynamics of brain patterns along the time. Inspired from EEGNet structure [Lawhern et al., 2018], a CNN based deep encoder-decoder network is developed as shown in Fig. 2. A sequential two-dimension convolutional layers are implemented to generate feature maps covering EEG spatial information at different frequency bands, where the filter length is the half of the sampling rate of input data. The activation function, exponential linear units (ELU), is added in convolutional and deconvolutional layers for model fitting improvement, without extra computational cost and any overfitting risk. Notably, as the input EEG signals consist of channels and time points (

𝑋 ∈ ℝ

𝐶×𝑇 ), two-dimension convolution functions are adopted here, instead of one-dimension convolution function. In the architecture, the encoder performs convolution and down-sampling, while the decoder performs deconvolution and up-sampling to reconstruct the input EEG signals. The main possible benefits of the multiply convolution layers could be: (1) compact, comprehensive and complete EEG pattern characterization from different dimensions; (2) relationship explorations within and between the extracted feature maps and feature fusion in an optimal approach; (3) less parameters to fit with the implementation of subsampling layers. Thus, the design of the convolution layers could be capable of providing an efficient way to learn spatial- temporal dynamics from time-series EEG signals collected at different brain locations and integrate the sample points to a compact and deep feature representation vector (EEG code). Figure 2. An illustration of the CNN based encoder-decoder network.

Channels Time PointsTime PointsChannels

InputOutput

EEG

Code

EncoderDecoder Feature MapFeature Map FCFCBN ELU Pooling DropoutConv UpPooling De-Conv Full Connection Table 1. The network configurations of the CNN based encoder-decoder network.

Encoder Name Parameters Input Size Output Size

Convolution (Conv) channel=16, kernel size (1,193), padding (0,96) 128x1×32×384 128×16×32×384 Batch Norm (BN) channel=16 128×16×32×384 128×32×32×384 Convolution (Conv) channel =32, kernel size (32,1), padding (0,0) 128×32×32×384 128×32×1×384 Batch Norm (BN) channel=32 128×32×1×384 128×32×1×384 ELU Activation Function 128×32×1×384 128×32×1×384 Pooling pooling (1,4) 128×32×1×384 128×32×1×96 Dropout dropout rate=0.25 128×32×1×96 128×32×1×96 Depthwise separable convolution (Conv) channel=32, kernel size (1,49), padding (0,24) 128×32×1×96 128×32×1×96 Pointwise Convolution (Conv) channel=16, kernel size (1,1), padding (0,0) 128×32×1×96 128×16×1×96 Batch Norm (BN) channel=16 128×16×1×96 128×16×1×96 ELU Activation Function 128×16×1×96 128×16×1×96 Polling pooling (1,8) 128×16×1×96 128×16×1×12 Data Reshape 128×16×1×12 128×192 Full Connection (FC) input=192, output=64 128×192 128×64

Decoder

Full Connection (FC) input=64, output=192 128×64 128×192 Data Reshape 128×192 128×16×1×12 UnPooling pooling (1,8) 128×16×1×12 128×16×1×96 Pointwise Deconvolution (De-Conv) channel=32, kernel size (1,1), padding (0,0) 128×16×1×96 128×32×1×96 Depthwise separable deconvolution (De-Conv) channel=32, kernel size (1,49), padding (0,24) 128×32×1×96 128×32×1×96 Batch Norm (BN) channel=32 128×32×1×96 128×32×1×96 Dropout dropout rate=0.25 128×32×1×96 128×32×1×96 ELU Activation Function 128×32×1×96 128×32×1×96 UnPooling unpooling (1,4) 128×32×1×96 128×32×1×384 Deconvolution (De-Conv) channel=16, kernel size (32,1), padding (0,0) 128×32×1×384 128×32×32×384 Batch Norm (BN) channel=16 128×32×32×384 128×16×32×384 Deconvolution (De-Conv) channel=1, kernel size (1,193), padding (0,96) 128×16×32×384 128×1×32×384

This network could offer a baseline for unsupervised deep feature characterization and fusion. The specific architecture details are presented in Table 1. The encoder network consists of 4 convolution layers. The weights in the training process are initialized randomly. In the design of an encoder-decoder architecture, each encoder layer would have a corresponding decoder layer. Thus, there are also 4 deconvolution layers in decoder part. The final decoder output is to reconstruct the input EEG signals with minimized difference. In the model training process, we use the mean squared error (MSE) as the objective function to measure the difference between the input EEG signals 𝑋 ∈ ℝ

𝐶×𝑇 and the reconstructed EEG signals

𝑌 ∈ ℝ

𝐶×𝑇 from the estimated EEG code by the network, given as loss = ‖𝑋 − 𝑌‖ . (2) A perfect model would have a loss of 0. A traditional encoder-decoder network is easy to train, but the generated features would be with low quality [Akbari and Liang, 2018]. Many researches have proven GAN could be capable of generating features with high quality from sequential data [Makhzani et al., 2015; Chen and Konukoglu, 2018; Sahu et al., 2018]. To further learn the complex structures of the non-stationary time-series EEG data, a hybrid encoder-decoder architecture incorporating CNN and GAN is developed. Figure 3 illustrates a general encoder-decoder pipeline with GAN. The network includes a generator (encoder-decoder network) and a discriminator, where generator is response to reconstruct EEG signals from the extracted EEG code and discriminator is to distinguish whether the input EEG signals is a fake one generated by generator or a real one collected from human brain. Figure 3. A general encoder-decoder pipeline with GAN.

Channels

Time Points

Input

Channels

Time Points

Fake

Channels

Time Points

True

Real ?Fake ?

Encoder Decoder Figure 4. An illustration of the discriminator in the hybrid CNN-GAN based encoder-decoder network. Table 2. The network configurations of the discriminator in the hybrid CNN-GAN based encoder-decoder network.

Name Parameters Input Size Output Size

Convolution (Conv) channel=8, kernel size (1,193), padding (0,96) 128×1×32×384 128×8×32×384 Batch Norm (BN) channel=8 128×8×32×384 128×16×32×384 Convolution (Conv) channel=16, kernel size (32,1), padding (0,0) 128×16×32×384 128×16×1×384 Batch Norm (BN) channel=16 128×16×1×384 128×16×1×384 ELU Activation Function 128×16×1×384 128×16×1×384 Pooling pooling (1,4) 128×16×1×384 128×16×1×96 Dropout dropout rate=0.25 128×16×1×96 128×16×1×96 Depthwise separable convolution (Conv) channel=32, kernel size (1,49), padding (0,24) 128×16×1×96 128×16×1×96 Pointwise Convolution (Conv) channel=16, kernel size (1,1), padding (0,0) 128×16×1×96 128×8×1×96 Batch Norm (BN) channel=16 128×8×1×96 128×8×1×96 ELU Activation Function 128×8×1×96 128×8×1×96 Pooling pooling (1,8) 128×8×1×96 128×8×1×12 Data Reshape 128×8×1×12 128×96 Full Connection (FC) input=96, output=1 128×96 128×1 Sigmoid Activation Function

On the basis of the CNN-based encoder-decoder network presented in Section 2.2.1, we further develop a hybrid CNN-GAN based deep encoder-decoder network. In the construction of the hybrid CNN-GAN based network, the generator is CNN-based encoder-decoder network shown in Fig. 2. The discriminator is designed as shown in Fig. 4, where the output layer is a

Channels Time PointsTime PointsChannels

Fake

True

Full Connection

RealFake

Sigmoid

BN ELU Pooling DropoutConv fully-connected layer. Then a sigmoid layer is used to represent the output by a 1-D feature vector. More specific configurations about the designed discriminator are reported in Table 2. In the training process, the generator 𝐺 characterizes the latent feature representation (code) of the sequential EEG signals 𝑋 and the discriminator 𝐷(𝑋, 𝐺(𝑜)) ∈ [0,1] measures the probability that the input (real training sample 𝑋 or synthesized fake sample 𝐺(𝑜) produced by the generator) is real or fake. Here, 𝑜 is the estimated EEG code by 𝐺 . The objective function in the training process is to build a good 𝐷 that is capable of discriminating the real sample from the generated fake samples and at the same time develop a good 𝐺 that can produce a fake sample which is as similar as possible to the real ones (two-player minimax game). The objective function is given as ℒ 𝐺𝐴𝑁 (𝐺, 𝐷) = 𝔼 𝑋 [log𝐷(𝑋)] + 𝔼 𝑋 log [1 − 𝐷(𝐺(𝑋))] , (3) where the first part log𝐷(𝑋) is the discriminator output for real sample 𝑋 and the second part 𝐷(𝐺(𝑋)) is the discriminator output for the generated fake sample based on the estimated latent code 𝑜 . Together with the objective function of 𝐺 , ℒ (𝐺) = ‖𝑋 − 𝐺(𝑋)‖ , (4) the overall objective function of the hybrid CNN-GAN based EEGFuseNet is given as ℒ = arg min 𝐺 max 𝐷 (ℒ 𝐺𝐴𝑁 (𝐺, 𝐷) + 𝜆ℒ (𝐺)) . (5) In the training process, the input of the discriminator is a paired of 𝑥 𝑠 and 𝐺(𝑥 𝑠 ) , where 𝑥 𝑠 ∈ 𝑋 ( 𝑠 = 1, … , 𝑛 ) is a real sample of EEG signals and 𝐺(𝑥 𝑠 ) is a generated fake EEG data by using the input of 𝑥 𝑠 . Here, 𝑛 refers to the total number of samples. Although CNNs are powerful models in feature extraction and are capable of achieving excellent performance on various learning tasks, they failed to consider the relations between sequences to sequences. In other words, the extracted EEG features in the previous two architectures lack the ability to measure the temporal dependencies of the patterns extracted at adjacent time points. However, according to the nature of EEG signals, there should possess a hierarchical structure with complex dependencies between the extracted features at different time points. The extracted feature at each time point should not be considered as an independent and isolated point. In the existing works, encoder-decoder models based on RNNs [Sutskever et al., 2014], LSTM [Sutskever et al., 2014] and gated recurrent neural networks (GRUs) [Chung et al., 2014] have recently demonstrated impressive results on sequential data representations [Serban et al., 2016; Vosoughi et al., 2016; Li et al., 2017]. To enhance the feature representation of time-series EEG signals, we extend the CNN-based network to a hybrid architecture to extract EEG features by exploiting the advantages of both recurrent and convolutional networks. In the hybrid CNN-RNN based deep encoder-decoder network, it draws on the intuition that use CNN to extract shallow sequence of features from time-series EEG signals and then encode the extracted features into a vector representation using RNN under a consideration of the relationships between the extracted features at different time points. So, the encoded features by RNN could be concerned as deeper feature representation and embed the meaning of the whole input EEG signals. This architecture successfully fuses the extracted feature representations at different deep levels, at different brain locations, and at different time points, which would be beneficial to the representation of spatial and temporal dynamics in the non-stationary time-series EEG signals. The developed hybrid CNN-RNN based encoder-decoder network is shown in Fig. 5, where the encoder consists of convolutional layers to extract features from EEG signals at every time point and the recurrent layers to encode the extracted features at every time point to an entire feature representation of the whole input EEG signals. While, the decoder consists of recurrent layers to predict the features at each time point from the output of encoder and deconvolutional layers to reconstruct the features at each time point to the original EEG signals. The convolution and deconvolution layers in the encoder and decoder parts are the same as the CNN based (Fig. 2). Here, the rows and columns in the generated feature map refer to the features from channels and time points. In the encoder pipeline, {𝑝 , 𝑝 , … , 𝑝 𝑡 } are the CNN features extracted from all the channels at each single time point; while in the decoder pipeline, {𝑞 , 𝑞 , … , 𝑞 𝑡 } are the reconstructed CNN features of all the channels at each single time point, which will be further used to reconstruct the raw EEG signals. In the recurrent layers (RNN network), the basic building modules for learning spatial dependencies between neighbors are the LSTM units. Due to the sophisticated training of LSTM, a GRU was proposed [Cho et al., 2014], which is similar to LSTM that modulates the flow of intimation inside the gating unit, without separate memory cell. It was evident GRU has shown comparable performance as LSTM on machine learning tasks, with less parameters required [Chung et al., 2014]. To achieve a higher computational efficiency, in the implementation of RNN in the hybrid CNN-RNN based network, we employ a bidirectional GRU which is defined as 𝑧 𝑡 = σ(𝑊 (𝑧) 𝑥 𝑡 + 𝑈 (𝒛) ℎ 𝑡−1 + 𝑏 (𝑧) ) , (6) 𝑟 𝑡 = σ(𝑊 (𝑟) 𝑥 𝑡 + 𝑈 (𝒓) ℎ 𝑡−1 + 𝑏 (𝑟) ) , (7) ℎ̂ 𝑡 = ∅(𝑊 (ℎ) 𝑥 𝑡 + 𝑈 (𝒉) (𝑟 𝑡 ⨀ ℎ 𝑡−1 ) + 𝑏 (ℎ) ) , (8) ℎ 𝑡 = (1 − 𝑧 𝑡 ) ⨀ ℎ 𝑡−1 + 𝑧 𝑡 ⨀ℎ̂ 𝑡 , (9) where 𝑥 𝑡 is the input and ℎ 𝑡 is the output. 𝑊 (𝑧) , 𝑊 (𝑟) , 𝑊 (ℎ) , 𝑈 (𝑧) , 𝑈 (𝑟) , and 𝑈 (ℎ) are weight matrices and 𝑏 (𝑧) , 𝑏 (𝑟) , 𝑏 (ℎ) are biases, which are learnt in the training process. 𝑧 𝑡 , 𝑟 𝑡 and ℎ̂ 𝑡 are update gate vector, reset gate vector and hidden state vector, respectively. σ and ∅ are sigmoid and tangent function. ⨀ is an element-wise multiplication. In the implementation, the forward and backward recurrent layers iteratively work on the time point based feature vectors in a sequence and compute the corresponding forward and backward sequences of hidden state vectors. Suppose that the hidden state vectors in the forward and backward sequences are denoted as ℎ 𝑡𝑓 and ℎ 𝑡𝑏 , the output of GRU at 𝑡 point is defined as a 𝑡 = ℎ 𝑡𝑓 ⊕ ℎ 𝑡𝑏 , (10) where ⊕ indicates vector concatenation. Finally, the generated deep feature representation vector (EEG code) is denoted as 𝑜 = (a , … , a 𝑡 , … , a 𝑇 ) . (11) Figure 5. The hybrid CNN-RNN based encoder-decoder network. For the purpose of EEG feature characterization and fusion, the input EEG signals are first characterized as a sequence of feature vectors at each time point 𝑡 after the convolutional layers (considered as spatial dynamic characterization ) and then sequential features are learnt by recurrent layers to synthesize the past and future dynamic information of time-series EEG signals (considered as temporal dynamic characterization ). Thus, the extracted 𝑜 is treated as the deep EEG features to represent the entire input EEG signals cross timepoints covering ...... Encoder

Channels Time Points

Input

Feature Map ............

GRU ............

GRUTime PointsChannels

Output EEGCode

Feature Map

Decoder

Shallow Feature Extraction Deep Feature Extraction

BN ELU Pooling DropoutConv Upsampling De-Conv Full Connection

FCFCFC FCFCFCFCFCFC FC FCFC

ELU

ELUELUELUELUELU ...... not only the EEG characteristics but also the EEG characteristics in the sequential information, which could be widely used in various EEG-related applications such as brain decoding. To improve the implementation efficiency, we update the input of GRU from batch to batch in the training process, where one batch includes a continuous EEG signals at a certain time gap. More details about the hybrid CNN-RNN based encoder-decoder network are presented in Table 3. Table 3. The network configurations of the hybrid CNN-RNN based encoder-decoder network. Encoder Name Parameters Input Size Output Size

Convolution (Conv) channel=16, kernel size (1,193), padding (0,96) 128×1×32×384 128×16×32×384 Batch Norm (BN) channel =16 128×16×32×384 128×32×32×384 Convolution (Conv) channel =32, kernel size (32,1), padding (0,0) 128×32×32×384 128×32×1×384 Batch Norm (BN) channel=32 128×32×1×384 128×32×1×384 ELU Activation Function 128×32×1×384 128×32×1×384 Pooling pooling (1,4) 128×32×1×384 128×32×1×96 Dropout dropout rate=0.25 128×32×1×96 128×32×1×96 Depthwise separable convolution (Conv) channel=32, kernel size (1,49), padding (0,24) 128×32×1×96 128×32×1×96 Pointwise Convolution (Conv) channel=16, kernel size (1,1), padding (0,0) 128×32×1×96 128×16×1×96 Batch Norm (BN) channel=16 128×16×1×96 128×16×1×96 ELU Activation Function 128×16×1×96 128×16×1×96 Pooling pooling (1,8) 128×16×1×96 128×16×1×12 Data Reshape 128×16×1×12 128×16×12 Full Connection (FC) input=16, output=16 128×12×16 128×12×16 ELU Activation Function 128×12×16 128×12×16 Bidirectional GRU input=16, hidden size=16, layer=1 128×12×16 128×12×32 Full Connection (FC) input=32, output=16 128×12×32 128×12×16

Decoder

Full Connection (FC) input=16, output=32 128×12×16 128×12×32 Bidirectional GRU input=32, hidden size=16, layer=1 128×12×32 128×12×32 Full Connection (FC) input=32, output=16 128×12×32 128×12×16 ELU Activation Function 128×12×16 128×12×16 Data Reshape 128×12×16 128×16×1×12 UnPooling unpooling (1,8) 128×16×1×12 128×16×1×96 Pointwise Deconvolution (De-Conv) channel=32, kernel size (1,1), padding (0,0) 128×16×1×96 128×32×1×96 Depthwise separable Deconvolution (De-Conv) channel=32, kernel size (1,49), padding (0,24) 128×32×1×96 128×32×1×96 Batch Norm (BN) channel=32 128×32×1×96 128×32×1×96 Dropout dropout rate=0.25 128×32×1×96 128×32×1×96 ELU Activation Function 128×32×1×96 128×32×1×96 UnPooling unpooling (1,4) 128×32×1×96 128×32×1×384 Deconvolution (De-Conv) channel=16, kernel size (32,1), padding (0,0) 128×32×1×384 128×32×32×384 Batch Norm (BN) channel=16 128×32×32×384 128×16×32×384 Deconvolution (De-Conv) channel=1, kernel size (1,193), padding (0,96) 128×16×32×384 128×1×32×384

Based on the network architectures developed in Section 2.2.1-2.2.3, we finally propose a hybrid network that incorporates CNN-RNN and GAN networks to realize an efficient and effective deep EEG feature characterization and fusion in an unsupervised manner. This proposed hybrid deep CNN-RNN-GAN based encoder-decoder network is also termed as EEGFuseNet, where the generator is the hybrid CNN-RNN based network presented in Fig. 5, and the discriminator is the same as the one in the hybrid CNN-GAN based network presented in Fig. 4. The architecture design of EEGFuseNet is shown in Fig. 6. The configuration details of the generator and discriminator parts are reported in Table 3 and Table 2, respectively. Figure 6. The architecture design of the proposed EEGFuseNet. Experimental Results and Discussion

In this section, we present the emotion recognition results based on the extracted deep features by the proposed EEGFuseNet in an unsupervised manner. To fully evaluate the validity and reliability of the extracted deep features, we quantify the unsupervised emotion recognition

BN ELU Pooling DropoutConv Upsampling De-Conv Full Connection ......

Encoder

Channels

Time Points

Input

Feature Map ............

GRU ...... ......

GRU

Time Points

Channels

Output

EEGCode

Feature Map

Decoder

Shallow Feature Extraction Deep Feature Extraction

Full Connection

RealFake

Sigmoid

GENERATOR

DISCRIMINATOR

FakeTrue

FCFC FC FC FC FCFCFC

FCFC

FCFC FC

ELU

ELUELU

ELU

ELUELU ... ... performance in four different emotion dimensions (valence, arousal, dominance, and liking) and compare with the results reported in the literature. To fully evaluate the feature representation performance of the proposed EEGFuseNet, we compare the extracted deep features from the abovementioned hybrid architectures on a well-known public emotion database. The performance is compared with the literature and presented in Section 3.5. A database for emotion analysis using physiological signals (DEAP) was constructed by Koelstra et al. [2012], which is the most widely used public database for studying emotions with EEG signals. DEAP provides a benchmark for comparing the EEG based emotion decoding methods. Recent publications have been proposed and verified using this database. The DEAP database is a multimodal dataset for studying human affective states, in which a total of 32 subjects participated. Forty music videos with a duration of 60s were selected through a stimuli selection system and used to evoke different specific and strong emotions. More details of the experimental design and data collection are presented in Appendix B of Supplementary Materials.

To solve a pure unsupervised learning based cross-subject EEG-based emotion decoding problem, we introduce a hypergraph theory in this paper. Hypergraph is recognized as an effective approach to describe complex hidden data structure in an unsupervised manner. Different from simple graph that only considers pairwise relation between any two vertices which possibly lead to a loss of information [Ducournau et al., 2009], a hypergraph is capable of connecting a couple of vertices (more than two) that share similar properties, presenting more general types of relations, and revealing more complex hidden structures than single connections [Liang et al., 2019]. Through measuring the relationships among the vertices in terms of the extracted deep EEG features by EEGFuseNet, one hyperedge is formed to connect a couple of vertices (trials) that share similar EEG features refer to different evoking emotions and subjects. Here, the number of vertices connected in one hyperedge is determined by the given hyperedge size, denoted as 𝜅 . Furthermore, a spectral hypergraph partitioning method is introduced to compute the hypergraph Laplacian ∆ , solve an optimal eigenspace exploration based on the given feature size (denoted as ℓ ), and then divide the constructed hypergraph into a specific number of classes. Here, the partitioned classes indicate different emotion statuses. More information about hypergraph construction and partitioning are presented in Appendix C in Supplementary Materials. To quantitatively evaluate the decoding performance, the emotion recognition accuracy is measured by 𝑃 𝑎𝑐𝑐 , given as 𝑃 𝑎𝑐𝑐 = 𝑛 𝑇𝑁 +𝑛 𝑇𝑃 𝑛 𝑇𝑁 +𝑛 𝐹𝑁 +𝑛 𝑇𝑃 +𝑛 𝐹𝑃 × 100% , (12) where 𝑛 𝑇𝑁 and 𝑛 𝑇𝑃 are the correctly predicted samples, and 𝑛 𝐹𝑁 and 𝑛 𝐹𝑃 are the misclassified samples. We use the DEAP database to benchmark the performance of the proposed EEGFuseNet and compare to the other state-of-art methods. Not like the media data used in deep learning studies, DEAP database is relatively small, only consisting of 1280 trials (32 subjects ×

40 trials). The challenge of network training is to well train the network to extract sufficient EEG features and avoid the over-fitting problem. To increase the sample size, each trial is further segmented into a number of segments with a length of 1s. The trial length is 60s and the number of segments of one trial is equal to 60. Thus, the total sample size is increased to 76800 (32 subjects ×

40 trials ×

60 segments). In the training process, the weight parameters in convolution layers were initialized with the uniform distribution based on Glorot initialization [Glorot and Bengio, 2010]. We run 100 training epochs and perform validation stopping. The model weights that generated the lowest validation set loss are saved as the final parameters for deep feature characterization and fusion. The Adam optimizer is with a momentum of 0.9. The mini-batch stochastic gradient descent (SGD) method with a fixed learning rate of 0.001 for autoencoder or generator and of 0.0002 for discriminator is used. Here, the mini-batch size is equal to 128. All the models are trained on an NVIDIA GeForce RTX 2080 GPU, with CUDA 10.0 using the Pytorch API.

As the computational complexity of the hypergraph construction process is

𝑂(𝑛 ) , it is very time-consuming to measure the adjacency relationship through all the available samples (32 subjects ×

40 trials ×

60 segments = ×

40 trials ×

60 segments = 𝜂 %) are randomly selected from the training data candidates (31 subjects ×

40 trials ×

60 segments = To evaluate the validity and reliability of the proposed method, we compare the emotion recognition performance of the proposed EEGFuseNet with the existing methods. To clarify the subject-based LOOCV process, the evaluation pipeline is illustrated below. (1) Sample increment: we increase the sample size by dividing each trial into a number of segments with a fixed length of 1s. (2) Baseline normalization: we normalize the emotion evoking EEG data by subtracting the last 1s baseline data, defined as 𝑋 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒 = 𝑋 𝑒𝑚𝑜𝑡𝑖𝑜𝑛 ⊖ 𝑋 𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒 . Here, 𝑋 𝑒𝑚𝑜𝑡𝑖𝑜𝑛 and 𝑋 𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒 are 1s EEG raw signals collected under emotion evoking and baseline conditions. ⊖ is an element-wise subtraction operation. (3) Data downsample: the original sampling rate is 512Hz, thus the data size of one segment data is 32 channel × 512 time points. To increase the computational efficiency, the data size of 𝑋 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒 is further reduced to 32 channel × 384 time points through a linear interpolation based downsample method (“resample” function in MATLAB). (4) Feature extraction: we extract EEG deep features for every segment data by using the proposed EEGFuseNet. (5) Test data formation: we use all the segment data from one subject as the test data. (6) Training data selection: based on the learning strategy mentioned in Section 3.4, we randomly select a certain number of samples ( 𝜂 %) from the remaining subjects’ segment data in the database (formed by 31 subjects). (7) Hypergraph construction: we build a hypergraph based on the test data and the selected training data in step (6). (8) Hypergraph partitioning: we divide the constructed hypergraph into two emotion classes based on the hidden relationships among the data in terms of the extracted deep features. The hypergraph Laplacian given in the Appendix C of Supplementary Materials is optimized in the partitioning process. (9) Emotion assignment: for each emotion dimension, we assign the emotion class to the test data according to the label distribution of the training data in the same class (majority voting rule). (10) Accuracy calculation: we calculate the emotion recognition accuracy of the test data by using Eq. (12). (11) Repeated steps (5) to (10) until each subject is treated as test data once : we repeatedly treat each subject’s data as test data and the remaining subjects’ data as training data candidates that is to be selected in the learning strategy, and measure the corresponding testing accuracy in emotion recognition. (12) Final evaluation result: we calculate the average of all the obtained testing accuracies cross all the subjects and present it as the final emotion recognition result. The results reported below are evaluated in this subject-based LOOCV approach. It is well known that, even using the same database, different performance validation procedures would lead to a great difference in the results. Generally speaking, the validation methods affect the obtained performance as: - supervised vs. unsupervised : supervised methods would have a better result than unsupervised methods, as label information is used for model training in the supervised methods; - subject-dependent vs. subject-independent : subject-dependent evaluation methods would have a better performance than subject-independent evaluation methods, as individual difference is not considered in the subject-dependent evaluation methods; - k-fold CV vs. video-based LOOCV k-fold CV methods would have a better performance than video-based LOOCV methods, as it would exist the possibility of having the training and test data from the same video stimuli in the k-fold CV methods; - video-based LOOCV vs. subject-based LOOCV video-based LOOCV methods would have a better performance than subject-based LOOCV methods, as it would exist the possibility of having the training and test data from the same subject in the video-based LOOCV methods. Table 4. Emotion recognition performance comparison with the existing studies. Methods Valence Arousal Dominance Liking

Supervised Subject-Dependent K-fold CV Liu and Sourina [2012] 50.80 76.51 - - Li et al. [2015] 58.40 64.20 65.80 66.90 Chen et al. [2015] 67.89 69.09 - - Supervised Subject-Dependent Video-based LOOCV Koelstra et al. [2012] 57.60 62.00 - 55.40 Bahari and Janghorbani [2013] 58.05 64.56 - 67.42 Naser and Saha [2013] 64.30 66.20 68.90 70.20 Zhuang et al. [2017] 69.10 71.99 - - Supervised Subject-Independent K-fold CV Torres-Valencia et al. [2014] 58.75 55.00 - - Atkinson and Campos [2016] 73.14 73.06 - - Liu et al. [ 2016] 69.90 71.20 - - Supervised Subject-Independent Subject-based LOOCV Shahnaz et al. [2016] 64.71 66.51 66.88 70.52 Song et al. [2018] 59.29 61.10 - - Chen et al. [2019] 67.90 66.50 - - Zhong et al. [2020] 66.23 68.50 - - Du et al. [2020] 69.06 72.97 - - Unsupervised Subject-Dependent Video-based LOOCV Liang et al. [2019] 56.25 62.34 64.22 66.09 Unsupervised Subject-Independent Subject-based LOOCV

EEGFuseNet 56.27 58.78 61.69 66.30 Table 4 reports the performance comparisons with the existing literature, where the corresponding validation approaches are clearly described. Under a consideration of computational efficiency and performance stability, in the implementation of EEGFuseNet, the feature size ℓ and hyperedge size 𝜅 in hypergraph construction and partitioning are set to 64 and 5, while the sampling rate 𝜂 in the modeling learning with the learning strategy is 10. From the results listed in Table 4, the benefit of the network depth can be observed. The results show EEGFuseNet achieves a promising emotion recognition performance, where the recognition accuracies on valence, arousal, dominance, and liking are 56.27, 58.78, 61.79 and 66.30, respectively. As our work is unsupervised learning based, it is reasonable to have a lower accuracy in the emotion recognition comparing to the supervised learning methods presented in Table 4. Comparing to the unsupervised method proposed in [Liang et al., 2019], our recognition accuracies on valence and liking are slightly better and the accuracies on arousal and dominance is a bit lower. But Liang et al. [2019] only evaluated their proposed model in a subject-dependent manner without a consideration of individual difference, where the emotion recognition was trained and tested for each individual separately. The results reveal the validity of recognizing emotional states in four dimensions via subject-independent unsupervised learning method using the characterized deep features by the proposed EEGFuseNet. We compare the EEGFuseNet performance with the characterized and fused unsupervised deep EEG features by the other network configurations, such as CNN based, CNN-GAN based, and CNN-RNN based encoder-decoder networks. The corresponding unsupervised based emotion recognition performance in valence, arousal, dominance, and liking are reported in Table 5. The results show, comparing to CNN-based network, CNN-GAN based, CNN-RNN based and CNN-RNN-GAN based (EEGFuseNet) could be more capable of charactering and fusing emotion related deep EEG features in a high quality and achieving better cross-subject based emotion recognition performance. The traditional EEG features which are commonly used in the literature are also evaluated here for performance comparison between handcrafted features and deep features. The traditional EEG features include: (1) Time domain features: characterize the statistical patterns, Hjorth features and shape information of time-series EEG data; (2) Power spectral features: characterize the spectral powers at different frequency bands; (3) Differential entropy features: characterize the differential entropy at different frequency bands. Table 5. Emotion recognition performance comparison with different network configurations. Methods Valence Arousal Dominance Liking

CNN based 55.15 56.68 60.04 62.44 CNN-GAN based 55.90 56.95 61.10 64.81 CNN-RNN based 55.78 57.47 59.54 64.76 CNN-RNN-GAN based (EEGFuseNet)

More details about the traditional EEG features can be found in Appendix D of Supplementary Materials. To make the results comparable, the extracted traditional features are reduced to the same feature dimensionality as the proposed EEGFuseNet in modeling learning (solving the hypergraph Laplacian ∆ to the optimal eigenspace with the same feature size), where the hyperedge size 𝜅 and the sampling rate 𝜂 are set to 5 and 10, respectively. As shown in Table 6 with the presented traditional features, the best emotion recognition accuracy on valence is 54.60 when the power spectral features are used, arousal is 56.34 and liking is 62.51 when the differential entropy features are adopted, dominance is 58.89 when the time domain features are utilized. For an unsupervised study, these recognition results validate EEG features characterized and fused by hybrid deep networks (reported in Table 5) outperform the traditional handcrafted feature representations, which could fuse the spatial and temporal dynamic characteristics of EEG signals in a more intelligent way. Table 6. Emotion recognition performance with traditional EEG features. Traditional EEG Features Valence Arousal Dominance Liking

Time domain features 54.50

Power spectral features 54.60

Differential entropy features 53.54

The above results demonstrate the efficiency and effectiveness of the proposed EEGFuseNet for EEG feature characterization and fusion. The deep EEG features characterized by EEGFuseNet outperform the features extracted by the other network configurations and the handcrafted EEG features as well. It has proven EEGFuseNet is capable of characterizing and fusing deep features that imply comparative cortical dynamic significance corresponding to the changing of different emotion states. In addition, we evaluate the robustness of the proposed decoding model with the hypergraph theory, by comparing with the decoding performance using some other unsupervised methods such as simple graph based method and baseline methods. Here, simple graph based method refers to using a pair-wise graph construction and partitioning method for emotion decoding. Two types of baseline methods are tested, one is based on principal component analysis (PCA) and k-means clustering, and another is using k-nearest neighbors algorithm (k-NN). Notably, to make the results comparable, the used EEG features are also extracted from the proposed EEGFuseNet with same parameter settings. As shown in Table 7, the performance in emotion recognition of valence, arousal, dominance and liking decreases to 56.01, 55.92, 58.82, and 63.16, when the simple graph based method is used. It demonstrates that, comparing to pair-wise relationship measurement in simple graph, the proposed pure unsupervised framework together with hypergraph construction and partitioning could be beneficial to complex hidden relationship descriptions and be more suitable for EEG data modeling. For the two baseline methods, the recognition results of the method with PCA and k-means are all around 0.5 for valence, arousal, dominance, and liking, which is close to the random level (binary classification task); the recognition results of k-NN based method are slightly better than PCA+k-means based, but still much lower than the proposed one. These results verify a simple unsupervised method is not suitable for solving the complex and difficult decoding applications using high-dimensional EEG signals. Table 7. Emotion recognition performance comparison with classic unsupervised methods.

Methods Valence Arousal Dominance Liking

Simple graph based method 56.01 55.92 58.82 63.16 Baseline method (PCA+k-means) 50.40 50.60 50.32 51.33 Baseline method (k-NN) 50.46 52.16 53.74 59.83

We provide a testing on the parameters in the proposed EEGFuseNet that would affect the validity and reliability of the emotion recognition performance. For example, the input segment data size (EEG channels × Time points) in the network is evaluated. We adjust the input data size from 32 ×

128 to 32 × ×

384 perform the best, which is evident to cover almost of the important information in the collected data and improve the computational efficiency as well. If the feature size is further reduced, a loss of information would lead to a significant decrease in the recognition performance. According to the network design, the kernel size in the first convolution layer and the depthwise separable convolution layer would be adaptively adjusted according to the input data size. For the original data size (32 × 𝜅 ) and sampling rate ( 𝜂 ) in modeling learning are also examined. We adjust 𝜅 value from 5 to 35 with a step of 5 and present the corresponding emotion recognition performance in Fig. 8. It shows the performance is relative stable and less sensitive to the change of 𝜅 value. Figure 8. A comparison of emotion recognition performance with various hyperedge size values. To evaluate the effect of sampling rate on the modeling learning performance, the sampling rate ( 𝜂 ) value is set to 1, 2, 3, 4, 5, 10, 15, respectively. The corresponding training data size is 744, 1488, 2232, 2976, 3720, 7440 and 11160 samples (the total training data candidates are 74400 samples). The emotion recognition results with different sampling rates are shown in Fig. 9. It reveals an increase of sampling rate could generally lead to a greater emotion recognition accuracy. For the case of 𝜂 = 2 achieving better performance than 𝜂 = 3 , it could be the randomly selected training data of 𝜂 = 2 probably share similar patterns to the test data and less individual difference are involved. On the other hand, an increase of sampling rate would lead to a significant growth in computation time, due to the computational complexity of hypergraph theory. There is a tradeoff between recognition performance and computation time. Figure 9. A comparison of emotion recognition performance with different sampling rates. Conclusion

The aim of this paper is to present a theoretical and practical method for valid and reliable feature characterization and fusion of high-dimensional EEG signals in an unsupervised manner. This paper offers a comprehensive and dedicated comparisons on the proposed EEGFuseNet with different specific designs and configurations. The efficiency and effectivity of the extracted features is demonstrated in an emotion recognition application. The results reveal that the proposed hybrid CNN-RNN-GAN based EEGFuseNet systematically outperforms the other three networks (CNN based, hybrid CNN-GAN, and hybrid CNN-RNN), which also proves our original hypothesis in the network design. Notably, the proposed characterization, fusion and classification framework is a self-learning paradigm, without any requirement on labelling information in the training process. This work could serve as a foundational framework for high-dimensional EEG study and assessing the validity of other feature characterization and fusion methods beyond non-stationary time-series EEG signals.

Abbreviations

The abbreviations are mainly used in this paper. BCI Brain-Computer Interface EEG Electroencephalography DNN Deep Neural Network CNN Convolution Neural Network DBN Deep Belief Networks GAN Generative Adversarial Network RNN Recurrent Neural Network GRU Gated Recurrent Unit LSTM Long Short-Term Memory MSE Mean Squared Error CV Cross Validation LOOCV Leave-One-Out Cross Validation

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by National Natural Science Foundation of China (No.61906122).

References

Akbari, M. and Liang, J., 2018. Semi-recurrent CNN-based VAE-GAN for sequential data generation.

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 2321-2325. Alarcao, S.M. and Fonseca, M.J., 2017. Emotions recognition using EEG signals: A survey.

IEEE Transactions on Affective Computing , 10(3), pp. 374-393. Amin-Naji, M., Aghagolzadeh, A. and Ezoji, M., 2019. Ensemble of CNN for multi-focus image fusion.

Information fusion , 51, pp. 201-214. Atkinson, J. and Campos, D., 2016. Improving BCI-based emotion recognition by combining EEG feature selection and kernel classifiers.

Expert Systems with Applications , 47, pp. 35-41. Bahari, F. and Janghorbani, A., 2013. Eeg-based emotion recognition using recurrence plot analysis and k nearest neighbor classifier. , pp. 228-233. Barlow, H.B., 1989. Unsupervised learning.

Neural computation , 1(3), pp. 295-311. Castelnovo, A., Riedner, B.A., Smith, R.F., Tononi, G., Boly, M. and Benca, R.M., 2016. Scalp and source power topography in sleepwalking and sleep terrors: a high-density EEG study.

Sleep , 39(10), pp. 1815-1825. Chen, J., Hu, B., Moore, P., Zhang, X. and Ma, X., 2015. Electroencephalogram-based emotion assessment system using ontology and data mining techniques.

Applied Soft Computing , 30, pp. 663-674. Chen, H., Song, Y. and Li, X., 2019. A deep learning framework for identifying children with ADHD using an EEG-based brain network.

Neurocomputing , 356, pp. 83-96. Chen, J.X., Jiang, D.M. and Zhang, Y.N., 2019. A hierarchical bidirectional GRU model with attention for EEG-based emotion classification.

IEEE Access , , pp. 118530-118540. Chen, J.X., Zhang, P.W., Mao, Z.J., Huang, Y.F., Jiang, D.M. and Zhang, Y.N., 2019. Accurate EEG-based emotion recognition on combined features using deep convolutional neural networks. IEEE Access , 7, pp. 44317-44328. Chen, X. and Konukoglu, E., 2018. Unsupervised detection of lesions in brain mri using constrained adversarial auto-encoders. arXiv preprint arXiv:1806.04972 , pp. 1-9. Cho, K., Van Merriënboer, B., Bahdanau, D. and Bengio, Y., 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 , pp. 1-9. Chung, J., Gulcehre, C., Cho, K. and Bengio, Y., 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 , pp. 1-9. Cimtay, Y. and Ekmekcioglu, E., 2020. Investigating the use of pretrained convolutional neural network on cross-subject and cross-dataset EEG emotion recognition.

Sensors , 20(7), p.2034. Cao, Z., 2020. A review of artificial intelligence for EEG-based brain-computer interfaces and applications,

Brain Science Advances , 6(3), pp. 162-170. de Tommaso, M., Trotta, G., Vecchio, E., Ricci, K., Van de Steen, F., Montemurno, A., Lorenzo, M., Marinazzo, D., Bellotti, R. and Stramaglia, S., 2015. Functional connectivity of EEG signals under laser stimulation in migraine.

Frontiers in human neuroscience , 9, p. 640. Deng, J., Zhang, Z., Eyben, F. and Schuller, B., 2014. Autoencoder-based unsupervised domain adaptation for speech emotion recognition.

IEEE Signal Processing Letters , 21(9), pp. 1068-1072. Du, X., Ma, C., Zhang, G., Li, J., Lai, Y.K., Zhao, G., Deng, X., Liu, Y.J. and Wang, H., 2020. An Efficient LSTM Network for Emotion Recognition from Multichannel EEG Signals.

IEEE Transactions on Affective Computing , DOI: 10.1109/TAFFC.2020.3013711, pp 1-12. Ducournau, A., Rital, S., Bretto, A. and Laget, B., 2009, November. A multilevel spectral hypergraph partitioning approach for color image segmentation.

IEEE International Conference on Signal and Image Processing Applications , pp. 419-424. Finn, C., Tan, X.Y., Duan, Y., Darrell, T., Levine, S. and Abbeel, P., 2015. Learning visual feature spaces for robotic manipulation with deep spatial autoencoders. arXiv preprint arXiv:1509.06113 , 25, pp. 1-8. García-Martínez, B., Martinez-Rodrigo, A., Alcaraz, R. and Fernández-Caballero, A., 2019. A review on nonlinear methods using electroencephalographic recordings for emotion recognition.

IEEE Transactions on Affective Computing , DOI: 10.1109/TAFFC.2018.2890636, pp.1-20. Gehring, J., Miao, Y., Metze, F. and Waibel, A., 2013, May. Extracting deep bottleneck features using stacked auto-encoders.

IEEE international conference on acoustics, speech and signal processing , pp. 3377-3381. Glorot, X. and Bengio, Y., 2010, March. Understanding the difficulty of training deep feedforward neural networks.

Proceedings of the thirteenth international conference on artificial intelligence and statistics , pp. 249-256. Haufe, S., Nikulin, V.V., Müller, K.R. and Nolte, G., 2013. A critical assessment of connectivity measures for EEG data: a simulation study.

Neuroimage , , pp. 120-133. He, L., Liang, J., Li, H. and Sun, Z., 2018. Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 7073-7082. He, H., Zhao, W. and Fujimoto, K.I., 2019, March. Identification of EEG-based Music Emotion Using Hybrid COA features and t-SNE.

Proceedings of the 2019 9th International Conference on Biomedical Engineering and Technology , pp. 95-102. Hosseini, S.A. and Naghibi-Sistani, M.B., 2011. Emotion recognition method using entropy analysis of EEG signals. International Journal of Image,

Graphics and Signal Processing , 3(5), p. 30. Hu, W., Huang, G., Li, L., Zhang, L., Zhang, Z. and Liang, Z., 2020. Video-triggered EEG-emotion public databases and current methods: a survey,

Brain Science Advances , 6(3), pp. 255-287. Islam, M.R. and Ahmad, M., 2019, February. Wavelet analysis based classification of emotion from EEG signal.

International Conference on Electrical, Computer and Communication Engineering (ECCE) , pp. 1-6. Jenke, R., Peer, A. and Buss, M., 2014. Feature extraction and selection for emotion recognition from EEG.

IEEE Transactions on Affective computing , 5(3), pp. 327-339. Jiao, Z., Gao, X., Wang, Y., Li, J. and Xu, H., 2018. Deep convolutional neural networks for mental load classification based on EEG data.

Pattern Recognition , 76, pp. 582-595. Jie, X., Cao, R. and Li, L., 2014. Emotion recognition based on the sample entropy of EEG.

Bio-medical materials and engineering , 24(1), pp. 1185-1192. Jirayucharoensak, S., Pan-Ngum, S. and Israsena, P., 2014. EEG-based emotion recognition using deep learning network with principal component based covariate shift adaptation.

The Scientific World Journal , Article ID 627892, pp. 1-10. Khanna, A., Pascual-Leone, A., Michel, C.M. and Farzan, F., 2015. Microstates in resting-state EEG: current status and future directions.

Neuroscience & Biobehavioral Reviews , 49, pp. 105-113. Kiran, B.R., Thomas, D.M. and Parakkal, R., 2018. An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos.

Journal of Imaging , (2), p. 36. Koelstra, S., Muhl, C., Soleymani, M., Lee, J.S., Yazdani, A., Ebrahimi, T., Pun, T., Nijholt, A. and Patras, I., 2012. DEAP: A database for emotion analysis using physiological signals.

IEEE transactions on affective computing , 3(1), pp. 18-31. Lan, Z., Sourina, O., Wang, L. and Liu, Y., 2016. Real-time EEG-based emotion monitoring using stable features.

The Visual Computer , 32(3), pp. 347-358. Lawhern, V.J., Solon, A.J., Waytowich, N.R., Gordon, S.M., Hung, C.P. and Lance, B.J., 2018. EEGNet: a compact convolutional neural network for EEG-based brain–computer interfaces.

Journal of Neural Engineering , 15(5), p. 056013. Li, L., Qu, X., Zhang, J., Wang, Y. and Ran, B., 2019. Traffic speed prediction for intelligent transportation system based on a deep feature fusion model.

Journal of Intelligent Transportation Systems , 23(6), pp.605-616. Li, M. and Lu, B.L., 2009, September. Emotion classification based on gamma-band EEG.

Annual International Conference of the IEEE Engineering in Medicine and Biology Society , pp. 1223-1226. Li, M., Xu, H., Liu, X. and Lu, S., 2018. Emotion recognition from multichannel EEG signals using K-nearest neighbor classification.

Technology and Health Care , 26(S1), pp. 509-519. Li, X., Zhang, P., Song, D., Yu, G., Hou, Y. and Hu, B., 2015. EEG based emotion identification using unsupervised deep feature learning.

SIGIR2015 Workshop on Neuro-Physiological Methods in IR Research , ID: 44132, pp. 1-2. Li, P., Lam, W., Bing, L. and Wang, Z., 2017. Deep recurrent generative decoder for abstractive text summarization. arXiv preprint arXiv:1708.00625 , pp. 1-10. Liang, Z., Oba, S. and Ishii, S., 2019. An unsupervised EEG decoding system for human emotion recognition.

Neural Networks , 116, pp. 257-268. Liu, Y., Sourina, O. and Nguyen, M.K., 2011. Real-time EEG-based emotion recognition and its applications.

Transactions on computational science XII , pp. 256-277. Liu, B., Yu, X., Zhang, P., Yu, A., Fu, Q. and Wei, X., 2017. Supervised deep feature extraction for hyperspectral image classification.

IEEE Transactions on Geoscience and Remote Sensing , 56(4), pp. 1909-1921. Liu, J., Meng, H., Nandi, A. and Li, M., 2016, August. Emotion detection from EEG recordings. , pp. 1722-1727. Liu, Y. and Sourina, O., 2012, September. EEG-based valence level recognition for real-time applications.

International Conference on Cyberworlds , pp. 53-60. Long, J., Shelhamer, E. and Darrell, T., 2015. Fully convolutional networks for semantic segmentation.

Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 3431-3440. Luo, Z., Peng, B., Huang, D.A., Alahi, A. and Fei-Fei, L., 2017. Unsupervised learning of long-term motion dynamics for videos.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 2203-2212. Ma, X., Huang, X., Shen, Y., Qin, Z., Ge, Y., Chen, Y. and Ning, X., 2017. EEG based topography analysis in string recognition task.

Physica A: Statistical Mechanics and its Applications , 469, pp. 531-539. Ma, Y., Hao, Y., Chen, M., Chen, J., Lu, P. and Košir, A., 2019. Audio-visual emotion fusion (AVEF): A deep efficient weighted approach.

Information Fusion , 46, pp. 184-192. Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I. and Frey, B., 2015. Adversarial autoencoders. arXiv preprint arXiv:1511.05644 , pp. 1-16. Masci, J., Meier, U., Cireşan, D. and Schmidhuber, J., 2011, June. Stacked convolutional auto-encoders for hierarchical feature extraction.

International Conference on Artificial Neural Networks , pp. 52-59. Milz, P., Faber, P.L., Lehmann, D., Koenig, T., Kochi, K. and Pascual-Marqui, R.D., 2016. The functional significance of EEG microstates-Associations with modalities of thinking.

Neuroimage , 125, pp. 643-656. Naser, D.S. and Saha, G., 2013, March. Recognition of emotions induced by music videos using DT-CWPT.

Indian Conference on Medical Informatics and Telemedicine (ICMIT) , pp. 53-57. Nawaz, R., Nisar, H. and Voon, Y.V., 2018. The effect of music on human brain; frequency domain and time series analysis using electroencephalogram.

IEEE Access , 6, pp. 45191-45205. Nguyen, D., Nguyen, K., Sridharan, S., Dean, D. and Fookes, C., 2018. Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition.

Computer Vision and Image Understanding , 174, pp.33-42. Oktavia, N.Y., Wibawa, A.D., Pane, E.S. and Purnomo, M.H., 2019, September. Human Emotion Classification Based on EEG Signals Using Naïve Bayes Method.

International Seminar on Application for Technology of Information and Communication (iSemantic) , pp. 319-324. Puthankattil, S.D. and Joseph, P.K., 2014. Analysis of EEG signals using wavelet entropy and approximate entropy: A case study on depression patients.

International Journal of Bioengineering and Life Sciences , 8(7), pp. 430-434. Raghu, S., Sriraam, N., Temel, Y., Rao, S.V. and Kubben, P.L., 2020. EEG based multi-class seizure type classification using convolutional neural network and transfer learning.

Neural Networks , 124, pp. 202-212. Ramos-Aguilar, R., Olvera-López, J.A., Olmos-Pineda, I. and Snchez-Urrieta, S., 2020. Feature extraction from EEG spectrograms for epileptic seizure detection.

Pattern Recognition Letters , 133, pp. 202-209. Rifai, S., Vincent, P., Muller, X., Glorot, X. and Bengio, Y., 2011. Contractive auto-encoders: Explicit invariance during feature extraction.

Proceedings of the 28 th International Conference on Machine Learning , pp. 1-8. Sahu, S., Gupta, R., Sivaraman, G., AbdAlmageed, W. and Espy-Wilson, C., 2018. Adversarial auto-encoders for speech based emotion recognition. arXiv preprint arXiv:1806.02146 , pp. 1-5. Sakkalis, V., 2011. Review of advanced techniques for the estimation of brain connectivity measured with EEG/MEG.

Computers in Biology and Medicine , (12), pp. 1110-1117. Schirrmeister, R.T., Springenberg, J.T., Fiederer, L.D.J., Glasstetter, M., Eggensperger, K., Tangermann, M., Hutter, F., Burgard, W. and Ball, T., 2017. Deep learning with convolutional neural networks for EEG decoding and visualization. Human Brain Mapping , 38(11), pp. 5391-5420. Serban, I.V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A. and Bengio, Y., 2016. A hierarchical latent variable encoder-decoder model for generating dialogues. arXiv preprint arXiv:1605.06069 , pp. 1-15. Shahnaz, C. and Hasan, S.S., 2016, November. Emotion recognition based on wavelet analysis of Empirical Mode Decomposed EEG signals responsive to music videos. , pp. 424-427. Shoeibi, A., Ghassemi, N., Alizadehsani, R., Rouhani, M., Hosseini-Nejad, H., Khosravi, A., Panahiazar, M. and Nahavandi, S., 2021. A comprehensive comparison of handcrafted features and convolutional autoencoders for epileptic seizures detection in EEG signals.

Expert Systems with Applications , , p. 113788. Sivasankari, K. and Thanushkodi, K., 2014. An improved EEG signal classification using neural network with the consequence of ICA and STFT. Journal of Electrical Engineering and Technology , (3), pp. 1060-1071. Soleymani, M., Asghari-Esfeden, S., Fu, Y. and Pantic, M., 2015. Analysis of EEG signals and facial expressions for continuous emotion detection. IEEE Transactions on Affective Computing , 7(1), pp. 17-28. Song, T., Zheng, W., Song, P. and Cui, Z., 2018. EEG emotion recognition using dynamical graph convolutional neural networks.

IEEE Transactions on Affective Computing . 11(3), pp. 532-541. Sutskever, I., Vinyals, O. and Le, Q.V., 2014. Sequence to sequence learning with neural networks.

Advances in Neural Information Processing Systems , pp. 3104-3112. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A., 2015. Going deeper with convolutions.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 1-9. Tao, C., Pan, H., Li, Y. and Zou, Z., 2015. Unsupervised spectral-spatial feature learning with stacked sparse autoencoder for hyperspectral imagery classification.

IEEE Geoscience and remote sensing letters , 12(12), pp. 2438-2442. Tabar, Y.R. and Halici, U., 2016. A novel deep learning approach for classification of EEG motor imagery signals.

Journal of Neural Engineering , 14(1), p. 016003. Tang, P., Wang, H. and Kwong, S., 2017. G-MS2F: GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition.

Neurocomputing , 225, pp.188-197. Tang, Z., Sun, S., Zhang, S., Chen, Y., Li, C. and Chen, S., 2016. A brain-machine interface based on ERD/ERS for an upper-limb exoskeleton control.

Sensors , 16(12), p. 2050. Torres-Valencia, C.A., Garcia-Arias, H.F., Lopez, M.A.A. and Orozco-Gutiérrez, A.A., 2014. Comparative analysis of physiological signals and electroencephalogram (EEG) for multimodal emotion recognition using generative models. , pp. 1-5. Tzimourta, K.D., Tzallas, A.T., Giannakeas, N., Astrakas, L.G., Tsalikakis, D.G. and Tsipouras, M.G., 2017. Epileptic seizures classification based on long-term EEG signal wavelet analysis.

International Conference on Biomedical and Health Informatics , pp. 165-169. Uddin, M.A. and Lee, Y.K., 2019. Feature fusion of deep spatial features and handcrafted spatiotemporal features for human action recognition.

Sensors , 19(7), p.1599. Ufer, N. and Ommer, B., 2017. Deep semantic feature matching.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 6914-6923. Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M. and Baik, S.W., 2017. Action recognition in video sequences using deep bi-directional LSTM with CNN features.

IEEE Access , 6, pp.1155-1166. Vosoughi, S., Vijayaraghavan, P. and Roy, D., 2016. Tweet2vec: Learning tweet embeddings using character-level cnn-lstm encoder-decoder.

Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval , pp. 1041-1044. Wang, X.W., Nie, D. and Lu, B.L., 2014. Emotional state classification from EEG data using machine learning approach.

Neurocomputing , 129, pp. 94-106. Wen, T. and Zhang, Z., 2018. Deep convolution neural network and autoencoders-based unsupervised feature learning of EEG signals.

IEEE Access , , pp. 25399-25410. Xu, G., Shen, X., Chen, S., Zong, Y., Zhang, C., Yue, H., Liu, M., Chen, F. and Che, W., 2019. A deep transfer convolutional neural network framework for EEG signal classification. IEEE Access , 7, pp. 112767-112776. Yasrab, R., 2018. ECRU: An encoder-decoder based convolution neural network (CNN) for road-scene understanding.

Journal of Imaging , (10), p. 116. Ye, J.C. and Sung, W.K., 2019. Understanding geometry of encoder-decoder CNNs. arXiv preprint arXiv:1901.07647 , pp. 1-15. Yin, Z., Wang, Y., Liu, L., Zhang, W. and Zhang, J., 2017. Cross-subject EEG feature selection for emotion recognition using transfer recursive feature elimination. Frontiers in Neurorobotics , 11(19), pp. 1-16. Zhang, Y., Ji, X. and Zhang, S., 2016. An approach to EEG-based emotion recognition using combined feature extraction method.

Neuroscience Letters , 633, pp. 152-157. Zhang, J., Yin, Z., Chen, P. and Nichele, S., 2020. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review.

Information Fusion , 59, pp. 103-126. Zhao, Z.Q., Zheng, P., Xu, S.T. and Wu, X., 2019. Object detection with deep learning: A review.

IEEE Transactions on Neural Networks and Learning Systems , 30(11), pp. 3212-3232. Zheng, W.L. and Lu, B.L., 2015. Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks.

IEEE Transactions on Autonomous Mental Development , 7(3), pp. 162-175. Zheng, W.L., Liu, W., Lu, Y., Lu, B.L. and Cichocki, A., 2018. Emotionmeter: A multimodal framework for recognizing human emotions.

IEEE transactions on Cybernetics , 49(3), pp. 1110-1122. Zhong, X., Yin, Z. and Zhang, J., 2020, July. Cross-Subject emotion recognition from EEG using Convolutional Neural Networks. , pp. 7516-7521. Zhuang, N., Zeng, Y., Tong, L., Zhang, C., Zhang, H. and Yan, B., 2017. Emotion recognition from EEG signals using multidimensional information in EMD domain.

BioMed Research International , Article ID 8317357, pp. 1-9. Supplementary Materials

Appendix A: EEG Preprocessing

Each preprocessing step is presented below with full details. 1) Loading: loading a bdf file into MATLAB. 2) Common average referencing: re-referencing the data by subtracting the average of all the collected electrodes from each single electrode. 3) Finite impulse response (FIR) high-pass filter at 1 Hz: removing DC components at the low frequencies. 4) FIR low-pass filter at 45 Hz: removing the other artifact noises at the high frequencies. 5) Reordering: reordering the EEG electrode locations for all the data in the Geneva order [Koelstra et al., 2012]. According to the event marks, the collected EEG data of each subject is then segmented into 40 trials. Each trial included 5 s of baseline and 60 s of video playing. Further preprocessing is conducted for each trial as below: 6) Centering: aligning each channel to a zero mean. 7) Whitening transformation: conversion to a data matrix that has an identity covariance matrix. Any correlations in the data were removed. 8) ICA computing: running independent component analysis (ICA) in EEGLAB [Delorme and Makeig, 2004]. 9) Electrooculography (EOG) removal: removing the independent components with a lower fractal dimension (FD), based on the theory presented in [Gomez-Herrero et al., 2006]. Then, the EEG signals were reconstructed based on the remaining ICA components. 10) Re-centering: realigning the cleaned EEG data to a zero mean in each channel.

Appendix B: DEAP Database

In the DEAP database, in total 32 subjects were involved. For each subject, 40 trials with different videos were conducted. Each trial was composed of 5 s of fixation and 60 s of video playing. After finishing one trial, the subjects were requested to give subjective feedback on their emotions while watching the video. Four emotion dimensions were considered for affective evaluation: arousal, valence, dominance, and liking. With the use of the Self-Assessment Manikin (SAM) system [Morris, 1995], subjective feedback with a range from 1 to 9 was given for each dimension of emotion, where 9 indicated an extremely strong emotion and 1 indicated an extremely weak emotion. During the experiments, the videos were played in a randomized sequence and were presented on a 17-inch screen at a resolution of 800 × 𝑓𝑠 ) of 512 Hz from 32 active AgCl electrode sites according to the international 10-20 system placement [Jasper, 1958]. Appendix C: Hypergraph Theory

A hypergraph is composed of a number of vertices and hyperedges. Different from the conventional graph (simple graph) in which one edge only connects two vertices, one hyperedge can connect three or more vertices. A hypergraph could be represented as

𝐺 =(𝑉, 𝐸, 𝑤) , where 𝑉 and 𝐸 denote the set of vertices 𝑉 = {𝑣 , 𝑣 , … , 𝑣 |𝑉| } and the set of hyperedges 𝐸 = {𝑒 , 𝑒 , … , 𝑒 |𝐸| } . 𝑤 is a weight function of hyperedges, where 𝑤(𝑒 𝑘 ) is the hyperedge weight of 𝑒 𝑘 . The hyperedge weight matrix, 𝐖 , is a diagonal matrix comprising the weights of all the hyperedges in the hypergraph 𝐺 . For each 𝑒 𝑘 ∈ 𝐸 , we have 𝑒 𝑘 ={𝑣 𝑘 , 𝑣 𝑘 , … , 𝑣 |𝑒 𝑘 |𝑒 𝑘 } , indicating the vertices belong to the hyperedge 𝑒 𝑘 . An incident matrix 𝐇 defines the relationship between the hyperedges 𝐸 and the vertices 𝑉 , with the size of of |𝑉| × |𝐸| . Each element in 𝐇 denotes whether a vertex 𝑣 𝑘 is connected by a hyperedge 𝑒 𝑡 , established by ℎ(𝑣 𝑘 , 𝑒 𝑡 ) = {1, if 𝑣 𝑘 ∈ 𝑒 𝑡

0, if 𝑣 𝑘 ∉ 𝑒 𝑡 . (1) The degree of a vertex 𝑣 𝑘 ∈ 𝑉 is defined as the summation of all hyperedge weights 𝑤(𝑒) that the hyperedges 𝑣 𝑘 are belonged to, which is given by 𝑑(𝑣 𝑘 ) = ∑ ℎ(𝑣 𝑘 , 𝑒)𝑤(𝑒) {𝑒∈𝐸|𝑣 𝑘 ∈𝑒} . (2) The vertex degree matrix, 𝐃 𝑣 , is a diagonal matrix showing the degree of all the vertices in the hypergraph 𝐺 . The degree of a hyperedge 𝑒 𝑘 ∈ 𝐸 is defined by the number of vertices that are connected by the hyperedge 𝑒 𝑘 , denoted as 𝑑(𝑒 𝑘 ) = ∑ ℎ(𝑣, 𝑒 𝑘 ) {𝑒 𝑘 ∈𝐸|𝑣∈𝑒 𝑘 } . (3) 𝐃 𝑒 is the hyperedge degree matrix, where the diagonal shows the degrees of all the hyperedges in the hypergraph 𝐺 . In this study, we treat each vertex as a centroid, and form one hyperedge by the centroid vertex with its 𝜅 − 1 nearest vertices. In the constructed hypergraph, the number of hyperedges is equal to the number of vertices, and every hyperedge size is equal to 𝜅 . Then, the constructed hypergraph could describe the complex relationships among the characteristics extracted from different EEG trials under various emotion states cross subjects. One trial is treated as one vertex, and a hyperedge is formed by connecting an arbitrary number of vertices that share similar patterns. Later, emotion recognition can be realized through a spectral hypergraph partitioning [Zhou et al., 2007] by cutting the hypergraph into two clusters. The spectral hypergraph partitioning is defined as 𝐻𝑐𝑢𝑡(𝑌, 𝑌̅) = ∑ 𝑤(𝑒) |𝑒∩𝑌||𝑒∩𝑌̅|𝑑(𝑒)𝑒∈𝜕𝑌 , (4) where 𝑌 and 𝑌̅ are the two partitions of 𝑉 ( 𝑌̅ is the complement of 𝑌 ), 𝜕𝑌 = {𝑒 ∈ 𝐸|𝑒 ∩ 𝑌 ≠∅ and 𝑒 ∩ 𝑌̅ ≠ ∅} is the hyperedge boundary of 𝑌 , and 𝑑(𝑒) = ∑ ℎ(𝑣, 𝑒) {𝑒∈𝐸|𝑣∈𝑒} is the degree of the hyperedge 𝑒 . 𝐻𝑐𝑢𝑡(𝑌, 𝑌̅) is the sum of the hyperedge weights. To avoid unbalanced partitioning, a further normalization is conducted as

𝑁𝐻𝑐𝑢𝑡(𝑌, 𝑌̅) = 𝐻𝑐𝑢𝑡(𝑌, 𝑌̅)( + ) , (5) where 𝑣𝑜𝑙(𝑌) and 𝑣𝑜𝑙(𝑌̅) are the respective volume of 𝑌 and 𝑌̅ , defined as 𝑣𝑜𝑙(𝑌) =∑ 𝑑(𝑣) 𝑣∈𝑌 and 𝑣𝑜𝑙(𝑌̅) = ∑ 𝑑(𝑣) 𝑣∈𝑌̅ . As the connections among the vertices in the same cluster (sharing similar patterns) should be stronger while the connections among the vertices in two different clusters (sharing different patterns) should be weaker, an optimal hypergraph partitioning should be the one that minimized 𝑁𝐻𝑐𝑢𝑡(𝑌, 𝑌̅) given in Eq. (5). It is an NP-complete problem. To solve this optimization problem, the cost function is written as

Ω(𝑓) = 𝑓 𝑇 Δ𝑓 , (6) where Δ is the hypergraph Laplacian, defined as Δ = 𝐈 − Θ . Here, Δ is a positive semi-definite matrix. 𝐈 is an identify matrix, and Θ = 𝐷 𝑣−(1 2⁄ )

𝐻𝑊𝐷 𝑒−1 𝐻 𝑇 𝐷 𝑣−(1 2⁄ ) . Then, the optimization solution of the cost function turns to find the eigenvectors with the smallest non-zeros eigenvalues of Δ . In other words, the optimal hypergraph partitioning is to find the first several eigenvectors (based on the given feature size) with the smallest non-zero eigenvalues in Δ and form an eigenspace for the subsequent vertex classification. In practical, we take first ℓ (the given feature size) eigenvectors to form an eigenspace as Δ ′ and adopt K-means to cluster Δ ′ into two classes. Appendix D: Traditional EEG Features

The commonly widely used EEG features include the patterns extracted in time domain and spectral power in frequency domain. The extracted traditional EEG features used in this paper for comparison are illustrated below in details.

Time-domain feature extraction

We extract EEG features in the time domain and identify the general characteristics of the time-series EEG data, including (1) statistical features (7 features): power, mean, standard deviation, 1st difference, normalized 1st difference, 2nd difference and normalized 2nd difference. (2) Hjorth features (3 features): activity, mobility and complexity. (3) Fractal dimension (1 feature): based on Sevcik’s method [Sevcik, 1998], FD value is measured to characterize the shape information of EEG time-series data. For each segment, the extracted time domain features constitute a total of 11 features ×

32 channels.

Frequency-domain feature extraction

To study the variability in EEG rhythms across subjects, the individual alpha frequency (IAF) is computed to dynamically define the edges of the frequency bands. Here, IAF is computed based on the baseline data and did not involve any emotion factor. We then extract the average EEG power spectra on each subject-wise and trial-wise frequency band (defined by the calculated IAF values) from each EEG channel, based on the computed the spectral power distribution using a Hamming window with 50% overlap. The extracted band powers are named as follows: theta {𝜃} , alpha {𝛼} , beta1 {𝛽} , gamma {𝛾} . For each segment, the extracted power spectral features constitute a total of 4 features ×

32 channels.

Differential-entropy features

Based on the calculated subject-wise and trial-wise frequency bands defined by IAF values, differential entropy is calculated from each EEG channel at each frequency band { 𝜃 , 𝛼 , 𝛽 , 𝛾 } to measure the complexity of EEG signals [Shi et al., 2013]. For each segment, the extracted differential entropy features constitute a total of 4 features ×

32 channels.

References

Delorme, A. and Makeig, S., 2004. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis.

Journal of Neuroscience Methods , 134(1), pp. 9-21. Jasper, H.H., 1958. The ten-twenty electrode system of the International Federation.

Electroencephalography and Clinical Neurophysiology , 10, pp. 370-375. Koelstra, S., Muhl, C., Soleymani, M., Lee, J.S., Yazdani, A., Ebrahimi, T., Pun, T., Nijholt, A. and Patras, I., 2011. Deap: A database for emotion analysis; using physiological signals.

IEEE Transactions on Affective Computing , 3(1), pp. 18-31. Morris, J.D., 1995. Observations: SAM: the Self-Assessment Manikin; an efficient cross-cultural measurement of emotional response.

Journal of Advertising Research , 35(6), pp.63-68. Sevcik, C., 2010. A procedure to estimate the fractal dimension of waveforms.