[PDF] A Novel Transferability Attention Neural Network Model for EEG Emotion Recognition

Abstract

The existed methods for electroencephalograph (EEG) emotion recognition always train the models based on all the EEG samples indistinguishably. However, some of the source (training) samples may lead to a negative influence because they are significant dissimilar with the target (test) samples. So it is necessary to give more attention to the EEG samples with strong transferability rather than forcefully training a classification model by all the samples. Furthermore, for an EEG sample, from the aspect of neuroscience, not all the brain regions of an EEG sample contains emotional information that can transferred to the test data effectively. Even some brain region data will make strong negative effect for learning the emotional classification model. Considering these two issues, in this paper, we propose a transferable attention neural network (TANN) for EEG emotion recognition, which learns the emotional discriminative information by highlighting the transferable EEG brain regions data and samples adaptively through local and global attention mechanism. This can be implemented by measuring the outputs of multiple brain-region-level discriminators and one single sample-level discriminator. We conduct the extensive experiments on three public EEG emotional datasets. The results validate that the proposed model achieves the state-of-the-art performance.

Full PDF

11 A Novel Transferability Attention Neural NetworkModel for EEG Emotion Recognition

Yang Li, Boxun Fu, Fu Li ∗ , Guangming Shi, Wenming Zheng Abstract —The existed methods for electroencephalograph(EEG) emotion recognition always train the models based onall the EEG samples indistinguishably. However, some of thesource (training) samples may lead to a negative inﬂuence becausethey are signiﬁcant dissimilar with the target (test) samples.So it is necessary to give more attention to the EEG sampleswith strong transferability rather than forcefully training aclassiﬁcation model by all the samples. Furthermore, for anEEG sample, from the aspect of neuroscience, not all the brainregions of an EEG sample contains emotional information thatcan transferred to the test data effectively. Even some brainregion data will make strong negative effect for learning theemotional classiﬁcation model. Considering these two issues, inthis paper, we propose a transferable attention neural network(TANN) for EEG emotion recognition, which learns the emotionaldiscriminative information by highlighting the transferable EEGbrain regions data and samples adaptively through local andglobal attention mechanism. This can be implemented by measur-ing the outputs of multiple brain-region-level discriminators andone single sample-level discriminator. We conduct the extensiveexperiments on three public EEG emotional datasets. The resultsvalidate that the proposed model achieves the state-of-the-artperformance.

Index Terms —EEG emotion recognition, transferable atten-tion, brain region

I. I

NTRODUCTION

Emotion plays an important role in human daily life. It inﬂu-ences our rational decision-making, perception and cognition,and is essential in interpersonal communication [1]. Thus, itis necessary to make machines to understand human emotionsin the ﬁeld of human-computer interaction (HCI). To this end,the technology of emotion recognition provides a possible wayfor computers to capture human emotions, which is the ﬁrststep to improve and humanize the interaction between humansand machines.Generally, emotion recognition measures the emotionalstates by analyzing the data of bodily reactions under emo-tional conditions [2]. These reactions, including speech, facialexpression and gesture, can adequately express our emotionsunder most circumstances. Nevertheless, these methods aresubjective and cannot guarantee the authenticity of emo-tion [3]. Except for the above external methods, the internalphysiological variables tend to be much close to the realemotions. Human brain, as the source of all the reactions,

Yang Li, Boxun Fu, Fu Li and Guangming Shi are with the Key Laboratoryof Intelligent Perception and Image Understanding of Ministry of Educa-tion, the School of Artiﬁcial Intelligence, Xidian University, Xian, 710071,China. ( ∗ Corresponding author: Fu Li (E-mail: [email protected]).)

Wenming Zheng is with the Key Laboratory of Child Development andLearning Science (Ministry of Education), School of Biological Sciences andMedical Engineering, Southeast University, Nanjing, Jiangsu, 210096, China. can reﬂect the mental activities including the emotion states.According to the studies of neurophysiology and psychology,EEG has the ability to record the brain neural activities, andcan be used to decode the effective information of humanemotional states [4], [5]. Consequently, EEG emotion recog-nition has received substantial attention from human-computerinteraction and pattern recognition research communities inrecent years [6], [7], [8].Most EEG emotion recognition methods focus on two majortasks, i.e., EEG feature extraction and classiﬁcation. Theﬁrst task aims at seeking the discriminative emotion-relatedinformation from the raw EEG signals. EEG emotional signalsusually consist of many neural processes and hence present ahighly heterogeneous and nonstationary behavior [2]. Hence,how to extract the speciﬁc emotion information that contributeto the emotion recognition becomes a very important task.In [9], Jenke et al. summarized and evaluated all the ex-isting EEG features extracted from time domain, frequencydomain and time-frequency domain on their self-recorded EEGemotional dataset. The target of classiﬁcation is modeling thecorrelation between the EEG emotional feature and the classlabels, which leads to the interpretation of raw EEG emotionalsignals. Classiﬁcation performance provides insight about howwell a trained model can estimate the emotional state. Manyadvanced classiﬁcation algorithms have been proposed overthe years. For example, Zheng et al. [10] proposed a groupsparse canonical correlation analysis method for simultaneousEEG channel selection and emotion recognition. Li et al. [8]fused the information propagation patterns and activation dif-ference in the brain to improve emotional recognition. In [11],Alarcao and Fonseca summarized, reviewed and comparedthese works comprehensively.Recently, many domain adaptation methods have been pro-posed to deal with EEG emotion recognition, especially inthe subject-independent task, where the source and targetdata come from different subjects. These methods have sig-niﬁcantly advanced the EEG emotion recognition task. Forexample, Zheng and Lu [12] evaluated four different domainadaptation approaches including Transfer component analysis(TCA) [13], Kernel Principle Analysis (KPCA) [14], Trans-ductive Support Vector Machine (T-SVM) [15] and Trans-ductive Parameter Transfer (TPT) [16] on SEED dataset, andﬁnd that the accuracy can be improved by 20 % comparedwith the generic classiﬁer. Lan et al. [17] made a comparativestudy on several state-of-the-art domain adaptation techniqueson two EEG emotional datasets and the experiment resultsshow that using domain adaptation technique can improve theaccuracy signiﬁcantly by 7.25 % and 13.40 % compared with a r X i v : . [ c s . C V ] S e p the baseline accuracy where no domain adaptation techniqueis used. In all the domain adaptation methods, the mostwell-established one is the domain adversarial neural network(DANN) [18], which constructs a two players mini-max gameby using a domain discriminator that works adversarially withthe feature extractor to generate the domain-invariable datarepresentations. Li et al. adopted this setting and proposed abi-hemisphere domain adversarial neural network (BiDANN)for EEG emotion recognition and achieved the state-of-the-artperformance [19].Nevertheless, we argue that the there are two issues needto be better addressed in EEG emotion recognition tasks. Theﬁrst one is how to identify the positive EEG samples thatconsist of more emotion-related information. EEG emotionalsignals usually consist of many neural processes and aremuch vulnerable to negative effect of irrelevant knowledge,which incurs that some training EEG samples are signiﬁcantlydissimilar with the test ones. Exploring how to highlight thepositive EEG emotional samples and weaken the effect ofnegative samples will contribute more to emotion recognition.The second issue is how to weight the variability of differentbrain regions for EEG emotion recognition. Some studies ofneuroscience have shown that different brain regions havedifferent contributions for emotion expression [20]. In an EEGemotional sample, it is obvious that not all the brain regionscontain the knowledge of emotion that can be transferred to thetest samples. Making a strategy to distinguish the transferableand nontransferable brain regions is helpful to improve EEGemotion recognition.To this end, in this paper, we propose a transferableattention neural network (TANN) to deal with the abovetranferability learning problem for EEG emotion recognition.This transferability of data can be measured by calculatingfrom the outputs of domain discriminators. Speciﬁcally, forthe domain adversarial neural network [18], the output ofdomain discriminator is the probability of input data belongsto source or target domain. When the probability approaches0, it represents the input data belongs to source domain,while approaching 1 indicates that it belongs to the targetdomain. Therefore, TANN takes advantages of the domaindiscriminator to measure the transferability from the trainingdata to test data. Concretely, the framework of TANN includesthe following three major modules: • Feature extractor.

The goal of feature extractor is toextract the high-level discriminative deep feature fromraw EEG data for classiﬁcation. EEG data is made ofseveral electrodes that are set under the coordinates on thescalp, which are predeﬁned referring to the locations ofdifferent brain regions. In the feature learning procedure,we should well retain this intrinsic structural informationthat will be helpful for classiﬁcation. To achieve this,TANN employs two directional recurrent neural networks(RNN) that traverse all the electrodes from horizontaland vertical directions, which will construct a completerelationship and generate discriminative deep features forall the EEG electrodes. • Attention module.

The attention module aims to weightthe input training data according to the level of trans- ferability. For EEG emotional data, there is a largedistribution gap between training and test data, resultingthat some training EEG data are signiﬁcantly dissimilarwith the test. Moreover, from the aspect of neuroscience,not all the brain regions of an EEG sample containsemotional information that can transferred to the test dataeffectively. Therefore, TANN employs multiple brain-region-level and one sample-level discriminators to assessthe transferability of EEG sample and the inside brainregion data, then strengthen or weaken the contributionsof these brain regions and samples for emotion classiﬁ-cation. • Classiﬁer.

Like most supervised learning methods, weintroduce a classiﬁer to predict the emotion class labelbased on the deep features obtained by the feature extrac-tor. It will guide the feature extracting process towardsgenerate more discriminative EEG features for emotionclassiﬁcation.To the best of our knowledge, this is the ﬁrst work to exploitthe global and local transferability of EEG signals for emotionrecognition. The experimental results verify the proposedTANN method can achieve the state-of-the-art performanceon three public datasets.II. P

RELIMINARY

In this section, we brieﬂy overview the preliminary oftransferable attention and then address how we can apply it toEEG emotion recognition.Most attention based methods focus on how to highlightor weaken different parts in an image according to theircontribution for classiﬁcation but neglect the evaluation foreach training sample [21]. It is known that not all the trainingsamples are similar with the test. It will be a negative inﬂuencein the learning process if we feed the model with all thetraining samples forcefully. Transferable attention (TA) isdesigned to deal with this problem [22]. When a trainingsample is much easier to be transferred to the test, it will berewarded with more attention due to the high similarity withthe test data, which is called transferable attention. Inspired byadversarial learning methods, this attention can be realized bycalculating the outputs of the discriminator, which can reﬂectthe similarity between training and test data.Since in EEG emotion recognition tasks, not all the trainingEEG data are useful in the process of learning a model,exploring the transferability of EEG data will be meaningfuland can further improve EEG emotion recognition.III. T

HE PROPOSED MODEL FOR

EEG

EMOTIONRECOGNITION

To specify the proposed method clearly, we illustrate theframework of the proposed TANN model in Fig. 1. TANNaims to distinguish which training samples are easy or hardto be transferred to test samples. Through penalizing thesetraining samples, it can further improve EEG emotion recog-nition. Besides, considering not all the brain regions have theequal transferability, as well as measuring the similarity acrossEEG samples, TANN also focuses on the brain regions with

Fig. 1: The framework of TANN. TANN consists of two major modules, i.e., local and global attentions, that can make themodel focus on the brain regions and samples with higher transferability.high transferability. To achieve this goal, we adopt local andglobal attentions to the EEG emotion sample and its insidebrain regions’ data, respectively. These attention weights canbe obtained from the outputs of multiple local and one globaldomain discriminators. Concretely, TANN consists of threemajor modules, i.e., feature extractor, attention layers, andclassiﬁer. In the following, we illustrate these parts detailedly.

A. Feature extractor

The process of feature extraction is depicted in Fig. 2,and the goal is to represent the EEG emotional data ina more discriminative feature space so as to improve theEEG classiﬁcation performance. The EEG deep features areextracted by two directional RNN modules that traverse thespatial regions under two predeﬁned stacks, which are de-termined with respect to horizontal and vertical directions.These two directional RNNs are complementary to construct acomplete relationship of electrodes locations that avoid losingthe intrinsic structural information of EEG data. By doing this,we can obtain the high-level features for each EEG electrodethat facilitate to construct the brain regions’ features.Fig. 2: The process of feature extraction. We ﬁrst extract thedeep feature for each electrode, and then rearrange them toform the data representation of brain regions.Concretely, for an EEG sample X = [ x , · · · , x n ] ∈ R d × n ,where d and n are the dimension and number of EEG electrode, the above process can be formulated as s hi = σ ( U h x hi + (cid:88) nj =1 e hij V h h hj + b h ) ∈ R d f ,e hij = (cid:26) , if x hj ∈ N ( x hi ) , , otherwise , (1) s vi = σ ( U v x vi + (cid:88) nj =1 e vij V v h vj + b v ) ∈ R d f ,e vij = (cid:26) , if x vj ∈ N ( x vi ) , , otherwise , (2)where s · i is the hidden unit of the RNN module as well asthe data representation for the electrode x i , and d f is itsdimension; { U · ∈ R d f × d , V · ∈ R d f × d f , b · ∈ R d f × } arethe learnable transformation matrices of RNN module; σ ( · ) denotes the nonlinear operation such as Sigmoid function; and N ( x · i ) denotes the set of predecessors of node x · i .Due to that TANN consists of horizontal and verticaldirectional RNNs to represent EEG electrode, we can obtainthe data representations that not only contain the informationof the electrodes itself but also the nearby relationship. Specif-ically, it can be expressed as S h = { s hi } that contains theinformation from left and right electrodes, and S v = { s vi } that includes the information from up and down electrodes. Tointegrate these spatial information into a overall representation,we arrange the order of the columns of S h and S v , and use twotransformation matrices P and Q to obtain the deep features H = { h k } for all the electrodes, in which h i = Ps hi + Qs vi + b ∈ R d f (cid:48) , i ∈ { , · · · , n } . (3)Here h i is the deep representation of electrode x i that kepdfthe location structural relation, d f is the dimension. B. Attention layers

For EEG emotion samples, there is a large distribution gapbetween training and test data. Some training samples are verydissimilar with the test ones. Therefore, to avoid training amodel with all the source samples indiscriminately, TANNmeasures the transferability of all the training samples andthen strengthen or weaken them in the learning process ofthe model. Besides, as we know, for emotion recognition, notall the brain regions of an EEG sample contains emotional information that can transferred to the test data effectively.Some brain regions are more transferable than the others.Due to this, TANN not only employs a global attention layerto weight the sample-level transferability but also a localattention layer as a complement to focus on the brain-region-level transferability. Speciﬁcally, the transferability is quanti-ﬁed by the entropy of the outputs of domain discriminator. Thedomain discriminator can generate the probability of confusionbetween source (training) and target (test) data. When theprobability approaches 0.5, it indicates that the input has goodability to confuse the domain discriminator, which nicely meetour need to highlight the data with positive transferability. Inthe following, we will demonstrate how to achieve the localand global attentions by transferability learning.

1) Local transferable attention on brain-region-level:

Afterobtaining the data representation h i of each electrode of X ,TANN employs local attention to highlight the brain regionswith high transferability. Here we ﬁrst group the electrodesinto several clusters according to the associated brain regionlocations, which can be formulated asbrain region 1: H = [ h , h , · · · , h n ] , · · · · · · brain region N: H N = [ h N , h N , · · · , h Nn N ] , (4)where N is the number of brain regions, n c denotes the num-ber of electrodes in the c -th brain region, n + · · · + n N = n .In this case, the reordered deep feature can be expressed as ˆ H = [ H , · · · , H N ] . (5)Based on the above process, we can obtain the deep featuresof all the brain regions from source and target EEG samples,which can be denoted as ˆ H S = [ H S , · · · , H NS ] and ˆ H T =[ H T , · · · , H NT ] . Then they are fed to N local discriminators tocalculate the transferability. Concretely, let d N i = { d N i s , d N i t } denote the output probability of one discriminator for brainregion N i , where d N i s and d N i t are the probabilities that theinput belongs to the source and target data, respectively. Thenwe can quantity the transferability of this brain region throughthe entropy function in information theory [22], which isdeﬁned as H ( d N i ) = − d N i s · log( d N i s ) − d N i t · log( d N i t ) . (6)Then the higher transferability of a brain region has, the moreattention value is.However, for an EEG signal, the emotion information is themost difﬁcult component to be transferred. Due to this, wereverse the attention values for the brain regions to make themodel pay attention on the difﬁcult transferred brain regions.Thus the attention value for brain region N i is deﬁned as w N i = 1 − H ( d N i ) . (7)Besides, to mitigate the negative effect of wrong attentions,we adopt the residual attention mechanism to make the modelmore robust. Thus, after local attention layer, the data repre-sentations for EEG sample X can be formulated as ˆ H (cid:48) = [(1 + w ) H , · · · , (1 + w N ) H N ] ∈ R d f (cid:48) × n . (8) Here the loss function of the local discriminators for all thebrain regions can be formulated as L ld = 1 N N (cid:88) N i =1 L l Ni d ( X S , X T | θ l Ni d ) , (9)where L l Ni d = − (cid:88) M t =1 log p (0 | X S Ni t ) − (cid:88) M t (cid:48) =1 log p (1 | X T Ni t (cid:48) ) (10)denote the loss of the local discriminator for brain region N i ; p (0 | X S Ni t ) and p (1 | X T Ni t (cid:48) ) are the probabilities of the inputdata belongs to source and target domains respectively; θ l Ni d isthe parameter of the local attention network; X S Ni t and X T Ni t (cid:48) represent the N i brain region data of the t -th and t (cid:48) -th sourceand target sample, respectively; M and M are the numberof the source and target data.

2) Global transferable attention on sample-level:

Althoughthe above local attention for all the brain regions can makea ﬁne-grained transfer learning between the source and targetdomain data, there is a possible that the local domain discrim-inator ﬁnd fewer brain regions to transfer. Meanwhile, due tothe distribution difference, there are some negative samples inthe source data that are very dissimilar with the target data.It will weak the efﬁciency If we force training the modelwith these negative samples equaling with the other positivesamples. Hence, after weighting the transferability of brainregions with local attention, we adopt the global transferableattention on the sample-level to transfer the knowledge fromsource to target domain.Concretely, after local attention module, the input featurecan be expressed as ˜ H = ˆ H (cid:48) S ∈ R d f (cid:48) × n (cid:48) , (11)where S is a learnable transformation matrix. Then it is sentto a global discriminator L gd ( X S , X T | θ gd ) = − (cid:88) M t =1 log p (0 | X St ) − (cid:88) M t (cid:48) =1 log p (1 | X Tt (cid:48) ) , (12)to highlight the EEG samples with higher transferability, where θ gd is the parameter of the global attention network. Concretely,let d = { d s , d t } denote the output probability of the globaldiscriminator, where d s and d t are the probabilities that theinput belongs to the source and target data respectively. Theglobal attention value w can be calculated as w = 1 + H ( d ) , (13) H ( d ) = − d s · log( d s ) − d t · log( d t ) . (14)Here we also adopt the residual mechanism to avoid the wrongattention. In this case, we obtain that the more transferabilityis, the larger attention value w is.Inspired by Long et al. [23], the entropy minimization prin-ciple can reﬁne the classiﬁer adaptation, which can increasethe conﬁdence of the classiﬁer prediction. Thus, we utilize theglobal domain discriminator to generate the global attentionvalues acting on the label entropy to enhance the certaintyof the source samples that are more similar with the target samples. Then w is embedded into the label entropy lossto achieve the function for global attention. Hence the lossfunction of the label entropy, which is called attentive entropyloss, can be written as L e ( X S , X T | θ e ) = M + M (cid:88) k =1 C (cid:88) c =1 − w · p ( c | X k ) · log p ( c | X k ) , (15)where X k is the k -th sample in { X S , X T } ; w is the globalattention value for EEG sample X k ; and C is the number ofemotion classes. C. Classiﬁer

To enhance the discriminative ability of the model, we addthe classiﬁer to TANN model. Concretely, based on the ﬁnalfeature vector ˜ H in Eq. (11), we ﬁrst arrange the matrix ˜ H intoa vector ˜ h , and then use the simple linear transform approachto predict the class label, which can be formulated as O = G ˜ h + b c = [ o , · · · , o C ] , (16)where G and b c are the transformation matrices. Finally, theoutput vector O is fed into the softmax layer for emotionclassiﬁcation, which can be written as p ( c | X t ) = exp( o c ) / (cid:88) Ci =1 exp( o i ) , (17)where p ( c | X t ) denotes the predicted probability that the inputsample X t belongs to the c -th class. As a result, the label ˜ l of sample X t is predicted as ˜ l = arg max c p ( c | X t ) . (18)Hence, the loss function of the classiﬁer can be expressedas L c ( X S | θ c ) = M (cid:88) t =1 C (cid:88) c =1 − τ ( l, c ) · log p ( c | X t ) ,τ ( l, c ) = (cid:26) , if l = c, , otherwise , (19)where θ c denotes the parameter of the classiﬁer. D. The optimization

In summary, the overall loss function includes four parts,i.e., local and global discriminator losses, classiﬁer loss andthe attentive entropy loss. Concretely, the loss function of theproposed TANN method can be formulated as L ( X S , X T | θ c , θ e , θ ld , θ gd ) = L c ( X S | θ c ) + αL e ( X S , X T | θ e ) − β ( 1 N N (cid:88) N i =1 L l Ni d ( X S , X T | θ l Ni d ) + L gd ( X S , X T | θ gd )) , (20)where α and β are the hyper-parameters, L l Ni d and L gd rep-resent the losses of local and global attention discriminators.Then we iteratively optimize the classiﬁer, attentive entropy, local and global attention discriminators. Concretely, the pa-rameters can be found through minimizing and maximizing (ˆ θ f , ˆ θ c ) = arg min θ f ,θ c L c ( X S | θ f , θ c , ˆ θ e , ˆ θ ld , ˆ θ gd ) , (21) ˆ θ e = arg min θ e L e ( X S , X T | ˆ θ f , ˆ θ c , θ e , ˆ θ ld , ˆ θ gd ) , (22) ˆ θ l Ni d = arg max θ lNid L l Ni d ( X S , X T | ˆ θ f , ˆ θ c , ˆ θ e , θ l Ni d , ˆ θ gd ) , (23) ˆ θ gd = arg max θ gd L gd ( X S , X T | ˆ θ f , ˆ θ c , ˆ θ e , ˆ θ ld , θ gd ) . (24)The above maximization problem, i.e., Eq. (23) and (24),can be transferred to a minimization problem through adoptinga gradient reversal layer (GRL) [18] before the discrimina-tor, which will act as an identity transform in the forward-propagation but reverse the gradient sign while performingthe back-propagation operation. Then we can use the stochas-tic gradient decent (SGD) algorithm to solve the parameteroptimization process easily. Speciﬁcally, the parameters canbe updated by the rules below θ c ← θ c − ∂L c ∂θ c , θ e ← θ e − α · ∂L e ∂θ e , (25) θ l Ni d ← θ l Ni d − β · ∂L l Ni d ∂θ l Ni d , θ gd ← θ gd − β · ∂L gd ∂θ gd , (26) θ f ← θ f − ( ∂L c ∂θ f + α · ∂L e ∂θ f − β · ∂L l Ni d ∂θ f − β · ∂L gd ∂θ f ) . (27)IV. E XPERIMENTS

A. Datasets and settings

To evaluate the proposed TANN method adequately, we con-duct the experiments on three public EEG emotion datasets,namely,(1)

SEED [7] dataset is a standard benchmark for EEGemotion recognition. It contains three types of emotions,i.e., happy , neutral and sad , from 15 subjects’ EEGemotional signals.(2) SEED-IV [24] dataset includes four types of emotionsfrom 15 subjects. Compared with SEED, it contains anextra emotion fear .(3) MPED [25] dataset includes seven reﬁned emotiontypes, i.e., joy , funny , neutral , sad , fear , disgust and anger from 30 subjects.On these datasets, we design two kinds of EEG emotionrecognition experiments including the subject-dependent andsubject-independent ones. Table I summarizes the number oftraining and test samples, and the experimental protocols usedin the experiments. The concrete protocols are described asfollows: • The subject-dependent experiment - In this experiment,the training and test data come from the same subject butdifferent trials. We adopt the same protocols as [7], [24]and [26]. Namely, for SEED, we use the former ninetrials of EEG data per session of each subject as source Note that both SEED-IV and MPED are multi-modal datasets. MPEDconsists of 30 subjects’ EEG data, among which 23 subjects contain multi-modal data. In this experiment, we only use the EEG modal data. (training) domain data while using the remaining six trialsper session as target (test) domain data; for SEED-IV,we use the ﬁrst sixteen trials per session of each subjectas the training data, and the last eight trials containingall emotions (each emotion with two trials) as the testdata; for MPED, we use twenty-one trials of EEG dataas training data and the rest seven trials consisting ofseven emotions as test data for each subject. The meanaccuracy (ACC) and standard deviation (STD) are usedas the evaluation criteria for all the subjects in the dataset. • The subject-independent experiment - In this exper-iment, the training and test data come from differentsubjects, which is a harder task than the above subject-dependent one but more conductive to practical applica-tions. We adopt the leave-one-subject-out (LOSO) cross-validation strategy [12] to evaluate the proposed TANNmodel. LOSO strategy uses the EEG signals of onesubject as test data and the rest subjects’ EEG signalsas training data. This procedure is repeated such that theEEG signals of each subject will be used as test data once.Again, the mean accuracy (ACC) and standard deviation(STD) are used as the evaluation criteria.Besides, we use the released handcraft features, namely,the differential entropy (DE) in SEED and SEED-IV, and theShort-Time Fourier Transform (STFT) in MPED, as the inputto feed our model. Thus the sizes d × n of the input sample X t are × , × and × for these three datasets,respectively. Moreover, in the experiment, we respectively setthe dimension d f and d (cid:48) f of the feature extractor to 32; thenumber of brain region N to 16 ; the dimension n (cid:48) of the inputfor the global attention layer to 6; the hyper-parameters α and β are both set to 0.1 throughout the experiment. Speciﬁcally,we implemented TANN using TensorFlow on one Nvidia1080Ti GPU. The learning rate, momentum and weight decayrate are set as 0.003, 0.9 and 0.95, respectively. The networkis trained using SGD with batch size of 200. B. Experiment results

To validate the classiﬁcation superiority of TANN, we alsoconduct the same experiments using various existed methods.Recall that the distribution gap in the subject-independent taskis much larger than that in the subject-dependent one. In thiscase, domain adaptation methods shall be properly employedin order to achieve promising performance. Therefore, in theexperiment on subject-independent task, we include manydomain adaptation methods in the comparison. By doing so,we can effectively validate the state-of-the-art performance ofour method. The comparable methods are listed as follows: • Two baseline methods: linear support vector machine(SVM) [28], and random forest (RF) [29]; Concretely, the brain regions include Pre-Frontal (AF3, FP1, FPZ, FP2,AF4), Frontal (F3, F1, FZ, F2, F4), Left Frontal (F7, F5), Right Frontal(F8, F6), Left Temporal (FT7, FC5, T7, C5, TP7, CP5), Right Temporal(FT8, FC6, T8, C6, TP8, CP6), Frontal Central (FC3, FC1, FCZ, FC2, FC4),Central (C3, C1, CZ, C2, C4), Central Parietal (CP3, CP1, CPZ, CP2, CP4),Left Parietal (P7, P5), Right Parietal (P8, P6), Parietal (P3, P1, PZ, P2, P4),Left Parietal Occipital (PO7, PO5, CB1), Right Parietal Occipital (PO8, PO6,CB2), Parietal Occipital (PO3, POZ, PO4), Occipital (O1, OZ, O2) lobes. TABLE I: The number of training and test samples, and theexperimental protocols used in the experiment. (a) The subject-dependent experiment

Dataset Training Test Protocol

SEED 2010 1384 [Zheng and Lu][7]Session 1 561 290 [Zheng et al.][24]SEED-IV Session 2 550 282Session 3 576 246MPED 2520 840 [Song et al.][27](b) The subject-independent experiment

Dataset Training Test Protocol

SEED 47516 3394 [Zheng et al.][12]Session 1 11914 851 LOSO ∗ SEED-IV Session 2 11648 832Session 3 11508 822MPED 97440 3360 LOSO ∗ ∗ LOSO denotes the leave-one-subject-out strategy. • Three subspace learning methods: canonical correlationanalysis (CCA) [30], group sparse canonical correlationanalysis (GSCCA) [31], and graph regularization sparselinear regression (GRSLR) [32]; • Six transfer subspace learning methods: Kullback-Leiblerimportance estimation procedure (KLIEP) [33], uncon-strained least-squares importance ﬁtting (ULSIF) [34], se-lective transfer machine (STM) [35], transfer componentanalysis (TCA) [13], subspace alignment (SA) [36], andgeodesic ﬂow kernel (GFK) [37]; • Seven recent deep learning methods: deep believe net-work (DBN) [7], graph convolutional neural network(GCNN) [38], dynamical graph convolutional neural net-work (DGCNN) [25], domain adversarial neural networks(DANN) [18], bi-hemisphere domain adversarial neu-ral network (BiDANN) [39], EmotionMeter [24], andattention-long short-term memory (A-LSTM) [26].All the methods are representative ones in the previous studiesof emotion recognition. We directly quote (or reproduce) theirresults from the literature to ensure a convincing comparisonwith the proposed method.The results are summarized in Table II and III. Note thatthe subspace based methods, such as TCA, SA and GFK,are problematic to handle a large amount of EEG data dueto the computer memory limitation and computational issue.Therefore, to compare with them, we have to randomly select3000 EEG feature samples from the training data set to trainthese methods. Besides, the comparable methods adoptingdomain adaptation technique train the model with labeledtraining data and unlabeled test data as TANN does. FromTable II and III, we have three observations:(1) The proposed TANN model outperforms all the compa-rable methods on all the three datasets. Especially onSEED-IV dataset, the mean improvement is about 3.4 % and 2.5 % over the state-of-the-art methods A-LSTMand BiDANN. It veriﬁes the learned transferable datarepresentation are useful for EEG emotion recognition. (2) The proposed TANN is superior to the recent do-main adaptation methods. TANN has an improvementof 1.0 % , 3.7 % and 2.1 % for subject-dependent taskin Table II, and 1.2 % , 2.4 % and 2.5 % for subject-independent task in Table III than the BiDANN method,which also adopts domain adversarial learning strategyto train the model. This reveals that the local and globalattention structures are helpful to learn the discriminativeinformation for emotion recognition.(3) Even under the same classiﬁcation models, the perfor-mance of the subject-independent tasks are quite lowerthan the subject-dependent ones. It is clear to see thegaps on three datasets are about 13 % , 5 % and 12 % ,respectively. This reveals that the individual differenceis a negative inﬂuence on EEG emotion recognition, andshould be mitigated in the subject-independent task.TABLE II: The classiﬁcation performance for subject-dependent EEG emotion recognition on SEED, SEED-IV andMPED datasets. Method ACC / STD (%)

SEED SEED-IV MPEDSVM [28] 83.99/09.72 56.61/20.05 ∗ ∗ RF [29] 78.46/11.77 50.97/16.22 ∗ ∗ CCA [30] 77.63/13.21 54.47/18.48 ∗ ∗ GSCCA [31] 82.96/09.95 69.08/16.66 ∗ ∗ DBN [7] 86.08/08.34 66.77/07.38 ∗ ∗ GRSLR [32] 87.39/08.64 69.32/19.57 ∗ ∗ GCNN [38] 87.40/09.20 68.34/15.42 ∗ ∗ DGCNN [25] 90.40/08.49 69.88/16.29 ∗ ∗ DANN [18] 91.36/08.30 63.07/12.66 ∗ ∗ BiDANN [39] 92.38/07.04 70.29/12.63 ∗ ∗ EmotionMeter [24] − − A-LSTM [27] 88.61/10.16 ∗ ∗ ∗ TANN ∗ indicates the experiment results obtained are based on our own imple-mentation. − indicates the experiment results are not reported on that dataset. C. Discussion1) The confusion of different emotions based on TANNmodel:

To better understand the confusion of TANN in rec-ognizing different emotions, we depict the confusion matricesof subject-dependent and subject-independent EEG emotionrecognition experiments in Fig. 3 and 4, respectively, fromwhich we have the following observations:(1) In Fig. 3, for SEED, the classiﬁcation accuracies forthree emotions are about 90 % , and the happy andneutral emotions are easier to be recognized than the sademotion; for SEED-IV, which consists of four emotions,we can see the negative emotions, i.e., sad and fear, areconfused by the classiﬁer with higher possibility; andfor MPED, the confusion is more complex because ithas more emotions than the other two datasets. It isobvious to see that the funny emotion is the easiest to berecognized and has 16 % more than the neutral emotion TABLE III: The classiﬁcation performance for subject-independent EEG emotion recognition on SEED, SEED-IVand MPED datasets. Method ACC / STD (%)

SEED SEED-IV MPEDKLIEP [33] 45.71/17.76 31.46/09.20 ∗ ∗ ULSIF [34] 51.18/13.57 32.99/11.05 ∗ ∗ STM [35] 51.23/14.82 39.39/12.40 ∗ ∗ SVM [28] 56.73/16.29 37.99/12.52 ∗ ∗ TCA [13] 63.64/14.88 56.56/13.77 ∗ ∗ SA [36] 69.00/10.89 64.44/09.46 ∗ ∗ GFK [37] 71.31/14.09 64.38/11.41 ∗ ∗ A-LSTM [27] 72.18/10.85 ∗ ∗ ∗ DANN [18] 75.08/11.18 47.59/10.01 ∗ ∗ DGCNN [25] 79.95/09.02 52.82/09.23 ∗ ∗ DAN [40] 83.81/08.56 58.87/08.13 − BiDANN [39] 83.28/09.60 65.59/10.39 ∗ ∗ TANN ∗ indicates the experiment results obtained are based on our ownimplementation. − indicates the experiment results are not reported on that dataset. on the second place. Except this, we can ﬁnd that thefunny and joy are easier to be confused maybe becauseboth of them are positive emotions.(2) From the results of subject-independent EEG emotionrecognition experiment in Fig. 4, we can observe that,for SEED, which has three types of emotions, the happyemotion is much easier to be recognized than neutraland sad; for SEED-IV, the neutral and sad emotionsare much easier to be recognized; for MPED, whichis a hard seven classiﬁcation problem, the accuraries offunny, neutral and anger emotions overpass that of theother emotions, and this reveals that we should focus onthe joy, sad, fear and disgust emotion data in the taskof classifying seven emotions. (a) SEED (b) SEED-IV(c) MPED Fig. 3: The confusion matrices based on the subject-dependentexperimental results on three datasets. (a) SEED (b) SEED-IV(c) MPED

Fig. 4: The confusion matrices based on the subject-independent experimental results on three datasets.

2) The transferability of different brain regions:

To inves-tigate the transferability of different brain regions for EEGemotion recognition, we visualize all the brain regions bymapping the local attention values w in Eq. (7) into thecorresponding electrodes. The obtained results are shown inFig. 5, from which we have two observations:(1) The left and right temporal lobes make more importantcontribution for emotion recognition in all the threedatasets, which coincides with the previous EEG emo-tion studies [6], [7]. This also reveals that, as well as theproposed model can adaptively give attention to differentbrain regions, it is still effective to capture the mostimportant ones.(2) The activation areas are slightly different across datasets.For example, there is a broader activation to the temporallobes for SEED-IV compared with SEED. And forMPED, which consists of more types of emotions, theoccipital lobe, as well as the temporal lobe, contributesmore for emotion expression. (a) SEED (b) SEED-IV (c) MPED Fig. 5: The transferability of different EEG brain regions.

3) Ablation study:

To see the importance of each module ofTANN for EEG emotion recognition, we conduct an ablationstudy by removing the local and global attention layers bothand separately. These reduced models are depicted in Fig. 6,which includes • TANN-R1, which removes both the local and globalattention modules; • TANN-R2, which neglects the global transferability forEEG samples; • TANN-R3, which employs the same structure of TANNmodel except the local attention layer. (a) TANN-R1(b) TANN-R2(c) TANN-R3

Fig. 6: The frameworks of the reduced models of TANN: (a)TANN-R1, (b) TANN-R2, (c) TANN-R3.The experimental results are shown in Table IV, from whichwe can have three observations:(1) It is effective for the structure of the feature extractorin TANN. From the results of TANN-R1, we can see itachieves comparable performance on three datasets. Thisveriﬁes the obtained deep data representation by twodirectional recurrent neural networks is discriminativefor emotion recognition.(2) Either the local or global transferable attention modulescan enhance emotion recognition. In contrast to TANN-R1, TANN-R2 and TANN-R3 improve the accuracy, onaverage, by 1.8 % and 1.5 % on three datasets, respec-tively.(3) By assembling the feature extractor, local and globalattention modules, TANN achieves the best performance.We can see TANN has a further improvement of 3 % compared with TANN-R2 and TANN-R3. The above results verify the effectiveness of the three impor-tant modules in TANN.TABLE IV: The comparison of EEG emotion recognitionresults among four methods: (1) TANN-R1, (2) TANN-R2,(3) TANN-R3; (4) TANN.

Method ACC / STD (%)

SEED SEED-IV MPEDTANN-R1 87.06/09.45 68.28/14.28 37.92/07.80TANN-R2 89.73/07.53

TANN-R3

V. C

ONCLUSION

In this paper, we propose a transferable attention neuralnetwork (TANN) to deal with EEG emotion recognition prob-lem, which is motivated by the ﬁnding that not all the trainingsamples have the equal contribution for emotion recognition,which also happens for the importance of different brainregions in this sample. TANN has the ability to learn thepositive and negative information from the sample-level andbrain-region-level, which can improve EEG emotion recogni-tion. The proposed framework is easy to implement and theextensive experiments on three public EEG emotion datasetsdemonstrated that the proposed TANN method achieves thestate-of-the-art performance. Besides, based on TANN, wealso investigate the transferability of different brain regionsin EEG emotion recognition and ﬁnd that the temporal lobeand occipital lobe contribute more for emotion expression. Inthe future work, we will further investigate more operations forlearning the transferability information to explore the potentialefﬁcacy of transferable attention for EEG emotion recognition.R

EFERENCES[1] R. W. Picard,

Affective computing . MIT press, 2000.[2] B. Garc´ıa-Mart´ınez, A. Martinez-Rodrigo, R. Alcaraz, andA. Fern´andez-Caballero, “A review on nonlinear methods usingelectroencephalographic recordings for emotion recognition,”

IEEETransactions on Affective Computing , 2019.[3] J. Chen, P. Zhang, Z. Mao, Y. Huang, D. Jiang, and Y. Zhang,“Accurate eeg-based emotion recognition on combined features usingdeep convolutional neural networks,”

IEEE Access , vol. 7, pp. 44 317–44 328, 2019.[4] D. Sammler, M. Grigutsch, T. Fritz, and S. Koelsch, “Music andemotion: electrophysiological correlates of the processing of pleasantand unpleasant music,”

Psychophysiology , vol. 44, no. 2, pp. 293–304,2007.[5] D. Mathersul, L. M. Williams, P. J. Hopkinson, and A. H. Kemp, “In-vestigating models of affect: relationships among eeg alpha asymmetry,depression, and anxiety,”

Emotion , vol. 8, no. 4, pp. 560–572, 2008.[6] Y.-P. Lin, C.-H. Wang, T.-P. Jung, T.-L. Wu, S.-K. Jeng, J.-R. Duann, andJ.-H. Chen, “Eeg-based emotion recognition in music listening,”

IEEETransactions on Biomedical Engineering , vol. 57, no. 7, pp. 1798–1806,2010.[7] W.-L. Zheng and B.-L. Lu, “Investigating critical frequency bands andchannels for eeg-based emotion recognition with deep neural networks,”

IEEE Transactions on Autonomous Mental Development , vol. 7, no. 3,pp. 162–175, 2015.[8] P. Li, H. Liu, Y. Si, C. Li, F. Li, X. Zhu, X. Huang, Y. Zeng, D. Yao,Y. Zhang et al. , “Eeg based emotion recognition by combining functionalconnectivity network and local activations,”

IEEE Transactions onBiomedical Engineering , 2019. [9] R. Jenke, A. Peer, and M. Buss, “Feature extraction and selectionfor emotion recognition from eeg,”

IEEE Transactions on AffectiveComputing , vol. 5, no. 3, pp. 327–339, 2014.[10] W. Zheng, “Multichannel eeg-based emotion recognition via groupsparse canonical correlation analysis,”

IEEE Transactions on Cognitiveand Developmental Systems , vol. 9, no. 3, pp. 281–290, 2017.[11] S. M. Alarcao and M. J. Fonseca, “Emotions recognition using eegsignals: a survey,”

IEEE Transactions on Affective Computing , 2017.[12] W.-L. Zheng and B.-L. Lu, “Personalizing eeg-based affective modelswith transfer learning,” in

International Joint Conference on ArtiﬁcialIntelligence (IJCAI) . AAAI Press, 2016, pp. 2732–2738.[13] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation viatransfer component analysis,”

IEEE Transactions on Neural Networks ,vol. 22, no. 2, pp. 199–210, 2011.[14] B. Sch¨olkopf, A. Smola, and K.-R. M¨uller, “Nonlinear componentanalysis as a kernel eigenvalue problem,”

Neural Computation , vol. 10,no. 5, pp. 1299–1319, 1998.[15] R. Collobert, F. Sinz, J. Weston, and L. Bottou, “Large scale transductivesvms,”

Journal of Machine Learning Research , vol. 7, no. Aug, pp.1687–1712, 2006.[16] E. Sangineto, G. Zen, E. Ricci, and N. Sebe, “We are not all equal:Personalizing models for facial expression analysis with transductiveparameter transfer,” in

Proceedings of the 22nd ACM internationalconference on Multimedia (MM) . ACM, 2014, pp. 357–366.[17] Z. Lan, O. Sourina, L. Wang, R. Scherer, and G. R. M¨uller-Putz,“Domain adaptation techniques for eeg-based emotion recognition: acomparative study on two public datasets,”

IEEE Transactions onCognitive and Developmental Systems , vol. 11, no. 1, pp. 85–94, 2018.[18] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavi-olette, M. Marchand, and V. Lempitsky, “Domain-adversarial trainingof neural networks,”

Journal of Machine Learning Research , vol. 17,no. 59, pp. 1–35, 2016.[19] Y. Li, W. Zheng, Y. Zong, Z. Cui, T. Zhang, and X. Zhou, “A bi-hemisphere domain adversarial neural network model for eeg emotionrecognition,”

IEEE Transactions on Affective Computing , 2018.[20] P. A. Kragel and K. S. Labar, “Decoding the nature of emotion in thebrain,”

Trends in Cognitive Sciences , vol. 20, no. 6, pp. 444–455, 2016.[21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,L. Kaiser, and I. Polosukhin, “Attention is all you need,” pp. 5998–6008,2017.[22] X. Wang, L. Li, W. Ye, M. Long, and J. Wang, “Transferable attentionfor domain adaptation,” 2019.[23] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Unsupervised domainadaptation with residual transfer networks,” in

Advances in NeuralInformation Processing Systems , 2016, pp. 136–144.[24] W.-L. Zheng, W. Liu, Y. Lu, B.-L. Lu, and A. Cichocki, “Emotionmeter:A multimodal framework for recognizing human emotions,”

IEEEtransactions on cybernetics , vol. 49, pp. 1110–1122, 2019.[25] T. Song, W. Zheng, P. Song, and Z. Cui, “Eeg emotion recognition usingdynamical graph convolutional neural networks,”

IEEE Transactions onAffective Computing , 2018.[26] T. Song, W. Zheng, C. Lu, Y. Zong, X. Zhang, and Z. Cui, “Mped:A multi-modal physiological emotion database for discrete emotionrecognition,”

IEEE Access , vol. 7, pp. 12 177–12 191, 2019.[27] ——, “Mped: A multi-modal physiological emotion database for discreteemotion recognition,”

IEEE Access , vol. 7, pp. 12 177–12 191, 2019.[28] J. A. Suykens and J. Vandewalle, “Least squares support vector machineclassiﬁers,”

Neural Processing Letters , vol. 9, no. 3, pp. 293–300, 1999.[29] L. Breiman, “Random forests,”

Machine Learning , vol. 45, no. 1, pp.5–32, 2001.[30] B. Thompson, “Canonical correlation analysis,”

Encyclopedia of Statis-tics in Behavioral Science , 2005.[31] W. Zheng, “Multichannel eeg-based emotion recognition via groupsparse canonical correlation analysis,”

IEEE Transactions on Cognitiveand Developmental Systems , vol. 9, no. 3, pp. 281–290, 2017.[32] Y. Li, W. Zheng, Z. Cui, Y. Zong, and S. Ge, “Eeg emotion recognitionbased on graph regularized sparse linear regression,”

Neural ProcessingLetters , pp. 1–17, 2018.[33] M. Sugiyama, S. Nakajima, H. Kashima, P. V. Buenau, and M. Kawan-abe, “Direct importance estimation with model selection and its appli-cation to covariate shift adaptation,” in

Advances in Neural InformationProcessing Systems (NIPS) , 2008, pp. 1433–1440.[34] T. Kanamori, S. Hido, and M. Sugiyama, “A least-squares approachto direct importance estimation,”

The Journal of Machine LearningResearch , vol. 10, pp. 1391–1445, 2009. [35] W.-S. Chu, F. De la Torre, and J. F. Cohn, “Selective transfer machine forpersonalized facial expression analysis,” IEEE Transactions on PatternAnalysis and Machine Intelligence , vol. 39, no. 3, pp. 529–545, 2017.[36] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervisedvisual domain adaptation using subspace alignment,” in

IEEE Interna-tional Conference on Computer Vision (ICCV) . IEEE, 2013, pp. 2960–2967.[37] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic ﬂow kernel forunsupervised domain adaptation,” in

IEEE Conference on ComputerVision and Pattern Recognition (CVPR) . IEEE, 2012, pp. 2066–2073.[38] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neuralnetworks on graphs with fast localized spectral ﬁltering,” in

Advances inNeural Information Processing Systems (NIPS) , 2016, pp. 3844–3852.[39] Y. Li, W. Zheng, Z. Cui, T. Zhang, and Y. Zong, “A novel neural networkmodel based on cerebral hemispheric asymmetry for eeg emotionrecognition.” in

International Joint Conference on Artiﬁcial Intelligence(IJCAI) , 2018, pp. 1561–1567.[40] H. Li, Y.-M. Jin, W.-L. Zheng, and B.-L. Lu, “Cross-subject emotionrecognition using deep adaptation networks,” in