[PDF] EEG-Based Emotion Recognition Using Regularized Graph Neural Networks

Abstract

Electroencephalography (EEG) measures the neuronal activities in different brain regions via electrodes. Many existing studies on EEG-based emotion recognition do not fully exploit the topology of EEG channels. In this paper, we propose a regularized graph neural network (RGNN) for EEG-based emotion recognition. RGNN considers the biological topology among different brain regions to capture both local and global relations among different EEG channels. Specifically, we model the inter-channel relations in EEG signals via an adjacency matrix in a graph neural network where the connection and sparseness of the adjacency matrix are inspired by neuroscience theories of human brain organization. In addition, we propose two regularizers, namely node-wise domain adversarial training (NodeDAT) and emotion-aware distribution learning (EmotionDL), to better handle cross-subject EEG variations and noisy labels, respectively. Extensive experiments on two public datasets, SEED and SEED-IV, demonstrate the superior performance of our model than state-of-the-art models in most experimental settings. Moreover, ablation studies show that the proposed adjacency matrix and two regularizers contribute consistent and significant gain to the performance of our RGNN model. Finally, investigations on the neuronal activities reveal important brain regions and inter-channel relations for EEG-based emotion recognition.

Full PDF

11 EEG-Based Emotion Recognition UsingRegularized Graph Neural Networks

Peixiang Zhong, Di Wang,

Senior Member, IEEE, and Chunyan Miao,

Senior Member, IEEE

Abstract —Electroencephalography (EEG) measures the neuronal activities in different brain regions via electrodes. Many existingstudies on EEG-based emotion recognition do not fully exploit the topology of EEG channels. In this paper, we propose a regularizedgraph neural network (RGNN) for EEG-based emotion recognition. RGNN considers the biological topology among different brainregions to capture both local and global relations among different EEG channels. Speciﬁcally, we model the inter-channel relations inEEG signals via an adjacency matrix in a graph neural network where the connection and sparseness of the adjacency matrix areinspired by neuroscience theories of human brain organization. In addition, we propose two regularizers, namely node-wise domainadversarial training (NodeDAT) and emotion-aware distribution learning (EmotionDL), to better handle cross-subject EEG variationsand noisy labels, respectively. Extensive experiments on two public datasets, SEED and SEED-IV, demonstrate the superiorperformance of our model than state-of-the-art models in most experimental settings. Moreover, ablation studies show that theproposed adjacency matrix and two regularizers contribute consistent and signiﬁcant gain to the performance of our RGNN model.Finally, investigations on the neuronal activities reveal important brain regions and inter-channel relations for EEG-based emotionrecognition.

Index Terms —Affective Computing, EEG, Graph Neural Network, SEED (cid:70)

NTRODUCTION E MOTION recognition focuses on the recognition of hu-man emotions based on a variety of modalities, suchas audio-visual expressions, body language, physiologicalsignals, etc. Compared to other modalities, physiologicalsignals, such as electroencephalography (EEG), electrocar-diogram (ECG), electromyography (EMG), etc., have theadvantage of being difﬁcult to hide or disguise. In recentyears, due to the rapid development of noninvasive, easy-to-use and inexpensive EEG recording devices, EEG-basedemotion recognition has received an increasing amount ofattention in both research [1] and applications [2].Emotion models can be broadly categorized into discretemodels and dimensional models. The former categorizesemotions into discrete entities, e.g., anger, disgust, fear,happiness, sadness, and surprise in Ekman’s theory [3].The latter describes emotions using their underlying di-mensions, e.g., valence, arousal and dominance [4], whichmeasures emotions from unpleasant to pleasant, passive toactive, and submissive to dominant, respectively.EEG signals measure voltage ﬂuctuations from the cortexin the brain and have been shown to reveal importantinformation about human emotional states [5]. For example,greater relative left frontal EEG activity has been observedwhen experiencing positive emotions [5]. The voltage ﬂuctu-ations in different brain regions are measured by electrodesattached to the scalp. Each electrode collects EEG signals inone channel. The collected EEG signals are often analyzed • P. Zhong, D. Wang and C. Miao are with the Joint NTU-UBC ResearchCentre of Excellence in Active Living for the Elderly (LILY), NanyangTechnological University, Singapore.P. Zhong and C. Miao are also with the Alibaba-NTU Singapore JointResearch Institute and the School of Computer Science and Engineering,Nanyang Technological University, Singapore.E-mail: [email protected], { wangdi, ascymiao } @ntu.edu.sg in speciﬁc frequency bands, namely delta (1-4 Hz), theta (4-7Hz), alpha (8-13 Hz), beta (13-30 Hz), and gamma ( >

30 Hz).Many existing EEG-based emotion recognition methodsare primarily based on the supervised machine learningapproach, wherein features are often extracted from prepro-cessed EEG signals in each channel over a time window.Then, a classiﬁer is trained on the extracted features torecognize emotions. Wang et al. [6] compared power spec-tral density features (PSD), wavelet features and nonlineardynamical features with a Support Vector Machine (SVM)classiﬁer. Zheng and Lu [7] investigated critical frequencybands and channels using PSD, differential entropy (DE) [8]and PSD asymmetry features, and obtained robust accuracyusing deep belief networks (DBN). However, most existingEEG-based emotion recognition approaches do not addressthe following three challenges: 1) the topological structureof EEG channels are not effectively exploited to learn morediscriminative EEG representations; 2) EEG signals vary sig-niﬁcantly across different subjects, which hinders the gener-alizability of the trained classiﬁers in subject-independentclassiﬁcation settings; and 3) participants may not alwaysgenerate the intended emotions when watching emotion-eliciting stimuli. Consequently, the emotion labels in thecollected EEG data may be noisy and inconsistent with theactual elicited emotions.There have been several attempts to address the ﬁrstchallenge. Zhang et al. [9] and Zhang et al. [10] incorpo-rated spatial relations in EEG signals using convolutionalneural networks (CNN) and recurrent neural networks(RNN), respectively. However, their approaches require a2D representation of EEG channels on the scalp, which maycause information loss during ﬂattening because channelsare actually arranged in the 3D space. In addition, theirapproach of using CNNs and RNNs to capture inter-channel a r X i v : . [ c s . C V ] M a y relations has difﬁculty in learning long-range dependencies[11]. Graph neural networks (GNN) has been applied in [12]to capture inter-channel relations using an adjacency matrix.However, similar to CNNs and RNNs, the GNN approach[12] only considers relations between the nearest channels,which thus may lose valuable information between distantchannels, such as the PSD asymmetry between channels onthe left and right hemispheres in the frontal region, whichhas been shown to be informative in valence prediction [5].In recent years, several studies [13], [14] attempted totackle the second challenge by investigating the transfer-ability of EEG-based emotion recognition models acrosssubjects. Lan et al. [15] compared several domain adapta-tion techniques such as maximum independence domainadaptation (MIDA), transfer component analysis (TCA),subspace alignment (SA), etc. They found that the subject-independent classiﬁcation accuracy can be improved byaround 10%. Li et al. [16] applied domain adversarial train-ing to lower the inﬂuence of individual subject on EEG dataand obtained improved performance as well. However, theiradversarial training does not exploit any graph structureof the EEG signals and only leads to small performanceimprovement in our experiment (see Section 7.1).To the best of our knowledge, no attempt has been madeto address the third challenge, i.e., noisy emotion labels, inEEG-based emotion recognition.In this paper, we propose a regularized graph neuralnetwork (RGNN) aiming to address all the three aforemen-tioned challenges. Graph analysis for human brain has beenstudied extensively in the neuroscience literature [17], [18].However, making an accurate connectome is still an openquestion and subject to different scales [18]. Inspired by [12],[19], we consider each EEG channel as a node in our graph.Our RGNN model extends the simple graph convolutionnetwork (SGC) [20] and leverages the topological structureof EEG channels. Speciﬁcally, we propose a sparse adjacencymatrix to capture both local and global inter-channel rela-tions based on the biological topology of human brain [19].Local inter-channel relations connect nearby groups of neu-rons and may reveal anatomical connectivity at macroscale[18], [21]. Global inter-channel relations connect distantgroups of neurons between the left and right hemispheresand may reveal emotion-related functional connectivity [5],[16].In addition, we propose a node-wise domain adversar-ial training (NodeDAT) method to regularize RGNN forbetter generalization in subject-independent classiﬁcationscenarios. Different from the domain adversarial training in[16], [22], our NodeDAT method provides a ﬁner-grainedregularization by minimizing the domain discrepancies be-tween features in the source and target domains for eachchannel/node. Moreover, we propose an emotion-awaredistribution learning (EmotionDL) method to address theproblem of noisy labels in the datasets. Prior studies haveshown that noisy labels can adversely impact classiﬁcationaccuracy [23]. Instead of learning the traditional single-labelclassiﬁcation, our EmotionDL method learns a distributionof labels of the training data and thus acts as a regularizerto improve the robustness of our model against noisy labels.Finally, we conduct extensive experiments to validate theeffectiveness of our RGNN model and investigate emotion- related informative neuronal activities.In summary, the main contributions of this paper are asfollows:1) We propose a regularized graph neural network(RGNN) model to recognize emotions based on EEGsignals. Our biologically inspired model capturesboth local and global inter-channel relations.2) We propose two regularizers: node-wise domainadversarial training (NodeDAT) and emotion-awaredistribution learning (EmotionDL), to improve therobustness of our model against cross-subject varia-tions and noisy labels, respectively.3) We conduct extensive experiments in both subject-dependent and subject-independent classiﬁcationsettings on two public EEG datasets, namely SEED[7] and SEED-IV [24]. Experimental results demon-strate the effectiveness of our proposed model andregularizers. In addition, our RGNN model achievessuperior performance over the state-of-the-art mod-els in most experimental settings.4) We investigate the emotional neuronal activitiesand the results reveal that pre-frontal, parietal andoccipital regions may be the most informative re-gions for emotion recognition. In addition, globalinter-channel relations between the left and righthemispheres are important, and local inter-channelrelations between (FP1, AF3), (F6, F8) and (FP2,AF4) may also provide useful information. ELATED W ORK

In this section, we review related work in the ﬁelds of EEG-based emotion recognition, graph neural network, unsuper-vised domain adaptation, and learning with noisy labels.

EEG feature extractors and classiﬁers are the two fundamen-tal components in the machine learning approach of EEG-based emotion recognition. EEG features can be broadlydivided into single-channel features and multi-channel ones[25]. The majority of existing features are computed ona single channel, e.g., statistical features [26], PSD [27],differential entropy (DE) [8], and wavelet features [28]. Afew number of features are computed on multiple channelsto capture the inter-channel relations, e.g., the asymmetry ofPSD between two hemispheres [7] and functional connec-tivity [29], [30], where common indices such as correlation,coherence and phase synchronization were used estimatebrain functional connectivity between channels. Anotherline of research in multi-channel features is to use commonspatial ﬁlters [31] and spatial-temporal ﬁlters [32], [33] toextract class-discriminative EEG features. In contrast, ourmodel is deigned to operate on single-channel features andlearn to effectively combine them using a graph neuralnetwork.EEG classiﬁers can be broadly divided into topology-invariant classiﬁers and topology-aware ones. The majorityof existing classiﬁers are topology-invariant classiﬁers suchas SVM, k-Nearest Neighbors (KNN), DBN [34] and RNN [35], which do not take the topological structure of EEG fea-tures into account when learning the EEG representations.In contrast, topology-aware classiﬁers such as CNN [9], [36],[37] and GNN [12] consider the inter-channel topologicalrelations and learn EEG representations for each channelby aggregating features from nearby channels using convo-lutional operations either in the Euclidean space or in thenon-Euclidean space. However, as discussed in Section 1,existing CNNs and GNNs have difﬁculty in learning thedependencies between distant channels, which may revealimportant emotion-related information. Recently, Zhang etal. [10] and Li et al. [38] proposed to use RNNs to learnspatial topological relations between channels by scanningelectrodes in both vertical and horizontal directions. How-ever, their approaches do not fully exploit the topologicalstructure of EEG channels. For example, two topologicallyclose channels may be far away from each other in theirscanning sequence. In contrast, our model is able to learnrelations between distant channels using global connections.

Graph neural network (GNN) is a type of neural networkdealing with data in the graph domain, e.g., molecular struc-tures, social networks and knowledge graphs [39]. One earlywork on GNN [40] aimed to learn a converged static stateembedding for each node in the graph using a transitionfunction applied to its neighborhood. Later, inspired by theconvolutional operation of CNN in the Euclidean domain,Bruna et al. [41] combined spectral graph theory [42] withneural network and deﬁned convolutional operations in thegraph domain using the spectral ﬁlters computed from thenormalized graph Laplacian. Following this line of research,Defferrard et al. [43] proposed fast localized convolutionsby using a recursive formulation of the K -order Cheby-shev polynomials to approximate the ﬁlters. The resultingrepresentation for each node is an aggregation of its K th -order neighborhood. Kipf and Welling [44] further limited K = 1 and proposed the standard graph convolutionalnetwork (GCN) with a faster localized graph convolutionaloperation. The convolutional layers in GCN can be stacked K times to effectively convolve the K th -order neighborhoodof a node. Recently, Wu et al. [20] simpliﬁed GCN byremoving the nonlinearities between convolutional layersin GCN and proposed the simple graph convolution net-work (SGC), which effectively behaves like a linear featuretransformation followed by a logistic regression. Apart fromthe convolution operation used in GCNs, there are othertypes of operations used in GNNs, such as attention [45].However, they are often trained signiﬁcantly slower thanSGC [20]. In this paper, we extend SGC to model EEGsignals because it performs orders of magnitude faster thanother networks with comparable classiﬁcation accuracy. Unsupervised domain adaptation aims to mitigate the do-main shift in knowledge transfer from a supervised sourcedomain to an unsupervised target domain. The most com-mon approaches are instance re-weighting and domain-invariant feature learning. Instance re-weighting methods[46] aim to infer the resampling weight directly by feature distribution matching across source and target domains ina non-parametric manner. Domain-invariant feature leaningmethods align features from both source and target domainsto a common feature space. The alignment can be achievedby minimizing divergence [47], maximizing reconstruction[48], or adversarial training [22]. Our proposed NodeDATregularizer extends the domain adversarial training [22] tograph neural networks and achieves ﬁner-grained regular-ization by minimizing the discrepancies between features insource and target domains for each node individually.

Commonly adopted approaches to learning with noisy la-bels are based on the noise transition matrix and robustloss functions. The noise transition matrix speciﬁes theprobabilities of transition from each ground truth label toeach noisy label and is often applied to modify the cross-entropy loss. The matrix can be pre-computed as a prior [49]or estimated from noisy data [50]. A few studies tackle noisylabels by using noise-tolerant robust loss functions, suchas unhinged loss [51] and ramp loss [52]. Our proposedEmotionDL regularizer is inspired by [53], which appliesdistribution learning to classify ambiguous images.

RELIMINARIES

In this section, we introduce the preliminaries of the simplegraph convolution network (SGC) [20] and spectral graphconvolution, which are the basis of our RGNN model.

Given a graph G = ( V , E ) , where V denotes a set of nodesand E denotes a set of edges between nodes in V . Data on V can be represented by a feature matrix X ∈ R n × d , where n = |V| and d denotes the input feature dimension. Theedge set E can be represented by a weighted adjacency ma-trix A ∈ R n × n with self-loops, i.e., A ii = 1 , i = 1 , , ..., n .In general, GNNs learn a feature transformation functionfor X and produces output Z ∈ R n × d (cid:48) , where d (cid:48) denotesthe output feature dimension.Between adjacent layers in GNNs, the feature transfor-mation can be written as H l +1 = f ( H l , A ) , (1)where l = 0 , , ..., L − , L denotes the number of layers, H = X , H L = Z , and f denotes the function we want tolearn. A simple deﬁnition of f would be H l +1 = σ ( AH l W l ) , (2)where σ denotes a non-linear function and W l denotes aweight matrix at layer l . For each node x , function f simplycomputes the weighted sum of all the node features in itsneighborhood including x itself, followed by a non-lineartransformation. However, one major limitation of the f in(2) is that repeatedly applying f along multiple layers maylead to H l with overly large values due to summation. Kipfand Welling [44] alleviated this limitation by proposing thegraph convolution network (GCN) as follows H l +1 = σ ( D − AD − H l W l ) , (3) where D denotes the diagonal degree matrix of A ,i.e., D ii = (cid:80) j A ij . The normalized adjacency matrix D − AD − prevents H from growing overly large. If weignore σ and W l temporarily and expand (3), the hiddenstate H l +1 i for node x i , i = 1 , , ..., n , can be computed via H l +1 i ← A ii D ii + 1 H li + n (cid:88) j =1 A ij (cid:112) ( D ii + 1)( D jj + 1) H j l . (4)Note that each neighboring H lj is now normalized by thedegrees of both x i and x j . Successively applying L layersaggregates node features within a neighborhood of size L .To further accelerate training while keeping comparableperformance, Wu et al. [20] proposed SGC by removingthe non-linear function σ in (3) and reparameterizing alllinear transformations W l across all layers into one lineartransformation W as follows Z = H L = SH L − W L − = ... = S L XW , (5)where S = D − AD − and W = W W ... W L − . Essen-tially, SGC computes a topology-aware linear transforma-tion ˆX = S L X , followed by a ﬁnal linear transformation Z = ˆXW . We analyze GCN from the perspective of spectral graphtheory [42]. Graph Fourier analysis relies on the graphLaplacian L = D − A or the normalized graph Laplacian ˆL = I − D − AD − . Since ˆL is a symmetric positivesemideﬁnite matrix, it can be decomposed as ˆL = UΛU T ,where U is the orthonormal eigenvector matrix of ˆL and Λ = diag ( λ , ..., λ N ) is the diagonal matrix of correspond-ing eigenvalues. Given graph data X , the graph Fouriertransform of X is ˆX = U T X , and the inverse Fouriertransform of ˆX is X = U ˆX . Hence, the graph convolutionbetween X and a ﬁlter G is computed as follows X ∗ G = U (( U T G ) (cid:12) ( U T X )) = U ˆGU T X , (6)where (cid:12) denotes element-wise multiplication and ˆG = diag ( ˆ g , ..., ˆ g n ) denotes a diagonal matrix with n spectralﬁlter coefﬁcients.To reduce the current learning complexity of O ( n ) to thatof conventional CNN, i.e., O ( K ) , (6) can be approximatedusing the K th order polynomials as follows U ˆGU T X ≈ U ( K (cid:88) i =0 θ i Λ i ) U T X = K (cid:88) i =0 θ i ˆL i X , (7)where θ i denotes learnable parameters. To further reducecomputational cost, Defferrard et al. [43] proposed to useChebyshev polynomials to approximate the ﬁltering opera-tion as follows U ˆGU T X = K (cid:88) i =0 θ i T i ( ˆL (cid:48) ) X , (8)where ˆL (cid:48) = λ max ˆL − I denotes the scaled normalizedLaplacian with its eigenvalues lying within [ − , , λ max denotes the maximum eigenvalue of ˆL , and T i ( x ) denotesthe Chebyshev polynomials recursively deﬁned as T i ( x ) =2 xT i − ( x ) − T i − ( x ) with T ( x ) = 1 and T ( x ) = x . The GCN proposed in [44] made a few approximationsto simplify the ﬁltering operation in (8): 1) use K = 1 ; 2) set λ max = 2 ; and 3) set θ = − θ . The resulted GCN arrives at(3). Essentially, the graph convolutional operations deﬁnedin (3) and (5) behave like a low-pass ﬁlter by smoothing thefeatures of each node on the graph using node features inits neighborhood. EGULARIZED G RAPH N EURAL N ETWORK

In this section, we present our regularized graph neuralnetwork (RGNN), speciﬁcally, the biologically inspired adja-cency matrix, the dynamics of RGNN, and two regularizers,i.e., node-wise domain adversarial training (NodeDAT) andemotion-aware distribution learning (EmotionDL).

The adjacency matrix A ∈ R n × n in RGNN represents thetopological structure of EEG channels and is essential tograph representation learning, where n denotes the numberof channels in EEG signals. Each entry A ij is learnable andindicates the weight of connection between channels i and j . Note that A contains self-loops. To reduce overﬁtting,we model A as a symmetric matrix by using only n ( n +1)2 number of parameters instead of n . Salvador et al. [55] ob-served that the strength of connection between brain regionsdecays as an inverse square function of physical distance.Hence, we initialize the local inter-channel relations in ouradjacency matrix as follows A ij = min (1 , δd ij ) , (9)where d ij , i, j = 1 , , ..., n , denotes the physical distancebetween channels i and j , which is computed from their 3Dcoordinates obtained from the data sheet of the recordingdevice, and δ > denotes a calibration constant. Achardand Bullmore [56] observed that sparse fMRI networks,comprising around 20% of all possible connections, typicallymaximize the efﬁciency of the network topology. Therefore,we choose δ = 5 such that around 20% of the entries in A are non-negligible. We empirically regard entries havingvalues larger than . as non-negligible connections.Bullmore and Sporns [19] suggested that the brain orga-nization is shaped by an economic trade-off between mini-mizing wiring costs and network running costs. Minimizingwiring costs encourages local inter-channel connections asmodelled in (9). However, minimizing network runningcosts encourages certain global inter-channel connectionsfor high efﬁciency of information transfer across the net-work as a whole. To this end, we add several global con-nections to our adjacency matrix to improve the networkefﬁciency. The global connections depend on speciﬁc elec-trode placement adopted in experiments. Fig. 2 depicts theglobal connections in both SEED [7] and SEED-IV [24]. Theselection of global channels is supported by prior studiesshowing that the asymmetry in neuronal activities betweenthe left and right hemispheres is informative in valence andarousal predictions [5]. To leverage the differential asym-metry information, we initialize the global inter-channelrelations in A to [ − , as follows A ij = A ij − , (10) Training Samples (Source) L-step Feature Propagation

Pooling . . . . . .

Testing Samples (Target) L-step Feature Propagation Domain ClassifierGRL . . . . . .

CECE+ . . .

KL Loss

OutputProbability Target Distribution

Domain Loss

FC & Softmax

DomainProbability DomainLabel

Fig. 1: The overall architecture of our RGNN model. FC denotes fully-connected layer. CE denotes cross-entropy loss. KLdenotes Kullback-Leibler divergence [54]. GRL denotes gradient reversal layer [22].

FPZF7 F3 FZCZCPZPZPOZOZFCZF1 F2 F4 F8FT7 FC3 FC1 FC2 FC4 FT8T7 C3 C1 C2 C4 T8TP7 CP3 CP1 CP2 CP4 TP8P7 P3 P1 P2 P4 P8PO3 PO4 PO8CB1 CB2PO7 PO5 O2O1FP1 FP2AF3 AF4F5 F6FC5 FC6C5 C6CP5 CP6P5 P6PO6

Fig. 2: The 62-channel EEG placement used to collect data inSEED and SEED-IV. Gray symmetric channels are connectedglobally via red dashed lines.where ( i, j ) denotes the indices of global channel pairs:(FP1, FP2), (AF3, AF4), (F5, F6), (FC5, FC6), (C5, C6), (CP5,CP6), (P5, P6), (PO5, PO6) and (O1, O2). Note that weselect these indices because 1) they are connected to alarge number of nodes in their immediate neighborhood,which maximizes the effects of EEG asymmetry; and 2) theyempirically perform slightly better than alternative sets ofindices (see Section 7.1). Our adjacency matrix A obtainedin (9) and (10) aims to represent the brain network whichcombines both local anatomical connectivity and emotion-related global functional connectivity. Our RGNN model extends the SGC model [20]. The archi-tecture of RGNN is illustrated in Fig. 1. Given EEG features X ∈ R N × n × d and labels Y ∈ Z N , where N denotes thenumber of training samples, Y i ∈ { , , ..., C − } denotesthe class index, and C denotes the number of classes. Our model aims to minimize the following cross-entropy loss: Φ = − N (cid:88) i =1 log ( p ( Y i | X i , θ )) + α || A || , (11)where θ denotes the model parameters we want to optimize,and α denotes the strength of L1 sparse regularization forour adjacency matrix A .By passing each feature matrix X i into our RGNN, theoutput probability of class Y i can be computed as follows Z i = S L X i W ,p ( Y i | X i , θ ) = softmax Y i ( pool ( σ ( Z i )) W O ) , (12)where S ∈ R n × n , W ∈ R d × d (cid:48) and L follow the deﬁnitionsin (5), σ ( x ) = max (0 , x ) denotes a non-linear transforma-tion, W O ∈ R d (cid:48) × C denotes the output weight matrix, andpool ( · ) denotes the sum pooling across all nodes on thegraph. We choose sum pooling because it demonstratedmore expressive power than mean pooling and max pooling[57]. Note that we use the absolute values of A to computethe degree matrix D (see (3)) because A has negative entries,e.g., global connections. EEG signals vary signiﬁcantly across different subjects,which hinders the generalizability of trained classiﬁers insubject-independent classiﬁcation settings. To improve therobustness of our model across subjects, we extend thedomain adversarial training [22] by proposing a node-wisedomain adversarial training (NodeDAT) method to reducethe discrepancies between source and target domains, i.e.,training and testing sets, respectively. Speciﬁcally, a domainclassiﬁer is proposed to classify each node representationinto either source domain or target domain. During opti-mization, our model aims to confuse the domain classiﬁerby learning domain-invariant representations. Comparedto [22], which only regularizes the pooled representationin the last layer, our NodeDAT method has ﬁner-grained regularization because it explicitly regularizes each noderepresentation before pooling.Speciﬁcally, given labelled source/training data X S ∈ R N × n × d (in this subsection, we denote X by X S for betterclarity) and unlabelled target/testing data X T ∈ R N × n × d ,where in practice X T can be either oversampled or don-wsampled to have the same number of samples as X S [22], the domain classiﬁer aims to minimize the sum of thefollowing two binary cross-entropy losses: Φ D = − N (cid:88) i =1 n (cid:88) j =1 ( log ( p j (0 | X Si , θ D )) + log ( p j (1 | X Ti , θ D ))) , (13)where θ D denotes the parameters of the domain classiﬁer, and denote source and target domains, respectively.Intuitively, the domain classiﬁer is learned to classify sourcedata as 0 and target data as 1. The domain probabilities p j ( · ) for the j th node on the i th example are computed as p j (0 | X Si , θ D ) = softmax ( σ ( Z Sij ) W D ) ,p j (1 | X Ti , θ D ) = softmax ( σ ( Z Tij ) W D ) , (14)where Z { S,T } ij denotes the j th node representation in Z { S,T } i ,and W D ∈ R d (cid:48) × denotes the matrix parameter in thedomain classiﬁer, i.e., θ D .In order to confuse the domain classiﬁer and learn do-main invariant node presentation Z { S,T } ij , we implement agradient reversal layer (GRL) [22] that acts like an identitylayer in the forward propagation and reverses the gradientsof the domain classiﬁer during backpropagation. Conse-quently, the parameters in the feature extractor essentiallyperform gradient ascent with respect to the gradients fromthe domain classiﬁer. The reversed gradients are furtherscaled by a GRL scaling factor β which gradually increasesfrom 0 to 1 as the training progresses. The gradually in-creasing β allows our domain classiﬁer to be less sensitiveto noisy inputs at the early stages of the training process.Speciﬁcally, as suggested in [22], we let β = e − p − ,where p ∈ [0 , denotes the progression of training. Participants may not always generate the intended emotionswhen watching emotion-eliciting stimuli, which may havenegative impact on model performance [23]. To this end,we propose an emotion-aware distribution learning (Emo-tionDL) method to learn a distribution of classes insteadof one single class for each training sample. Speciﬁcally,we convert each training label Y i ∈ { , , ..., C − } intoa prior probability distribution of all classes ˆ Y i ∈ R C ,where ˆ Y ic denotes the probability of class c in ˆ Y i . Theconversion is dataset-dependent. SEED has three classes:negative, neutral, and positive with corresponding classindices 0, 1, and 2, respectively. We convert Y as follows ˆ Y i =  (1 − (cid:15) , (cid:15) , , Y i = 0 , ( (cid:15) , − (cid:15) , (cid:15) ) , Y i = 1 , (0 , (cid:15) , − (cid:15) ) , Y i = 2 , (15)where (cid:15) ∈ [0 , denotes a hyper-parameter controlling thenoise level in the training labels. This conversion mechanismis based on our assumption that participants are unlikely to generate opposite emotions when watching emotion-eliciting stimuli. Therefore, for each class, the convertedclass distribution centers on the original class and has zeroprobabilities at its opposite classes.SEED-IV has four classes: neutral, sad, fear, and happywith corresponding class indices 0, 1, 2, and 3, respectively.We convert Y as follows ˆ Y i =  (1 − (cid:15) , (cid:15) , (cid:15) , (cid:15) ) , Y i = 0 , ( (cid:15) , − (cid:15) , (cid:15) , , Y i = 1 , ( (cid:15) , (cid:15) , − (cid:15) , (cid:15) ) , Y i = 2 , ( (cid:15) , , (cid:15) , − (cid:15) ) , Y i = 3 . (16)This conversion is based on the distances between the fouremotion classes on the valence-arousal plane. Speciﬁcally, inthe self-reported ratings [24] for SEED-IV, neutral, sad, fear,and happy movie ratings cluster in the zero valence zeroarousal, low valence low arousal, low valence high arousal,and high valence high arousal regions, respectively. Weassume that participants are likely to generate emotions thathave similar ratings in either valence or arousal dimensions,e.g., both angry and happy have high arousal, but unlikelyto generate emotions that are far away in both dimensions,e.g., sad and happy are different in both valence and arousaldimensions.After obtaining the converted class distributions ˆ Y ,our model can be optimized by minimizing the followingKullback-Leibler (KL) divergence [54] instead of (11): Φ (cid:48) = N (cid:88) i =1 KL ( p ( Y | X i , θ ) , ˆ Y i ) + α || A || , (17)where p ( Y | X i , θ ) denotes the output probability distribu-tion computed via (12). Note that EmotionDL incorporatesmore prior knowledge than label smoothing, which simplyadds uniform noise to other classes. Combining both NodeDAT and EmotionDL, the overall lossfunction Φ (cid:48)(cid:48) of RGNN is computed as follows Φ (cid:48)(cid:48) = Φ (cid:48) + Φ D . (18)The detailed algorithm for training RGNN is presented inAlgorithm 1. XPERIMENTAL S ETTINGS

In this section, we present the datasets, classiﬁcation settingsand model settings in our experiments.

We conduct experiments on two public datasets, namelySEED and SEED-IV. The SEED dataset [7] comprises EEGdata of 15 subjects (7 males) recorded in 62 channels usingthe ESI NeuroScan System . The data were collected whenparticipants watch emotion-eliciting movies in three typesof emotions, namely negative, neutral and positive. Eachmovie lasts around 4 minutes. Three sessions of data arecollected and each session comprises 15 trials/movies for

1. https://compumedicsneuroscan.com/

Algorithm 1

The Training Algorithm of RGNN

Input:

Training samples X and ˆ Y , unlabelled testing sam-ples X T , learning rate η , number of epochs T , batch size B , other regularization hyper-parameters; Output:

The learned model parameters in RGNN; Randomly initialize model parameters in RGNN usingXavier initialization [58]; Initialize adjacency matrix A based on (9) and (10); for i = 1 : T do repeat Draw one batch of training samples X B and ˆ Y B from X and ˆ Y , respectively; Draw one batch of testing samples X TB from X T ; Compute degree matrix D based on (3); Compute normalized adjacency matrix S basedon (5); Compute output representation Z based on (12); Use X B and ˆ Y B to compute KL loss Φ (cid:48) based on(17); Use X B and X TB to compute domain loss Φ D based on (13); Compute GRL scaling factor β ; Update W D ← W D − η ∂ Φ D ∂ W D ; Update W O ← W O − η ∂ Φ (cid:48) ∂ W O ; Update W ← W − η ( ∂ Φ (cid:48) ∂ W − β ∂ Φ D ∂ W ) ; Update A ← A − η ( ∂ Φ (cid:48) ∂ A − β ∂ Φ D ∂ A ) ; until all samples in X have been drawn;each subject. To make a fair comparison with existing stud-ies, we directly use the pre-computed differential entropy(DE) features smoothed by linear dynamic systems (LDS)[7] in SEED. DE extends the idea of Shannon entropy andmeasures the complexity of a continuous random variable.In SEED, DE features are pre-computed over ﬁve frequencybands (delta, theta, alpha, beta and gamma) for each secondof EEG signals (without overlapping) in each channel.The SEED-IV dataset [24] comprises EEG data of 15subjects (7 males) recorded in 62 channels . The recordingdevice is the same as the one used in SEED. The datawere collected when participants watch emotion-elicitingmovies in four types of emotions, namely neutral, sad,fear, and happy. Each movie lasts around 2 minutes. Threesessions of data are collected and each session comprises 24trials/movies for each subject. Similar to SEED, we adoptthe pre-computed DE features from SEED-IV. We closely follow prior studies to conduct both subject-dependent and subject-independent classiﬁcations on bothSEED and SEED-IV to evaluate our model.

For SEED, we follow the experimental settings in [7], [12],[16] to evaluate our RGNN model using subject-dependentclassiﬁcation. Speciﬁcally, for each subject, we train our

2. SEED-IV also contains eye movement data, which we do not usein our experiments. model using the ﬁrst 9 trials as the training set and theremaining 6 trials as the testing set. We evaluate the modelperformance by using the accuracy averaged across allsubjects over two sessions of EEG data [7]. Similarly, forsubject-dependent classiﬁcation on SEED-IV, we follow theexperimental settings in [24], [38] to use the ﬁrst 16 trials fortraining and the remaining 8 trials containing all emotions(two trials per emotion class) for testing. We evaluate ourmodel using data from all three sessions [24].

For SEED, we follow the experimental settings in [12],[13], [16] to evaluate our RGNN model using subject-independent classiﬁcation. Speciﬁcally, we adopt leave-one-subject-out cross-validation, i.e, during each fold, we trainour model on 14 subjects and test on the remaining subject.We evaluate the model performance using the accuracyaveraged cross all test subjects over one session of EEGdata [13]. Similarly, for SEED-IV, we follow the experimentalsettings in [38] to evaluate our RGNN model using subject-independent classiﬁcation. We evaluate our model usingdata from all three sessions [38].

For hyper-parameters of RGNN in all experiments, weempirically set the number of convolutional layers L = 2 ,dropout rate of . at the output fully-connected layer[64], and batch size of . We use Adam [65] to optimizemodel parameters using gradient descent. We only tune theoutput feature dimension d (cid:48) , label noise level (cid:15) , learningrate η , L1 regularization factor α , and L2 regularization foreach experiment. Note that we only adopt NodeDAT insubject-independent classiﬁcation experiments. Our modelis publicly available . We compare our model with severalbaselines, which are all cited from published results [10],[12], [16], [38]. ERFORMANCE E VALUATIONS

In this section, we present model evaluation results andinvestigate the critical frequency bands and confusion ma-trices of our RGNN model.

Table 1 presents the subject-dependent classiﬁcation accu-racy of our RGNN model and all baselines on both SEEDand SEED-IV. The performance on SEED in the individualdelta, theta, alpha, beta, and gamma bands is reported aswell. It is encouraging to see that our model achieves betterperformance than all baselines including the state-of-the-artBiHDM on both datasets when features from all frequencybands are used. In particular, our model performs betterthan DGCNN, another GNN-based model that leverages thetopological structure of EEG channels. Besides the proposedtwo regularizers (see Table 3), the main performance im-provement can be attributed to two factors: 1) our adjacencymatrix incorporates the emotion-discriminative global inter-channel asymmetry relation between the left and right hemi-spheres; and 2) our model has less concern of overﬁtting by

3. https://github.com/zhongpeixiang/RGNN

TABLE 1: Subject-dependent classiﬁcation accuracy (mean/std) on SEED and SEED-IV

SEED SEED-IVModel delta band theta band alpha band beta band gamma band all bands all bands

SVM 60.50/14.14 60.95/10.20 66.64/14.41 80.76/11.56 79.56/11.38 83.99/09.92 56.61/20.05GSCCA [59] 63.92/11.16 64.64/10.33 70.10/14.76 76.93/11.00 77.98/10.72 82.96/09.95 69.08/16.66DBN [7] 64.32/12.45 60.77/10.42 64.01/15.97 78.92/12.48 79.19/14.58 86.08/08.34 66.77/07.38STRNN [10] /12.27 /09.15 /12.99 83.41/10.16 69.61/15.65 89.50/07.63 -DGCNN [12] 74.25/11.42 71.52/05.99 74.43/12.16 83.65/10.17 85.73/10.64 90.40/08.49 69.88/16.29BiDANN [16] 76.97/10.95 75.56/07.88 81.03/11.74 /09.59 88.64/09.46 92.38/07.04 70.29/12.63EmotionMeter [24] - - - - - - 70.58/17.01BiHDM [38] (SOTA) - - - - - 93.12/06.06 74.35/14.09RGNN (Our model) 76.17/07.91 72.26/07.25 75.33/08.85 84.25/12.54 /08.90 /05.95 /10.54

TABLE 2: Subject-independent classiﬁcation accuracy (mean/std) on SEED and SEED-IV

SEED SEED-IVModel delta band theta band alpha band beta band gamma band all bands all bands

SVM 43.06/08.27 40.07/06.50 43.97/10.89 48.63/10.29 51.59/11.83 56.73/16.29 37.99/12.52TCA [60] 44.10/08.22 41.26/09.21 42.93/14.33 43.93/10.06 48.43/09.73 63.64/14.88 56.56/13.77SA [61] 53.23/07.47 50.60/08.31 55.06/10.60 56.72/10.78 64.47/14.96 69.00/10.89 64.44/09.46T-SVM [62] - - - - - 72.53/14.00 -DGCNN [12] 49.79/10.94 46.36/12.06 48.29/12.28 56.15/14.01 54.87/17.53 79.95/09.02 52.82/09.23DAN [63] - - - - - 83.81/08.56 58.87/08.13BiDANN-S [16] 63.01/07.49 /07.52 /09.50 73.59/09.12 73.72/08.67 84.14/06.87 65.59/10.39BiHDM [38] (SOTA) - - - - - /07.53 69.03/08.66RGNN (Our model) /06.87 60.69/05.79 60.84/07.57 /08.94 /08.10 85.30/06.72 /08.02 extending SGC, which is much simpler than ChebNet [43]used in DGCNN.

Similar to Table 1, Table 2 presents the subject-independentclassiﬁcation results. When using features from all fre-quency bands, our model performs marginally worse thanBiHDM on SEED but much better than BiHDM on SEED-IV(nearly 5% improvement). In addition, our model achievesthe lowest standard deviation in accuracy compared to allbaselines on both datasets, showing the robustness of ourmodel against cross-subject variations.Comparing the results shown in Tables 1 and 2, weﬁnd that the accuracy obtained in subject-independent set-tings is consistently worse than the accuracy obtained insubject-dependent settings by around 5% to 30% for everymodel. This ﬁnding is unsurprising because the variabilityof EEG signals across subjects makes subject-independentclassiﬁcation more challenging. However, an interestingobservation is that the performance gap between thesetwo settings is gradually decreasing from around 27% onSEED and 19% on SEED-IV using SVM to around 9% onSEED and 6% on SEED-IV using our model. One possiblereason for the diminishing performance gap is that recentdeep learning models in subject-independent classiﬁcationsettings are becoming better at leveraging a large amountof data and learning subject-invariant EEG representations.This observation seems to indicate that transfer learningmay be a necessary tool for emotion recognition in cross-subject settings.

We further compare the performance of our model and allbaselines on SEED using features from different frequencybands, as reported in Tables 1 and 2. In subject-dependent experiments, STRNN achieves the highest accuracy in delta,theta and alpha bands, BiDANN performs best in beta band,and our model performs best in gamma band. In subject-independent experiments, BiDANN-S achieves the highestaccuracy in theta and alpha bands, and our model performsbest in delta, beta and gamma bands.We investigate the critical frequency bands for emo-tion recognition. For both subject-dependent and subject-independent settings on SEED, we compare the perfor-mance of each model across different frequency bands. Ingeneral, most models including ours achieve better perfor-mance on beta and gamma bands than delta, theta andalpha bands, with one exception of STRNN, which performsthe worst on gamma band. This observation is consistentwith the literature [7], [66]. One subtle difference betweenour model and other models is that our model performsconsistently better in gamma band than beta band, whereasother models perform comparably in both bands, indicatingthat gamma band may be the most discriminative band forour model.

We present the confusion matrices of our model in Fig. 3.For SEED, our model can recognize positive and neutralemotions better than negative emotion in both classiﬁcationsettings. Comparing subject-independent classiﬁcation (seeFig. 3(b)) to subject-dependent classiﬁcation (see Fig. 3(a)),the performance of our model gets relatively much worse atdetecting negative emotion, indicating that participants arelikely to generate distinct EEG patterns when experiencingnegative emotion.For SEED-IV, our model performs signiﬁcantly better onsad emotion than all other emotions in both classiﬁcationsettings. Comparing subject-independent classiﬁcation (seeFig. 3(d)) to subject-dependent classiﬁcation (see Fig. 3(c)),the performance of our model gets relatively much worse negative neutral positive (a) n e g a t i v e n e u t r a l p o s i t i v e negative neutral positive (b) n e g a t i v e n e u t r a l p o s i t i v e neutral sad fear happy (c) n e u t r a l s a d f e a r h a pp y neutral sad fear happy (d) n e u t r a l s a d f e a r h a pp y Fig. 3: Confusion matrices of RGNN. (a) Subject-dependentclassiﬁcation on SEED. (b) Subject-independent classiﬁca-tion on SEED. (c) Subject-dependent classiﬁcation on SEED-IV. (d) Subject-independent classiﬁcation on SEED-IV.TABLE 3: Ablation study for subject-independent classiﬁ-cation accuracy (mean/std) on SEED and SEED-IV. Symbol“ − ” indicates the following component is removed. Model SEED SEED-IV

RGNN /06.72 /08.02correlation-based adjacency matrix 84.41/06.94 72.73/08.36coherence-based adjacency matrix 84.02/07.05 72.26/08.48random adjacency matrix 83.57/07.34 71.78/08.64 − symmetric adjacency matrix 83.69/07.92 72.02/08.66 − global connection 82.42/08.24 71.13/08.78global connection alternative 1 84.52/06.87 73.29/08.18global connection alternative 2 84.23/07.04 73.08/08.35 − NodeDAT 81.92/09.35 71.65/09.43DAT 83.51/08.11 72.40/08.54 − EmotionDL 82.27/08.81 70.76/09.22 at detecting sad emotion, which is similar to SEED. We notethat fear is the only emotion that performs better in subject-independent classiﬁcation than in subject-dependent clas-siﬁcation. This ﬁnding indicates that participants watchinghorror movies may generate similar EEG patterns.

ISCUSSION

In this section, we conduct ablation study and sensitivityanalysis for our RGNN model. We also analyze importantbrain regions and inter-channel relations for emotion recog-nition.

We conduct ablation study to investigate the contributionof each key component in our model. Table 3 reports thesubject-independent classiﬁcation results on both datasets.We compared different initialization methods of the adja-cency matrix and found that our distance-based method(see (9)) obtains slightly better performance than functionalconnectivity-based methods, i.e., correlation and coherencecomputed from the training dataset. The uniformly ran-domly initialized adjacency matrix in [0 , performs worst, L1 sparsity coefficient (a) A cc u r a c y SEED subject-dependentSEED subject-independentSEED-IV subject-dependentSEED-IV subject-independent

Noise coefficient (b)

SEED subject-dependentSEED subject-independentSEED-IV subject-dependentSEED-IV subject-independent

Fig. 4: Classiﬁcation accuracy of RGNN with varying hyper-parameters. (a) L1 sparsity coefﬁcient α in (11). (b) Noisecoefﬁcient (cid:15) in (15) and (16).indicating that properly initializing the adjacency matrix isbeneﬁcial to model performance. Our symmetric adjacencymatrix design also proves to be useful in reducing overﬁt-ting and improving accuracy.Removing the global connection causes noticeable per-formance drop on both datasets, demonstrating the impor-tance of global connections in modelling the EEG differentialasymmetry. Moreover, we compared the performance ofalternative sets of global connections. Alternative 1 hasglobal indices that are nearer to the central region, i.e., (FP1,FP2), (AF3, AF4), (F3, F4), (FC3, FC4), (C3, C4), (CP3, CP4),(P3, P4), (PO5, PO6) and (O1, O2). Alternative 2 has globalindices that are further from the central region, i.e., (FP1,FP2), (AF3, AF4), (F7, F8), (FT7, FT8), (T7, T8), (TP7, TP8),(P7, P8), (PO7, PO8) and (O1, O2). Both alternatives performslightly worse than our model but much better than noglobal connection, indicating that they are able to modelEEG asymmetry to a certain extent.Our NodeDAT regularizer has a noticeable positive im-pact on the performance of our model, suggesting thatdomain adaptation is helpful in cross-subject classiﬁcation.To further investigate the impact of our node-level domainclassiﬁer, we experimented with replacing NodeDAT witha generic domain classiﬁer DAT [22]. The clear perfor-mance gap between DAT and our RGNN model indicatesthat NodeDAT can better regularize the model by learningsubject-invariant representation at node level than graphlevel. In addition, if NodeDAT is removed, the performanceof our model has a greater variance, validating the im-portance of our NodeDAT regularizer in improving therobustness of RGNN against cross-subject variations.Our EmotionDL regularizer improves the performanceof our model by around 3% in accuracy on both datasets.This performance gain validates our assumption that par-ticipants are not always generating the intended emotionswhen watching emotion-eliciting stimuli. In addition, ourEmotionDL regularizer can be easily adopted by other deeplearning based emotion recognition models. We analyze the performance of our model across varyingL1 sparsity coefﬁcient α (see (11)) and noise coefﬁcient (cid:15) in EmotionDL (see (15) and (16)), as illustrated in Fig. 4.For subject-dependent classiﬁcation, increasing α from 0 to0.1 generally increases the model performance. However, Fig. 5: Activation maps learned from subject-dependent classiﬁcation on SEED-IV. (a) Delta band. (b) Theta band. (c) Alphaband. (d) Beta band. (e) Gamma band. (a) (b)

Fig. 6: Top 10 connections between channels in the adjacencymatrix A , excluding global connections in (10) for betterclarity. (a) Initialized A according to (9). (b) Learned and av-eraged A across ﬁve frequency bands in subject-dependentclassiﬁcation on both SEED and SEED-IV.for subject-independent classiﬁcation, increasing α beyonda certain threshold, i.e, 0.01 in Fig. 4(a), decreases the modelperformance. One possible explanation for the differencein model behaviors is that there is much less trainingdata in subject-dependent classiﬁcation, which thus requiresa stronger regularization to reduce overﬁtting, whereasfor subject-independent classiﬁcation where the amount oftraining data is less of a concern, adding stronger regular-ization may introduce bias and hinder the learning efﬁcacy.As illustrated in Fig. 4(b), our model behaves consis-tently across different experimental settings with varyingnoise coefﬁcient (cid:15) . Speciﬁcally, by increasing (cid:15) , the perfor-mance of our model ﬁrst increases and then decreases. Inparticular, our model usually performs best when (cid:15) is setto 0.2, demonstrating the existence of label noises and thenecessity of addressing them on both datasets. Introducingexcessive noise in EmotionDL causes performance drop,which is expected because excessive noise weakens the truelearning signals. We identify important brain regions for emotion recogni-tion. Fig. 5 shows the heatmaps of the diagonal elementsin our learned adjacency matrix A in subject-dependentclassiﬁcation on SEED-IV for each frequency band. Thevalues are scaled to the [0 , interval for better visualization.Conceptually, as shown in (4), the diagonal values in A rep-resents the contribution of each channel in computing theﬁnal EEG representation. It is clear from 5 that there is strongactivation on the pre-frontal, parietal and occipital regions for all frequency bands, indicating that these regions may bestrongly related to the emotion processing in the brain. Ourﬁnding is consistent with existing studies, which observedthat asymmetrical frontal and parietal EEG activity mayreﬂect changes on both valence and arousal [5], [27]. Thesynchronization between frontal and occipital regions hasalso been reported to be related to positive emotions [67]. Inaddition, there is strong activation on the temporal regionsfor beta and gamma bands, which is consistent with [7]. Thesymmetry pattern on the activation maps of channels alsoindicates that the asymmetry in EEG activity between theleft and right hemispheres is critical for emotion recognition.We identify important inter-channel relations for emo-tion recognition. Fig. 6 shows the top 10 connections be-tween channels having the largest edge weights in ouradjacency matrix A . Note that all global connections re-main among the strongest connections after A is learned,demonstrating again that global inter-channel relations areessential for emotion recognition. It is clear from Fig. 6(b)that the connection between the channel pair (FP1, AF3) isthe strongest, followed by (F6, F8), (FP2, AF4) and (PO8,CB2), indicating that local inter-channel relations in thefrontal region may be important for emotion recognition. ONCLUSION

In this paper, we propose a regularized graph neural net-work for EEG-based emotion recognition. Our model isinspired by neuroscience theories on human brain orga-nization and captures both local and global inter-channelrelations in EEG signals. In addition, we propose two regu-larizers, namely NodeDAT and EmotionDL, to improve therobustness of our model against cross-subject EEG varia-tions and noisy labels, respectively. Extensive experimentson two public datasets demonstrate the superior perfor-mance of our model than several competitive baselines andthe state-of-the-art BiHDM in most experimental settings.Our model analysis shows that our proposed biologicallyinspired adjacency matrix and two regularizers contributeconsistent and signiﬁcant gain to the performance of ourmodel. Investigations on the brain regions reveal that pre-frontal, parietal and occipital regions may be the mostinformative regions for emotion recognition. In addition,global inter-channel relations between the left and righthemispheres are important, and local inter-channel relationsbetween (FP1, AF3), (F6, F8) and (FP2, AF4) may alsoprovide useful information.In the future, we plan to explore: 1) training a more dis-criminative domain classiﬁer, e.g., by using more advanced classiﬁers or applying more sophisticated techniques tohandle imbalanced samples between training and test sets,to help our model learn more domain-invariant EEG repre-sentations; 2) applying our model to EEG signals that have asmaller number of channels. A simpler version of our modeland more advanced regularizations may be necessary toavoid over-smoothing on these small graphs. In addition,data processing techniques that can improve the spatialresolution of EEG signals, e.g., spatial ﬁltering, may beworth exploring. A CKNOWLEDGMENTS

This research is supported by Alibaba Group throughAlibaba Innovative Research Program, Alibaba-NTU Sin-gapore Joint Research Institute (Alibaba-NTU-AIR2019B1),Singapore Ministry of Health under its National In-novation Challenge on Active and Conﬁdent Ageing(MOH/NIC/COG04/2017; MOH/NIC/HAIG03/2017), theNational Research Foundation, Singapore under its NRFInvestigatorship Programme (NRF-NRFI05-2019-0002) andunder its AI Singapore Programme (AISG Award No: AISG-GC-2019-003). Any opinions, ﬁndings and conclusions orrecommendations expressed in this material are those of theauthors and do not reﬂect the views of National ResearchFoundation, Singapore. R EFERENCES [1] S. M. Alarcao and M. J. Fonseca, “Emotions recognition using EEGsignals: a survey,”

IEEE Transactions on Affective Computing , vol. 10,no. 3, pp. 374–393, 2017.[2] U. R. Acharya, V. K. Sudarshan, H. Adeli, J. Santhosh, J. E. Koh,and A. Adeli, “Computer-aided diagnosis of depression usingEEG signals,”

European Neurology , vol. 73, no. 5-6, pp. 329–336,2015.[3] P. Ekman and D. Keltner, “Universal facial expressions of emo-tion,”

Segerstrale U, P. Molnar P, eds. Nonverbal Communication:Where nature meets culture , pp. 27–46, 1997.[4] A. Mehrabian, “Pleasure-arousal-dominance: A general frame-work for describing and measuring individual differences in tem-perament,”

Current Psychology , vol. 14, no. 4, pp. 261–292, 1996.[5] L. A. Schmidt and L. J. Trainor, “Frontal brain electrical activity(EEG) distinguishes valence and intensity of musical emotions,”

Cognition & Emotion , vol. 15, no. 4, pp. 487–500, 2001.[6] X.-W. Wang, D. Nie, and B.-L. Lu, “Emotional state classiﬁcationfrom EEG data using machine learning approach,”

Neurocomput-ing , vol. 129, pp. 94–106, 2014.[7] W.-L. Zheng and B.-L. Lu, “Investigating critical frequency bandsand channels for EEG-based emotion recognition with deep neuralnetworks,”

IEEE Transactions on Autonomous Mental Development ,vol. 7, no. 3, pp. 162–175, 2015.[8] L.-C. Shi, Y.-Y. Jiao, and B.-L. Lu, “Differential entropy feature forEEG-based vigilance estimation,” in the 35th Annual InternationalConference of the IEEE Engineering in Medicine and Biology Society .IEEE, 2013, pp. 6627–6630.[9] D. Zhang, L. Yao, X. Zhang, S. Wang, W. Chen, R. Boots, and B. Be-natallah, “Cascade and parallel convolutional recurrent neuralnetworks on EEG-based intention recognition for brain computerinterface,” in

Thirty-Second AAAI Conference on Artiﬁcial Intelli-gence , 2018, pp. 1703–1710.[10] T. Zhang, W. Zheng, Z. Cui, Y. Zong, and Y. Li, “Spatial-temporalrecurrent neural network for emotion recognition,”

IEEE Transac-tions on Cybernetics , no. 99, pp. 1–9, 2018.[11] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difﬁculty oftraining recurrent neural networks,” in

International Conference onMachine Learning , 2013, pp. 1310–1318.[12] T. Song, W. Zheng, P. Song, and Z. Cui, “EEG emotion recognitionusing dynamical graph convolutional neural networks,”

IEEETransactions on Affective Computing , 2018, in press. [13] W.-L. Zheng and B.-L. Lu, “Personalizing EEG-based affectivemodels with transfer learning,” in the Twenty-Fifth InternationalJoint Conference on Artiﬁcial Intelligence . AAAI Press, 2016, pp.2732–2738.[14] X. Chai, Q. Wang, Y. Zhao, Y. Li, D. Liu, X. Liu, and O. Bai, “A fast,efﬁcient domain adaptation technique for cross-domain electroen-cephalography (EEG)-based emotion recognition,”

Sensors , vol. 17,no. 5, p. 1014, 2017.[15] Z. Lan, O. Sourina, L. Wang, R. Scherer, and G. R. M ¨uller-Putz, “Domain adaptation techniques for EEG-based emotionrecognition: a comparative study on two public datasets,”

IEEETransactions on Cognitive and Developmental Systems , vol. 11, no. 1,pp. 85–94, 2018.[16] Y. Li, W. Zheng, Y. Zong, Z. Cui, T. Zhang, and X. Zhou, “A bi-hemisphere domain adversarial neural network model for EEGemotion recognition,”

IEEE Transactions on Affective Computing ,2018, in press.[17] E. Bullmore and O. Sporns, “Complex brain networks: graphtheoretical analysis of structural and functional systems,”

NatureReviews Neuroscience , vol. 10, no. 3, p. 186, 2009.[18] A. Fornito, A. Zalesky, and M. Breakspear, “Graph analysis of thehuman connectome: promise, progress, and pitfalls,”

Neuroimage ,vol. 80, pp. 426–444, 2013.[19] E. Bullmore and O. Sporns, “The economy of brain networkorganization,”

Nature Reviews Neuroscience , vol. 13, no. 5, p. 336,2012.[20] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Sim-plifying graph convolutional networks,” in

International Conferenceon Machine Learning . PMLR, 2019, pp. 6861–6871.[21] R. C. Craddock, S. Jbabdi, C.-G. Yan, J. T. Vogelstein, F. X. Castel-lanos, A. Di Martino, C. Kelly, K. Heberlein, S. Colcombe, andM. P. Milham, “Imaging human connectomes at the macroscale,”

Nature Methods , vol. 10, no. 6, p. 524, 2013.[22] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle,F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,”

The Journal of MachineLearning Research , vol. 17, no. 1, pp. 2096–2030, 2016.[23] X. Zhu and X. Wu, “Class noise vs. attribute noise: A quantitativestudy,”

Artiﬁcial Intelligence Review , vol. 22, no. 3, pp. 177–210,2004.[24] W.-L. Zheng, W. Liu, Y. Lu, B.-L. Lu, and A. Cichocki, “Emo-tionmeter: A multimodal framework for recognizing human emo-tions,”

IEEE Transactions on Cybernetics , no. 99, pp. 1–13, 2018.[25] R. Jenke, A. Peer, and M. Buss, “Feature extraction and selectionfor emotion recognition from EEG,”

IEEE Transactions on AffectiveComputing , vol. 5, no. 3, pp. 327–339, 2014.[26] C. Tang, D. Wang, A.-H. Tan, and C. Miao, “EEG-based emotionrecognition via fast and robust feature smoothing,” in

InternationalConference on Brain Informatics . Springer, 2017, pp. 83–92.[27] Y.-P. Lin, C.-H. Wang, T.-P. Jung, T.-L. Wu, S.-K. Jeng, J.-R. Duann,and J.-H. Chen, “EEG-based emotion recognition in music listen-ing,”

IEEE Transactions on Biomedical Engineering , vol. 57, no. 7, pp.1798–1806, 2010.[28] M. Akin, “Comparison of wavelet transform and FFT methods inthe analysis of EEG signals,”

Journal of Medical Systems , vol. 26,no. 3, pp. 241–247, 2002.[29] X. Wu, W.-L. Zheng, and B.-L. Lu, “Identifying functional brainconnectivity patterns for EEG-based emotion recognition,” in the 9th International IEEE/EMBS Conference on Neural Engineering .IEEE, 2019, pp. 235–238.[30] P. Li, H. Liu, Y. Si, C. Li, F. Li, X. Zhu, X. Huang, Y. Zeng, D. Yao,and Y. Zhang, “EEG based emotion recognition by combiningfunctional connectivity network and local activations,”

IEEE Trans-actions on Biomedical Engineering , vol. 66, no. 10, pp. 2869–2881,2019.[31] W. Wu, Z. Chen, X. Gao, Y. Li, E. N. Brown, and S. Gao, “Proba-bilistic common spatial patterns for multichannel eeg analysis,”

IEEE Transactions on Pattern Analysis and Machine Intelligence ,vol. 37, no. 3, pp. 639–653, 2014.[32] W. Wu, S. Nagarajan, and Z. Chen, “Bayesian machine learning:EEG/MEG signal processing measurements,”

IEEE Signal Process-ing Magazine , vol. 33, no. 1, pp. 14–36, 2015.[33] F. Qi, Y. Li, and W. Wu, “Rstfc: A novel algorithm for spatio-temporal ﬁltering and classiﬁcation of single-trial eeg,”

IEEE trans-actions on neural networks and learning systems , vol. 26, no. 12, pp.3070–3082, 2015. [34] W.-L. Zheng, J.-Y. Zhu, Y. Peng, and B.-L. Lu, “EEG-based emotionclassiﬁcation using deep belief networks,” in IEEE InternationalConference on Multimedia and Expo . IEEE, 2014, pp. 1–6.[35] B. H. Kim and S. Jo, “Deep physiological affect network for therecognition of human emotions,”

IEEE Transactions on AffectiveComputing , 2018, in press.[36] X. Li, D. Song, P. Zhang, G. Yu, Y. Hou, and B. Hu, “Emotionrecognition from multi-channel EEG data through convolutionalrecurrent neural network,” in . IEEE, 2016, pp. 352–359.[37] J. Li, Z. Zhang, and H. He, “Hierarchical convolutional neural net-works for EEG-based emotion recognition,”

Cognitive Computation ,vol. 10, no. 2, pp. 368–380, 2018.[38] Y. Li, W. Zheng, L. Wang, Y. Zong, L. Qi, Z. Cui, T. Zhang, andT. Song, “A novel bi-hemispheric discrepancy model for EEGemotion recognition,” arXiv preprint arXiv:1906.01704 , 2019.[39] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “Acomprehensive survey on graph neural networks,” arXiv preprintarXiv:1901.00596 , 2019.[40] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Mon-fardini, “The graph neural network model,”

IEEE Transactions onNeural Networks , vol. 20, no. 1, pp. 61–80, 2008.[41] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral net-works and locally connected networks on graphs,” arXiv preprintarXiv:1312.6203 , 2013.[42] F. R. Chung and F. C. Graham,

Spectral Graph Theory . AmericanMathematical Soc., 1997, no. 92.[43] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutionalneural networks on graphs with fast localized spectral ﬁltering,”in

Advances in Neural Information Processing Systems , 2016, pp.3844–3852.[44] T. N. Kipf and M. Welling, “Semi-supervised classiﬁcation withgraph convolutional networks,” in

International Conference onLearning Representations , 2017.[45] P. Velikovi, G. Cucurull, A. Casanova, A. Romero, P. Li, andY. Bengio, “Graph attention networks,” in

International Conferenceon Learning Representations , 2018.[46] J. Huang, A. Gretton, K. Borgwardt, B. Sch¨olkopf, and A. J. Smola,“Correcting sample selection bias by unlabeled data,” in

Advancesin Neural Information Processing Systems , 2007, pp. 601–608.[47] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch¨olkopf, andA. Smola, “A kernel two-sample test,”

The Journal of MachineLearning Research , vol. 13, no. Mar, pp. 723–773, 2012.[48] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li, “Deepreconstruction-classiﬁcation networks for unsupervised domainadaptation,” in

European Conference on Computer Vision . Springer,2016, pp. 597–613.[49] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu,“Making deep neural networks robust to label noise: A losscorrection approach,” in

IEEE Conference on Computer Vision andPattern Recognition , 2017, pp. 1944–1952.[50] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fer-gus, “Training convolutional networks with noisy labels,” arXivpreprint arXiv:1406.2080 , 2014.[51] B. Van Rooyen, A. Menon, and R. C. Williamson, “Learning withsymmetric label noise: The importance of being unhinged,” in

Advances in Neural Information Processing Systems , 2015, pp. 10–18.[52] J. P. Brooks, “Support vector machines with the ramp loss and thehard margin loss,”

Operations Research , vol. 59, no. 2, pp. 467–479,2011.[53] B.-B. Gao, C. Xing, C.-W. Xie, J. Wu, and X. Geng, “Deep labeldistribution learning with label ambiguity,”

IEEE Transactions onImage Processing , vol. 26, no. 6, pp. 2825–2838, 2017.[54] S. Kullback and R. A. Leibler, “On information and sufﬁciency,”

The Annals of Mathematical Statistics , vol. 22, no. 1, pp. 79–86, 1951.[55] R. Salvador, J. Suckling, M. R. Coleman, J. D. Pickard, D. Menon,and E. Bullmore, “Neurophysiological architecture of functionalmagnetic resonance images of human brain,”

Cerebral Cortex ,vol. 15, no. 9, pp. 1332–1342, 2005.[56] S. Achard and E. Bullmore, “Efﬁciency and cost of economicalbrain functional networks,”

PLoS Computational Biology , vol. 3,no. 2, p. e17, 2007.[57] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful aregraph neural networks?” in

International Conference on LearningRepresentations , 2019.[58] X. Glorot and Y. Bengio, “Understanding the difﬁculty of trainingdeep feedforward neural networks,” in

Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics , 2010, pp.249–256.[59] W. Zheng, “Multichannel EEG-based emotion recognition viagroup sparse canonical correlation analysis,”

IEEE Transactions onCognitive and Developmental Systems , vol. 9, no. 3, pp. 281–290, 2016.[60] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptationvia transfer component analysis,”

IEEE Transactions on NeuralNetworks , vol. 22, no. 2, pp. 199–210, 2010.[61] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsu-pervised visual domain adaptation using subspace alignment,” in

IEEE International Conference on Computer Vision , 2013, pp. 2960–2967.[62] R. Collobert, F. Sinz, J. Weston, and L. Bottou, “Large scale trans-ductive svms,”

The Journal of Machine Learning Research , vol. 7, pp.1687–1712, 2006.[63] H. Li, Y.-M. Jin, W.-L. Zheng, and B.-L. Lu, “Cross-subject emo-tion recognition using deep adaptation networks,” in

InternationalConference on Neural Information Processing . Springer, 2018, pp.403–413.[64] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: A simple way to prevent neural net-works from overﬁtting,”

The Journal of Machine Learning Research ,vol. 15, no. 1, pp. 1929–1958, 2014.[65] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980 , 2014.[66] W. J. Ray and H. W. Cole, “EEG alpha activity reﬂects attentionaldemands, and beta activity reﬂects emotional and cognitive pro-cesses,”

Science , vol. 228, no. 4700, pp. 750–752, 1985.[67] T. Costa, E. Rognoni, and D. Galati, “EEG phase synchronizationduring emotional response to positive and negative ﬁlm stimuli,”

Neuroscience Letters , vol. 406, no. 3, pp. 159–164, 2006.

Peixiang Zhong received the B.Eng. degreein Electrical and Electronic Engineering fromNanyang Technological University, Singapore in2016. He is currently a PhD candidate in theSchool of Computer Science and Engineering,Nanyang Technological University, Singapore.His research interests include affective comput-ing, natural language processing and machinelearning, etc.

Di Wang received the B.Eng. degree in Com-puter Engineering and the Ph.D. degree in Com-puter Science from Nanyang Technological Uni-versity, Singapore, in 2003 and 2014, respec-tively. He is currently working as a Senior Re-search Fellow and the Research Manager inthe Joint NTU-UBC Research Centre of Ex-cellence in Active Living for the Elderly (LILY),Nanyang Technological University, Singapore.His research interests include computational in-telligence, decision support systems, computa-tional neuroscience, autonomous agents, affective computing, ubiqui-tous computing, etc.