Decoding EEG Brain Activity for Multi-Modal Natural Language Processing
Nora Hollenstein, Cedric Renggli, Benjamin Glaus, Maria Barrett, Marius Troendle, Nicolas Langer, Ce Zhang
DDecoding EEG Brain Activity for Multi-Modal Natural LanguageProcessing
Nora Hollenstein , ∗ , Cedric Renggli , Benjamin Glaus , Maria Barrett ,Marius Troendle , Nicolas Langer and Ce Zhang Department of Computer Science, ETH Zurich, Switzerland Department of Computer Science, IT University of Copenhagen, Denmark Department of Psychology, University of Zurich, Switzerland ∗ Corresponding author: Nora Hollenstein; [email protected]
Abstract
Until recently, human behavioral data from reading has mainly been of interest to researchers tounderstand human cognition. However, these human language processing signals can also be beneficialin machine learning-based natural language processing tasks. Using EEG brain activity to this purposeis largely unexplored as of yet. In this paper, we present the first large-scale study of systematicallyanalyzing the potential of EEG brain activity data for improving natural language processing tasks,with a special focus on which features of the signal are most beneficial. We present a multi-modalmachine learning architecture that learns jointly from textual input as well as from EEG features. Wefind that filtering the EEG signals into frequency bands is more beneficial than using the broadbandsignal. Moreover, for a range of word embedding types, EEG data improves binary and ternarysentiment classification and outperforms multiple baselines. For more complex tasks such as relationdetection, further research is needed. Finally, EEG data shows to be particularly promising whenlimited training data is available.
Keywords:
EEG, frequency bands, brain activity, physiological data, natural language processing, machinelearning, multimodal learning, neural networks
Recordings of brain activity play an important role in furthering our understanding of how human language works(Ling, Lee, Armstrong, & Nestor, 2019; Murphy, Wehbe, & Fyshe, 2018). The appeal and added value of usingbrain activity signals in linguistic research are intelligible (Stemmer & Connolly, 2012). Computational languageprocessing models still struggle with basic linguistic phenomena that humans perform effortlessly (Ettinger,2020). Combining insights from neuroscience and artificial intelligence will take us closer to human-level languageunderstanding (McClelland, Hill, Rudolph, Baldridge, & Sch¨utze, 2020). Moreover, numerous datasets of cognitiveprocessing signals in naturalistic experiment paradigms with real-word language understanding tasks are becomingavailable (Alday, 2019; Kandylaki & Bornkessel-Schlesewsky, 2019).Linzen (2020) advocates for the grounding of NLP models in multi-modal settings to compare the generalizationabilities of the models to human language learning. Developing models that learn from such multi-modal inputsefficiently is crucial to advance the generalization capabilities of state-of-the-art NLP models. Leveraging EEG andother physiological and behavioral signals seem especially appealing to model multi-modal human-like learningprocesses. Therefore, we investigate if and how we take advantage of electrical brain activity signals to provide a a r X i v : . [ c s . C L ] F e b uman inductive bias for these natural language processing (NLP) models. Our objective is to narrow the gapbetween human and machine language understanding.Two popular NLP tasks are sentiment analysis and relation detection. The goal of both tasks is to automaticallyextraction information from text. Sentiment analysis is the task of identifying and categorizing subjectiveinformation in text. For example, the sentence “This movie is great fun.” contains a positive sentiment, while thesentence “This movie is terribly boring.” contains a negative sentiment. Relation detection is the task of identifyingsemantic relationships between entities in the text. In the sentence “Albert Einstein was born in Ulm.”, the relation Birth Place holds between the entities “Albert Einstein” and “Ulm”. NLP researchers have made great progressin building computational models for these tasks (Barnes, Klinger, & im Walde, 2017; Rotsztejn, Hollenstein, &Zhang, 2018). However, these machine learning (ML) models still lack core human language understanding skillsthat humans perform effortlessly (Barnes, Velldal, & Øvrelid, 2020; Poria, Hazarika, Majumder, & Mihalcea, 2020).Barnes, Øvrelid, and Velldal (2019) find that sentiment models struggle with different linguistic elements such asnegations or sentences containing mixed sentiment towards several target aspects.
Leveraging Physiological Data for Natural Language Processing
In recent years, natural language processing researchers have increasingly leveraged human language processingsignals from physiological and neuroimaging recordings for both augmenting and evaluating machine learning-basedNLP models (Artemova, Bakarov, Artemov, Burnaev, & Sharaev, 2020). The approaches taken in those studies canbe categorized as encoding or decoding cognitive processing signals. Encoding and decoding are complementaryoperations: encoding uses stimuli to predict brain activity, while decoding uses the brain activity to predictinformation about the stimuli (Naselaris, Kay, Nishimoto, & Gallant, 2011). In the present study, we focus on thedecoding process for predicting information about the text input from human brain activity.Until now, mostly eye tracking and functional magnetic resonance imaging (fMRI) signals have been leveraged tothis purpose (e.g., Fyshe, Talukdar, Murphy, and Mitchell (2014)). On one hand, fMRI recordings provide insightsinto brain activity with a high spatial resolution, which furthers the research of localization of language-relatedcognitive processes. For instance, Schwartz, Toneva, and Wehbe (2019) fine-tuned a contextualized languagemodel with brain activity data, which yields better predictions of brain activity and does not harm the model’sperformance on downstream NLP tasks. On the other hand, eye tracking enables to objectively and accuratelyrecord visual behavior with high temporal resolution at low cost. Eye tracking is widely used in psycholinguisticstudies and it is common to extract well-established theory-driven features (Barrett, Bingel, Keller, & Søgaard, 2016;Hollenstein, Barrett, & Beinborn, 2020; Mathias, Kanojia, Mishra, & Bhattacharyya, 2020). These establishedmetrics are derived from a large body of psycholinguistic research.Reading times of words in a sentence depend on the amount of information the words convey. This correlationcan be observed in eye tracking data, but also in electroencephalography (EEG) data (Frank, Otten, Galli, &Vigliocco, 2015). Thus, eye tracking and EEG are complementary measures of cognitive load. Compared to eyetracking, EEG may be more cumbersome to record and requires more expertise, however, while eye movementsindirectly reflect the cognitive load of text processing, EEG contains more direct and comprehensive informationabout language processing in the human brain. For instance, word predictability and semantic similarity showdistinct patterns of brain activity during language comprehension (Ettinger, 2020; Frank & Willems, 2017). Theword representations used by neural networks and brain activity observed via the process of subjects reading astory can be aligned (Wehbe, Vaswani, Knight, & Mitchell, 2014). Moreover, EEG effects that reflect syntacticalprocesses can also be found in computational models of grammar (Hale, Dyer, Kuncoro, & Brennan, 2018). EEGis a non-invasive method to measure electrical brain surface activity. The synchronized activity of neurons in thebrain produces electrical currents. The resulting voltage fluctuations can be recorded with external electrodeson the scalp. The excellent temporal resolution of EEG information allows one to track a brain response inmilliseconds and therefore makes it uniquely suited to research concerning language processing (Beres, 2017).Due to the complexity and the low signal-to-noise ratio in the EEG data, it is very challenging to isolate specificcognitive processes, more and more researchers are relying on machine learning techniques to decode the EEGsignals (Affolter, Egressy, Pascual, & Wattenhofer, 2020; Pfeiffer, Hollenstein, Zhang, & Langer, 2020; P. Sun,Anumanchipalli, & Chang, 2020). To isolate certain cognitive functions, EEG signals can be split into frequency ands. For instance, effects related to semantic violations can be found within the gamma frequency range( ∼ −
100 Hz; Penolazzi, Angrilli, and Job (2009)).The amount of cognitive processes and noise included in brain activity signals make feature engineeringmuch harder on fMRI and electroencephalography (EEG) signals than on eye tracking. Machine learning studiesleveraging fMRI and EEG data also rely on standard preprocessing steps such as motion correction and spatialsmoothing and then use data-driven approaches to reduce the number of features, e.g., principal componentanalysis (Beinborn, Abnar, & Choenni, 2019). Moreover, fMRI data is most often used over full sentences or longertext spans, since the extraction of word-level signals is more complex than for EEG due to the lower temporalresolution and hemodynamic delay. Compared to fMRI and other neuroimaging techniques, EEG can be recordedwith a very high temporal resolution. This allows for more fine-grained language understanding experiments onthe word-level, which is crucial for applications in NLP. Thus, in this work, we analyze the potentials and benefitsof decoding EEG signals for NLP.EEG in combination with eye-tracking have become important tools for studying the temporal dynamics ofnaturalistic reading (Dimigen, Sommer, Hohlfeld, Jacobs, & Kliegl, 2011; Hollenstein et al., 2018; Sato & Mizuhara,2018). In this context, fixation-related potentials (FRPs), which are the evoked electrical responses time-locked tothe onset of fixations, have been studied and have received broad interest by naturalistic imaging researchers forfree viewing studies. In naturalistic reading paradigms, FRPs allow the study of the neural dynamics of how novelinformation from currently fixated text affects the ongoing language comprehension process.As of yet, the related work relying on EEG signals for NLP is very limited. Murphy and Poesio (2010) showedthat semantic categories can be detected in simultaneous EEG recordings. Muttenthaler, Hollenstein, and Barrett(2020) used EEG signals to train an attention mechanism, similar to Barrett, Bingel, Hollenstein, Rei, and Søgaard(2018), who used eye tracking signals to induce machine attention with human attention. However, EEG hasnot yet been leveraged for higher-level semantic tasks such as sentiment analysis or relation detection. Deeplearning techniques have been applied to decode EEG signals (Craik, He, & Contreras-Vidal, 2019), especiallyfor brain-computer interface technologies, e.g., Nurse et al. (2016). However, this avenue has not yet beenexplored when leveraging EEG signals to enhance NLP models. Through decoding EEG signals, we aim to explorethe specific mental tasks occurring during language understanding, more specifically, during English sentencecomprehension.
Contributions
More than a practical application of improving real-world NLP tasks, our main goal is to explore to what extentthere is additional linguistic processing information in the EEG signal to complement the text input. In thispresent study, we investigate for the first time the potential of leveraging EEG signals for augmenting NLP models.With the purpose of making language decoding studies from brain activity more interpretable, we follow therecommendations of Gauthier and Ivanova (2018): (1) We commit to a specific mechanism and task, and (2)subdivide the input feature space including theoretically founded preprocessing steps. We investigate the impact ofenhancing a neural network architecture for two common NLP tasks with a range of EEG features. We propose amulti-modal network capable of processing textual features and brain activity features simultaneously. We employtwo different well-established types of neural network architectures for decoding the EEG signals throughout theentire study. To analyze the impact of different EEG features, we perform experiments on sentiment analysis as abinary or ternary sentence classification task, and relation detection as a multi-class and multi-label classificationtask. We investigate the effect of augmenting NLP models with neurophysiolgical data in an extensive study whileaccounting for various dimensions:1. We present a comparison of a purely data-driven approach of feature extraction for machine learning, usingfull broadband EEG signals , to a more theoretically motivated approach, splitting the word-level EEGfeatures into frequency bands .2. We develop two
EEG decoding components for our multi-modal ML architecture: A recurrent and aconvolutional component.3. We contrast the effects of these EEG features on multiple word representation types commonly used in NLP.We compare the improvements of EEG features as a function of various training data sizes . . We analyze the impact of the EEG features on varying classification complexity : from binary classificationto multi-class and multi-label tasks.This comprehensive study is completed by comparing the impact of the decoded EEG signals not only to a text-only baseline, but also to baselines augmented with eye tracking data as well as random noise. In the next section,we describe the materials used in this study and the multi-modal machine learning architecture. Thereafter, wepresent the results of the NLP tasks and discuss the dimensions defined above. With the purpose of augmenting natural language processing tasks with brain activity signals, we leverage theZurich Cognitive Language Processing Corpus (ZuCo; Hollenstein et al. (2018); Hollenstein, Troendle, Zhang, andLanger (2020)). ZuCo is an openly available dataset of simultaneous EEG and eye tracking data from subjectsreading naturally occurring English sentences. This corpus consists of two datasets, ZuCo 1.0 and ZuCo 2.0,which contain the same type of recordings. We select the normal reading paradigms from both datasets, in whichparticipants were instructed to read English sentences in their own pace with no specific task beyond readingcomprehension. The participants read one sentence at a time, using a control pad to move to the next sentence.This setup facilitated the naturalistic reading paradigm. Descriptive statistics about the datasets used in this workare presented in Table 1.A detailed description of the entire ZuCo dataset, including individual reading speed, lexical performance,average word length, average number of words per sentence, skipping proportion on word level, and effect of wordlength on skipping proportion, can be found in Hollenstein et al. (2018). In the following section, we will describethe methods relevant to the subset of the ZuCo data used in the present study.
For ZuCo 1.0, data were recorded from 12 healthy adults (between 22 and 54 years old; all right-handed; 5 femalesubjects). For ZuCo 2.0, data were recorded from 18 healthy adults (between 23 and 52 years old; 2 left-handed; 10female subjects). The native language of all participants is English, originating from Australia, Canada, UK, USAor South Africa. In addition, all subjects completed the standardized LexTAELE test to assess their vocabularyand language proficiency (Lexical Test for Advanced Learners of English; Lemh¨ofer and Broersma (2012)). Allparticipants gave written consent for their participation and the re-use of the data prior to the start of theexperiments. The study was approved by the Ethics Commission of the University of Zurich.
During the normal reading tasks included in the ZuCo corpus, the participants were instructed to read the sentencesnaturally, without any specific task other than comprehension. Participants were told to read the sentencesnormally without any special instructions. The control condition for this task consisted of single-choice readingcomprehension questions about the content of the previous sentence. 12% of randomly selected sentences werefollowed by a control question on a separate screen.The reading materials recorded for the ZuCo corpus contain sentences from movie reviews from the StanfordSentiment Treebank (Socher et al., 2013) and Wikipedia articles from a dataset provided by Culotta, McCallum,and Betz (2006). Table 2 presents a few examples of the sentences read during the experiments.
In this section, we present the EEG data extracted from the ZuCo corpus for this work. We describe the acquisitionand preprocessing procedures as well as the feature extraction. uCo 1.0 ZuCo 1.0 ZuCo 2.0 Task SR Task NR Task NRParticipants 12 12 18Sentences 400 300 349Words 7,079 6,386 6,828Unique word types 3,080 2,657 2,412Sentiment Analysis (cid:88) - -Relation Detection - (cid:88) (cid:88)
Table 1: Details about the ZuCo tasks used in this chapter. In
Task SR participants read sentences frommovie reviews, and in
Task NR sentences from Wikipedia articles.
EEG Acquisition and Preprocessing
High-density EEG data were recorded using a 128-channel EEGGeodesic Hydrocel system (Electrical Geodesics, Eugene, Oregon) with a sampling rate of 500 Hz. The recordingreference was at Cz (vertex of the head), and the impedances were kept below 40 k Ω . All analyses were performedusing MATLAB 2018b (The MathWorks, Inc., Natick, Massachusetts, United States). EEG data was automaticallypreprocessed using the current version (2.4.3) of the MATLAB toolbox Automagic (Pedroni, Bahreini, & Langer,2019).Our preprocessing pipeline consisted of the following steps. First, 13 of the 128 electrodes in the outermostcircumference (chin and neck) were excluded from further processing as they capture little brain activity andmainly record muscular activity. Additionally, 10 EOG electrodes were separated from the data and were not usedfor further analysis, yielding a total number of 105 EEG electrodes. Subsequently, bad channels were detected bythe algorithms implemented in the EEGLAB plugin clean rawdata . A channel was defined as a bad electrodewhen recorded data from that electrode was correlated at less than 0.85 to an estimate based on other channels.Furthermore, a channel was defined as bad if it had more line noise relative to its signal than all other channels(4 standard deviations). Finally, if a channel had a longer flat-line than 5 seconds, it was considered bad.Thesebad channels were automatically removed and later interpolated using a spherical spline interpolation (EEGLABfunction eeg interp.m ). The interpolation was performed as a final step before the automatic quality assessmentof the EEG files (see below). Next, data was filtered using a high-pass filter (-6dB cut-off: 0.5 Hz) and a 60 Hznotch filter was applied to remove line noise artifacts. Subsequently, independent component analysis (ICA) wasperformed. Components reflecting artifactual activity were classified by the pre-trained classifier MARA (Winkler,Haufe, & Tangermann, 2011). Each component being classified with a probability rating > . µ V. After this, the pipeline automatically assessed the quality of the resulting EEGfiles based on four criteria: A data file was marked as bad-quality EEG and not included in the analysis if (1) theproportion of high-amplitude data points in the signals ( > µ V) was larger than 0.20; (2) more than 20% oftime points showed a variance larger than 15 microvolts across channels; (3) 30% of the channels showed highvariance ( > µ V); (4) the ratio of bad channels was higher than 0.3.
EEG Features
The fact that ZuCo provides simultaneous EEG and eye tracking data highly facilitates theextraction of word-level brain activity signals. Dimigen et al. (2011) demonstrated that EEG indices of semanticprocessing can be obtained in natural reading and compared to eye movement behavior. The eye tracking dataprovides millisecond-accurate fixation times for each word. Therefore, we were able to obtain the brain activityduring all fixations of a word by computing fixation-related potentials aligned to the onsets of the fixation on agiven word.In this work, we select a range of EEG features with a varying degree of theory-driven and data-driven featureextraction. We define the broadband EEG signal , i.e., the full EEG signal from 0.5-50 Hz as the averaged brainactivity over all fixations of a word, i.e., its total reading time. We compare the full EEG features, a data-drivenfeature extraction approach, to frequency band features , a more theoretically motivated approach. Differentneurocognitive aspects of language processing during reading are associated with brain oscillations at various http://sccn.ucsd.edu/wiki/Plugin list process ask Example Sentence Label(s) binary/ternary sentiment analysis “The film often achieves a mesmerizingpoetry.” Positive binary/ternary sentiment analysis “Flaccid drama and exasperatingly slowjourney.”
Negative
Ternary sentiment analysis “A portrait of an artist.”
Neutral
Relation detection “He attended Wake Forest University.”
Education
Relation detection “She attended Beverly Hills High School,but left to become an actress.”
Education , Job Title
Table 2: Examples sentences for all three NLP tasks used in this study. frequencies. These frequency ranges are known to be associated with certain cognitive functions. We split theEEG signal into four frequency bands to limit the bandwidth of the EEG signals to be analyzed. The frequencybands are fixed ranges of wave frequencies and amplitudes over a time scale: theta (4-8 Hz), alpha (8.5-13 Hz), beta (13.5-30 Hz), and gamma (30.5-49.5 Hz). We elaborate on cognitive and linguistic functions of each of thesefrequency bands in Section 4.1.In spite of the high inter-subject variability in EEG data, it has been shown in previous research of machinelearning applications (Foster, Dharmaretnam, Xu, Fyshe, & Tzanetakis, 2018; Hollenstein et al., 2019), thataveraging over the EEG features of all subjects yields results almost as good as the single best-performing subjects.Hence, we also average the EEG features over all subjects to obtain more robust features. Finally, for each wordin each sentence, the EEG features consist of a vector of 105 dimensions (one value for each EEG channel). Fortraining the ML models, we split all available sentences into sets of 80% for training and 20% for testing to ensurethat the test data is unseen during training.
In the following, we describe the eye tracking data recorded for the Zurich Cognitive Language Processing Corpus.In this study, we focus on decoding EEG data, but we use eye movement data to compute an additional baseline.As mentioned previously, augmenting ML models with eye tracking yields consistent improvements across a rangeof NLP tasks, including sentiment analysis and relation extraction (Hollenstein et al., 2019; Long, Lu, Xiang,Li, & Huang, 2017; Mishra, Kanojia, Nagar, Dey, & Bhattacharyya, 2017). Since the ZuCo datasets providesimultaneous EEG and eye tracking recordings, we leverage the available eye tracking data to augment all NLPtasks with eye tracking features as an additional multi-modal baseline based on cognitive processing features.
Eye Tracking Acquisition and Preprocessing
Eye movements were recorded with an infrared video-based eye tracker (EyeLink 1000 Plus, SR Research) at a sampling rate of 500 Hz. The EyeLink 1000 trackerprocesses eye position data, identifying saccades, fixations and blinks. Fixations were defined as time periodswithout saccades during which the visual gaze is fixed on a single location. The data therefore consists of (x,y) gazelocation entries for individual fixations mapped to word boundaries. A fixation lasts around 200–250ms (with largevariations). Fixations shorter than 100 ms were excluded, since these are unlikely to reflect language processing(Sereno & Rayner, 2003). Fixation duration depends on various linguistic effects, such as word frequency, wordfamiliarity and syntactic category (Clifton, Staub, & Rayner, 2007).
Eye Tracking Features
The following features were extracted from the raw data: (1) gaze duration (GD),the sum of all fixations on the current word in the first-pass reading before the eye moves out of the word; (2) total reading time (TRT), the sum of all fixation durations on the current word, including regressions; (3) firstfixation duration (FFD), the duration of the first fixation on the prevailing word; (4) go-past time (GPT), the sumof all fixations prior to progressing to the right of the current word, including regressions to previous words thatoriginated from the current word; (5) number of fixations (nFix), the total amount of fixations on the current word.We use these five features provided in the ZuCo data, which cover the extent of the human reading process.
20 40 60 80 100 120NatoinalityJob TitleDeathWifePlotical AffiliationBirth PlaceEmployerVisitedFounderAwardedEducation Number of sentences 0 50 100 150543210 Number of sentences N u m b e r o f r e l a t i o n s Figure 1: (left) Label distribution of the 11 relation types in the relation detection dataset. (right) Numberof relation types per sentence in the relation detection dataset.
To increase the robustness of the signal, analogously to the EEg features, the eye tracking features are averagedover all subjects (Barrett & Hollenstein, 2020). This results in a feature vector of five dimensions for each word ina sentence. Training and test data were split in the same fashion as the EEG data.
In this section, we describe the natural language processing tasks we use to evaluate the multi-modal ML models.As usual in supervised machine learning, the goal is to learn a mapping from given input features to an outputspace to predict the labels as accurately as possible. The tasks we consider in our work do not differ much in theinput definition as they consist of three sequence classification tasks for information extraction from text. The goalof a sequence classification task is to assign the correct label(s) to a given sentence. The input for all tasks consistsof tokenized sentences, which we augment with additional features, i.e., EEG or eye tracking. The labels to predictvary across the three chosen tasks resulting in varying task difficulty. Table 2 provides examples for all three tasks.
The objective of sentiment analysis is to interpret subjective information in text. More specifically, we definesentiment analysis as a sentence-level classification task. We run our experiments on both binary ( positive/negative )and ternary ( + neutral ) sentiment classification. For this task, we leverage only the sentences recorded in the firsttask of ZuCo 1.0, since they are part of the Stanford Sentiment Treebank (Socher et al., 2013), and thus directlyprovide annotated sentiment labels for training the ML models. For the first task, binary sentiment analysis, weuse the 263 positive and negative sentences. For the second task, ternary sentiment analysis, we additionally usethe neutral sentences, resulting in a total of 400 sentences.
Relation classification is the task of identifying the semantic relation holding between two entities in text. TheZuCo corpus also contains Wikipedia sentences with relation types such as
Job Title , Nationality and
PoliticalAffiliation . The sentences in ZuCo 1.0 and ZuCo 2.0, from the normal reading experiment paradigms, include11 relation types (Figure 1). In order to further increase the task complexity, we treat this task differently thanthe sentiment analysis tasks. Since any sentence can include zero, one or more of the relevant semantic relations(see example in Table 2, we treat relation detection as a multi-class and multi-label sequence classification task.Concretely, every sample can be assigned to any possible combination out of the 11 classes including none ofthem. Removing duplicates between ZuCo 1.0 and ZuCo 2.0 resulted in 594 sentences used for training the models.Figure 1 illustrates the label and relation distribution of the sentences used to train the relation detection task. nput Layer w o r d e m bedd i ng s d d d d word word word s … e e e e biLSTMs Dense Layers c on c a t ena t e h i dden s t a t e s predictionDense Layer + Softmax …… EE G f ea t u r e s word word word s … EEG Decoding Component … Figure 2: The multi-modal machine learning architecture for the EEG-augmented models. Word em-beddings of dimension d are the input for the textual component (yellow); EEG features of dimension e for the cognitive component (blue). The text component consists of recurrent layers followed by twodense layers with dropout. We test multiple architectures for the EEG component (see Figure 3). Finally,the hidden states of both components are concatenated and followed by a final dense layer with softmaxactivation for classification (green). We present a multi-modal neural architecture to augment the NLP sequence classification tasks with any othertype of data. Although combining different modalities or types of information for improving performance seemsan intuitively appealing task, it is often challenging to combine the varying levels of noise and conflicts betweenmodalities in practice.Previous works using physiological data for improving NLP tasks mostly implement early fusion multi-modalmethods, i.e., directly concatenating the textual and cognitive embeddings before inputting them into the network.For example, Hollenstein and Zhang (2019), Barrett, Gonz´alez-Gardu˜no, Frermann, and Søgaard (2018) andMishra et al. (2017) concatenate textual input features with eye-tracking features to improve NLP tasks such asentity recognition, part-of-speech tagging and sentiment analysis, respectively. Concatenating the input features atthe beginning in only one joint decoder component aims at learning a joint decoder across all modalities at risk ofimplicitly learning different weights for each modality. However, recent multi-modal machine learning work hasshown the benefits of late fusion mechanisms (Ramachandram & Taylor, 2017). Do, Nguyen, Tsiligianni, Cornelis,and Deligiannis (2017) argument in favor of concatenating the hidden layers instead of concatenating the featuresat input time. Such multi-modal models have been successfully applied in other areas, mostly combining inputsacross different domains, for instance, learning speech reconstruction from silent videos (Ephrat, Halperin, & Peleg,2017), or for text classification using images (Kiela, Grave, Joulin, & Mikolov, 2018). Tsai et al. (2019) train amulti-modal sentiment analysis model from natural language, facial gestures, and acoustic behaviors.Hence, we adopted this strategy in our work. We present multi-modal models for various NLP tasks, combiningthe learned representations of all input types (i.e., text and EEG features) in a late fusion mechanism beforeconducting the final classification. Purposefully, this enables the model to learn independent decoders for eachmodality before fusing the hidden representations together. In the present study, we investigate the proposedmulti-modal machine learning architecture, which learns simultaneously from text and from cognitive data such aseye tracking and EEG signals.Following, we first describe the uni-modal and multi-modal baseline models we use to evaluate the results.Thereafter, we present the multi-modal NLP models that jointly learn from text and brain activity data. .3.1 Uni-Modal Text Baselines For each of the tasks presented above, we train uni-modal models on textual features only. To represent the wordnumerically, we use word embeddings. Word embeddings are vector representations of words, computed so thatwords with similar meaning have a similar representation. To analyze the interplay between various types of wordembeddings and EEG data, we use the following three embedding types typically used in practice: (1) randomlyinitialized embeddings trained at run time on the sentences provided, (2) GloVe pre-trained embeddings basedon word co-occurrence statistics (Pennington, Socher, & Manning, 2014) , and (3) BERT pre-trained contextualembeddings (Devlin, Chang, Lee, & Toutanova, 2019) .The randomly initialized word representations define word embeddings as n -by- d matrices, where n is thevocabulary size, i.e., the number of unique words in our dataset, and d is the embedding dimension. Each value inthat matrix is randomly initialized and will then be trained together with the neural network parameters. We set d = 32. This type of embeddings does not benefit from pre-training on large text collections and hence is known toperform worse than GloVe or BERT embeddings. We include them in our study to better isolate the impact of theEEG features and to limit the learning of the model on the text it is trained on. Non-contextual word embeddingssuch as GloVe encode each word in a fixed vocabulary as a vector. The purpose of these vectors is to encodesemantic information about a word, such that similar words result in similar embedding vectors. We use the GloVeembeddings of d = 300 dimensions that are trained on 6 billion words. The contextualized BERT embeddings werepre-trained on multiple layers of transformer models with self-attention (Vaswani et al., 2017). Given a sentence,BERT encodes each word into a feature vector of dimension d = 768, which incorporates information from theword’s context in the sentence.The uni-modal text baseline model consists of a first layer taking the embeddings as an input, followed by abidirectional Long-Short Term Memory network (LSTM; Hochreiter and Schmidhuber (1997)), then two fully-connected dense layers with dropout between them, and finally a prediction layer using softmax activation. Thiscorresponds to a single component of the multi-modal architecture, i.e., the top component in Figure 2. Followingbest practices (e.g., C. Sun, Qiu, Xu, and Huang (2019)), we set the weights of BERT to be trainable similarly to therandomly initialized embeddings. This process of adjusting the initialized weights of a pre-trained feature extractorduring the training process, in our case BERT, is commonly known as fine-tuning in the literature (Howard &Ruder, 2018). In contrast, the parameters of the GloVe embeddings are fixed to the pre-trained weights and thusdo not change during training. To analyze the effectiveness of our multi-modal architecture with EEG signals properly, we not only compare itto uni-modal text baselines, but also to multi-modal baselines using the same architecture described in the nextsection for the EEG models, but replacing the features of the second modality with the following alternatives:(1) We implement a gaze-augmented baseline, where the five eye tracking features described in Section 2.1.4 arecombined with the word embeddings by adding them to the multi-modal model in the same manner as the EEGfeatures, as vectors with dimension = 5. The purpose of this baseline is to allow a comparison of multi-modalmodels learning from two different types of physiological features. Since the benefits of eye tracking data in MLmodels are well established (Barrett & Hollenstein, 2020; Mathias et al., 2020), this is a strong baseline. (2)We further implement a random noise-augmented baseline, where we add uniformly sampled vectors of randomnumbers as the second input data type to the multi-modal model. These random vectors are of the same dimensionas the EEG vectors (i.e., d = 105). It is well known that the addition of noise to the input data of a neural networkduring training can lead to improvements in generalization performance as a form of regularization (Bishop, 1995).Thus, this baseline is relevant because we want to analyze whether the improvements from the EEG signals on theNLP tasks are due to its capability of extracting linguistic information and not merely due to additional noise. EG Decoding Component
Input Layer e e e e biLSTMs Dense Layers … EE G f ea t u r e s word word word s … EEG Decoding Component
Input Layer e e e e EE G f ea t u r e s word word word s … ConvConv
Max Pool
ConvConvConvConv
Inception Module c on c a t ena t e Dense Layers … flatten Figure 3: EEG decoding components: (left) The recurrent model component is analogous to the textcomponent and consists of recurrent layers followed by two dense layers with dropout. (right) Theconvolutional inception component consists of an ensemble of convolution filters of varying lengths whichare concatenated and flattened before the subsequent dense layers.
To fully understand the impact of the EEG data on the NLP models, we build a model that is able to deal withmultiple inputs and mixed data. We present a multi-modal model with late decision-level fusion to learn jointrepresentations of textual and cognitive input features. We test both a recurrent and a convolutional neuralarchitecture for decoding the EEG signals. Figure 2 depicts the main structure of our model and we describe theindividual components below.All input sentences are padded to the maximum sentence length to provide fixed-length text inputs to the model.Word embeddings of dimension d are the input for the textual component, where d ∈ { , , } for randomlyinitialized embeddings, Glove embeddings and BERT embeddings, respectively. EEG features of dimension e arethe input for the cognitive component, where e = 105. As described, the text component consists of bidirectionalLSTM layers followed by two dense layers with dropout. Text and EEG features are given as independent inputs totheir own respective component of the network. The hidden representations of these are then concatenated beforebeing fed to a final dense classification layer.We also experimented with different merging mechanisms to join thetext and EEG layers of our two-tower model (concatenation, addition, subtraction, maximum). Concatenationoverall achieved the best results, so we report only these. Although the goal of each network is to learn featuretransformations for their own modality, the relevant extracted information should be complementary. This isachieved, as commonly done in deep learning, through alternatively running inference and back-propagation ofthe data through the entire network enabling information to flow from the component responsible for one inputmodality to the other via the fully connected output layers. To learn a non-linear transformation function for eachcomponent, we employ the rectified linear units (ReLu) as activation functions after each hidden layer.For the EEG component, we test a recurrent and a convolutional architecture since both have proven usefulin learning features from time series data for language processing (e.g., Fawaz et al. (2020), Yin, Kann, Yu, andSch¨utze (2017), Lipton, Berkowitz, and Elkan (2015)). For the recurrent architecture (Figure 3, left), the modelcomponent is analogous to the text component: it consists of bidirectional LSTM layers followed by two denselayers with dropout and ReLu activation functions. For the convolutional architecture (Figure 3, right), we build amodel component based on the Inception module first introduced by Szegedy et al. (2015). An inception moduleis an ensemble of convolutions that applies multiple filters of varying lengths simultaneously to an input timeseries. This allows the network to automatically extract relevant features from both long and short time series. Assuggested by Schirrmeister et al. (2017) we used exponential linear unit activations (ELUs; Clevert, Unterthiner,and Hochreiter (2015)) in the convolutional EEG decoding model component.For binary and ternary sentiment analysis, the final dense layer has a softmax activation in order to use themaximal output for the classification. For the multi-label classification case of relation detection, we replace thesoftmax function in the last dense layer of the model with a sigmoid activation to produce independent scores https://nlp.stanford.edu/projects/glove/ https://huggingface.co/bert-base-uncased arameter Range LSTM layer dimension 64, 128, 256, 512Number of LSTM layers 1, 2, 3, 4CNN filters 14, 16, 18CNN kernel sizes [1,4,7]CNN pool sizes 3, 5, 7Dense layer dimension 8, 16, 32, 64, 128, 256, 512Dropout 0.1, 0.3, 0.5Batch size 20, 40, 60Learning rate 10 − , 10 − , 10 − , 10 − , 10 − Random seeds 13, 22, 42, 66, 78Threshold 0.3, 0.5, 0.7
Table 3: Tested value ranges included in the hyper-parameter search for our multi-modal machine learningarchitecture.
Threshold only applies to relation detection. for each class. If the score for any class surpasses a certain threshold, the sentence is labeled to contain thatrelation type (opposite to simply taking the max score as the label of the sentence). The threshold is tuned as anadditional hyper-parameter.This multi-modal model with separate components learned for each input data type has several advantages:It allows for separate pre-processing of each type of data, e.g., it can deal with differing tokenization strategies,which is useful in our case since it is challenging to map linguistic tokenization to the word boundaries presentedto participants during the recordings of eye tracking and brain activity. Moreover, this approach is scalable to anynumber of input types. The generalizability of our model enables the integration of multiple data representations,e.g., learning from brain activity, eye movements, and other cognitive modalities simultaneously.
To assess the impact of the EEG signals under fair modelling conditions, the hyper-parameters are tuned individuallyfor all baseline models as well as for all eye tracking and EEG augmented models. The ranges of the hyper-parameters are presented in Table 3. All results are reported as means over five independent runs with differentrandom seeds. In each run, 5-fold cross-validation is performed on a 80% training and 20% test split. The bestparameters were selected according to the model’s accuracy on the validation set (10% of the training set) acrossall 5 folds. We implemented early stopping with a patience of 80 epochs and a minimum difference in validationaccuracy of 10 − . The validation set is used for both parameter tuning and early stopping. In this study, we assess the potential of EEG brain activity data to enhance NLP tasks in a multi-modal architecture.We present the results of all augmented models compared to the baseline results. As described above, we selectthe best hyper-parameters based on the best validation accuracy achieved.The performance of our models is evaluated based on the comparison between the predicted labels (i.e., positive,neutral or negative sentiment for a sentence; or the relation type(s) in a sentence) and the true labels of the test setresulting in the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN)across the classified samples. The terms positive and negative refer to the classifier’s prediction, and the terms true and false refer to whether that prediction corresponds to the ground truth label. The following decodingperformance metrics were computed:Precision is the fraction of relevant instances among the retrieved instances, and is defined as
P recision = T PT P + F P (1) andomly initialized GloVe BERT Model P R F (std) P R F (std) P R F (std) Baseline 0.572 0.573 0.552 (0.07) 0.751 0.738 0.728 (0.08) 0.900 0.899 0.893 (0.04)+ noise 0.599 0.574 0.541 (0.08) 0.721 0.715 0.709 (0.09) 0.921 0.918 0.915 (0.05)+ ET ** (0.06) 0.913 0.907 0.904 (0.05)+ EEG full 0.562 0.560 0.550 (0.07) 0.752 0.747 0.744 (0.04) 0.909 0.908 0.903 (0.05)+ EEG θ α ** (0.06) 0.775 0.767 0.760* (0.06) 0.915 0.915 0.913* (0.04)+ EEG β γ * (0.03)+ θ + α + β + γ Table 4:
Binary sentiment analysis results of the multi-modal model using the recurrentEEG decoding component . We report precision (P), recall (R), F -score and the standard deviation(std) between five runs. The best results per column are marked in bold, all EEG results better thanthe text baseline and the baseline augmented with random noise are marked with grey background.Significance is indicated on the F -score with asterisks: ∗ = p < . ∗∗ = p < .
003 (Bonferronicorrection).
Randomly initialized GloVe BERT
Model P R F (std) P R F (std) P R F (std) Baseline 0.412 0.406 0.365 (0.08) 0.516 0.510 0.501 (0.04) 0.722 0.714 0.710 (0.05)+ noise 0.373 0.399 0.344 (0.10) 0.531 0.519 0.504 (0.04) 0.711 0.706 0.700 (0.06)+ ET ** (0.06) 0.539 θ α β (0.07)+ EEG γ (0.05) 0.709 0.705 0.697 (0.06)+ θ + α + β + γ Table 5:
Ternary sentiment analysis results of the multi-modal model using the recurrentEEG decoding component . We report precision (P), recall (R), F -score and the standard deviation(std) between five runs. The best results per column are marked in bold, all EEG results better thanthe text baseline and the baseline augmented with random noise are marked with grey background.Significance is indicated on the F -score with asterisks: ∗ = p < . ∗∗ = p < .
003 (Bonferronicorrection).
Recall is the fraction of the relevant instances that are successfully retrieved:
Recall = T PT P + F N (2)The F -score is the harmonic mean combining precision and recall: F score = 2 · P recision · RecallP recision + Recall (3)For analyzing the results, we report macro-averaged precision (P), recall (R), and F -score, i.e., the metrics arecalculated for each label to counteract the label imbalance in the datasets.The results for the multi-modal architecture using the recurrent EEG decoding component are presented inTable 4 for binary sentiment analysis, Table 5 for ternary sentiment analysis, and Table 6 for relation detection.The first three rows in each table represent the uni-modal text baseline, the multi-modal noise and eye-trackingbaselines. This is followed by the multi-modal models augmented with the full broadband EEG signals and each ofthe four frequency bands. Finally, in the last row, we also present the results of a multi-modal model with fivecomponents, where text and each frequency band are learned separately and concatenated at the end. In both andomly initialized GloVe BERT Model P R F (std) P R F (std) P R F (std) Baseline (0.04) (0.04) (0.03)+ noise 0.462 0.335 0.382 (0.05) 0.577 0.497 0.532 (0.03) 0.675 0.585 0.625 (0.03)+ ET 0.468 0.324 0.373 (0.06) 0.547 0.476 0.506 (0.04) 0.661 0.631 0.644 (0.03)+ EEG full 0.426 0.335 0.370 (0.06) 0.519 0.449 0.480 (0.05) 0.677 0.627 0.650 (0.03)+ EEG θ α β γ θ + α + β + γ Table 6:
Relation detection results of the multi-modal model using the recurrent EEG decod-ing component . We report precision (P), recall (R), F -score and the standard deviation (std) betweenfive runs. The best results per column are marked in bold, all EEG results better than the text baseline and the baseline augmented with random noise are marked with grey background. Randomly initialized GloVe BERT
Model P R F (std) P R F (std) P R F (std) Baseline 0.572 0.573 0.552 (0.07) 0.751 0.738 0.728 (0.08) 0.900 0.899 0.893 (0.04)+ noise 0.558 0.584 0.528 (0.11) 0.780 0.767 0.762 (0.06) 0.906 0.903 0.901 (0.05)+ ET 0.617 0.623 (0.07) ** (0.05) 0.896 0.887 0.881 (0.05)+ EEG full 0.601 0.594 0.584 (0.06) 0.771 0.765 0.756 (0.07) 0.923 0.923 0.921 (0.04)+ EEG θ α β γ ** (0.04)+ θ + α + β + γ Table 7:
Binary sentiment analysis results of the multi-modal model using the convolutionalEEG decoding component . We report precision (P), recall (R), F -score and the standard deviation(std) between five runs. The best results per column are marked in bold, all EEG results better thanthe text baseline and the baseline augmented with random noise are marked with grey background.Significance is indicated on the F -score with asterisks: ∗ = p < . ∗∗ = p < .
003 (Bonferronicorrection). sentiment tasks, the eye tracking and EEG data yield a modest but consistent improvement over the text baseline.However, in the case of relation detection, the addition of either eye tracking or brain activity data seems to beharmful. Generally, the results show a decreasing maximal performance per task with increasing task complexitymeasured in terms of the number of classes (see Section 4.5 for a detailed analysis).Furthermore, the results for the multi-modal architecture using the convolutional
EEG decoding componentare presented in Table 7 for binary sentiment analysis, Table 8 for ternary sentiment analysis, and Table 9 forrelation detection. The results of this model architecture yield higher results, whereas the trend across tasksis similar to the model using the recurrent EEG decoding component, i.e., considerable improvements for bothsentiment analysis tasks, but none for relation detection.To assess the results, we perform statistical significance testing with respect to the text baseline in a bootstraptest as described in Dror, Baumer, Shlomov, and Reichart (2018) over the F -scores of the five runs of all tasks.We compare the results of the multi-modal models using text and EEG data to the uni-modal text baseline. Inaddition, we apply the Bonferroni correction to counteract the problem of multiple comparisons. We choose thisconservative correction because of the dependencies between the datasets used (Dror, Baumer, Bogomolov, &Reichart, 2017). Under the Bonferroni correction, the global null hypothesis is rejected if p < α/N , where N is thenumber of hypotheses (Bonferroni, 1936). In our setting, α = 0 .
05 and N = 18, accounting for the combination ofthe 3 embedding types and 6 EEG feature sets, namely broadband EEG; θ , α , β and γ frequency bands; and all andomly initialized GloVe BERT Model P R F (std) P R F (std) P R F (std) Baseline 0.412 0.406 0.365 (0.08) 0.516 0.510 0.501 (0.04) 0.722 0.714 0.710 (0.05)+ noise 0.359 0.388 0.334 (0.09) 0.529 0.517 0.505 (0.05) 0.715 0.683 0.684 (0.05)+ ET 0.398 0.404 0.378 (0.07) * (0.04) 0.721 0.687 0.670 (0.05)+ EEG full 0.417 0.397 0.366 (0.06) 0.506 0.503 0.495 (0.06) 0.738 0.724 (0.04)+ EEG θ α β γ θ + α + β + γ * (0.06) 0.542 0.531 0.514 (0.05) Table 8:
Ternary sentiment analysis results of the multi-modal model using the convolutionalEEG decoding component . We report precision (P), recall (R), F -score and the standard deviation(std) between five runs. The best results per column are marked in bold, all EEG results better thanthe text baseline and the baseline augmented with random noise are marked with grey background.Significance is indicated on the F -score with asterisks: ∗ = p < . ∗∗ = p < .
003 (Bonferronicorrection).
Randomly initialized GloVe BERT
Model P R F (std) P R F (std) P R F (std) Baseline (0.04) (0.04) (0.03)+ noise 0.424 0.299 0.342 (0.06) 0.447 0.413 0.428 (0.07) 0.532 0.493 0.511 (0.07)+ ET 0.415 0.307 0.345 (0.08) 0.548 0.408 0.464 (0.06) 0.624 0.540 0.577 (0.05)+ EEG full 0.458 0.343 0.386 (0.06) 0.486 0.420 0.448 (0.05) 0.586 0.545 0.564 (0.05)+ EEG θ α β γ θ + α + β + γ Table 9:
Relation detection results of the multi-modal model using the convolutional EEGdecoding component . We report precision (P), recall (R), F -score and the standard deviation (std)between five runs. The best results per column are marked in bold, all EEG results better than the textbaseline and the baseline augmented with random noise are marked with grey background. four frequency bands jointly. For instance, in Table 7 the improvements in 12 configurations out of 18 are alsostatistically significant under the Bonferroni correction (i.e., p < . The results show substantial improvements on both sentiment analysis tasks, however no improvements are achievedon relation detection. EEG performs better than, or at least comparable to eye tracking in many scenarios. Thisstudy shows the potential of decoding EEG for NLP and provides a good basis for future studies. Despite thelimited amount of data, these results suggest that augmenting NLP systems with EEG features is a generalizableapproach.In the following sections, we discuss these results from different angles. We contrast the performance ofdifferent EEG features, we compare the EEG results to the text baseline and multi-modal baselines (as describedin Section 2.3.2), and we analyze the effect of different word embedding types. Additionally, we explore the impactof varying training set sizes in a data ablation study. Finally, we investigate the possible reasons for the decreasein performance for the relation detection task, which we associate with the task complexity. We run all analyseswith both the recurrent and the convolutional EEG component. .1 EEG Feature Analysis We start by investigating the impact of the various EEG features included in our multi-modal models. Differentneurocognitive aspects of language processing during reading are associated with brain oscillations at variousfrequencies. We first give a short overview of the cognitive functions related to EEG frequency bands that arefound in literature before discussing the insights of our results.
Theta activity reflects cognitive control and working memory (Williams, Kappen, Hassall, Wright, & Krigolson,2019), and increases when processing semantic anomalies (Prystauka & Lewis, 2019). Moreover, Bastiaansen,Van Berkum, and Hagoort (2002) showed a frequency-specific increase in theta power as a sentence unfolds,possibly related to the formation of an episodic memory trace, or to incremental verbal working memory load.
Alpha activity has been related to attentiveness (Klimesch, 2012). Both theta and alpha ranges are sensitive tothe lexical–semantic processes involved in language translation (Grabner, Brunner, Leeb, Neuper, & Pfurtscheller,2007).
Beta activity has been involved in higher-order linguistic functions such as the discrimination of wordcategories and the retrieval of action semantics as well as semantic memory, and syntactic processes, which supportmeaning construction during sentence processing. There is evidence that suggests that beta frequencies areimportant for linking past and present inputs and the detection of novelty of stimuli, which are essential processesfor language perception as well as production (Weiss & Mueller, 2012). Beta frequencies also affect decisionsregarding relevance (Eugster et al., 2014). Emotional processing of pictures enhances gamma band power (M¨uller,Keil, Gruber, & Elbert, 1999). Gamma-band activity has been used to detect emotions (Li & Lu, 2009), andincreases during syntactic and semantic structure building (Prystauka & Lewis, 2019). In the gamma frequencyband, a power increase was observed during the processing of correct sentences, but this effect was absent followingsemantic violations (Hald, Bastiaansen, & Hagoort, 2006). Frequency band features have often been used in deeplearning methods for decoding EEG in other domains, such as mental workload and sleep stage classification(Craik et al., 2019).The results show that our multi-modal models yield better results with filtered EEG frequency bands thanusing the broadband EEG signal on all tasks and embedding types, as well as on both EEG decoding components.Although the alpha, beta and gamma features show promising results on some embedding types and tasks (e.g.,BERT embeddings and gamma features for binary sentiment analysis reported in Table 4), the results show no clearsign of any frequency band outperforming the others (neither across tasks for a fixed embedding type, nor for afixed task and across all embedding types). For the sentiment analysis tasks, where both EEG components achievesignificant improvements, gamma features most often achieve the highest results Moreover, the combination of allfour EEG frequency bands (i.e., in a multi-modal model of 5 components, including text embeddings), performsmuch better than the full broadband EEG. Hence, further exploring the effects of specific frequency bands onlanguage understanding tasks might prove useful.Data-driven methods can help us to tease more information from the recordings by allowing to test broadertheories and task-specific language representations (Murphy et al., 2018), but our results also clearly show thatrestricting the EEG signal to a given frequency band is beneficial. More research is required in this area tospecifically isolate the linguistic processing from the filtered EEG signals.
The multi-modal EEG models often outperform the text baselines (at least for the sentiment analysis tasks). Wenow analyze how the EEG models compare to the two augmented baselines described in Section 2.3.2 (i.e., eyetracking and models augmented with random noise). We find that EEG always performs better than or equal tothe multi-modal text + eye tracking models. This shows how promising EEG is as a data source for multi-modalcognitive NLP. Although eye tracking requires less recording efforts, these results corroborate that EEG datacontain more information about the cognitive processes occurring in the brain during language understanding.As expected, the baselines augmented with random noise perform worse than the pure text baselines in allcases expect for binary sentiment analysis with BERT embeddings. This model seems to deal exceptionally wellwith added noise. In the case of relation detection, the added noise harms the models similarly to adding EEG. Itbecomes clear for this task that adding the full EEG features is worse than adding random noise, but some of the recurrent EEG decoding component . The shaded areas represent the standard deviations.Figure 5: Data ablation for all three word embedding types for the binary sentiment analysis task usingthe convolutional EEG decoding component . The shaded areas represent the standard deviations. frequency band features clearly outperform the augmented noise baseline.
Our baseline results show that contextual embeddings outperform the non-contextual methods across all tasks.Arora, May, Zhang, and R´e (2020) also compared random, GloVe and BERT embeddings and found that withsmaller training sets, the difference in performance between these three embedding types is larger. This is inaccordance with our results, which show that the type of embedding has a large impact on the baseline performanceon all three tasks. The improvements of EEG in the binary sentiment analysis task with BERT embeddings areespecially noteworthy.Augmenting our baseline with EEG data on the binary sentiment analysis tasks results results in approximately+3% F -score across all the different embeddings with the recurrent EEG component. The gain is slightly lowerat +1% for all the embeddings in the ternary sentiment classification task. While there is no gain for relationdetection, the differences are also constant across embeddings. This shows that the improvements gained by addingEEG signals are much more dependent on the task than on the embedding type. In foresight, this finding mightbe useful in the future, when new embeddings will improve the baseline performance even further while upholdingthe stable gain from the EEG signals. One of the challenges of NLP is to learn as much as possible from limited resources. Unlike most machine learningmodels, one of the most striking aspects of human learning is the ability to learn new words or concepts fromlimited numbers of examples (Lake, Salakhutdinov, & Tenenbaum, 2015). Using cognitive language processingdata may allow us take a step towards meta-learning, the process of discovering the cognitive processes that areused to tackle a task in the human brain (Griffiths et al., 2019), and in turn be able to improve the generalizationabilities of NLP models. Humans can learn from very few examples, while machines, particularly deep learningmodels, typically need many examples. Perhaps this advantage in humans is due to their multi-modal learningmechanisms (Linzen, 2020).Therefore, we analyze the impact of adding EEG features to our NLP models with less training data. Weperformed data ablation experiments for all three tasks. The most conclusive results were achieved on binary
Job Title Precision Recall F -score Precision Recall F -score BERT 0.870 0.871 0.868 (0.03) 0.870 0.871 0.868 (0.03)BERT + EEG full (0.03) 0.882 0.878 0.877 (0.04)BERT + EEG β (0.03) (0.04) Visited Precision Recall F -score Precision Recall F -score BERT 0.864 0.848 0.848 (0.05) 0.864 0.848 0.848 (0.05)BERT + EEG full (0.05)BERT + EEG β (0.06) sentiment analysis. Randomly initialised embeddings unsurprisingly suffer a lot when reducing training data.The results are shown in Figure 4 and 5, for both EEG decoding components. We present the results for thebest-performing frequency bands only. GloVe and BERT embeddings yield the largest gain from EEG data isobtained with only 50% of the training data which is as little as 105 training sentences. These experimentsemphasize the potential of EEG signals for NLP especially when dealing with very small amounts of training dataand using popular word embedding types. From the previously described results, one hypothesis on the reason why augmenting the baseline with either EEGor eye tracking data lowers the performance in the relation detection task lies in the complexity of the task. Moreconcretely, we measure the complexity by counting the number of classes the model needs to learn. We see adecreasing performance boost with increasing complexity over the three evaluated tasks. Therefore, we validatethis hypothesis by simplifying the relation detection task by reducing the number of classes from 11 to 2. Wecreate binary relation detection tasks for the two most frequent relation types
Job Title and
Visited (see Figure 1.For example, we classify all the samples containing the relation
Job Title (184 samples) against all samples withno relation (219 samples).We train these additional models with BERT embeddings. The results for full EEG features and the bestfrequency band from the previous results are shown in Table 10. It is evident that with the simplification of therelation detection task into binary classification tasks, EEG signals boost the performance and achieve considerableimprovements over the text baseline. The gains are similar as for binary sentiment analysis for both EEG decodingcomponents. This confirms our hypothesis that the EEG features tested yield good results on simple tasks, butmore research is needed to achieve improvements on more complex tasks.
We presented a large-scale study about leveraging electrical brain activity signals during reading comprehensionfor augmenting machine learning models of semantic language understanding tasks, namely, sentiment analysisand relation detection. We analyzed the effects of different EEG features and compared the multi-modal modelsto multiple baselines. Moreover, we compared the improvements gained from the EEG signals on three differenttypes of word embeddings. Not only did we test the effect of varying training set sizes, but also tasks of variousdifficulty levels (in terms of number of classes).We achieve consistent improvements with EEG across all three embedding types, but the improvementmagnitude decreases for more difficult tasks. While the improvements for the binary and ternary sentimentanalysis tasks are significant with both EEG decoding components, for relation detection, a multi-class andmulti-label sequence classification task, it was not possible to achieve any improvements unless the task complexityis substantially reduced. We find that in the tasks where the multi-modal architecture does achieve considerableimprovements, the convolutional EEG decoding component yields even higher results than the recurrent component. owever, for the complex relation detection task none of the components achieve an improvement.To sum up, we capitalize on the advantages of electroencephalography data to examine if and which EEGfeatures can serve to augment language understanding models. While our results show that there is linguisticinformation in the EEG signal complementing the text features, more research is needed to isolate language-specificfeatures. More generally, this work paves the way for more in-depth EEG-based NLP studies. References
Affolter, N., Egressy, B., Pascual, D., & Wattenhofer, R. (2020). Brain2word: Decoding brain activity forlanguage generation. arXiv preprint arXiv:2009.04765 .Alday, P. M. (2019). M/EEG analysis of naturalistic stories: A review from speech to language processing.
Language, Cognition and Neuroscience , (4), 457–473.Arora, S., May, A., Zhang, J., & R´e, C. (2020). Contextual embeddings: When are they worth it? arXivpreprint arXiv:2005.09117 .Artemova, E., Bakarov, A., Artemov, A., Burnaev, E., & Sharaev, M. (2020). Data-driven models andcomputational tools for neurolinguistics: a language technology perspective. Journal of CognitiveScience , (1), 15–52.Barnes, J., Klinger, R., & im Walde, S. S. (2017). Assessing state-of-the-art sentiment models onstate-of-the-art sentiment datasets. In Proceedings of the 8th workshop on computational approachesto subjectivity, sentiment and social media analysis (pp. 2–12).Barnes, J., Øvrelid, L., & Velldal, E. (2019, August). Sentiment analysis is not solved! assessing and probingsentiment classification. In
Proceedings of the 2019 acl workshop blackboxnlp: Analyzing and interpret-ing neural networks for nlp (pp. 12–23). Florence, Italy: Association for Computational Linguistics.Retrieved from
DOI: 10.18653/v1/W19-4802Barnes, J., Velldal, E., & Øvrelid, L. (2020). Improving sentiment analysis with multi-task learning ofnegation.
Natural Language Engineering , 1–21. DOI: 10.1017/S1351324920000510Barrett, M., Bingel, J., Hollenstein, N., Rei, M., & Søgaard, A. (2018). Sequence classification withhuman attention. In
Proceedings of the 22nd conference on computational natural language learning (pp. 302–312).Barrett, M., Bingel, J., Keller, F., & Søgaard, A. (2016). Weakly supervised part-of-speech tagging usingeye-tracking data. In
Proceedings of the 54th annual meeting of the association for computationallinguistics (Vol. 2, pp. 579–584).Barrett, M., Gonz´alez-Gardu˜no, A. V., Frermann, L., & Søgaard, A. (2018). Unsupervised inductionof linguistic categories with records of reading, speaking, and writing. In
Proceedings of the 2018conference of the north american chapter of the association for computational linguistics: Humanlanguage technologies, volume 1 (long papers) (pp. 2028–2038).Barrett, M., & Hollenstein, N. (2020). Sequence labelling and sequence classification with gaze: Novel usesof eye-tracking data for natural language processing.
Language and Linguistics Compass , (11),1–16.Bastiaansen, M. C., Van Berkum, J. J., & Hagoort, P. (2002). Event-related theta power increases in thehuman EEG during online sentence processing. Neuroscience Letters , (1), 13–16.Beinborn, L., Abnar, S., & Choenni, R. (2019). Robust evaluation of language-brain encoding experiments. International Journal of Computational Linguistics and Applications .Beres, A. M. (2017). Time is of the essence: A review of electroencephalography (eeg) and event-relatedbrain potentials (erps) in language research.
Applied psychophysiology and biofeedback , (4),247–255. 18ishop, C. M. (1995). Training with noise is equivalent to tikhonov regularization. Neural computation , (1), 108–116.Bonferroni, C. (1936). Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R IstitutoSuperiore di Scienze Economiche e Commericiali di Firenze , , 3–62.Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning byexponential linear units (elus). arXiv preprint arXiv:1511.07289 .Clifton, C., Staub, A., & Rayner, K. (2007). Eye movements in reading words and sentences. In Eyemovements (pp. 341–371). Elsevier.Craik, A., He, Y., & Contreras-Vidal, J. L. (2019). Deep learning for electroencephalogram (EEG)classification tasks: a review.
Journal of Neural Engineering , (3), 031001.Culotta, A., McCallum, A., & Betz, J. (2006). Integrating probabilistic extraction models and data miningto discover relations and patterns in text. In Proceedings of the human language technology conferenceof the north american chapter of the association of computational linguistics (pp. 296–303).Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectionaltransformers for language understanding. In
Proceedings of the 2019 conference of the north americanchapter of the association for computational linguistics: Human language technologies, volume 1(long and short papers) (pp. 4171–4186).Dimigen, O., Sommer, W., Hohlfeld, A., Jacobs, A. M., & Kliegl, R. (2011). Coregistration of eyemovements and EEG in natural reading: analyses and review.
Journal of Experimental Psychology:General , (4), 552.Do, T. H., Nguyen, D. M., Tsiligianni, E., Cornelis, B., & Deligiannis, N. (2017). Multiview deep learningfor predicting Twitter users’ location. arXiv preprint arXiv:1712.08091 .Dror, R., Baumer, G., Bogomolov, M., & Reichart, R. (2017). Replicability analysis for naturallanguage processing: Testing significance with multiple datasets. Transactions of the Associationfor Computational Linguistics , , 471–486.Dror, R., Baumer, G., Shlomov, S., & Reichart, R. (2018). The hitchhiker’s guide to testing statisticalsignificance in natural language processing. In Proceedings of the 56th annual meeting of theassociation for computational linguistics (volume 1: Long papers) (pp. 1383–1392).Ephrat, A., Halperin, T., & Peleg, S. (2017). Improved speech reconstruction from silent video. In
Proceedings of the ieee international conference on computer vision workshops (pp. 455–462).Ettinger, A. (2020). What BERT is not: Lessons from a new suite of psycholinguistic diagnostics forlanguage models.
Transactions of the Association for Computational Linguistics , , 34-48.Eugster, M. J., Ruotsalo, T., Spap´e, M. M., Kosunen, I., Barral, O., Ravaja, N., . . . Kaski, S. (2014).Predicting term-relevance from brain signals. In Proceedings of the 37th international acm sigirconference on research & development in information retrieval (pp. 425–434).Fawaz, H. I., Lucas, B., Forestier, G., Pelletier, C., Schmidt, D. F., Weber, J., . . . Petitjean, F. (2020).Inceptiontime: Finding alexnet for time series classification.
Data Mining and Knowledge Discovery , (6), 1936–1962.Foster, C., Dharmaretnam, D., Xu, H., Fyshe, A., & Tzanetakis, G. (2018). Decoding music in the humanbrain using eeg data. In (pp. 1–6).Frank, S. L., Otten, L. J., Galli, G., & Vigliocco, G. (2015). The ERP response to the amount ofinformation conveyed by words in sentences. Brain and Language , , 1–11.Frank, S. L., & Willems, R. M. (2017). Word predictability and semantic similarity show distinct patternsof brain activity during language comprehension. Language, Cognition and Neuroscience , (9),1192–1203.Fyshe, A., Talukdar, P. P., Murphy, B., & Mitchell, T. M. (2014). Interpretable semantic vectors from19 joint model of brain-and text-based meaning. In Proceedings of the 52nd annual meeting of theassociation for computational linguistics (volume 1: Long papers) (p. 489-499).Gauthier, J., & Ivanova, A. (2018). Does the brain represent words? An evaluation of brain decodingstudies of language understanding. arXiv preprint arXiv:1806.00591 .Grabner, R. H., Brunner, C., Leeb, R., Neuper, C., & Pfurtscheller, G. (2007). Event-related eeg thetaand alpha band oscillatory responses during language translation.
Brain Research Bulletin , (1),57–65.Griffiths, T. L., Callaway, F., Chang, M. B., Grant, E., Krueger, P. M., & Lieder, F. (2019). Doingmore with less: meta-reasoning and meta-learning in humans and machines. Current Opinion inBehavioral Sciences , , 24–30.Hald, L. A., Bastiaansen, M. C., & Hagoort, P. (2006). Eeg theta and gamma responses to semanticviolations in online sentence processing. Brain and Language , (1), 90–105.Hale, J., Dyer, C., Kuncoro, A., & Brennan, J. R. (2018). Finding syntax in human encephalographywith beam search. In Proceedings of the 56th annual meeting of the association for computationallinguistics (volume 1: Long papers) (pp. 2727–2736).Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.
Neural Computation , (8), 1735–1780.Hollenstein, N., Barrett, M., & Beinborn, L. (2020). Towards best practices for leveraging humanlanguage processing signals for natural language processing. In Proceedings of the second workshopon linguistic and neurocognitive resources (pp. 15–27).Hollenstein, N., Barrett, M., Troendle, M., Bigiolli, F., Langer, N., & Zhang, C. (2019). Advancing NLPwith cognitive language processing signals. arXiv preprint arXiv:1904.02682 .Hollenstein, N., Rotsztejn, J., Troendle, M., Pedroni, A., Zhang, C., & Langer, N. (2018). ZuCo, asimultaneous EEG and eye-tracking resource for natural sentence reading.
Scientific Data .Hollenstein, N., Troendle, M., Zhang, C., & Langer, N. (2020). ZuCo 2.0: A dataset of physiologicalrecordings during natural reading and annotation. In
Proceedings of the 12th language resourcesand evaluation conference (pp. 138–146).Hollenstein, N., & Zhang, C. (2019). Entity recognition at first sight: Improving NER with eye movementinformation. In
Proceedings of the 2018 conference of the north american chapter of the associationfor computational linguistics: Human language technologies, volume 1 (long papers).
Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. In
Proceedingsof the 56th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 328–339).Kandylaki, K. D., & Bornkessel-Schlesewsky, I. (2019). From story comprehension to the neurobiology oflanguage.
Language, Cognition and Neuroscience , (4), 405-410.Kiela, D., Grave, E., Joulin, A., & Mikolov, T. (2018). Efficient large-scale multi-modal classification. In Thirty-second aaai conference on artificial intelligence.
Klimesch, W. (2012). Alpha-band oscillations, attention, and controlled access to stored information.
Trends in Cognitive Sciences , (12), 606–617.Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Human-level concept learning throughprobabilistic program induction. Science , (6266), 1332–1338.Lemh¨ofer, K., & Broersma, M. (2012). Introducing LexTALE: A quick and valid lexical test for advancedlearners of english. Behavior Research Methods , (2), 325–343.Li, M., & Lu, B.-L. (2009). Emotion classification based on gamma-band EEG. In Engineering in medicineand biology society, 2009. embc 2009. annual international conference of the ieee (pp. 1223–1226).Ling, S., Lee, A. C., Armstrong, B. C., & Nestor, A. (2019). How are visual words represented? Insightsfrom EEG-based visual word decoding, feature derivation and image reconstruction.
Human BrainMapping , (17), 5056–5068. 20inzen, T. (2020). How can we accelerate progress towards human-like linguistic generalization? arXivpreprint arXiv:2005.00955 .Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A critical review of recurrent neural networks forsequence learning. arXiv preprint arXiv:1506.00019 .Long, Y., Lu, Q., Xiang, R., Li, M., & Huang, C.-R. (2017). A cognition based attention model forsentiment analysis. In Proceedings of the 2017 conference on empirical methods in natural languageprocessing (pp. 462–471).Mathias, S., Kanojia, D., Mishra, A., & Bhattacharyya, P. (2020). A survey on using gaze behaviour fornatural language processing.
Proceedings of IJCAI .McClelland, J. L., Hill, F., Rudolph, M., Baldridge, J., & Sch¨utze, H. (2020). Placing language in anintegrated understanding system: Next steps toward human-level performance in neural languagemodels.
Proceedings of the National Academy of Sciences .Mishra, A., Kanojia, D., Nagar, S., Dey, K., & Bhattacharyya, P. (2017). Leveraging cognitive featuresfor sentiment analysis.
Proceedings of The 20th Conference on Computational Natural LanguageLearning , 156–166.M¨uller, M. M., Keil, A., Gruber, T., & Elbert, T. (1999). Processing of affective pictures modulatesright-hemispheric gamma band eeg activity.
Clinical Neurophysiology , (11), 1913–1920.Murphy, B., & Poesio, M. (2010). Detecting semantic category in simultaneous EEG/MEG recordings. In Proceedings of the naacl hlt 2010 first workshop on computational neurolinguistics (pp. 36–44).Murphy, B., Wehbe, L., & Fyshe, A. (2018). Decoding language from the brain.
Language, Cognition,and Computational Models , 53.Muttenthaler, L., Hollenstein, N., & Barrett, M. (2020). Human brain activity for machine attention. arXiv preprint arXiv:2006.05113 .Naselaris, T., Kay, K. N., Nishimoto, S., & Gallant, J. L. (2011). Encoding and decoding in fMRI.
NeuroImage , (2), 400–410.Nurse, E., Mashford, B. S., Yepes, A. J., Kiral-Kornek, I., Harrer, S., & Freestone, D. R. (2016).Decoding EEG and LFP signals using deep learning: Heading TrueNorth. In Proceedings of the acminternational conference on computing frontiers (pp. 259–266).Pedroni, A., Bahreini, A., & Langer, N. (2019). Automagic: Standardized preprocessing of big EEG data.
NeuroImage .Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation.In
Proceedings of the 2014 conference on empirical methods in natural language processing (pp.1532–1543).Penolazzi, B., Angrilli, A., & Job, R. (2009). Gamma EEG activity induced by semantic violation duringsentence reading.
Neuroscience Letters , (1), 74–78.Pfeiffer, C., Hollenstein, N., Zhang, C., & Langer, N. (2020). Neural dynamics of sentiment processingduring naturalistic sentence reading. NeuroImage , 116934.Poria, S., Hazarika, D., Majumder, N., & Mihalcea, R. (2020). Beneath the tip of the iceberg: Currentchallenges and new directions in sentiment analysis research.
IEEE Transactions on AffectiveComputing , 1-1. DOI: 10.1109/TAFFC.2020.3038167Prystauka, Y., & Lewis, A. G. (2019). The power of neural oscillations to inform sentence comprehension:A linguistic perspective.
Language and Linguistics Compass , (9), e12347.Ramachandram, D., & Taylor, G. W. (2017). Deep multimodal learning: A survey on recent advancesand trends. IEEE Signal Processing Magazine , (6), 96–108.Rotsztejn, J., Hollenstein, N., & Zhang, C. (2018). ETH-DS3Lab at SemEval-2018 Task 7: Effectivelycombining recurrent and convolutional neural networks for relation classification and extraction. In Proceedings of the 12th international workshop on semantic evaluation (p. 689-696).21ato, N., & Mizuhara, H. (2018). Successful encoding during natural reading is associated with fixation-related potentials and large-scale network deactivation.
Eneuro , (5).Schirrmeister, R. T., Springenberg, J. T., Fiederer, L. D. J., Glasstetter, M., Eggensperger, K., Tangermann,M., . . . Ball, T. (2017). Deep learning with convolutional neural networks for eeg decoding andvisualization. Human brain mapping , (11), 5391–5420.Schwartz, D., Toneva, M., & Wehbe, L. (2019). Inducing brain-relevant bias in natural language processingmodels. In Advances in neural information processing systems (pp. 14100–14110).Sereno, S. C., & Rayner, K. (2003). Measuring word recognition in reading: Eye movements andevent-related potentials.
Trends in Cognitive Sciences , (11), 489–493.Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., & Potts, C. (2013). Recursivedeep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013conference on empirical methods in natural language processing (pp. 1631–1642).Stemmer, B., & Connolly, J. F. (2012). The EEG/ERP technologies in linguistic research.
Methodologicaland Analytic Frontiers in Lexical Research , , 337.Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019). How to fine-tune BERT for text classification? In Chinanational conference on chinese computational linguistics (pp. 194–206).Sun, P., Anumanchipalli, G. K., & Chang, E. F. (2020). Brain2char: A deep architecture for decodingtext from brain recordings.
Journal of Neural Engineering .Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., . . . Rabinovich, A. (2015). Goingdeeper with convolutions. In
Proceedings of the ieee conference on computer vision and patternrecognition (pp. 1–9).Tsai, Y.-H. H., Bai, S., Liang, P. P., Kolter, J. Z., Morency, L.-P., & Salakhutdinov, R. (2019). Multimodaltransformer for unaligned multimodal language sequences. In
Proceedings of the 57th annual meetingof the association for computational linguistics (pp. 6558–6569).Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . . Polosukhin, I. (2017).Attention is all you need. In
Advances in neural information processing systems (pp. 5998–6008).Wehbe, L., Vaswani, A., Knight, K., & Mitchell, T. (2014). Aligning context-based statistical models oflanguage with brain activity during reading. In
Proceedings of the 2014 conference on empiricalmethods in natural language processing (emnlp) (pp. 233–243).Weiss, S., & Mueller, H. M. (2012). “Too many betas do not spoil the broth”: the role of beta brainoscillations in language processing.
Frontiers in psychology , , 201.Williams, C. C., Kappen, M., Hassall, C. D., Wright, B., & Krigolson, O. E. (2019). Thinking theta andalpha: Mechanisms of intuitive and analytical reasoning. NeuroImage , , 574–580.Winkler, I., Haufe, S., & Tangermann, M. (2011). Automatic classification of artifactual ICA-componentsfor artifact removal in EEG signals. Behavioral and Brain Functions , (1), 30.Yin, W., Kann, K., Yu, M., & Sch¨utze, H. (2017). Comparative study of CNN and RNN for naturallanguage processing. arXiv preprint arXiv:1702.01923arXiv preprint arXiv:1702.01923