[PDF] Extracting the Auditory Attention in a Dual-Speaker Scenario from EEG using a Joint CNN-LSTM Model

Abstract

Human brain performs remarkably well in segregating a particular speaker from interfering ones in a multi-speaker scenario. It has been recently shown that we can quantitatively evaluate the segregation capability by modelling the relationship between the speech signals present in an auditory scene and the cortical signals of the listener measured using electroencephalography (EEG). This has opened up avenues to integrate neuro-feedback into hearing aids whereby the device can infer user's attention and enhance the attended speaker. Commonly used algorithms to infer the auditory attention are based on linear systems theory where the speech cues such as envelopes are mapped on to the EEG signals. Here, we present a joint convolutional neural network (CNN) - long short-term memory (LSTM) model to infer the auditory attention. Our joint CNN-LSTM model takes the EEG signals and the spectrogram of the multiple speakers as inputs and classifies the attention to one of the speakers. We evaluated the reliability of our neural network using three different datasets comprising of 61 subjects where, each subject undertook a dual-speaker experiment. The three datasets analysed corresponded to speech stimuli presented in three different languages namely German, Danish and Dutch. Using the proposed joint CNN-LSTM model, we obtained a median decoding accuracy of 77.2% at a trial duration of three seconds. Furthermore, we evaluated the amount of sparsity that our model can tolerate by means of magnitude pruning and found that the model can tolerate up to 50% sparsity without substantial loss of decoding accuracy.

Full PDF

EExtracting the Locus of Attention at a Cocktail Party fromSingle-Trial EEG using a Joint CNN-LSTM Model

Ivine Kuruvila , Jan Muncke , Eghart Fischer , Ulrich Hoppe Abstract —Human brain performs remarkably well in seg-regating a particular speaker from interfering speakers in amulti-speaker scenario. It has been recently shown that we canquantitatively evaluate the segregation capability by modellingthe relationship between the speech signals present in an auditoryscene and the cortical signals of the listener measured usingelectroencephalography (EEG). This has opened up avenues tointegrate neuro-feedback into hearing aids whereby the devicecan infer user’s attention and enhance the attended speaker.Commonly used algorithms to infer the auditory attention arebased on linear systems theory where the speech cues such asenvelopes are mapped on to the EEG signals. Here, we presenta joint convolutional neural network (CNN) - long short-termmemory (LSTM) model to infer the auditory attention. Our jointCNN-LSTM model takes the EEG signals and the spectrogramof the multiple speakers as inputs and classiﬁes the attention toone of the speakers. We evaluated the reliability of our neuralnetwork using three different datasets comprising of 61 subjectswhere, each subject undertook a dual-speaker experiment. Thethree datasets analysed corresponded to speech stimuli presentedin three different languages namely German, Danish and Dutch.Using the proposed joint CNN-LSTM model, we obtained amedian decoding accuracy of 77.2% at a trial duration of threeseconds. Furthermore, we evaluated the amount of sparsity thatour model can tolerate by means of magnitude pruning andfound that the model can tolerate up to 50% sparsity withoutsubstantial loss of decoding accuracy.

I. I

NTRODUCTION

Holding a conversation in presence of multiple noise sourcesand interfering speakers is a task that people with normalhearing does exceptionally well. The inherent ability to focusthe auditory attention on a particular speech signal in acomplex mixture has helped us to overcome what is known asthe cocktail party effect [1]. However, an automatic machinebased solution to the cocktail party problem is yet to bediscovered despite the intense research for more than half acentury. Such a solution is highly desirable for a plethora ofapplications such as human-machine interface (e.g. AmazonAlexa), automatic captioning of audio/video recordings (e.g.YouTube, Netﬂix), advanced hearing aids etc.In the domain of hearing aids, people with hearing losssuffer from deteriorated speech intelligibility when listeningto a particular speaker in a multi-speaker scenario. Hearingaids currently available in the market often do not providesufﬁcient amenity in such scenarios due to their inabilityto distinguish between the attended speaker and the ignored

This work was carried out at the ENT clinic, Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg (FAU), Erlangen, Germany. Ivine Kuruvila, Jan Muncke and Ulrich Hoppe are with Department ofAudiology, ENT-Clinic, Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg(FAU), Erlangen, Germany.

[email protected] Eghart Fischer is with WS Audiology, Erlangen, Germany. speakers. Hence, additional information about the locus ofattention is highly desirable. In visual domain, selective at-tention is explained in terms of visual object formation wherean observer focuses on a certain object in a complex visualscene [2]. This was extended to auditory domain where it wassuggested that phenomena such as cocktail party effect couldbe better understood using auditory object formation [3]. I.e.,brain forms objects based on the multiple speakers presentin an auditory scene and selects those objects belonging toa particular speaker during attentive listening (top-down orlate selection). However, ﬂexible locus of attention theory wasconcurrently proposed where the late selection is hypothe-sized to occur at low cognitive load and early selection ishypothesized to occur at high cognitive load [4]. This hasinspired investigation into whether cortical signals can provideadditional information that could help to discriminate betweenthe attended speaker and interfering speakers. In a dual-speaker experiment, it was observed that the cortical signalsmeasured using implanted electrodes track the salient featuresof the attended speaker stronger than the ignored speaker [5].Similar results were obtained using magnetoencephalographyand electroencephalography (EEG) [6] [7]. In recent years,EEG analyses have become the commonly used methodologyin attention research which is lately known as auditory atten-tion decoding (AAD).Both low level acoustic features (speech envelope or speechspectrogram) and high level features (phonemes or phonetics)have been used to investigate the speech tracking in corticalsignals [8] [9] [10] [11]. State-of-the-art AAD algorithmsare based on the theory of linear systems where acousticfeatures are linearly mapped on to the EEG signals. Thismapping can be either in the forward direction [9] [12] [13]or in the backward direction [7] [14] [15]. These algorithmshave been successful in providing insights into the underly-ing neuroscientiﬁc processes by which brain suppresses theignored speaker in a multi-speaker scenario. Using speechenvelope as the input acoustic feature, linear algorithms couldgenerate system response functions that characterize the audi-tory pathway in the forward direction. These system responsefunctions are referred to as temporal response function (TRF)[9]. Analysis of the shape of TRFs has revealed that thehuman brain encodes the attended speaker and the ignoredspeaker differently. I.e., TRFs corresponding to the attendedspeaker have salient peaks around 100 ms and 200 ms whichare not present in TRFs corresponding to the ignored speaker[16] [17]. Similar attention modulation effects were observedwhen using speech spectrogram and higher level featuressuch as phonetics as the acoustic input [10]. Likewise, usingbackward models, the input stimulus can be reconstructed fromEEG signals (stimulus reconstruction method) and a listener’s a r X i v : . [ c s . S D ] F e b ame Number ofSubjects Duration perSubject(minutes) Total duration(hours) Experiment type LanguageFAU Dataset 27 30 13.5 male + male GermanDTU Dataset 18 50 15 male + female DanishKUL Dataset 16 24 6.4 male + male Dutch TABLE I:

Details of the EEG datasets analysed. attention could be inferred by comparing the reconstructedstimulus to the input stimuli [7]. These ﬁndings give thepossibility of integrating AAD algorithms into hearing aidswhich in combination with robust speech separation algorithmscould greatly improve the amenity provided to the users.It has been well established that the human auditory systemis inherently non-linear [18] and AAD analysis based on linearsystems theory addresses the issue of non-linearity to a certainextend in the preprocessing stage. For e.g., during speechenvelope extraction. Another limitation of linear methods isthe longer time delay required to classify attention [19] [20],although there were attempts to overcome this limitation [17][21]. In the last few years, deep neural networks have becomepopular especially in the ﬁeld of computer vision and speechprocessing. Since neural networks have the ability to modelnon-linearity, they have been used to estimate the dynamicstate of brain from EEG signals [22]. Similarly, in AADparadigm, convolutional neural network (CNN) based modelswere proposed where the stimulus reconstruction algorithmwas implemented using the CNN model to infer attention [23][24]. A direct classiﬁcation of attention which bypasses theregression task of stimulus reconstruction, instead classiﬁeswhether the attention is to speaker 1 or speaker 2 directly wasproposed in [24] [25]. In a non-competing speaker experiment,classifying attention as successful vs unsuccessful or match vsmismatch was further addressed in [26] [27].All aforementioned neural network models either did notuse speech features or made use of only speech envelopeas the input feature. As neural networks are data drivenmodels, additional data/information about the speech stimulimay improve the performance of the network. In neuralnetwork based speech separation algorithms, spectrogram isused as the input feature to separate multiple speakers from aspeech mixture [28]. Inspired by the joint audio-visual speechseparation model [29], we present a novel neural networkframework that make use the speech spectrogram of multiplespeakers and the EEG signals as inputs to classify the auditoryattention.The rest of the paper is organized as follows. In section II,details of the datasets that were used to train and validate theneural network are provided. In section III, the neural networkarchitecture is explained in detail. The results are presented insection IV and section V provides a discussion of the results. II. M

ATERIALS AND M ETHODS

A. Examined EEG datasets

We evaluated the performance of our neural network modelusing three different EEG datasets. The ﬁrst dataset was col-lected at our lab and it will be referred to as FAU Dataset [17].The second and third datasets are publicly available and theywill be referred to as DTU Dataset [30] and KUL Dataset[31].

1) FAU Dataset:

This dataset comprised of EEG collectedfrom 27 subjects who were all native German speakers. Acocktail party effect was simulated by presenting two speechstimuli simultaneously using loudspeakers and the subject wasasked to attend selectively to one of the two stimuli. Speechstimuli were taken from the slowly spoken news sectionof the German news website and were read bytwo male speakers. The experiment consisted of six differentpresentations with each presentation being approximately ﬁveminutes long making it a total of 30 minutes. EEG wascollected using 21 AgCl electrodes placed over the scalpaccording to the 10-20 EEG format. The reference electrodewas placed at the right mastoid, the ground electrode wasplaced at the left earlobe and the EEG signals were sampledat 2500 Hz. More details of the experiment could be found in[17].

2) DTU Dataset:

This is a publicly available dataset thatwas part of the work presented in [19]. The dataset consistedof 18 subjects who selectively attended to one of the twosimultaneous speakers. Speech stimuli were excerpts takenfrom Danish audiobooks that were narrated by a male anda female speaker. The experiment consisted of 60 segmentswith each segment being 50 seconds long making it a totalof 50 minutes. EEG were recorded using 64 electrodes andwere sampled at 512 Hz. The reference electrode was choseneither as the left mastoid or as the right mastoid after visualinspection. Further details can be found in [19] [30].

3) KUL Dataset:

The ﬁnal dataset that was analysed isanother publicly available dataset where 16 subjects undertookselective attention experiment. Speech stimuli consisted of fourDutch stories narrated by male speakers. Each story was 12minutes long which was further divided into two 6-minutespresentations. EEG was recorded using 64 electrodes and weresampled at 8196 Hz. The reference electrode was chosen eitheras TP7 or as TP8 electrode after visually inspecting the qualityof the EEG signal measured at these locations. The experimentconsisted of three different conditions namely HRTF, dichoticand repeated stimuli. In this work, we analysed only theichotic condition which was 24 minutes long. Additionaldetails about the experiment and the dataset can be found in[31] [32].Details of the datasets are summarized again in Table I.A total of 34.9 hours of EEG data were examined in thiswork. However, the speech stimuli used were identical acrosssubjects per dataset and they totaled 104 minutes of dual-speaker data. For each subject, the training and the test datawere split as 75% - 25% and we ensured that no part of theEEG or the speech used in the test data was part of the trainingdata. The test data were further divided equally into two halvesand one half was used as a validation set during the trainingprocess.

B. Data Analysis

As EEG signals analysed were collected at differentsampling frequencies, they were all low pass ﬁltered at acut off frequency of 32 Hz and downsampled to 64 Hzsampling rate. Additionally, signals measured at only 10electrode locations were considered for analysis and theywere F7, F3, F4, F8, T7, C3, Cz, C4, T8, Pz. We analysedfour different trial durations in this study namely twoseconds, three seconds, four seconds and ﬁve seconds. For 2seconds trials, an overlap of one second was applied. Thus,there were 118922 trials in total for analysis. In order tomaintain the total number of trials constant, two seconds ofoverlap was used in case of 3 seconds trial, three seconds ofoverlap was used in case of 4 seconds trial and four secondsoverlap was used in case of 5 seconds trial. EEG signalsin each trial were further high pass ﬁltered with a cut offfrequency of 1 Hz and the ﬁltered signals were normalizedto have zero mean and unit variance at each electrode location.

Trialduration(sec) EEG data (time xnum electrodes) Speech data(time x freq)2 128x10 101x2573 192x10 151x2574 256x10 201x2575 320x10 251x257

TABLE II: Trial duration vs dimension of the inputSpeech stimuli were initially low pass ﬁltered with a cutoff frequency of 8 kHz and were downsampled to a samplingrate of 16 kHz. Subsequently, they were segmented intotrials with a duration of two seconds, three seconds, fourseconds and ﬁve seconds at an overlap of one, two, three andfour seconds respectively. The speech spectrogram for eachtrial was obtained by taking the absolute value of the short-time Fourier transform (STFT) coefﬁcients. The STFT wascomputed using a Hann window of 32 ms duration with a 12ms overlap. Most of the analysis in our work were performedusing 3 seconds trial and other trial durations were used onlyfor comparison purposes. A summary of the dimensions ofEEG signals and speech spectrogram after preprocessing fordifferent trial durations is provided in Table II. III. N

ETWORK A RCHITECTURE

A top level view of the proposed neural network architectureis shown in Fig.1. It consists of three subnetworks namelyEEG CNN, Audio CNN and AE Concat.

A. EEG CNN

The EEG subnetwork comprised of four different convolu-tional layers as shown in Table III. The kernel size of the ﬁrstlayer was chosen as 24 and it corresponded to a latency of 375ms in the time domain. A longer kernel was chosen becauseprevious studies have shown that the TRFs corresponding tothe attended and unattended speakers differ around 100 ms and200 ms [16] [17]. Therefore, a latency of 375 ms could helpus to extract features that modulate the attention to differentspeakers in a multi-speaker environment. All other layers wereinitialized with kernels of shorter duration as shown in TableIII. All convolutions were performed using a stride of 1x1 andafter the convolutions, max pooling was used to reduce thedimensionality. To prevent overﬁtting on the training data andimprove generalization, dropout [33] and batch normalization(BN) [34] were applied. Subsequently, the output was passedthrough a non-linear activation function which was chosen asrectiﬁed linear unit (ReLU). The dimension of the input toEEG CNN varied according to the length of the trial (TableII) but the dimension of the output was ﬁxed at 48x32. Themax pooling parameter was slightly modiﬁed for differenttrial durations to obtain the ﬁxed output dimension. Theﬁrst dimension (48) corresponded to the temporal axis andthe second dimension (32) corresponded to the number ofconvolution kernels. The dimension of the output that mappedthe EEG signals measured at different electrodes was reducedto one by the successive application of max pooling along theelectrode axis.

Number ofKernels KernelSize Dilation Padding MaxpoolLayer 1 32 24x1 1,1 12,0 2,1Layer 2 32 7x1 2,1 6,0 1,2Layer 3 32 7x5 1,1 3,2 2,5Layer 4 32 7x1 1,1 3,0 1,1

TABLE III: CNN parameters of the EEG subnetwork

Number ofKernels KernelSize Dilation Padding MaxpoolLayer 1 32 1x7 1,1 0,3 1,1Layer 2 32 7x1 1,1 0,0 1,4Layer 3 32 3x5 8,8 0,16 1,2Layer 4 32 3,3 16,16 0,16 1,1Layer 5 1 1x1 1,1 0,0 2,2

TABLE IV: CNN parameters of the Audio subnetworkig. 1:

The architecture of the proposed joint CNN-LSTM model. Input to the audio stream is the spectrogram of speech signalsand input to the EEG stream is the downsampled version of EEG signals. Number of Audio CNNs depends on the number ofspeakers present in the auditory scene (here two). From the outputs of Audio CNN and EEG CNN, speech and EEG embeddingsare created which are concatenated together and passed to a BLSTM layer followed by FC layers.

B. Audio CNN

The audio subnetwork that processed the speech spectro-gram consisted of ﬁve convolution layers whose parametersare shown in Table IV. All standard procedures such asmax pooling, batch normalization, dropout and ReLU acti-vation were applied to the convolution output. Similar to theEEG CNN, dimension of the input to the Audio CNN variedaccording to the trial duration (Table II) but the dimensionof the output feature map was always ﬁxed at 48x16. As thedatasets considered in this study were taken from dual-speakerexperiments, the Audio CNN was run twice resulting in twosets of output.

C. AE Concat

The feature maps obtained from EEG CNN and Au-dio CNN were concatenated along the temporal axis and thedimension of the feature map after concatenation was 48x64.In this way, we ensured that half of the feature map wascontributed from the EEG data and half of the feature mapwas contributed from the speech data. This also provides theﬂexibility to extend to more than two speakers such as theexperiment performed in [35]. The concatenated feature mapwas passed through a bidirectional long short-term memory(BLSTM) layer [36] [37] which was followed by four fullyconnected (FC) layers. For the ﬁrst three FC layers, ReLU ac-tivation was used and for the last FC layer, sigmoid activationwas applied which helps us to classify the attention to speaker1 or speaker 2. If more than two speakers are present, sigmoidactivation should be replaced by softmax activation.The total number of EEG samples and audio samples (trials)available was 118922 and 75% of the total available samples (89192) were used to train the network and the rest of theavailable samples (29730) were equally split as validationand test data. The network was trained for 80 epochs usinga mini batch size of 32 samples and with a learning rateof 5 ∗ − . The drop out probability was set to 0.25 forthe EEG CNN and the AE Concat subnetworks but it wasincreased to 0.4 for the Audio CNN subnetwork. A larger dropout probability was used for the Audio CNN because speechstimuli were identical across subjects for a particular dataset.Hence, when trained on data from multiple subjects, the speechdata remain identical and the network may remember thespeech spectrogram of the training data. The network wasoptimized using Adam optimizer [38] and the loss functionused was binary cross entropy. As neural network trainingcan result in random variations from epoch to epoch, the testaccuracy was calculated as the median accuracy of the last ﬁveepochs [39]. The network was trained using an Nvidia GeforceRTX-2060 (6 GB) graphics card and took approximately 36hours to complete the training. The neural network modelwas developed in PyTorch and the python code is availableat https://github.com/ivine-GIT/joint CNN LSTM AAD. D. Sparse Neural Network: Magnitude pruning

Although neural networks achieve state-of-the-art perfor-mances for a wide range of applications such as in theﬁeld of computer vision and natural language processing,they have large memory footprint and require extremely highcomputation power. Over the years, neural networks wereable to extend their scope of applications was by scaling upthe network size. In 1998, the CNN model (LeNet) that wassuccessful in recognizing handwritten digits consisted of underig. 2:

Boxplot depicting the decoding accuracies obtained using two different training scenarios. In the ﬁrst scenario (

Ind settrain ), individual dataset accuracies were obtained by using training samples only from that particular dataset. For e.g., to calculatethe test accuracy of FAU Dataset, training samples were taken only from FAU Dataset. In the second scenario (

Full set train ),individual dataset accuracies were obtained using training samples from all the three datasets combined. As a result, there aremore training samples in the second scenario compared to the ﬁrst. a million parameters [40] whereas AlexNet that won the Im-ageNet challenge in 2012 consisted of 60 million parameters[41]. Neural networks were further scaled up to the orderof 10 billion parameters and efﬁcient methods to train theseextremely large networks were presented in [42]. While theselarge models are very powerful, running them on embeddeddevices poses huge challenges due to the large memory andcomputation requirements. Sparse neural networks have beenrecently proposed to overcome these challenges and enablerunning these models on embedded devices [43]. In sparsenetworks, majority of the model parameters are zeros and zero-valued multiplications can be ignored thereby reducing thecomputational requirement. Similarly, only non-zero weightsneed to be stored on the device and for all the zero-valuedweights, only their position needs to be known reducing thememory footprint. Empirical evidences have shown that neuralnetworks tolerate high level of sparsity [43] [44] [45].Sparse neural networks are found out by using a procedureknown as network pruning. It consists of three steps. First, alarge over-parameterized network is trained in order to obtaina high test accuracy as over-parameterization have strongerrepresentation power [46]. Second, from the trained over-parametrized network, only important weights based on certaincriterion are retained and all other weights are assumed tobe redundant and reinitialized to zero. Finally, the prunednetwork is ﬁne-tuned by training it further using only theretained weights so as to improve the performance. Searchingfor the redundant weights can be based on simple criteria suchas magnitude pruning [43] or based on complex algorithms such as variational dropout [47] or L0 regularization [48].However, it was shown that introducing sparsity using magni-tude pruning could achieve comparable or better performancethan complex techniques such as variational dropout or L0regularization [49]. Hence, we will present results based ononly magnitude pruning in this work.Sparsiﬁcation of neural network has also been investigatedas a neural network architecture search rather than merelyas an optimization procedure. In the lottery ticket hypothesispresented in [50], authors posit that, inside the structure of anover-parameterized network, there exist subnetworks (winningtickets) that when trained in isolation reaches accuracies com-parable to the original network. The pre-requisite to achievecomparable accuracy is to initialize the sparse network usingthe original random weight initialization that was used toobtain the sparse architecture. However, it was shown that withcareful choice of the learning rate, the stringent requirementon original weight initialization can be foregone and thesparse network can be trained from scratch for any randominitialization [51]. IV. R

ESULTS

A. Attention Decoding Accuracy

To evaluate how the proposed neural network model per-forms on different datasets, we trained our model under twodifferent scenarios using a trial duration of 3 seconds. In theﬁrst scenario (

Ind set train ), attention decoding accuracieswere calculated per individual dataset. I.e., to obtain the testccuracy of subjects belonging to FAU Dataset, the modelwas trained using training samples only from FAU Datasetleaving out DTU dataset and KUL Dataset. Similarly, toobtain the test accuracy for DTU Dataset, the model wastrained using training samples only from DTU Dataset. Thesame procedure was repeated for KUL Dataset. The mediandecoding accuracy was 72.6% for FAU Dataset, 48.1% forDTU Dataset and 69.1% for KUL Dataset (Fig. 2). In thesecond scenario (

Full set train ), accuracies were calculatedby combining training samples from all the three datasetstogether. The median decoding accuracies obtained in thisscenario were 84.5%, 52.9% and 77.9% for FAU Dataset,DTU Dataset and KUL Dataset respectively. The results fromthe second scenario showed a clear improvement over the ﬁrstscenario ( p < . Comparison of the decoding accuracies calculated fordifferent trial durations per dataset.

B. Decoding Accuracy vs Trial Duration

To analyse the effect of trial duration on the attentiondecoding accuracy, the model was trained using trials of length2, 3, 4 and 5 seconds. For every trial, only one second of newdata were added and the remaining data were populated byoverlapping to the previous trial using a sliding window. I.e.,for 2 seconds trial, one second of overlap was used and for 3seconds trial, two seconds of overlap was used and so on. Inthis way, total number of training samples remained constantfor different trial durations considered in our analysis. Themean decoding accuracy across all subjects and datasets incase of 2 seconds trial duration was 70.9% ± ± p < . ± p < . ± p > . Boxplots showing the decoding accuracies obtained byablating the different blocks such as FC layer or BLSTM layer.To obtain the test accuracies after ablating, the ablated networkwas trained from scratch in case of FC rem and BLSTM rem.However, in case of Audio rem and EEG rem, accuracieswere calculated by zeroing out the corresponding input featuresbefore passing them to a fully trained network.

C. Ablation Analysis

In order to gain further insights into the architecture of theneural network and understand the contribution of differentparts of the model, we performed ablation analysis. I.e.,we modiﬁed the neural network architecture by removingspeciﬁc block such as the BLSTM layer or the FC layersone at a time and retrained the modiﬁed network. Similarly, tounderstand the importance of the audio input feature, decodingaccuracies were calculated by zeroing out the EEG input and tounderstand the importance of the EEG input feature, decodingaccuracies were calculated by zeroing out the audio input.The median decoding accuracy by zeroing out the EEG inputwas 48.6% whereas zeroing out the audio input resulted in anaccuracy of 53.6% (Fig. 4). When the network was retrainedby removing the BLSTM layer only, the median decodingaccuracy obtained was 68.3% and on removing the FC layersonly, median decoding accuracy was 74.7%. To compare,the median decoding accuracy calculated using the full thenetwork was 77.2%.

D. Sparse Neural Network using Magnitude Pruning

To investigate the degree of sparsity that our neural networkcan tolerate, we pruned our network at 40%, 50%, 60%, 70%and 80% sparsity. In order to ﬁne-tune the pruned neuralnetwork, there are two options: 1) sequential or 2) one-shot. Insequential ﬁne-tuning, weights of the trained original modelare reinitialized to zero in smaller steps per epoch until therequired sparsity is attained. In one-shot ﬁne-tuning, weightsof the trained original model are reinitialized to zero at oneshot in the ﬁrst epoch and the sparse model is further trainedto improve performance. We observed that the sequential ﬁne-tuning is less efﬁcient than one-shot ﬁne-tuning in terms oftraining time budget. Therefore, all results presented here arebased on one-shot method. We achieved a median decodingaccuracy of 76.9% at a sparsity of 40% which is statisticallyidentical to the original model at 77.2% ( p > . Plots comparing the trade off between decoding accura-cies and sparsity levels. decreased to 75.7% which was lower than the original model( p < . ISCUSSION

People with hearing loss suffer from deteriorated speechintelligibility in noisy acoustic environment such as multi-speaker scenario. Increasing the audibility by means of hearingaids have not shown to provide sufﬁcient improvement to thespeech intelligibility as hearing aids are unable to estimateapriori to which speaker the user intends to listen. Hence,hearing aids amplify both the wanted signal (attended speaker)and interfering signals (ignored speakers). Recently, it hasbeen shown that the cortical signals measured using EEGcould infer the auditory attention by discriminating betweenthe attended speaker and the ignored speaker in a dual-speaker scenario [7]. Linear system analysis has been thecommonly used methodology to analyse the EEG signalsmeasured from a listener performing selective attention in amulti-speaker scenario. However, in recent years, non-linearanalyses based on neural networks have become prominent,thanks to the availability of customized hardware acceleratorsand associated software libraries.In this work, we developed a joint CNN-LSTM model toinfer the auditory attention of a listener in a multi-speakerenvironment. CNNs take the EEG signal and spectrogram ofthe multiple speakers as inputs and extract features throughsuccessive convolutions. These convolutions generate an in-termediate embeddings of the inputs which are then given toa BLSTM layer. As LSTMs fall under the category of recurrentneural networks, they can model the temporal relationshipbetween the EEG embedding and the multiple spectrogramembeddings. Finally, the output of the BLSTM is processed through FC layers to infer the auditory attention. The effec-tiveness of the proposed neural network was evaluated with thehelp of three different EEG datasets collected from subjectswho undertook dual-speaker experiment.

A. Attention Decoding Accuracy

We analysed the performance of our neural network in twodifferent training scenarios. In the ﬁrst scenario, individualdataset accuracy was found out by training the network usingsamples taken only from that particular dataset. On the otherhand, in the second scenario, individual dataset accuracy wasfound out by training using samples combined from all threedatasets together. The accuracies obtained in the second sce-nario were higher than the ﬁrst scenario by 10.8% on averagewhich is in agreement with the postulate of neural networklearning that larger the amount of training data, better thegeneralization. The decoding accuracies obtained for subjectsbelonging to the DTU Dataset were markedly lower than theother two datasets similar to the observation made in [52].While the exact reason for the lower performance is unclear,a major difference of the DTU Dataset compared to the othertwo datasets was that the former consisted of attention to maleand female speakers whereas the latter consisted of attentionto only male speakers. Therefore, training with additional EEGdata that consist of attention to female speakers can providemore insights into the lower performance.

B. Decoding Accuracy vs Trial Duration

One of the major challenges that AAD algorithms basedon linear system theory faces is the deteriorated decodingperformance when the trial duration is reduced. To this end,we calculated the accuracies using our neural network fordifferent trial durations of 2 seconds, 3 seconds, 4 secondsand 5 seconds. We observed a clear performance improvementwhen trial duration was increased from two to three secondswhereas for all other trial durations, accuracies did not improvesubstantially (Fig. 3). However, increasing the trial durationwill result in larger latency needed to infer the auditoryattention that can adversely affect applications which requirereal-time operation. Hence, three seconds trial duration maybe an optimal operation point as it is known from a previousstudy that human brain tracks the sentence phrases and phrasesare normally not longer than three seconds [53]. Similarly,our analysis made use of 10 electrodes distributed all overthe scalp but future work should investigate the effect ofreducing the number of electrodes so that algorithms based onneural networks can be integrated into devices such as hearingaids. We anticipate that the current network will requiremodiﬁcations with additional hyperparameter tuning in orderto accommodate for the reduction in number of electrodes asfewer the number of electrode, lower is the amount of trainingdata available.

C. Ablation Analysis

Performing ablation analysis gives the possibility to evaluatethe contribution of different inputs and modules in a neuraletwork. To our model, when only the speech features weregiven as input, the median decoding accuracy was 48.6%whereas only EEG features as input resulted in an accuracy of53.6% (Fig. 4). I.e., our neural network model learned morefrom the EEG features compared to the speech features. Thisis not surprising because in all the datasets that we analysed,speech stimuli were repeated across subjects while the EEGmeasured were unique. This also suggests that in future, caremust be taken to design the experiment in such a way as toincorporate diverse speech stimuli. Further analysis ablatingthe BLSTM layer and the FC layers revealed that the BLSTMlayer was far more important than the FC layers. This isprobably due to the ability of the LSTM layer to model thetemporal delay between speech cues and the EEG. However,we anticipate that when the training datasets become largerare more diverse, FC layers will grow in importance due tothe improved representation and optimization power of densenetworks [46].

D. Sparse Neural Networks

Investigation into the amount of sparsity that our neuralnetwork can tolerate revealed a tolerance of upto 50% sparsitywithout substantial loss of accuracy (Fig. 5). However, stan-dard benchmarking on sparsity has found that deep networkssuch as ResNet-50 can tolerate upto 90% sparsity [49]. Oneof the potential reasons for the lower level of sparsity in ourmodel is due to its shallow nature. I.e., our model is comprisedof less than half a million learnable parameters while deepnetworks such as ResNet-50 is comprised of over 25 millionlearnable parameters. It is also interesting to note that theaccuracy obtained by removing the FC layer in our ablationanalysis was 74.6% compared to the full network accuracyof 77.2%. And the ablated network consisted of 105605parameters which is approximately only a quarter of the totalnumber of parameters (416741) of the original network. Thisshows that by careful design choices, we can reduce thenetwork size considerably compared to an automatic sparsenetwork search using magnitude pruning.VI. C

ONCLUSION

Integrating EEG to track the cortical signals is one of thelatest proposals to enhance the quality of service providedby hearing aids to the users. EEG is envisaged to provideneuro-feedback about the user’s intention thereby enabling thehearing aid to infer and enhance the attended speech signals.In the present study, we propose a joint CNN-LSTM networkto classify the attended speaker and subsequently infer the au-ditory attention of a listener. The proposed neural network usesspeech spectrograms and EEG signals as inputs to infer the au-ditory attention. Results obtained by training the network usingthree different EEG datasets collected from multiple subjectswho undertook a dual-speaker experiment showed that ournetwork generalizes well to different scenarios. Investigationinto the importance of different constituents of our networkarchitecture revealed that adding an LSTM layer improved theperformance of the model considerably. Evaluating sparsity onthe proposed joint CNN-LSTM network demonstrates that the network can tolerate upto 50% sparsity without considerabledeterioration in performance. These results could pave wayto integrate algorithms based on neural networks into hearingaids that have constrained memory and computational power.A

CKNOWLEDGMENT

This work was supported by a grant from

Johannes undFrieda Marohn-Stiftung, Erlangen . We convey our gratitudeto all participants who took part in the study and would liketo thank the student Laura Rupprecht who helped us with dataacquisition. R

EFERENCES[1] E. C. Cherry, “Some experiments on the recognition of speech, with oneand with two ears,”

The Journal of the acoustical society of America ,vol. 25, no. 5, pp. 975–979, 1953.[2] J. Feldman, “What is a visual object?”

Trends in Cognitive Sciences ,vol. 7, no. 6, pp. 252–256, 2003.[3] B. G. Shinn-Cunningham, “Object-based auditory and visual attention,”

Trends in cognitive sciences , vol. 12, no. 5, pp. 182–186, 2008.[4] E. K. Vogel, G. F. Woodman, and S. J. Luck, “Pushing around the Locusof Selection: Evidence for the Flexible-selection Hypothesis,”

Journalof Cognitive Neuroscience , vol. 17, no. 12, pp. 1907–1922, 2005.[5] N. Mesgarani and E. F. Chang, “Selective cortical representation ofattended speaker in multi-talker speech perception,”

Nature , vol. 485,no. 7397, pp. 233–236, 2012.[6] N. Ding and J. Z. Simon, “Neural coding of continuous speech inauditory cortex during monaural and dichotic listening,”

Journal ofNeurophysiology , vol. 107, no. 1, pp. 78–89, 2012.[7] J. A. O’Sullivan, A. J. Power, N. Mesgarani, S. Rajaram, J. J. Foxe,B. G. Shinn-Cunningham, M. Slaney, S. A. Shamma, and E. C. Lalor,“Attentional Selection in a Cocktail Party Environment can be Decodedfrom Single-Trial EEG,”

Cerebral Cortex , vol. 25, no. 7, pp. 1697–1706,2014.[8] S. J. Aiken and T. W. Picton, “Human Cortical Responses to the SpeechEnvelope,”

Ear and hearing , vol. 29, no. 2, pp. 139–157, 2008.[9] E. C. Lalor and J. J. Foxe, “Neural responses to uninterrupted naturalspeech can be extracted with precise temporal resolution,”

EuropeanJournal of Neuroscience , vol. 31, no. 1, pp. 189–193, 2010.[10] G. M. Di Liberto, J. A. O’Sullivan, and E. C. Lalor, “Low-FrequencyCortical Entrainment to Speech Reﬂects Phoneme-Level Processing,”

Current Biology , vol. 25, no. 19, pp. 2457–2465, 2015.[11] M. P. Broderick, A. J. Anderson, and E. C. Lalor, “Semantic ContextEnhances the Early Auditory Encoding of Natural Speech,”

Journal ofNeuroscience , vol. 39, no. 38, pp. 7564–7575, 2019.[12] L. Fiedler, M. Woestmann, C. Graversen, A. Brandmeyer, T. Lunner, andJ. Obleser, “Single-channel in-ear-EEG detects the focus of auditoryattention to concurrent tone streams and mixed speech,”

Journal ofNeural Engineering , vol. 14, no. 3, p. 036020, 2017.[13] I. Kuruvila, E. Fischer, and U. Hoppe, “An LMMSE-based Estimationof Temporal Response Function in Auditory Attention Decoding,” in . IEEE, 2020, pp. 2837–2840.[14] W. Biesmans, N. Das, T. Francart, and A. Bertrand, “Auditory-InspiredSpeech Envelope Extraction Methods for Improved EEG-Based Audi-tory Attention Detection in a Cocktail Party Scenario,”

IEEE Transac-tions on Neural Systems and Rehabilitation Engineering , vol. 25, no. 5,pp. 402–412, 2017.[15] B. Mirkovic, S. Debener, M. Jaeger, and M. De Vos, “Decoding theattended speech stream with multi-channel EEG: implications for online,daily-life applications,”

Journal of Neural Engineering , vol. 12, no. 4,p. 046007, 2015.[16] L. Fiedler, M. W¨ostmann, S. K. Herbst, and J. Obleser, “Late corticaltracking of ignored speech facilitates neural selectivity in acousticallychallenging conditions,”

NeuroImage , vol. 186, pp. 33–42, 2019.[17] I. Kuruvila, K. C. Demir, E. Fischer, and U. Hoppe, “Inference ofthe Selective Auditory Attention using Sequential LMMSE Estimation,” arXiv preprint arXiv:2102.01746 , 2021.[18] E. Zwicker and H. Fastl,

Psychoacoustics: Facts and models . SpringerScience & Business Media, 2013, vol. 22.19] S. A. Fuglsang, T. Dau, and J. Hjortkjær, “Noise-robust cortical trackingof attended speech in real-world acoustic scenes,”

NeuroImage , vol. 156,pp. 435–444, 2017.[20] S. Geirnaert, T. Francart, and A. Bertrand, “An Interpretable Perfor-mance Metric for Auditory Attention Decoding Algorithms in a Contextof Neuro-Steered Gain Control,”

IEEE Transactions on Neural Systemsand Rehabilitation Engineering , vol. 28, no. 1, pp. 307–317, 2019.[21] S. Miran, S. Akram, A. Sheikhattar, J. Z. Simon, T. Zhang, andB. Babadi, “Real-Time Tracking of Selective Auditory Attention FromM/EEG: A Bayesian Filtering Approach,”

Frontiers in Neuroscience ,vol. 12, p. 262, 2018.[22] A. Craik, Y. He, and J. L. Contreras-Vidal, “Deep learning for electroen-cephalogram (EEG) classiﬁcation tasks: a review,”

Journal of NeuralEngineering , vol. 16, no. 3, p. 031001, 2019.[23] T. de Taillez, B. Kollmeier, and B. T. Meyer, “Machine learning fordecoding listeners’ attention from electroencephalography evoked bycontinuous speech,”

European Journal of Neuroscience , vol. 51, no. 5,pp. 1234–1241, 2020.[24] G. Ciccarelli, M. Nolan, J. Perricone, P. T. Calamia, S. Haro,J. O’Sullivan, N. Mesgarani, T. F. Quatieri, and C. J. Smalt, “Compar-ison of two-talker attention decoding from EEG with nonlinear neuralnetworks and linear methods,”

Scientiﬁc reports , vol. 9, no. 1, pp. 1–10,2019.[25] L. Deckers, N. Das, A. H. Ansari, A. Bertrand, and T. Francart, “EEG-based detection of the attended speaker and the locus of auditoryattention with convolutional neural networks,” bioRxiv .[26] M. J. Monesi, B. Accou, J. Montoya-Martinez, T. Francart, andH. Van Hamme, “An LSTM Based Architecture to Relate SpeechStimulus to Eeg,” in

ICASSP 2020-2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2020,pp. 941–945.[27] Tian, Yin and Ma, Liang, “Auditory attention tracking states in acocktail party environment can be decoded by deep convolutional neuralnetworks,”

Journal of Neural Engineering , 2020.[28] D. Wang and J. Chen, “Supervised Speech Separation Based on DeepLearning: An Overview,”

IEEE/ACM Transactions on Audio, Speech,and Language Processing , vol. 26, no. 10, pp. 1702–1726, 2018.[29] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T.Freeman, and M. Rubinstein, “Looking to listen at the cocktail party:A speaker-independent audio-visual model for speech separation,” arXivpreprint arXiv:1804.03619 , 2018.[30] S. A. Fuglsang, D. D. Wong, and J. Hjortkjær, “EEG and audiodataset for auditory attention decoding,” 2018. [Online]. Available:https://doi.org/10.5281/zenodo.1199011[31] N. Das, T. Francart, and A. Bertrand, “Auditory Attention DetectionDataset KULeuven,” 2019. [Online]. Available: https://doi.org/10.5281/zenodo.3377911[32] N. Das, W. Biesmans, A. Bertrand, and T. Francart, “The effect of head-related ﬁltering and ear-speciﬁc decoding bias on auditory attentiondetection,”

Journal of Neural Engineering , vol. 13, no. 5, p. 056014,2016.[33] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: a simple way to prevent neural networks from overﬁt-ting,”

Journal of Machine Learning Research , vol. 15, no. 1, pp. 1929–1958, 2014.[34] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[35] P. J. Sch¨afer, F. I. Corona-Strauss, R. Hannemann, S. A. Hillyard,and D. J. Strauss, “Testing the Limits of the Stimulus ReconstructionApproach: Auditory Attention Decoding in a Four-Speaker Free FieldEnvironment,”

Trends in Hearing , vol. 22, p. 2331216518816600, 2018.[36] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[37] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net-works,”

IEEE transactions on Signal Processing , vol. 45, no. 11, pp.2673–2681, 1997.[38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[39] P. Goyal, P. Doll´ar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola,A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Trainingimagenet in 1 hour,” arXiv preprint arXiv:1706.02677 , 2017.[40] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-BasedLearning Applied to Document Recognition,”

Proceedings of the IEEE ,vol. 86, no. 11, pp. 2278–2324, 1998. [41] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classiﬁcationwith Deep Convolutional Neural Networks,”

Communications of theACM , vol. 60, no. 6, pp. 84–90, 2017.[42] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew,“Deep learning with cots hpc systems,” in

International conference onmachine learning . PMLR, 2013, pp. 1337–1345.[43] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both Weightsand Connections for Efﬁcient Neural Network,”

Advances in NeuralInformation Processing Systems , vol. 28, pp. 1135–1143, 2015.[44] S. Narang, E. Elsen, G. Diamos, and S. Sengupta, “Exploring Sparsityin Recurrent Neural Networks,” arXiv preprint arXiv:1704.05119 , 2017.[45] M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efﬁcacyof pruning for model compression,” arXiv preprint arXiv:1710.01878 ,2017.[46] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A ﬁlter level pruning methodfor deep neural network compression,” in