[PDF] Inference of the Selective Auditory Attention using Sequential LMMSE Estimation

Abstract

Attentive listening in a multispeaker environment such as a cocktail party requires suppression of the interfering speakers and the noise around. People with normal hearing perform remarkably well in such situations. Analysis of the cortical signals using electroencephalography (EEG) has revealed that the EEG signals track the envelope of the attended speech stronger than that of the interfering speech. This has enabled the development of algorithms that can decode the selective attention of a listener in controlled experimental settings. However, often these algorithms require longer trial duration and computationally expensive calibration to obtain a reliable inference of attention. In this paper, we present a novel framework to decode the attention of a listener within trial durations of the order of two seconds. It comprises of three modules: 1) Dynamic estimation of the temporal response functions (TRF) in every trial using a sequential linear minimum mean squared error (LMMSE) estimator, 2) Extract the N1-P2 peak of the estimated TRF that serves as a marker related to the attentional state and 3) Obtain a probabilistic measure of the attentional state using a support vector machine followed by a logistic regression. The efficacy of the proposed decoding framework was evaluated using EEG data collected from 27 subjects. The total number of electrodes required to infer the attention was four: One for the signal estimation, one for the noise estimation and the other two being the reference and the ground electrodes. Our results make further progress towards the realization of neuro-steered hearing aids.

Full PDF

IInference of the Selective Auditory Attention using SequentialLMMSE Estimation

Ivine Kuruvila , Kubilay Can Demir , Eghart Fischer , Ulrich Hoppe Abstract —Attentive listening in a multispeaker environmentsuch as a cocktail party requires suppression of the interferingspeakers and the noise around. People with normal hearingperform remarkably well in such situations. Analysis of thecortical signals using electroencephalography (EEG) has revealedthat the EEG signals track the envelope of the attended speechstronger than that of the interfering speech. This has enabled thedevelopment of algorithms that can decode the selective attentionof a listener in controlled experimental settings. However, oftenthese algorithms require longer trial duration and computa-tionally expensive calibration to obtain a reliable inference ofattention. In this paper, we present a novel framework to decodethe attention of a listener within trial durations of the orderof two seconds. It comprises of three modules: 1) Dynamicestimation of the temporal response functions (TRF) in everytrial using a sequential linear minimum mean squared error(LMMSE) estimator, 2) Extract the N1 − P2 peak of the estimatedTRF that serves as a marker related to the attentional stateand 3) Obtain a probabilistic measure of the attentional stateusing a support vector machine followed by a logistic regression.The efﬁcacy of the proposed decoding framework was evaluatedusing EEG data collected from 27 subjects. The total numberof electrodes required to infer the attention was four: One forthe signal estimation, one for the noise estimation and the othertwo being the reference and the ground electrodes. Our resultsmake further progress towards the realization of neuro-steeredhearing aids.

I. I

NTRODUCTION

The cocktail party effect or the ability of humans to focusthe attention on a certain speaker among multiple speakersis fundamental to interpersonal communication. In the lastdecades, scientiﬁc community has actively investigated howthe human brain is able to maintain uninterrupted attention inthe presence of noise and interference from different speakers.Although the process by which the brain segregates multiplespeakers still remains largely unknown, progresses are contin-uously being made by analyzing the cortical signals measuredusing electrocorticography (ECoG), magnetoencephalography(MEG) and electroencephalography (EEG). In recent years,EEG analyses have become the forefront of auditory attentionresearch, thanks to its non-invasive nature, minimal effort andhigh temporal resolution. Several methods have been proposedthat model the relationship between multiple speakers presentin an auditory scene and the elicited EEG signal [1] [2] [3][4].

This work was carried out at the ENT clinic, Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg (FAU), Erlangen, Germany and had receiveda grant from WS Audiology. Ivine Kuruvila, Kubilay Can Demir and Ulrich Hoppe arewith Department of Audiology, ENT-Clinic, Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg (FAU), Erlangen, Germany.

[email protected] Eghart Fischer is with WS Audiology, Erlangen, Germany.

The underlying assumption in these methods is that thecortical signal track the acoustic envelope of the attendedspeaker stronger than that of the unattended speaker [2] [5].While the path between the outer ear and the EEG/MEG sensoris non-linear, linear system analysis is often used to analysethe processing of speech envelope through the auditory path.The system response function that models this pathway istermed as temporal response function (TRF) [6] [7]. Since thismapping is from speech to EEG/MEG sensors in the forwarddirection, TRFs fall under the category of forward models.TRFs are not limited to speech envelopes but can be used tomap a linear relationship between speech spectrograms andthe cortical signals, or between phonemes and the corticalsignals [7] [8]. In EEG modality, TRFs have high temporalresolution with peaks around 100 ms and 200 ms that modulatethe attentional effect [3]. On the other hand, due to theassumption of linearity, EEG signals can be mapped on tothe speech envelope in backward direction (backward model).The current state-of-the-art system identiﬁcation method forboth forward and backward model is based on least-squares(LS) [9] and the identiﬁed system can be used to deduce theattention of a listener [2] [10] [11] [12] [13] [14]. However, thesystem response function calculated using the backward modelcannot be physiologically interpreted due to its implicit noisesuppression [15]. Therefore, TRF based on the forward modelusing EEG signals is the focus of this paper.Although the aforementioned studies have proven successfulin estimating the selective attention reliably, they suffer fromtwo major limitations. First, the temporal resolution required toobtain a reliable auditory attention decoding (AAD) accuracyis of the order of tens of seconds whereas a listener can switchthe attention at much shorter time scales. Second, attention de-coding accuracies deteriorate as the number of EEG electrodesused for analysis reduces. A recent study based on the statespace model has addressed the ﬁrst limitation whereby thetemporal resolution was reduced to less than one second [16].Similarly, the effect of reducing the number of EEG electrodeson the attention decoding accuracy was investigated in [10][12] [17].Here, we present a novel framework to infer the selectiveauditory attention of a listener. The proposed frameworkconsists of three modules. The ﬁrst module relates to thedynamic estimation of the TRF corresponding to the attendedspeaker (attended TRF) and the unattended speaker (unat-tended TRF). It is based on sequential linear minimum meansquared error (LMMSE) estimator which is an improvementof the algorithm proposed in [18]. LMMSE algorithms arebased on explicitly calculating the covariance of the signalcomponent and subsequently applying Bayesian estimationtheory. In the second module, a marker related to the attention a r X i v : . [ ee ss . A S ] F e b f the listener is extracted. Correlation based marker is a com-monly used attention marker where the Pearson correlationcoefﬁcients between the original signals and the reconstructedsignal are compared [13] [16] [19]. However, it has also beensuggested that the amplitude peaks of the attended TRF andthe unattended TRF differ around 100 ms (N1 TRF ) and 200ms (P2

TRF ) latencies and these peaks correlate with selectiveattention [3] [7] [20]. As the sequential LMMSE estimatoris capable of generating TRFs with high ﬁdelity from shortduration trials, we deﬁne the attention marker as the magnitudeof the difference between N1

TRF and P2

TRF known as theN1

TRF − P2 TRF peak. In the third module, we train a supportvector machine (SVM) using attention markers correspondingto the two speakers as features in order to generate a decisionboundary and classify attention. Finally, a logistic regressionis applied to the SVM’s output to obtain a probabilisticconﬁdence of the attentional inference.The rest of the paper is organized as follows. In sectionII, experimental details, information about the speech stimuliand the EEG data collection procedure are provided. In sectionIII, the proposed attention decoding framework is developedand explained in detail. Classiﬁcation results are presented insection IV and the paper is concluded with a discussion onthe algorithm and the results in section V.II. M

ATERIALS AND M ETHODS

A. Participants

Twenty seven subjects who were all native German speakerstook part in the study. Of these, eighteen subjects (M age =58 years, range = 24 - 73 years, nine females, nine males)were hearing impaired and the remaining nine subjects (M age = 32 years, range = 20 - 52 years, ﬁve females, four males)were of normal hearing. Individuals with pure tone audiomet-ric thresholds better than 25 dB HL at octave frequenciesfrom 125 Hz to 8 kHz were considered as normal hearingwhereas individuals with pure tone audiometric thresholdsworse than 25 dB HL were considered as hearing-impaired.All participants gave their written consent before the start ofthe experiments and the study was approved by the ethicscommittee at the university of Erlangen-Nuremberg.

B. Experimental Design

To emulate a cocktail party effect, two streams of newsspoken by two different male speakers were presented tothe subject simultaneously. They were presented through twoloudspeakers which were connected to a computer throughFireface UC soundcard (RME Haimhausen Germany). Loud-speakers were placed one meter away from the subject atan azimuth of +45 ◦ and -45 ◦ on the left and on the right,respectively. The total duration of the speech stimuli was 30minutes and consisted of six segments with each segmentbeing approximately ﬁve minutes long. Subjects were askedto focus their attention either to the left speaker or to the rightspeaker in the ﬁrst four segments and to the other speaker inthe remaining two segments. The initial attention to the left orto the right was chosen randomly across the subjects. Afterevery ﬁve-minute segment, subjects were given a multiple choice questionnaire related to contents of the attended streamto motivate them to comply with the task.Before the start of EEG measurements, loudness of theloudspeakers was calibrated to the most comfortable levelper individual subject. This was determined by increasing theloudness of a particular loudspeaker until the subject indicatesthat the contents are clearly understandable. Subsequently,loudness of the other loudspeaker was adjusted to the samelevel. Hearing-impaired subjects were requested to removetheir hearing aids before the experiment so as to avoid anynon-uniform speech processing introduced by the differenthearing aid manufacturers. Although participants in our exper-iment consisted of both hearing impaired and normal hearingsubjects, it is not within the scope of this study to presenta comparative analysis of both the groups. Comprehensiveanalyses of the effect of hearing loss on cortical trackingduring selective auditory attention are provided in [21] [22][23]. C. Speech Stimulus and EEG Recordings

Speech stimuli used in our selective auditory experimentwere taken from the slowly spoken news section of theGerman news website (Deutsche Welle). To reducesubjects’ prior knowledge, news contents were taken from the2015-16 archive of the website. Each news was around 60seconds long and were presented only once to the subject.The sampling rate of the speech signal was 44.1 kHz. EEGsignal was recorded with the help of Brainamp MR ampliﬁer(Brainproducts Munich, Germany) using 21 AgCl electrodesplaced according to 10-20 EEG format. The sampling rateof the EEG signal was 2.5 kHz and the reference electrodewas placed at the right mastoid and the ground electrode wasplaced on the left earlobe. As far as possible, the impedanceof the electrodes were maintained under 10 k Ω . The speechstimuli and the EEG signals were synchronized using a triggersignal that was send through the soundcard to the EEGampliﬁer. D. Data Analysis

The speech envelope was obtained by taking the absolutevalue of Hilbert transform of the speech signal. As manystudies on attention decoding have shown that the low fre-quency cortical signal under 10 Hz track the low frequencyspeech envelope, both the EEG and the speech envelopeswere downsampled to 64 Hz [1] [2]. Afterwards, they werebandpass ﬁltered between 1-9 Hz. In time domain, the trialduration considered for analysis was 2 sec throughout thispaper unless otherwise stated. Two seconds trial duration waschosen because the human brain can process of the order oftwo seconds of independent auditory information in workingmemory [24]. EEG signals measured at electrodes whoseimpedance was larger than 10 k Ω were discarded from furtherinvestigation. All analysis were performed using MATLAB,version R2017b (The Mathworks Inc. Natick, Massachusetts).III. A TTENTION D ECODING F RAMEWORK

Our attention decoding framework consists of three mod-ules. In the ﬁrst module (section III-A), response of theuditory system that generated the cortical signals whenattended to a certain speaker is estimated. In the secondmodule (section III-B), a marker corresponding to the dynamicattentional state of the listener is extracted. In the third andﬁnal module (section III-C), attention markers are fed as inputto an SVM classiﬁer to infer the current attention state. Alogistic regression is then trained on the classiﬁer output togenerate soft decisions corresponding to the probability ofattention to a certain speaker.

A. TRF Estimation1) LMMSE:

TRF is the linear system response functionthat characterizes the auditory pathway inside the brain when aperson listens to a certain speaker. It has now been establishedthat the cortical signals track the envelope of speech duringattentive listening [5].Let θ , a p × s , an N × N > p ), denote the inputspeech signal envelope. Let r denote the cortical signal mea-sured at an EEG electrode when a system with θ as TRF isstimulated by the input s . Then r can be expressed as r = s ∗ θ + w = S θ + w , (1)where S is a known matrix of order N × p which is thecausal time-lagged version of input speech envelope, w is an N × ∗ denotes the linear convolution operator.The most commonly used algorithm to estimate the systemresponse function in (1) is based on LS estimation where itis assumed that θ is a deterministic variable [25]. As it maynot always be guaranteed as deterministic, we assume θ tobe a random variable with a prior probability density function(PDF) N ( µ θ , C θθ ) . Noise vector w is also assumed to be arandom variable with a prior PDF N ( , C w ) and independentof θ . This form is known as Bayesian linear model and theestimator that minimizes the mean squared error is given as[26] [27] ˆ θ = E ( θ | r ) . (2)The posterior mean in (2) can be expressed as [18]ˆ θ = E ( θ ) + C θ r C − rr ( r − E ( r )) (3)where E ( · ) corresponds the sample mean, C rr corresponds tothe autocovariance of the observation signal and C θ r corre-sponds to the cross-covariance between the observation andthe system response. Equation (3) can be further expanded as[18] ˆ θ = µ θ + C θθ S T ( SC θθ S T + C w ) − ( r − S µ θ ) (4)which gives the LMMSE estimate of θ . When data frommultiple electrodes need to be analyzed, (4) must be solvedseparately for signals at each electrode.

2) Sequential LMMSE:

LMMSE estimation or LSestimation methods assume that ˆ θ estimated in the previoustrial does not have any inﬂuence on the ˆ θ estimated in thecurrent trial. But as auditory attention is a continuous process,attention during the previous instance may have an inﬂuenceon the attention at the current instance. In such scenarios, ˆ θ at n th trial can be estimated sequentially as [27] Estimator Update: ˆ θ [ n ] = ˆ θ [ n − ] + K [ n ]( r − S [ n ] ˆ θ [ n − ]) , (5)where K [ n ] = M [ n − ] S T [ n ]( C w + S [ n ] M [ n − ] S T [ n ]) − (6) Minimum MSE Matrix Update: M [ n ] = ( I − K [ n ] S [ n ]) M ( n − ) (7)The gain factor K [ n ] is a p × M [ n ] is a p × p matrix that corresponds to thecovariance of the estimate ˆ θ at n th trial. A procedure tocalculate the initialization parameter ˆ θ [ − ] and M [ − ] willbe discussed in section III-D. a) Spatial Noise Covariance Estimation: Since EEG am-pliﬁers are differential ampliﬁers, they amplify the differencebetween signals measured at the data electrodes and thereference electrode. As a result, the differential componentreduces as data electrodes are placed closer to the referenceelectrode. This can be better understood looking at auditoryevoked potentials (AEPs) at different scalp locations [28].AEPs are a convenient tool to analyze SNR characteristics aswe know the true noiseless signal beforehand due to ensembleaveraging. This is not the case in speech tracking as we areworking on single trial and ensemble averaging is not feasible.The peak amplitude of AEPs around 100 ms is termed as N1peak and it is a measure of the quality of the generated AEP[29]. Given that the reference electrode is placed at mastoid,N1 peak is largest in the vertex locations and reduces as wemove towards the temporal locations in both directions. Fig. 1depicts the SNR distribution of the AEPs obtained at differentscalp locations. SNRs are usually between -9 dB to -17 dBand electrodes closer to the reference electrode have low SNRcompared to the vertex electrodes [18] [30]. Consequently, wecan use signals at electrodes closest to the reference electrodeto calculate the covariance matrix of noise that is requiredto solve (4) or (6). This is under the assumption that signalto noise characteristics remain similar between AEP and AADparadigm as the stimulus in both scenarios are acoustic signals.If EEG signal ( w ) only from a single electrode is availablefor noise estimation, then the noise covariance matrix can becalculated as C w = σ ( w ) I , (8)where σ ( · ) represents the variance and I is an identity matrixof order N × N . In (8), we assume that the noise is sampled ig. 1. Heatplot depicting the SNR (in dB) of the AEPs obtained at differentscalp locations when the reference electrode was placed at the right mastoid. from a set of uncorrelated random variables that are identicallydistributed.If L > C w can be calculated as C w = L − L ∑ n = ( w k − E ( w k ))( w k − E ( w k )) T . (9)To ensure invertibility of the matrix in (4) and (6), rank of C w must be at least N . b) Noise Covariance Estimation in Multispeaker Sce-nario: When multiple speakers are present in an auditoryscene, cortical signals measured at an EEG electrode canbe inﬂuenced by every speaker present. In a dual-speakerscenario, the observed EEG signal r can be written as r = θ a ∗ s a + θ u ∗ s u + w , (10)where θ a represents the attended TRF, θ u represents theunattended TRF, s a represents the attended speaker envelopeand s u represents the unattended speaker envelope. θ a and θ u can be estimated separately using (5)-(7). However, indoing so, we assume the other speaker as noise in the EEGsignal. That means, during the estimation of θ a , the corticalsignal generated from the unattended speaker is consideredas interference and is added to the noise and vice versa.Subsequently, for the estimation of θ a , (10) is rewritten as r = θ a ∗ s a + w a (11)where w a = w + θ u ∗ s u ,and for the estimation of θ u , (10) is rewritten as r = θ u ∗ s u + w u (12)where w u = w + θ a ∗ s a .Finally, we make an approximation such that σ ( w a ) ≈ σ ( w u ) = σ ( w ) . (13) The rational for this approximation is as follows. Thecomponent of EEG that should have been contributed bythe acoustic stimulus is signiﬁcantly small compared to thebackground noise. Hence, the error introduced in the noisevariance calculation by assuming interference from the otherspeaker as noise should also be signiﬁcantly small. This isillustrated for the case of AEP in Fig. 2. Fig. 2.

Plot showing the true signal (AEP) and the background noise (SNR= -10.8 dB) for a single epoch measured at Cz electrode from a representativesubject. AEP was obtained by averaging over 100 epochs.

B. Extraction of Attention Markers

Attention marker is a measure of the reliability of esti-mated TRFs in decoding the attentional states. The correlationbased attention marker is a commonly used measure which iscalculated as the Pearson correlation coefﬁcient between theoriginal signal and the reconstructed signal. To be precise,once the TRF is estimated, it is used to reconstruct the EEGsignal as a linear convolution between the speech envelope andestimated the TRF. As there should be at least two speakerspresent for selective attention, two separate EEG signals canbe reconstructed using the speech envelopes and the estimatedTRF. Then the speaker corresponding to the reconstructedEEG having a larger Pearson correlation coefﬁcient to theoriginal EEG is ﬂagged as the attended speaker. Alternatively,correlation based attention marker can be calculated usingbackward model where the speech envelope is reconstructedfrom the EEG signals [2].It has been shown that the correlation based attention markeris highly ﬂuctuating for trial durations of the order of secondsresulting in an unreliable estimate of attention [16] [31].Hence, we decided to use the magnitude of N1

TRF − P2 TRF peak of the estimated TRFs as the attention markers. N1

TRF is the negative peak that occurs around 100 ms latency andP2

TRF is the positive peak that occurs around 200 ms latencyand they are known to modulate attentional effect [3] [7] [20].In our analysis, 75 ms - 135 ms latency was chosen as the timeframe to search for N1

TRF and 175 ms - 250 ms latency waschosen as the time frame to search for P2

TRF . If there was nonegative peak present in 75 ms - 135 ms time frame, N1

TRF was initialized to zero. Similarly, if there was no positive peakresent in 175 ms - 250 ms time frame, P2

TRF was initializedto zero.

C. Classiﬁcation using SVM

For a linearly separable binary classiﬁcation problem, SVMsaim to ﬁnd a unique separating hyperplane that maximizesthe margin between two classes. Searching the best sep-arating hyperplane describes an optimization problem thatcan be solved efﬁciently using existing convex optimizationalgorithms. Furthermore, it can be extended to linearly non-separable classes by adding slack variables to the optimizationproblem as an additional constraint.Let the separating hyperplane that describes the decisionboundary be deﬁned as a ﬁrst order afﬁne function f ( x ) suchthat f ( x ) = a T x + b . (14)It can be solved as [32] a = m ∑ i = α i y ( i ) x ( i ) , (15)where α i is a Lagrangian multiplier and x ( i ) and y ( i ) arethe feature vector and the class label of the training samplesrespectively. Substituting (15) into (14), the decision boundarycan be rewritten as f ( x ) = m ∑ i = α i y ( i ) (cid:104) x ( i ) , x (cid:105) + b . (16)The Lagrangian multiplier α i is non zero only for those x ( i ) and y ( i ) that act as support vectors [33]. The inner product in(16) is commonly referred to as kernel function and they helpSVMs to map the input vector to higher dimensions. There aremany choices for the kernel function such as linear, polynomialor Gaussian kernel. In our work, we used a linear kernelfunction to train the SVM. Fitting a sigmoid function aftercomputing SVM output enables us to obtain posterior proba-bilities [34]. In this work, SVM training was performed using ﬁtcsvm API and logistic regression training was performedusing ﬁtposterior

API provided in the MATLAB statistical andmachine learning toolbox.

D. Estimating Model Parameters

As described in section II-B, our dual-speaker EEG ex-periment consisted of six segments (A-F), each being ﬁveminutes long. During the ﬁrst four segments, subjects focusedtheir attention to the same speaker and during the remainingtwo segments, subjects switched their attention to the otherspeaker. In order to calculate ˆ θ [ − ] , M [ − ] which is neededto solve (5) - (7), we split the EEG data into three blocksas shown in Fig. 3. The ﬁrst block consisted of segments Aand B and it was used to calculate the initialization parameterˆ θ [ − ] and M [ − ] . The second block consisted of segments Cand E and it was used to train the SVM classiﬁer. Finally, thethird block consisted of segments D and F and it was usedto validate the performance of the trained SVM classiﬁer. As the second and the third blocks comprised of attention to bothspeakers, we were able to test the ability of the algorithm totrack attention switch. Fig. 3.

Data processing sequence. EEG was collected in six separate segments.In the ﬁrst four segments, subjects maintained their attention to the samespeaker. For the ﬁfth and the sixth segments, they were asked to switch theirattention to the other speaker. Segments A and B were used to calculate theinitialization parameter, segments C and E were used for the SVM training andsegments D and F were used to test the trained SVM classiﬁer.

In [18], initialization parameters ˆ θ [ − ] and M [ − ] werechosen as zero mean and unit variance respectively. However,we decided to use the ﬁrst block to calculate ˆ θ [ − ] and M [ − ] . We started with the zero mean and the unit varianceassumption and solved (5)-(7) sequentially for every twoseconds trial. Once all the trials were completed, estimated ˆ θ and M in the last trial of the ﬁrst block were used to initializeˆ θ [ − ] and M [ − ] for the second block. It must be noted that,as there were two speakers present, we obtained two sets ofˆ θ [ − ] and M [ − ] in the last trial of ﬁrst block. Hence, wetook their average as the initial value. Alternatively, if EEGdata corresponding to single-speaker attention is available, itcould be used to perform this initialization. In the secondblock, as part of the SVM training, TRFs corresponding tothe two speakers were estimated and their N1 TRF − P2 TRF peakswere extracted. The N1

TRF − P2 TRF peak served as the attentionmarker and it was labeled with the class labels correspondingto the attended and the unattended speaker. Finally, theselabeled attention markers were used as feature vectors tocompute the SVM decision boundary. In the third block,TRFs corresponding to the new data were estimated and theirN1

TRF − P2 TRF peaks were passed as input to the trained SVMmodel to classify attention.To compare the performance of our AAD framework withexisting algorithms, attention decoding accuracies were cal-culated using LS method via stimulus reconstruction [2] andstate space model [16]. In case of the LS method, the ﬁrst andthe second blocks were used to estimate the decoder and ﬁndthe optimal ridge regularization parameter. The third blockwas used to infer the attention using the estimated decoder.During the decoder estimation, the autocorrelation matriceswere averaged across trials to obtain a single ﬁnal decoder [11]instead of averaging the decoders estimated across multipletrials [2]. We performed two variants of the LS method usingstimulus reconstruction: 1)

LS 2sec and 2)

LS 60sec overlap .For

LS 2sec , the trial duration considered to estimate thedecoder was two seconds whereas for

LS 60sec overlap , thetrial duration considered was 60 seconds but with a slidingwindow of two seconds. I.e., in every trial, new data of onlytwo seconds were added.n case of the state space model, ﬁrst and second blockswere used to estimate the decoder and extract the attentionmarkers for every two seconds trial. Once attention markerswere extracted, they were used to ﬁt a log-normal distributionand the mean and the variance of log-normal distribution wereused to train the state space model. Finally, the third block wasused to infer the attention using the trained state-space model.The algorithm proposes two different attention markers namelythe Pearson correlation coefﬁcients of the reconstructed stimuliand the L1 norm of the estimated decoders. In our comparativeanalysis, the L1 norm of the estimated decoder was used asthe attention marker as it has been shown to have superiorperformance over the Pearson correlation coefﬁcient in theoriginal paper. All other hyperparameters were initialized tothe default values. More details of the algorithm can be foundin [16]. IV. R

ESULTS

A. TRF Estimation : Sequential LMMSE vs LS

In Fig. 4, the attended TRF (left) and the unattended TRF(right) estimated using the sequential LMMSE algorithm fora particular subject are shown. TRFs were estimated in everytrial and for a ﬁlter lag of 24 samples that corresponded to375 ms at 64 Hz sampling rate. For the same subject, TRFsestimated using the LS algorithm are shown in Fig. 5. Itcan be observed that for the case of LS algorithm, the TRFestimated across trials are noisy with high variance. For allsubjects considered, the mean of standard deviations of theTRFs estimated using sequential LMMSE and LS algorithmsare displayed in Fig. 6. The mean value was calculated byaveraging the standard deviations across each individual timelags per subject. It is evident that the sequential LMMSEalgorithm resulted in lower standard deviation compared to theLS algorithm ( p att , p unatt < .

001 based on paired Wilcoxonsigned-rank test).

Fig. 4.

Sequential LMMSE estimate: Upper row: Attended TRF (left) andunattended TRF (right) estimated at Cz electrode from a representative subject.Each coloured line indicates the TRF obtained from separate trials that were oftwo seconds duration. Lower row: Mean ± std dev across trials. Fig. 5.

LS estimate: Upper row: Attended TRF (left) and unattended TRF(right) estimated at Cz electrode from the same subject as in Fig. 4. Eachcoloured line indicates TRF obtained from separate trials of two secondsduration. Lower row: Mean ± std dev across trials. Fig. 6.

Comparison of the mean standard deviation of the attended and theunattended TRFs estimated using sequential LMMSE and LS algorithms. Eachdot on the boxlot represents an individual subject.

B. Attended vs Unattended TRF

TRFs obtained from all the subjects with attended TRF (A),unattended TRF (B) and the difference in magnitude betweenthem (C) are summarized in Fig. 7. On the upper panel, TRFsfrom the individual subjects are displayed and on the lowerpanel, their mean ± standard deviation are displayed. FromFig. 7A and Fig. 7B, it is evident that there is a negative peakat around 100 ms (N1 TRF ) and a positive peak at around 200ms (P2

TRF ). However, the magnitude of N1

TRF and P2

TRF arehigher for the attended TRF compared to the unattended TRF.The difference between the attended TRF and the unattendedTRF (Fig. 7C) is negative for N1

TRF ( p < .

01) and positivefor P2

TRF ( p < . ig. 7. Plots depicting the attended TRF (A), the unattended TRF (B) and the TRF difference (C) estimated at Cz electrode for every subject. The upper rowcorresponds to the mean TRF of each individual subject and the lower row corresponds to the mean ± std dev across subjects. Each individual TRF on the upperrow was obtained as an average of the TRFs across different trials. Fig. 8.

Decoding of the selective auditory attention of two representative subjects. For subject 1 (left panel), the initial attention was to speaker 1 and for subject2 (right panel), the initial attention was to speaker 2. A) Dynamic evolution of the attention markers (N1

TRF − P2 TRF at Cz electrode) corresponding to the twospeakers. A new attention marker was extracted for every trial of two seconds. B) Probability of attention to speaker 1 and speaker 2. Dotted line indicates the trueattention to speaker 1. Attention switch was not a continuous process, instead it was obtained by concatenating two EEG segments with opposite attention (refersection III-D). The attention probability was calculated by passing the attention markers to an SVM followed by a logistic regression. The attention probabilityis dependent on the width of the SVM decision boundary that is determined by the spread of the support vectors. C) Left scatter plot depicts the linear decisionboundary separating the attention markers. The abscissa (spkr1 marker) corresponds to the N1

TRF − P2 TRF peak extracted from the TRF of speaker 1 and theordinate (spkr2 marker) corresponds to the N1

TRF − P2 TRF peak extracted from the TRF of speaker 2. Each shaded circular dot is a 2D vector with its ﬁrstcomponent being spkr1 marker and its second component being spkr2 marker. On average, when attending to speaker 1, the ﬁrst component should be largerthan the second component and vice versa. The unshaded black circles on the left scatter plot depict support vectors that deﬁne the decision boundary. The rightscatter plot shows attention markers obtained during the testing procedure with black circles indicating misclassiﬁcations. mplitude than the attended TRF ( p < . C. Decoding the Selective Attention

Decoding the selective attention from two representativesubjects is shown in Fig. 8. For subject 1, the initial attentionwas to speaker 1 and the ﬁnal attention was to speaker 2whereas for subject 2, it was vice versa. For subject 1, thedecoding accuracy was 89.5% and for subject 2, the decodingaccuracy was 62.8%. The median decoding accuracy across allthe subjects was 79.8% which was above the statistical signiﬁ-cance level. The statistical signiﬁcance level for a trial durationof two seconds was found to be 54.7% based on binomial dis-tribution at 95% conﬁdence. Fig. 8A displays the progressionof attention markers (N1

TRF − P2 TRF ) corresponding to the twospeakers present. Fig. 8B shows the probability of attentionto speaker 1 which was obtained by passing the attentionmarker (Fig. 8A) into a trained SVM classiﬁer (Fig. 8C).The probabilistic output gives a conﬁdence in our attentioninference and it is proportional to how far the attention markersare from the decision boundary. Regarding the multiple choicequestionnaire, on average, subjects answered 84.7 ± r = − . D. AAD Accuracy: Comparison with Existing Algorithms

The median decoding accuracy achieved at Cz electrodefor the case of

LS 2sec was 53.96% which was lower thanthe statistical signiﬁcance level for a trial duration of twoseconds (Fig. 10). For

LS 60sec overlap , median decodingaccuracy at Cz electrode improved to 60.3%. Here the trialduration considered was 60 seconds with a sliding window oftwo seconds. For the state space model, the median decoding

Fig. 9.

Decoding accuracy vs time taken to detect the attention switch for everysubjects. Each * represents the result from an individual subject.

Fig. 10.

Comparison of the decoding accuracies obtained at Cz electrode usingLS, state space and sequential LMMSE algorithms. The trial duration used wastwo seconds except for

LS 60sec overlap . accuracy using the same settings as LS 2sec was found to be71.7%. In addition, all aforementioned algorithms were testedusing a combination of

Cz + left mastoid electrodes which wasthe same electrode combination used to estimate TRFs withsequential LMMSE. However, the accuracies were found to beslightly lower (not shown here) than the accuracies obtainedusing only Cz electrode.

E. Spatial Distribution of TRFs

Grand average TRFs estimated at different electrode loca-tions referenced to the mastoid electrode are shown in Fig. 11.They were obtained by averaging the TRFs across individualsubjects. As observed before, the amplitude of N1

TRF andP2

TRF corresponding to the attended TRF are larger than thatof the unattended TRF. The magnitude of the N1

TRF − P2 TRF peak decreases as we move from the vertex regions to thetemporal regions reminiscence of the SNR distribution ofAEPs (Fig. 1). The decoding accuracies obtained at differentscalp locations are shown in Fig. 12. The highest decodingaccuracy was obtained at Cz electrode (79.8 ± ± F. Accuracy Improvement with SVM

SVM was used in our framework to classify the selectiveattention based on the N1

TRF − P2 TRF peak of estimated TRFs.To compare the improvements due to the introduction ofSVM, decoding accuracies were calculated by thresholding theN1

TRF − P2 TRF peak. I.e., the speech envelope that generatedthe TRFs with larger N1

TRF − P2 TRF peaks was ﬂagged as theattended speech. Fig. 12 compares the decoding accuraciesobtained using SVM and thresholding. At every electrodelocation evaluated, the accuracy was higher using the SVMclassiﬁer than thresholding. Mean of the accuracies obtainedat different scalp locations using SVM was 74.5% and that ofusing thresholding was 64.36%.V. D

ISCUSSION

EEG analyses have been traditionally used to investigatethe sensory response of human brain. In the auditory domain,AEPs which are responses evoked when stimulated by an ig. 11.

Spatial distribution of the attended TRF (blue) and the unattended TRF (red) estimated at different scalp locations. The TRF displayed at each locationis the grand average obtained by calculating the mean across subjects. The reference electrode was placed at right mastoid.

Fig. 12.

Comparison between the attention decoding accuracies obtained usingSVM classiﬁer and thresholding at selected scalp locations. acoustic signal are widely employed [29]. Due to the presenceof substantial background noise, ensemble averaging must beperformed to obtain AEPs that are visually interpretable. Inother words, single trial analysis of EEG was a challenge untilrecently when it was demonstrated that the auditory attentionin a dual-speaker scenario could be inferred without ensembleaveraging using LS estimation [2]. One of the limitations ofthe LS estimation was that the trial duration required to inferattention was of the order of minutes but a listener couldswitch the attention multiple time within time scales of thisorder. Several improvements were proposed by which the trialduration needed to infer attention was considerably reduced

Fig. 13.

Mean correlation coefﬁcient between the EEG signals (ﬁve minutes)measured at different electrode locations. The mean was obtained by averagingthe correlation coefﬁcients across twenty seven subjects. [16] [35] [36].In this study, we have proposed a framework to decode theselective auditory attention of a listener. The efﬁcacy of thealgorithm was veriﬁed using EEG data collected from 27 sub-jects who undertook selective auditory attention experiment.The proposed framework consists of three modules. In theﬁrst module, a novel TRF estimation algorithm is presentedwhich is based on LMMSE estimator. The MMSE estimatorsare known to perform better than the LS estimators at lowSNR and their performance converges as the SNR improves.The improvement stems from the fact that the MMSE esti-ators make use of second order statistics (covariance) of thenoise present in the signal. To calculate the noise covariancematrix, we can use signals measured at locations closest orcontralateral to the reference electrode. Another potential wayof estimating the noise at a particular electrode is to measurethe signal at that electrode before the stimulus is presented[4]. This is appropriate for experiments with non-continuousspeech such as speech-in-noise test where the stimuli arepresented with periodic pauses. When entire signals are notavailable a priori, sequential LMMSE estimator can be usedto estimate the TRF. In sequential estimation, we assume thatthe TRF in the previous trial has an inﬂuence on the TRF inthe current trial. The contribution of the previous estimate onthe current estimate is controlled by the gain factor.In the second module, a marker related to the attentionof the listener is extracted. Amplitude peaks of the TRFaround 100 ms (N1

TRF ) and 200 ms (P2

TRF ) are known tomodulate selective attention [3] [7] [20]. Hence, we used theN1

TRF − P2 TRF peak as the attention marker in our framework.Once the markers were extracted, they were passed to anSVM classiﬁer followed by a logistic regression to obtaina probabilistic measure of the attention. Instead of usingan SVM, the attention markers could be directly passed tothe logistic regression but SVMs have the inherent abilityto correct for errors if the feature vector falls inside thedecision boundary. Alternate to SVMs, random forests orneural networks based classiﬁers could be used provided thereis sufﬁcient data available to train the network.

A. TRF Estimation : Sequential LMMSE vs LS

Two commonly used linear system identiﬁcation methods instatistical estimation theory are LS and LMMSE estimators.In LS estimator, the squared error is minimized withouthaving any assumptions on the stochastic property whereasin LMMSE estimator, the mean squared error is minimizedassuming a Gaussian PDF and taking the expectation overthe assumed PDF. Algorithms based on LMMSE exploitthe knowledge of noise covariance and in this work, wereformulated the covariance matrix of received EEG signal as acombination of the covariance matrix of the speech envelopeand the covariance matrix of the noise signal. I.e., we usedsome prior knowledge about the covariance matrix and if thisprior knowledge can be estimated with a sufﬁcient accuracy,LMMSE algorithm outperforms LS algorithm [27]. This per-formance improvement can be observed when comparing Fig.4 and Fig. 5. The standard deviation (and variance) of the TRFsestimated using the LS estimator was found to be signiﬁcantlylarger than that of using the sequential LMMSE estimator(Fig. 6). However, the performance of the LS estimator shouldimprove with increasing trial duration as it is known that theSNR of the LS estimate is directly proportional to the lengthof the trial [37]. As a result, it is only when the length of theobservation vector (EEG) signiﬁcantly exceeds the length ofthe estimation vector (TRF), the LS estimate is reliable.

B. Attended vs Unattended TRF

Analysis of the ﬁlter coefﬁcients of the attended and unat-tended TRFs suggest a top down processing of the auditory information. To be precise, a larger N1

TRF and P2

TRF for theTRF estimated using the attended speech envelope suggeststhat the brain might be suppressing the contents of the unat-tended speech from being encoded in the working memory.These results suggest that the LMMSE based TRF estimationcould replicate the results obtained in previous studies [3][7]. Responses beyond 100 ms are attributed to phonemeprocessing and suppression of the unattended TRF at theselatencies means that the high level features corresponding tothe unattended speaker are not encoded [8] [38]. For initiallatencies below 50 ms, a large positive amplitude was observedfor both attended and unattended TRFs. This is intriguingbecause previous studies have not reported large amplitudesat these latencies. A potential reason is the short trial durationthat was used to estimate the TRF. I.e., as the stimuli inour experiment were continuous news segments that lastedaround ﬁve minutes and as we were analyzing short trials oftwo seconds, subjects had an expectation about the incomingacoustic stimuli. These expectations may be getting reﬂectedas early responses that are largely identical for both theattended and the unattended speaker. We anticipate that theseearly responses should diminish for non-continuous stimulussuch as the one used in speech-in-noise tests.In our study, analyses were limited to dual-speaker scenariobut in real life, there could be more interfering speakerspresent. It has already been shown that the selective attentioncould be inferred in a four-speaker environment using the LSmethod [39]. TRF estimation using sequential LMMSE can beextended to multispeaker scenarios such as the four-speakerexperiment to evaluate the behaviour of TRFs correspondingto each of the interfering speakers.

C. Attention Decoding Accuracy

The most important improvement introduced by the sequen-tial LMMSE algorithm is its ability to generate interpretableTRFs from short duration trials. This has enabled us to moveaway from the conventional correlation based attention markerto the N1

TRF − P2 TRF peak based attention marker. Correlationbased attention marker falls under the category of regressiontasks because we have to reconstruct the EEG signals ini-tially to calculate the Pearson correlation coefﬁcient. However,inferring attention directly based on the N1

TRF − P2 TRF peakfalls under the category of classiﬁcation tasks. It is knownthat regression tasks are more challenging than classiﬁcationtasks and we expect an improvement in performance usingN1

TRF − P2 TRF peak instead of Pearson correlation coefﬁcient.The decoding accuracy obtained at Cz electrode for a trialduration of two seconds using LS estimator was found to be53.96%. An analysis of overlapping trials with a duration of 60seconds but using a sliding window of two seconds improvedthe accuracy to 60.3%. On the other hand, the decodingaccuracy obtained using our AAD framework was 79.8% (Fig.10). The lower decoding accuracy using LS estimator can beattributed to its dependency on the number of electrodes andthe trial duration. I.e., as the trial duration and the numberof electrodes reduce, decoding accuracy deteriorates [10][12][17]. An attention decoding framework based on the statepace algorithm reported comparable accuracies to that of ourframework. However, adaptation of the same algorithm on ourdataset yielded an accuracy of 72.7% which is lower than theaccuracy reported in [16]. The lower accuracy may be due tothe choice of hyperparameters in our adaptation because weoptimized only the LASSO regularization parameter and theparameters of the log-normal distribution. All other hyperpa-rameters were initialized with the default values suggested in[16]. The trial duration was chosen as two seconds becauseprevious studies have shown that the human brain can holdupto two seconds of independent auditory information inthe working memory [24]. In future work, the trial durationcould be chosen as the listener’s memory span and it can beestimated with the help of digit span tests.Despite our decoding framework being designed to detectthe attention at a time resolution of two seconds, the medianattention switch duration was found to be 52.4 seconds whichwas larger than our expectation. There are two possible ex-planations for the delayed detection of attention switch. First,our experiment was designed in such a way that there wasno dynamic attention switch present. Instead, the attentionswitch was synthesized by concatenating two EEG segmentswith opposite attention. Hence, it is possible that the subjectrequired certain amount of time to focus the attention on aparticular speaker at the start of the experiment. Second, dueto the sequential nature of estimating the current TRF basedon previous estimates, the algorithm may have introduced adelay by itself which is reﬂected in the longer switch duration.To disentangle the aforementioned possibilities, the algorithmmust be further evaluated in scenarios where the subject hasthe ﬂexibility to switch attention dynamically [40].

D. Spatial Distribution of TRF

Comparison of the TRFs obtained at different scalp loca-tions shows that frontal and central locations generate identicalTRFs. This is not trivial considering the fact that EEG signalsmeasured at these locations are highly correlated (Fig. 13). Forour choice of the mastoid reference location, the magnitudeof N1

TRF − P2 TRF peaks are largest at the central and thefrontal regions and smallest for the temporal regions. Onthe contrary, if the reference electrode was chosen as Czelectrode, the polarity of TRF will get inverted and temporalregions will have the largest N1

TRF − P2 TRF peak. Despitethe smaller peaks, a clear difference between the attendedand the unattended TRFs could be observed at all locations.Decoding accuracies obtained at the temporal electrodes werenot signiﬁcantly lower than the decoding accuracies obtainedat the central electrodes. This could pave the way to potentiallyintegrate these algorithms in a neuro-steered hearing aid.Furthermore, these results suggest that we may not requirehigh-density EEG measurements with multiple electrodes inparadigms such as the selective auditory attention.One of the assumptions that we have made throughout thispaper is the availability of clean speech envelope to estimatethe TRFs. In practice, only noisy mixtures are available andspeech sources must be separated before the envelope can beextracted. This is an active research ﬁeld and algorithms are already available based on classical signal processing such asbeamforming or based on deep neural networks [41]. Anotherlimitation of the sequential LMMSE algorithm in its currentform is that it can be used to estimate the auditory systemresponse only in the forward direction (speech to EEG). Thisis because the noise covariance matrix that we have made useof is available only in the forward direction.VI. C

ONCLUSION

In this paper, we have proposed a method to estimate theTRF from EEG signals using sequential LMMSE estimator.Unlike the LS estimator, sequential LMMSE estimator is ca-pable of generating reliable and interpretable TRFs from shortduration trials. Analysis of the properties of the TRFs in dual-speaker scenario has revealed a locus of attention around 100ms (N1

TRF ) and 200 ms (P2

TRF ). I.e., the TRFs correspondingto the attended speaker have a larger N1

TRF − P2 TRF peakthan the TRFs corresponding to the unattended speaker. Usingsequential LMMSE as the major building block, we developeda novel AAD framework to detect the auditory attention of alistener at a time resolution of two seconds. In the proposedAAD framework, only two electrodes (data measurement andnoise measurement) in addition to the reference and theground electrodes are sufﬁcient to achieve a reliable decodingaccuracy. Comparison of the results obtained at different scalpelectrodes revealed that the AAD accuracies are in the similarrange across different electrodes.Although the AAD framework was designed to detect theattention from EEG trials of the order of seconds, it wasobserved that the time taken to detect the attention switch waslonger than a single trial duration. Hence, further investigationis required to understand the fundamental reason behind it andreduce the attention switch duration.A

CKNOWLEDGMENT

We convey our gratitude to all participants who took part inthe study and would like to thank the student Laura Rupprechtwho helped us with data acquisition.R

EFERENCES[1] A. J. Power, J. J. Foxe, E.-J. Forde, R. B. Reilly, and E. C. Lalor, “Atwhat time is the cocktail party? A late locus of selective attention tonatural speech,”

European Journal of Neuroscience , vol. 35, no. 9, pp.1497–1503, 2012.[2] J. A. O’Sullivan, A. J. Power, N. Mesgarani, S. Rajaram, J. J. Foxe,B. G. Shinn-Cunningham, M. Slaney, S. A. Shamma, and E. C. Lalor,“Attentional Selection in a Cocktail Party Environment can be Decodedfrom Single-Trial EEG,”

Cerebral Cortex , vol. 25, no. 7, pp. 1697–1706,2014.[3] L. Fiedler, M. W¨ostmann, S. K. Herbst, and J. Obleser, “Late corticaltracking of ignored speech facilitates neural selectivity in acousticallychallenging conditions,”

NeuroImage , vol. 186, pp. 33–42, 2019.[4] N. Das, J. Vanthornhout, T. Francart, and A. Bertrand, “Stimulus-awarespatial ﬁltering for single-trial neural response and temporal responsefunction estimation in high-density EEG with applications in auditoryresearch,”

NeuroImage , vol. 204, p. 116211, 2020.[5] S. J. Aiken and T. W. Picton, “Human Cortical Responses to the SpeechEnvelope,”

Ear and hearing , vol. 29, no. 2, pp. 139–157, 2008.[6] E. C. Lalor and J. J. Foxe, “Neural responses to uninterrupted naturalspeech can be extracted with precise temporal resolution,”

EuropeanJournal of Neuroscience , vol. 31, no. 1, pp. 189–193, 2010.7] N. Ding and J. Z. Simon, “Neural coding of continuous speech inauditory cortex during monaural and dichotic listening,”

Journal ofNeurophysiology , vol. 107, no. 1, pp. 78–89, 2012.[8] G. M. Di Liberto, J. A. O’Sullivan, and E. C. Lalor, “Low-FrequencyCortical Entrainment to Speech Reﬂects Phoneme-Level Processing,”

Current Biology , vol. 25, no. 19, pp. 2457–2465, 2015.[9] N. Mesgarani, S. V. David, J. B. Fritz, and S. A. Shamma, “Inﬂuence ofContext and Behavior on Stimulus Reconstruction From Neural Activityin Primary Auditory Cortex,”

Journal of Neurophysiology , vol. 102,no. 6, pp. 3329–3339, 2009.[10] B. Mirkovic, S. Debener, M. Jaeger, and M. De Vos, “Decoding theattended speech stream with multi-channel EEG: Implications for online,daily-life applications,”

Journal of Neural Engineering , vol. 12, no. 4,p. 046007, 2015.[11] W. Biesmans, N. Das, T. Francart, and A. Bertrand, “Auditory-InspiredSpeech Envelope Extraction Methods for Improved EEG-Based Audi-tory Attention Detection in a Cocktail Party Scenario,”

IEEE Transac-tions on Neural Systems and Rehabilitation Engineering , vol. 25, no. 5,pp. 402–412, 2017.[12] S. A. Fuglsang, T. Dau, and J. Hjortkjær, “Noise-robust cortical trackingof attended speech in real-world acoustic scenes,”

NeuroImage , vol. 156,pp. 435–444, 2017.[13] L. Fiedler, M. Woestmann, C. Graversen, A. Brandmeyer, T. Lunner, andJ. Obleser, “Single-channel in-ear-EEG detects the focus of auditoryattention to concurrent tone streams and mixed speech,”

Journal ofNeural Engineering , vol. 14, no. 3, p. 036020, 2017.[14] D. D. Wong, S. A. Fuglsang, J. Hjortkjær, E. Ceolini, M. Slaney, andA. de Cheveign´e, “A Comparison of Regularization Methods in Forwardand Backward Models for Auditory Attention Decoding,”

Frontiers inNeuroscience , vol. 12, p. 531, 2018.[15] S. Haufe, F. Meinecke, K. G¨orgen, S. D¨ahne, J.-D. Haynes, B. Blankertz,and F. Bießmann, “On the interpretation of weight vectors of linearmodels in multivariate neuroimaging,”

NeuroImage , vol. 87, pp. 96–110,2014.[16] S. Miran, S. Akram, A. Sheikhattar, J. Z. Simon, T. Zhang, andB. Babadi, “Real-Time Tracking of Selective Auditory Attention FromM/EEG: A Bayesian Filtering Approach,”

Frontiers in Neuroscience ,vol. 12, p. 262, 2018.[17] Narayanan, Abhijith Mundanad and Bertrand, Alexander, “Analysis ofMiniaturization Effects and Channel Selection Strategies for EEG SensorNetworks With Application to Auditory Attention Detection,”

IEEETransactions on Biomedical Engineering , vol. 67, no. 1, pp. 234–244,2019.[18] I. Kuruvila, E. Fischer, and U. Hoppe, “An LMMSE-based Estimationof Temporal Response Function in Auditory Attention Decoding,” in . IEEE, 2020, pp. 2837–2840.[19] E. Alickovic, T. Lunner, F. Gustafsson, and L. Ljung, “A Tutorial onAuditory Attention Identiﬁcation Methods,”

Frontiers in Neuroscience ,vol. 13, 2019.[20] S. Akram, J. Z. Simon, and B. Babadi, “Dynamic Estimation ofthe Auditory Temporal Response Function from MEG in Competing-Speaker Environments,”

IEEE Transactions on Biomedical Engineering ,vol. 64, no. 8, pp. 1896–1905, 2016.[21] A. Presacco, J. Z. Simon, and S. Anderson, “Speech-in-noise represen-tation in the aging midbrain and cortex: Effects of hearing loss,”

PloSone , vol. 14, no. 3, 2019.[22] L. Decruy, J. Vanthornhout, and T. Francart, “Hearing impairmentis associated with enhanced neural tracking of the speech envelope,”

Hearing Research , p. 107961, 2020.[23] S. A. Fuglsang, J. M¨archer-Rørsted, T. Dau, and J. Hjortkjær, “Effects ofSensorineural Hearing Loss on Cortical Synchronization to CompetingSpeech during Selective Attention,”

Journal of Neuroscience , vol. 40,no. 12, pp. 2562–2572, 2020.[24] A. D. Baddeley, N. Thomson, and M. Buchanan, “Word length and thestructure of short-term memory,”

Journal of Verbal Learning and VerbalBehavior , vol. 14, no. 6, pp. 575–589, 1975.[25] M. J. Crosse, G. M. Di Liberto, A. Bednar, and E. C. Lalor, “The Mul-tivariate Temporal Response Function (mTRF) Toolbox: A MATLABToolbox for Relating Neural Signals to Continuous Stimuli,”

Frontiersin Human Neuroscience , vol. 10, p. 604, 2016.[26] J. M. Mendel,

Lessons in Estimation Theory for Signal Processing,Communications and Control . Pearson Education, 1995.[27] S. M. Kay,

Fundamentals of Statistical Signal Processing: EstimationTheory . Prentice Hall, 1997. [28] J. R. Wolpaw and C. C. Wood, “Scalp distribution of human auditoryevoked potentials. I. Evaluation of reference electrode sites,”

Electroen-cephalography and Clinical Neurophysiology , vol. 54, no. 1, pp. 15–24,1982.[29] R. F. Burkard, J. J. Eggermont, and M. Don,

Auditory Evoked Potentials:Basic Principles and Clinical Application . Lippincott Williams &Wilkins, 2007.[30] U. Hoppe, S. Weiss, R. W. Stewart, and U. Eysholdt, “An AutomaticSequential Recognition Method for Cortical Auditory Evoked Poten-tials,”

IEEE Transactions on Biomedical Engineering , vol. 48, no. 2,pp. 154–164, 2001.[31] S. Geirnaert, T. Francart, and A. Bertrand, “An Interpretable Perfor-mance Metric for Auditory Attention Decoding Algorithms in a Contextof Neuro-Steered Gain Control,”

IEEE Transactions on Neural Systemsand Rehabilitation Engineering , vol. 28, no. 1, pp. 307–317, 2019.[32] C. Cortes and V. Vapnik, “Support-vector networks,”

Machine learning ,vol. 20, no. 3, pp. 273–297, 1995.[33] T. Hastie, R. Tibshirani, and J. Friedman,

The Elements of StatisticalLearning . Springer New York Inc., 2001.[34] J. Platt et al. , “Probabilistic outputs for support vector machines andcomparisons to regularized likelihood methods,”

Advances in largemargin classiﬁers , vol. 10, no. 3, pp. 61–74, 1999.[35] T. de Taillez, B. Kollmeier, and B. T. Meyer, “Machine learning fordecoding listeners’ attention from electroencephalography evoked bycontinuous speech,”

European Journal of Neuroscience , vol. 51, no. 5,pp. 1234–1241, 2020.[36] O. Etard, M. Kegler, C. Braiman, A. E. Forte, and T. Reichenbach,“Decoding of selective attention to continuous speech from the humanauditory brainstem response,”

NeuroImage , vol. 200, pp. 1–11, 2019.[37] L. L. Scharf,

Statistical Signal Processing . Addison-Wesley, 1991.[38] C. Brodbeck, A. Presacco, and J. Z. Simon, “Neural source dynamics ofbrain responses to continuous stimuli: Speech processing from acousticsto comprehension,”

NeuroImage , vol. 172, pp. 162–174, 2018.[39] P. J. Sch¨afer, F. I. Corona-Strauss, R. Hannemann, S. A. Hillyard,and D. J. Strauss, “Testing the Limits of the Stimulus ReconstructionApproach: Auditory Attention Decoding in a Four-Speaker Free FieldEnvironment,”

Trends in Hearing , vol. 22, p. 2331216518816600, 2018.[40] A. Presacco, S. Miran, B. Babadi, and J. Z. Simon, “Real-TimeTracking of Magnetoencephalographic Neuromarkers during a DynamicAttention-Switching Task,” in .IEEE, 2019, pp. 4148–4151.[41] D. Wang and J. Chen, “Supervised Speech Separation Based on DeepLearning: An Overview,”