Inference of the Selective Auditory Attention using Sequential LMMSE Estimation
Ivine Kuruvila, Kubilay Can Demir, Eghart Fischer, Ulrich Hoppe
IInference of the Selective Auditory Attention using SequentialLMMSE Estimation
Ivine Kuruvila , Kubilay Can Demir , Eghart Fischer , Ulrich Hoppe Abstract —Attentive listening in a multispeaker environmentsuch as a cocktail party requires suppression of the interferingspeakers and the noise around. People with normal hearingperform remarkably well in such situations. Analysis of thecortical signals using electroencephalography (EEG) has revealedthat the EEG signals track the envelope of the attended speechstronger than that of the interfering speech. This has enabled thedevelopment of algorithms that can decode the selective attentionof a listener in controlled experimental settings. However, oftenthese algorithms require longer trial duration and computa-tionally expensive calibration to obtain a reliable inference ofattention. In this paper, we present a novel framework to decodethe attention of a listener within trial durations of the orderof two seconds. It comprises of three modules: 1) Dynamicestimation of the temporal response functions (TRF) in everytrial using a sequential linear minimum mean squared error(LMMSE) estimator, 2) Extract the N1 − P2 peak of the estimatedTRF that serves as a marker related to the attentional stateand 3) Obtain a probabilistic measure of the attentional stateusing a support vector machine followed by a logistic regression.The efficacy of the proposed decoding framework was evaluatedusing EEG data collected from 27 subjects. The total numberof electrodes required to infer the attention was four: One forthe signal estimation, one for the noise estimation and the othertwo being the reference and the ground electrodes. Our resultsmake further progress towards the realization of neuro-steeredhearing aids.
I. I
NTRODUCTION
The cocktail party effect or the ability of humans to focusthe attention on a certain speaker among multiple speakersis fundamental to interpersonal communication. In the lastdecades, scientific community has actively investigated howthe human brain is able to maintain uninterrupted attention inthe presence of noise and interference from different speakers.Although the process by which the brain segregates multiplespeakers still remains largely unknown, progresses are contin-uously being made by analyzing the cortical signals measuredusing electrocorticography (ECoG), magnetoencephalography(MEG) and electroencephalography (EEG). In recent years,EEG analyses have become the forefront of auditory attentionresearch, thanks to its non-invasive nature, minimal effort andhigh temporal resolution. Several methods have been proposedthat model the relationship between multiple speakers presentin an auditory scene and the elicited EEG signal [1] [2] [3][4].
This work was carried out at the ENT clinic, Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg (FAU), Erlangen, Germany and had receiveda grant from WS Audiology. Ivine Kuruvila, Kubilay Can Demir and Ulrich Hoppe arewith Department of Audiology, ENT-Clinic, Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg (FAU), Erlangen, Germany.
[email protected] Eghart Fischer is with WS Audiology, Erlangen, Germany.
The underlying assumption in these methods is that thecortical signal track the acoustic envelope of the attendedspeaker stronger than that of the unattended speaker [2] [5].While the path between the outer ear and the EEG/MEG sensoris non-linear, linear system analysis is often used to analysethe processing of speech envelope through the auditory path.The system response function that models this pathway istermed as temporal response function (TRF) [6] [7]. Since thismapping is from speech to EEG/MEG sensors in the forwarddirection, TRFs fall under the category of forward models.TRFs are not limited to speech envelopes but can be used tomap a linear relationship between speech spectrograms andthe cortical signals, or between phonemes and the corticalsignals [7] [8]. In EEG modality, TRFs have high temporalresolution with peaks around 100 ms and 200 ms that modulatethe attentional effect [3]. On the other hand, due to theassumption of linearity, EEG signals can be mapped on tothe speech envelope in backward direction (backward model).The current state-of-the-art system identification method forboth forward and backward model is based on least-squares(LS) [9] and the identified system can be used to deduce theattention of a listener [2] [10] [11] [12] [13] [14]. However, thesystem response function calculated using the backward modelcannot be physiologically interpreted due to its implicit noisesuppression [15]. Therefore, TRF based on the forward modelusing EEG signals is the focus of this paper.Although the aforementioned studies have proven successfulin estimating the selective attention reliably, they suffer fromtwo major limitations. First, the temporal resolution required toobtain a reliable auditory attention decoding (AAD) accuracyis of the order of tens of seconds whereas a listener can switchthe attention at much shorter time scales. Second, attention de-coding accuracies deteriorate as the number of EEG electrodesused for analysis reduces. A recent study based on the statespace model has addressed the first limitation whereby thetemporal resolution was reduced to less than one second [16].Similarly, the effect of reducing the number of EEG electrodeson the attention decoding accuracy was investigated in [10][12] [17].Here, we present a novel framework to infer the selectiveauditory attention of a listener. The proposed frameworkconsists of three modules. The first module relates to thedynamic estimation of the TRF corresponding to the attendedspeaker (attended TRF) and the unattended speaker (unat-tended TRF). It is based on sequential linear minimum meansquared error (LMMSE) estimator which is an improvementof the algorithm proposed in [18]. LMMSE algorithms arebased on explicitly calculating the covariance of the signalcomponent and subsequently applying Bayesian estimationtheory. In the second module, a marker related to the attention a r X i v : . [ ee ss . A S ] F e b f the listener is extracted. Correlation based marker is a com-monly used attention marker where the Pearson correlationcoefficients between the original signals and the reconstructedsignal are compared [13] [16] [19]. However, it has also beensuggested that the amplitude peaks of the attended TRF andthe unattended TRF differ around 100 ms (N1 TRF ) and 200ms (P2
TRF ) latencies and these peaks correlate with selectiveattention [3] [7] [20]. As the sequential LMMSE estimatoris capable of generating TRFs with high fidelity from shortduration trials, we define the attention marker as the magnitudeof the difference between N1
TRF and P2
TRF known as theN1
TRF − P2 TRF peak. In the third module, we train a supportvector machine (SVM) using attention markers correspondingto the two speakers as features in order to generate a decisionboundary and classify attention. Finally, a logistic regressionis applied to the SVM’s output to obtain a probabilisticconfidence of the attentional inference.The rest of the paper is organized as follows. In sectionII, experimental details, information about the speech stimuliand the EEG data collection procedure are provided. In sectionIII, the proposed attention decoding framework is developedand explained in detail. Classification results are presented insection IV and the paper is concluded with a discussion onthe algorithm and the results in section V.II. M
ATERIALS AND M ETHODS
A. Participants
Twenty seven subjects who were all native German speakerstook part in the study. Of these, eighteen subjects (M age =58 years, range = 24 - 73 years, nine females, nine males)were hearing impaired and the remaining nine subjects (M age = 32 years, range = 20 - 52 years, five females, four males)were of normal hearing. Individuals with pure tone audiomet-ric thresholds better than 25 dB HL at octave frequenciesfrom 125 Hz to 8 kHz were considered as normal hearingwhereas individuals with pure tone audiometric thresholdsworse than 25 dB HL were considered as hearing-impaired.All participants gave their written consent before the start ofthe experiments and the study was approved by the ethicscommittee at the university of Erlangen-Nuremberg.
B. Experimental Design
To emulate a cocktail party effect, two streams of newsspoken by two different male speakers were presented tothe subject simultaneously. They were presented through twoloudspeakers which were connected to a computer throughFireface UC soundcard (RME Haimhausen Germany). Loud-speakers were placed one meter away from the subject atan azimuth of +45 ◦ and -45 ◦ on the left and on the right,respectively. The total duration of the speech stimuli was 30minutes and consisted of six segments with each segmentbeing approximately five minutes long. Subjects were askedto focus their attention either to the left speaker or to the rightspeaker in the first four segments and to the other speaker inthe remaining two segments. The initial attention to the left orto the right was chosen randomly across the subjects. Afterevery five-minute segment, subjects were given a multiple choice questionnaire related to contents of the attended streamto motivate them to comply with the task.Before the start of EEG measurements, loudness of theloudspeakers was calibrated to the most comfortable levelper individual subject. This was determined by increasing theloudness of a particular loudspeaker until the subject indicatesthat the contents are clearly understandable. Subsequently,loudness of the other loudspeaker was adjusted to the samelevel. Hearing-impaired subjects were requested to removetheir hearing aids before the experiment so as to avoid anynon-uniform speech processing introduced by the differenthearing aid manufacturers. Although participants in our exper-iment consisted of both hearing impaired and normal hearingsubjects, it is not within the scope of this study to presenta comparative analysis of both the groups. Comprehensiveanalyses of the effect of hearing loss on cortical trackingduring selective auditory attention are provided in [21] [22][23]. C. Speech Stimulus and EEG Recordings
Speech stimuli used in our selective auditory experimentwere taken from the slowly spoken news section of theGerman news website (Deutsche Welle). To reducesubjects’ prior knowledge, news contents were taken from the2015-16 archive of the website. Each news was around 60seconds long and were presented only once to the subject.The sampling rate of the speech signal was 44.1 kHz. EEGsignal was recorded with the help of Brainamp MR amplifier(Brainproducts Munich, Germany) using 21 AgCl electrodesplaced according to 10-20 EEG format. The sampling rateof the EEG signal was 2.5 kHz and the reference electrodewas placed at the right mastoid and the ground electrode wasplaced on the left earlobe. As far as possible, the impedanceof the electrodes were maintained under 10 k Ω . The speechstimuli and the EEG signals were synchronized using a triggersignal that was send through the soundcard to the EEGamplifier. D. Data Analysis
The speech envelope was obtained by taking the absolutevalue of Hilbert transform of the speech signal. As manystudies on attention decoding have shown that the low fre-quency cortical signal under 10 Hz track the low frequencyspeech envelope, both the EEG and the speech envelopeswere downsampled to 64 Hz [1] [2]. Afterwards, they werebandpass filtered between 1-9 Hz. In time domain, the trialduration considered for analysis was 2 sec throughout thispaper unless otherwise stated. Two seconds trial duration waschosen because the human brain can process of the order oftwo seconds of independent auditory information in workingmemory [24]. EEG signals measured at electrodes whoseimpedance was larger than 10 k Ω were discarded from furtherinvestigation. All analysis were performed using MATLAB,version R2017b (The Mathworks Inc. Natick, Massachusetts).III. A TTENTION D ECODING F RAMEWORK
Our attention decoding framework consists of three mod-ules. In the first module (section III-A), response of theuditory system that generated the cortical signals whenattended to a certain speaker is estimated. In the secondmodule (section III-B), a marker corresponding to the dynamicattentional state of the listener is extracted. In the third andfinal module (section III-C), attention markers are fed as inputto an SVM classifier to infer the current attention state. Alogistic regression is then trained on the classifier output togenerate soft decisions corresponding to the probability ofattention to a certain speaker.
A. TRF Estimation1) LMMSE:
TRF is the linear system response functionthat characterizes the auditory pathway inside the brain when aperson listens to a certain speaker. It has now been establishedthat the cortical signals track the envelope of speech duringattentive listening [5].Let θ , a p × s , an N × N > p ), denote the inputspeech signal envelope. Let r denote the cortical signal mea-sured at an EEG electrode when a system with θ as TRF isstimulated by the input s . Then r can be expressed as r = s ∗ θ + w = S θ + w , (1)where S is a known matrix of order N × p which is thecausal time-lagged version of input speech envelope, w is an N × ∗ denotes the linear convolution operator.The most commonly used algorithm to estimate the systemresponse function in (1) is based on LS estimation where itis assumed that θ is a deterministic variable [25]. As it maynot always be guaranteed as deterministic, we assume θ tobe a random variable with a prior probability density function(PDF) N ( µ θ , C θθ ) . Noise vector w is also assumed to be arandom variable with a prior PDF N ( , C w ) and independentof θ . This form is known as Bayesian linear model and theestimator that minimizes the mean squared error is given as[26] [27] ˆ θ = E ( θ | r ) . (2)The posterior mean in (2) can be expressed as [18]ˆ θ = E ( θ ) + C θ r C − rr ( r − E ( r )) (3)where E ( · ) corresponds the sample mean, C rr corresponds tothe autocovariance of the observation signal and C θ r corre-sponds to the cross-covariance between the observation andthe system response. Equation (3) can be further expanded as[18] ˆ θ = µ θ + C θθ S T ( SC θθ S T + C w ) − ( r − S µ θ ) (4)which gives the LMMSE estimate of θ . When data frommultiple electrodes need to be analyzed, (4) must be solvedseparately for signals at each electrode.
2) Sequential LMMSE:
LMMSE estimation or LSestimation methods assume that ˆ θ estimated in the previoustrial does not have any influence on the ˆ θ estimated in thecurrent trial. But as auditory attention is a continuous process,attention during the previous instance may have an influenceon the attention at the current instance. In such scenarios, ˆ θ at n th trial can be estimated sequentially as [27] Estimator Update: ˆ θ [ n ] = ˆ θ [ n − ] + K [ n ]( r − S [ n ] ˆ θ [ n − ]) , (5)where K [ n ] = M [ n − ] S T [ n ]( C w + S [ n ] M [ n − ] S T [ n ]) − (6) Minimum MSE Matrix Update: M [ n ] = ( I − K [ n ] S [ n ]) M ( n − ) (7)The gain factor K [ n ] is a p × M [ n ] is a p × p matrix that corresponds to thecovariance of the estimate ˆ θ at n th trial. A procedure tocalculate the initialization parameter ˆ θ [ − ] and M [ − ] willbe discussed in section III-D. a) Spatial Noise Covariance Estimation: Since EEG am-plifiers are differential amplifiers, they amplify the differencebetween signals measured at the data electrodes and thereference electrode. As a result, the differential componentreduces as data electrodes are placed closer to the referenceelectrode. This can be better understood looking at auditoryevoked potentials (AEPs) at different scalp locations [28].AEPs are a convenient tool to analyze SNR characteristics aswe know the true noiseless signal beforehand due to ensembleaveraging. This is not the case in speech tracking as we areworking on single trial and ensemble averaging is not feasible.The peak amplitude of AEPs around 100 ms is termed as N1peak and it is a measure of the quality of the generated AEP[29]. Given that the reference electrode is placed at mastoid,N1 peak is largest in the vertex locations and reduces as wemove towards the temporal locations in both directions. Fig. 1depicts the SNR distribution of the AEPs obtained at differentscalp locations. SNRs are usually between -9 dB to -17 dBand electrodes closer to the reference electrode have low SNRcompared to the vertex electrodes [18] [30]. Consequently, wecan use signals at electrodes closest to the reference electrodeto calculate the covariance matrix of noise that is requiredto solve (4) or (6). This is under the assumption that signalto noise characteristics remain similar between AEP and AADparadigm as the stimulus in both scenarios are acoustic signals.If EEG signal ( w ) only from a single electrode is availablefor noise estimation, then the noise covariance matrix can becalculated as C w = σ ( w ) I , (8)where σ ( · ) represents the variance and I is an identity matrixof order N × N . In (8), we assume that the noise is sampled ig. 1. Heatplot depicting the SNR (in dB) of the AEPs obtained at differentscalp locations when the reference electrode was placed at the right mastoid. from a set of uncorrelated random variables that are identicallydistributed.If L > C w can be calculated as C w = L − L ∑ n = ( w k − E ( w k ))( w k − E ( w k )) T . (9)To ensure invertibility of the matrix in (4) and (6), rank of C w must be at least N . b) Noise Covariance Estimation in Multispeaker Sce-nario: When multiple speakers are present in an auditoryscene, cortical signals measured at an EEG electrode canbe influenced by every speaker present. In a dual-speakerscenario, the observed EEG signal r can be written as r = θ a ∗ s a + θ u ∗ s u + w , (10)where θ a represents the attended TRF, θ u represents theunattended TRF, s a represents the attended speaker envelopeand s u represents the unattended speaker envelope. θ a and θ u can be estimated separately using (5)-(7). However, indoing so, we assume the other speaker as noise in the EEGsignal. That means, during the estimation of θ a , the corticalsignal generated from the unattended speaker is consideredas interference and is added to the noise and vice versa.Subsequently, for the estimation of θ a , (10) is rewritten as r = θ a ∗ s a + w a (11)where w a = w + θ u ∗ s u ,and for the estimation of θ u , (10) is rewritten as r = θ u ∗ s u + w u (12)where w u = w + θ a ∗ s a .Finally, we make an approximation such that σ ( w a ) ≈ σ ( w u ) = σ ( w ) . (13) The rational for this approximation is as follows. Thecomponent of EEG that should have been contributed bythe acoustic stimulus is significantly small compared to thebackground noise. Hence, the error introduced in the noisevariance calculation by assuming interference from the otherspeaker as noise should also be significantly small. This isillustrated for the case of AEP in Fig. 2. Fig. 2.
Plot showing the true signal (AEP) and the background noise (SNR= -10.8 dB) for a single epoch measured at Cz electrode from a representativesubject. AEP was obtained by averaging over 100 epochs.
B. Extraction of Attention Markers
Attention marker is a measure of the reliability of esti-mated TRFs in decoding the attentional states. The correlationbased attention marker is a commonly used measure which iscalculated as the Pearson correlation coefficient between theoriginal signal and the reconstructed signal. To be precise,once the TRF is estimated, it is used to reconstruct the EEGsignal as a linear convolution between the speech envelope andestimated the TRF. As there should be at least two speakerspresent for selective attention, two separate EEG signals canbe reconstructed using the speech envelopes and the estimatedTRF. Then the speaker corresponding to the reconstructedEEG having a larger Pearson correlation coefficient to theoriginal EEG is flagged as the attended speaker. Alternatively,correlation based attention marker can be calculated usingbackward model where the speech envelope is reconstructedfrom the EEG signals [2].It has been shown that the correlation based attention markeris highly fluctuating for trial durations of the order of secondsresulting in an unreliable estimate of attention [16] [31].Hence, we decided to use the magnitude of N1
TRF − P2 TRF peak of the estimated TRFs as the attention markers. N1
TRF is the negative peak that occurs around 100 ms latency andP2
TRF is the positive peak that occurs around 200 ms latencyand they are known to modulate attentional effect [3] [7] [20].In our analysis, 75 ms - 135 ms latency was chosen as the timeframe to search for N1
TRF and 175 ms - 250 ms latency waschosen as the time frame to search for P2
TRF . If there was nonegative peak present in 75 ms - 135 ms time frame, N1
TRF was initialized to zero. Similarly, if there was no positive peakresent in 175 ms - 250 ms time frame, P2
TRF was initializedto zero.
C. Classification using SVM
For a linearly separable binary classification problem, SVMsaim to find a unique separating hyperplane that maximizesthe margin between two classes. Searching the best sep-arating hyperplane describes an optimization problem thatcan be solved efficiently using existing convex optimizationalgorithms. Furthermore, it can be extended to linearly non-separable classes by adding slack variables to the optimizationproblem as an additional constraint.Let the separating hyperplane that describes the decisionboundary be defined as a first order affine function f ( x ) suchthat f ( x ) = a T x + b . (14)It can be solved as [32] a = m ∑ i = α i y ( i ) x ( i ) , (15)where α i is a Lagrangian multiplier and x ( i ) and y ( i ) arethe feature vector and the class label of the training samplesrespectively. Substituting (15) into (14), the decision boundarycan be rewritten as f ( x ) = m ∑ i = α i y ( i ) (cid:104) x ( i ) , x (cid:105) + b . (16)The Lagrangian multiplier α i is non zero only for those x ( i ) and y ( i ) that act as support vectors [33]. The inner product in(16) is commonly referred to as kernel function and they helpSVMs to map the input vector to higher dimensions. There aremany choices for the kernel function such as linear, polynomialor Gaussian kernel. In our work, we used a linear kernelfunction to train the SVM. Fitting a sigmoid function aftercomputing SVM output enables us to obtain posterior proba-bilities [34]. In this work, SVM training was performed using fitcsvm API and logistic regression training was performedusing fitposterior
API provided in the MATLAB statistical andmachine learning toolbox.
D. Estimating Model Parameters
As described in section II-B, our dual-speaker EEG ex-periment consisted of six segments (A-F), each being fiveminutes long. During the first four segments, subjects focusedtheir attention to the same speaker and during the remainingtwo segments, subjects switched their attention to the otherspeaker. In order to calculate ˆ θ [ − ] , M [ − ] which is neededto solve (5) - (7), we split the EEG data into three blocksas shown in Fig. 3. The first block consisted of segments Aand B and it was used to calculate the initialization parameterˆ θ [ − ] and M [ − ] . The second block consisted of segments Cand E and it was used to train the SVM classifier. Finally, thethird block consisted of segments D and F and it was usedto validate the performance of the trained SVM classifier. As the second and the third blocks comprised of attention to bothspeakers, we were able to test the ability of the algorithm totrack attention switch. Fig. 3.
Data processing sequence. EEG was collected in six separate segments.In the first four segments, subjects maintained their attention to the samespeaker. For the fifth and the sixth segments, they were asked to switch theirattention to the other speaker. Segments A and B were used to calculate theinitialization parameter, segments C and E were used for the SVM training andsegments D and F were used to test the trained SVM classifier.
In [18], initialization parameters ˆ θ [ − ] and M [ − ] werechosen as zero mean and unit variance respectively. However,we decided to use the first block to calculate ˆ θ [ − ] and M [ − ] . We started with the zero mean and the unit varianceassumption and solved (5)-(7) sequentially for every twoseconds trial. Once all the trials were completed, estimated ˆ θ and M in the last trial of the first block were used to initializeˆ θ [ − ] and M [ − ] for the second block. It must be noted that,as there were two speakers present, we obtained two sets ofˆ θ [ − ] and M [ − ] in the last trial of first block. Hence, wetook their average as the initial value. Alternatively, if EEGdata corresponding to single-speaker attention is available, itcould be used to perform this initialization. In the secondblock, as part of the SVM training, TRFs corresponding tothe two speakers were estimated and their N1 TRF − P2 TRF peakswere extracted. The N1
TRF − P2 TRF peak served as the attentionmarker and it was labeled with the class labels correspondingto the attended and the unattended speaker. Finally, theselabeled attention markers were used as feature vectors tocompute the SVM decision boundary. In the third block,TRFs corresponding to the new data were estimated and theirN1
TRF − P2 TRF peaks were passed as input to the trained SVMmodel to classify attention.To compare the performance of our AAD framework withexisting algorithms, attention decoding accuracies were cal-culated using LS method via stimulus reconstruction [2] andstate space model [16]. In case of the LS method, the first andthe second blocks were used to estimate the decoder and findthe optimal ridge regularization parameter. The third blockwas used to infer the attention using the estimated decoder.During the decoder estimation, the autocorrelation matriceswere averaged across trials to obtain a single final decoder [11]instead of averaging the decoders estimated across multipletrials [2]. We performed two variants of the LS method usingstimulus reconstruction: 1)
LS 2sec and 2)
LS 60sec overlap .For
LS 2sec , the trial duration considered to estimate thedecoder was two seconds whereas for
LS 60sec overlap , thetrial duration considered was 60 seconds but with a slidingwindow of two seconds. I.e., in every trial, new data of onlytwo seconds were added.n case of the state space model, first and second blockswere used to estimate the decoder and extract the attentionmarkers for every two seconds trial. Once attention markerswere extracted, they were used to fit a log-normal distributionand the mean and the variance of log-normal distribution wereused to train the state space model. Finally, the third block wasused to infer the attention using the trained state-space model.The algorithm proposes two different attention markers namelythe Pearson correlation coefficients of the reconstructed stimuliand the L1 norm of the estimated decoders. In our comparativeanalysis, the L1 norm of the estimated decoder was used asthe attention marker as it has been shown to have superiorperformance over the Pearson correlation coefficient in theoriginal paper. All other hyperparameters were initialized tothe default values. More details of the algorithm can be foundin [16]. IV. R
ESULTS
A. TRF Estimation : Sequential LMMSE vs LS
In Fig. 4, the attended TRF (left) and the unattended TRF(right) estimated using the sequential LMMSE algorithm fora particular subject are shown. TRFs were estimated in everytrial and for a filter lag of 24 samples that corresponded to375 ms at 64 Hz sampling rate. For the same subject, TRFsestimated using the LS algorithm are shown in Fig. 5. Itcan be observed that for the case of LS algorithm, the TRFestimated across trials are noisy with high variance. For allsubjects considered, the mean of standard deviations of theTRFs estimated using sequential LMMSE and LS algorithmsare displayed in Fig. 6. The mean value was calculated byaveraging the standard deviations across each individual timelags per subject. It is evident that the sequential LMMSEalgorithm resulted in lower standard deviation compared to theLS algorithm ( p att , p unatt < .
001 based on paired Wilcoxonsigned-rank test).
Fig. 4.
Sequential LMMSE estimate: Upper row: Attended TRF (left) andunattended TRF (right) estimated at Cz electrode from a representative subject.Each coloured line indicates the TRF obtained from separate trials that were oftwo seconds duration. Lower row: Mean ± std dev across trials. Fig. 5.
LS estimate: Upper row: Attended TRF (left) and unattended TRF(right) estimated at Cz electrode from the same subject as in Fig. 4. Eachcoloured line indicates TRF obtained from separate trials of two secondsduration. Lower row: Mean ± std dev across trials. Fig. 6.
Comparison of the mean standard deviation of the attended and theunattended TRFs estimated using sequential LMMSE and LS algorithms. Eachdot on the boxlot represents an individual subject.
B. Attended vs Unattended TRF
TRFs obtained from all the subjects with attended TRF (A),unattended TRF (B) and the difference in magnitude betweenthem (C) are summarized in Fig. 7. On the upper panel, TRFsfrom the individual subjects are displayed and on the lowerpanel, their mean ± standard deviation are displayed. FromFig. 7A and Fig. 7B, it is evident that there is a negative peakat around 100 ms (N1 TRF ) and a positive peak at around 200ms (P2
TRF ). However, the magnitude of N1
TRF and P2
TRF arehigher for the attended TRF compared to the unattended TRF.The difference between the attended TRF and the unattendedTRF (Fig. 7C) is negative for N1
TRF ( p < .
01) and positivefor P2
TRF ( p < . ig. 7. Plots depicting the attended TRF (A), the unattended TRF (B) and the TRF difference (C) estimated at Cz electrode for every subject. The upper rowcorresponds to the mean TRF of each individual subject and the lower row corresponds to the mean ± std dev across subjects. Each individual TRF on the upperrow was obtained as an average of the TRFs across different trials. Fig. 8.
Decoding of the selective auditory attention of two representative subjects. For subject 1 (left panel), the initial attention was to speaker 1 and for subject2 (right panel), the initial attention was to speaker 2. A) Dynamic evolution of the attention markers (N1
TRF − P2 TRF at Cz electrode) corresponding to the twospeakers. A new attention marker was extracted for every trial of two seconds. B) Probability of attention to speaker 1 and speaker 2. Dotted line indicates the trueattention to speaker 1. Attention switch was not a continuous process, instead it was obtained by concatenating two EEG segments with opposite attention (refersection III-D). The attention probability was calculated by passing the attention markers to an SVM followed by a logistic regression. The attention probabilityis dependent on the width of the SVM decision boundary that is determined by the spread of the support vectors. C) Left scatter plot depicts the linear decisionboundary separating the attention markers. The abscissa (spkr1 marker) corresponds to the N1
TRF − P2 TRF peak extracted from the TRF of speaker 1 and theordinate (spkr2 marker) corresponds to the N1
TRF − P2 TRF peak extracted from the TRF of speaker 2. Each shaded circular dot is a 2D vector with its firstcomponent being spkr1 marker and its second component being spkr2 marker. On average, when attending to speaker 1, the first component should be largerthan the second component and vice versa. The unshaded black circles on the left scatter plot depict support vectors that define the decision boundary. The rightscatter plot shows attention markers obtained during the testing procedure with black circles indicating misclassifications. mplitude than the attended TRF ( p < . C. Decoding the Selective Attention
Decoding the selective attention from two representativesubjects is shown in Fig. 8. For subject 1, the initial attentionwas to speaker 1 and the final attention was to speaker 2whereas for subject 2, it was vice versa. For subject 1, thedecoding accuracy was 89.5% and for subject 2, the decodingaccuracy was 62.8%. The median decoding accuracy across allthe subjects was 79.8% which was above the statistical signifi-cance level. The statistical significance level for a trial durationof two seconds was found to be 54.7% based on binomial dis-tribution at 95% confidence. Fig. 8A displays the progressionof attention markers (N1
TRF − P2 TRF ) corresponding to the twospeakers present. Fig. 8B shows the probability of attentionto speaker 1 which was obtained by passing the attentionmarker (Fig. 8A) into a trained SVM classifier (Fig. 8C).The probabilistic output gives a confidence in our attentioninference and it is proportional to how far the attention markersare from the decision boundary. Regarding the multiple choicequestionnaire, on average, subjects answered 84.7 ± r = − . D. AAD Accuracy: Comparison with Existing Algorithms
The median decoding accuracy achieved at Cz electrodefor the case of
LS 2sec was 53.96% which was lower thanthe statistical significance level for a trial duration of twoseconds (Fig. 10). For
LS 60sec overlap , median decodingaccuracy at Cz electrode improved to 60.3%. Here the trialduration considered was 60 seconds with a sliding window oftwo seconds. For the state space model, the median decoding
Fig. 9.
Decoding accuracy vs time taken to detect the attention switch for everysubjects. Each * represents the result from an individual subject.
Fig. 10.
Comparison of the decoding accuracies obtained at Cz electrode usingLS, state space and sequential LMMSE algorithms. The trial duration used wastwo seconds except for
LS 60sec overlap . accuracy using the same settings as LS 2sec was found to be71.7%. In addition, all aforementioned algorithms were testedusing a combination of
Cz + left mastoid electrodes which wasthe same electrode combination used to estimate TRFs withsequential LMMSE. However, the accuracies were found to beslightly lower (not shown here) than the accuracies obtainedusing only Cz electrode.
E. Spatial Distribution of TRFs
Grand average TRFs estimated at different electrode loca-tions referenced to the mastoid electrode are shown in Fig. 11.They were obtained by averaging the TRFs across individualsubjects. As observed before, the amplitude of N1
TRF andP2
TRF corresponding to the attended TRF are larger than thatof the unattended TRF. The magnitude of the N1
TRF − P2 TRF peak decreases as we move from the vertex regions to thetemporal regions reminiscence of the SNR distribution ofAEPs (Fig. 1). The decoding accuracies obtained at differentscalp locations are shown in Fig. 12. The highest decodingaccuracy was obtained at Cz electrode (79.8 ± ± F. Accuracy Improvement with SVM
SVM was used in our framework to classify the selectiveattention based on the N1
TRF − P2 TRF peak of estimated TRFs.To compare the improvements due to the introduction ofSVM, decoding accuracies were calculated by thresholding theN1
TRF − P2 TRF peak. I.e., the speech envelope that generatedthe TRFs with larger N1
TRF − P2 TRF peaks was flagged as theattended speech. Fig. 12 compares the decoding accuraciesobtained using SVM and thresholding. At every electrodelocation evaluated, the accuracy was higher using the SVMclassifier than thresholding. Mean of the accuracies obtainedat different scalp locations using SVM was 74.5% and that ofusing thresholding was 64.36%.V. D
ISCUSSION
EEG analyses have been traditionally used to investigatethe sensory response of human brain. In the auditory domain,AEPs which are responses evoked when stimulated by an ig. 11.
Spatial distribution of the attended TRF (blue) and the unattended TRF (red) estimated at different scalp locations. The TRF displayed at each locationis the grand average obtained by calculating the mean across subjects. The reference electrode was placed at right mastoid.
Fig. 12.
Comparison between the attention decoding accuracies obtained usingSVM classifier and thresholding at selected scalp locations. acoustic signal are widely employed [29]. Due to the presenceof substantial background noise, ensemble averaging must beperformed to obtain AEPs that are visually interpretable. Inother words, single trial analysis of EEG was a challenge untilrecently when it was demonstrated that the auditory attentionin a dual-speaker scenario could be inferred without ensembleaveraging using LS estimation [2]. One of the limitations ofthe LS estimation was that the trial duration required to inferattention was of the order of minutes but a listener couldswitch the attention multiple time within time scales of thisorder. Several improvements were proposed by which the trialduration needed to infer attention was considerably reduced
Fig. 13.
Mean correlation coefficient between the EEG signals (five minutes)measured at different electrode locations. The mean was obtained by averagingthe correlation coefficients across twenty seven subjects. [16] [35] [36].In this study, we have proposed a framework to decode theselective auditory attention of a listener. The efficacy of thealgorithm was verified using EEG data collected from 27 sub-jects who undertook selective auditory attention experiment.The proposed framework consists of three modules. In thefirst module, a novel TRF estimation algorithm is presentedwhich is based on LMMSE estimator. The MMSE estimatorsare known to perform better than the LS estimators at lowSNR and their performance converges as the SNR improves.The improvement stems from the fact that the MMSE esti-ators make use of second order statistics (covariance) of thenoise present in the signal. To calculate the noise covariancematrix, we can use signals measured at locations closest orcontralateral to the reference electrode. Another potential wayof estimating the noise at a particular electrode is to measurethe signal at that electrode before the stimulus is presented[4]. This is appropriate for experiments with non-continuousspeech such as speech-in-noise test where the stimuli arepresented with periodic pauses. When entire signals are notavailable a priori, sequential LMMSE estimator can be usedto estimate the TRF. In sequential estimation, we assume thatthe TRF in the previous trial has an influence on the TRF inthe current trial. The contribution of the previous estimate onthe current estimate is controlled by the gain factor.In the second module, a marker related to the attentionof the listener is extracted. Amplitude peaks of the TRFaround 100 ms (N1
TRF ) and 200 ms (P2
TRF ) are known tomodulate selective attention [3] [7] [20]. Hence, we used theN1
TRF − P2 TRF peak as the attention marker in our framework.Once the markers were extracted, they were passed to anSVM classifier followed by a logistic regression to obtaina probabilistic measure of the attention. Instead of usingan SVM, the attention markers could be directly passed tothe logistic regression but SVMs have the inherent abilityto correct for errors if the feature vector falls inside thedecision boundary. Alternate to SVMs, random forests orneural networks based classifiers could be used provided thereis sufficient data available to train the network.
A. TRF Estimation : Sequential LMMSE vs LS
Two commonly used linear system identification methods instatistical estimation theory are LS and LMMSE estimators.In LS estimator, the squared error is minimized withouthaving any assumptions on the stochastic property whereasin LMMSE estimator, the mean squared error is minimizedassuming a Gaussian PDF and taking the expectation overthe assumed PDF. Algorithms based on LMMSE exploitthe knowledge of noise covariance and in this work, wereformulated the covariance matrix of received EEG signal as acombination of the covariance matrix of the speech envelopeand the covariance matrix of the noise signal. I.e., we usedsome prior knowledge about the covariance matrix and if thisprior knowledge can be estimated with a sufficient accuracy,LMMSE algorithm outperforms LS algorithm [27]. This per-formance improvement can be observed when comparing Fig.4 and Fig. 5. The standard deviation (and variance) of the TRFsestimated using the LS estimator was found to be significantlylarger than that of using the sequential LMMSE estimator(Fig. 6). However, the performance of the LS estimator shouldimprove with increasing trial duration as it is known that theSNR of the LS estimate is directly proportional to the lengthof the trial [37]. As a result, it is only when the length of theobservation vector (EEG) significantly exceeds the length ofthe estimation vector (TRF), the LS estimate is reliable.
B. Attended vs Unattended TRF
Analysis of the filter coefficients of the attended and unat-tended TRFs suggest a top down processing of the auditory information. To be precise, a larger N1
TRF and P2
TRF for theTRF estimated using the attended speech envelope suggeststhat the brain might be suppressing the contents of the unat-tended speech from being encoded in the working memory.These results suggest that the LMMSE based TRF estimationcould replicate the results obtained in previous studies [3][7]. Responses beyond 100 ms are attributed to phonemeprocessing and suppression of the unattended TRF at theselatencies means that the high level features corresponding tothe unattended speaker are not encoded [8] [38]. For initiallatencies below 50 ms, a large positive amplitude was observedfor both attended and unattended TRFs. This is intriguingbecause previous studies have not reported large amplitudesat these latencies. A potential reason is the short trial durationthat was used to estimate the TRF. I.e., as the stimuli inour experiment were continuous news segments that lastedaround five minutes and as we were analyzing short trials oftwo seconds, subjects had an expectation about the incomingacoustic stimuli. These expectations may be getting reflectedas early responses that are largely identical for both theattended and the unattended speaker. We anticipate that theseearly responses should diminish for non-continuous stimulussuch as the one used in speech-in-noise tests.In our study, analyses were limited to dual-speaker scenariobut in real life, there could be more interfering speakerspresent. It has already been shown that the selective attentioncould be inferred in a four-speaker environment using the LSmethod [39]. TRF estimation using sequential LMMSE can beextended to multispeaker scenarios such as the four-speakerexperiment to evaluate the behaviour of TRFs correspondingto each of the interfering speakers.
C. Attention Decoding Accuracy
The most important improvement introduced by the sequen-tial LMMSE algorithm is its ability to generate interpretableTRFs from short duration trials. This has enabled us to moveaway from the conventional correlation based attention markerto the N1
TRF − P2 TRF peak based attention marker. Correlationbased attention marker falls under the category of regressiontasks because we have to reconstruct the EEG signals ini-tially to calculate the Pearson correlation coefficient. However,inferring attention directly based on the N1
TRF − P2 TRF peakfalls under the category of classification tasks. It is knownthat regression tasks are more challenging than classificationtasks and we expect an improvement in performance usingN1
TRF − P2 TRF peak instead of Pearson correlation coefficient.The decoding accuracy obtained at Cz electrode for a trialduration of two seconds using LS estimator was found to be53.96%. An analysis of overlapping trials with a duration of 60seconds but using a sliding window of two seconds improvedthe accuracy to 60.3%. On the other hand, the decodingaccuracy obtained using our AAD framework was 79.8% (Fig.10). The lower decoding accuracy using LS estimator can beattributed to its dependency on the number of electrodes andthe trial duration. I.e., as the trial duration and the numberof electrodes reduce, decoding accuracy deteriorates [10][12][17]. An attention decoding framework based on the statepace algorithm reported comparable accuracies to that of ourframework. However, adaptation of the same algorithm on ourdataset yielded an accuracy of 72.7% which is lower than theaccuracy reported in [16]. The lower accuracy may be due tothe choice of hyperparameters in our adaptation because weoptimized only the LASSO regularization parameter and theparameters of the log-normal distribution. All other hyperpa-rameters were initialized with the default values suggested in[16]. The trial duration was chosen as two seconds becauseprevious studies have shown that the human brain can holdupto two seconds of independent auditory information inthe working memory [24]. In future work, the trial durationcould be chosen as the listener’s memory span and it can beestimated with the help of digit span tests.Despite our decoding framework being designed to detectthe attention at a time resolution of two seconds, the medianattention switch duration was found to be 52.4 seconds whichwas larger than our expectation. There are two possible ex-planations for the delayed detection of attention switch. First,our experiment was designed in such a way that there wasno dynamic attention switch present. Instead, the attentionswitch was synthesized by concatenating two EEG segmentswith opposite attention. Hence, it is possible that the subjectrequired certain amount of time to focus the attention on aparticular speaker at the start of the experiment. Second, dueto the sequential nature of estimating the current TRF basedon previous estimates, the algorithm may have introduced adelay by itself which is reflected in the longer switch duration.To disentangle the aforementioned possibilities, the algorithmmust be further evaluated in scenarios where the subject hasthe flexibility to switch attention dynamically [40].
D. Spatial Distribution of TRF
Comparison of the TRFs obtained at different scalp loca-tions shows that frontal and central locations generate identicalTRFs. This is not trivial considering the fact that EEG signalsmeasured at these locations are highly correlated (Fig. 13). Forour choice of the mastoid reference location, the magnitudeof N1
TRF − P2 TRF peaks are largest at the central and thefrontal regions and smallest for the temporal regions. Onthe contrary, if the reference electrode was chosen as Czelectrode, the polarity of TRF will get inverted and temporalregions will have the largest N1
TRF − P2 TRF peak. Despitethe smaller peaks, a clear difference between the attendedand the unattended TRFs could be observed at all locations.Decoding accuracies obtained at the temporal electrodes werenot significantly lower than the decoding accuracies obtainedat the central electrodes. This could pave the way to potentiallyintegrate these algorithms in a neuro-steered hearing aid.Furthermore, these results suggest that we may not requirehigh-density EEG measurements with multiple electrodes inparadigms such as the selective auditory attention.One of the assumptions that we have made throughout thispaper is the availability of clean speech envelope to estimatethe TRFs. In practice, only noisy mixtures are available andspeech sources must be separated before the envelope can beextracted. This is an active research field and algorithms are already available based on classical signal processing such asbeamforming or based on deep neural networks [41]. Anotherlimitation of the sequential LMMSE algorithm in its currentform is that it can be used to estimate the auditory systemresponse only in the forward direction (speech to EEG). Thisis because the noise covariance matrix that we have made useof is available only in the forward direction.VI. C
ONCLUSION
In this paper, we have proposed a method to estimate theTRF from EEG signals using sequential LMMSE estimator.Unlike the LS estimator, sequential LMMSE estimator is ca-pable of generating reliable and interpretable TRFs from shortduration trials. Analysis of the properties of the TRFs in dual-speaker scenario has revealed a locus of attention around 100ms (N1
TRF ) and 200 ms (P2
TRF ). I.e., the TRFs correspondingto the attended speaker have a larger N1
TRF − P2 TRF peakthan the TRFs corresponding to the unattended speaker. Usingsequential LMMSE as the major building block, we developeda novel AAD framework to detect the auditory attention of alistener at a time resolution of two seconds. In the proposedAAD framework, only two electrodes (data measurement andnoise measurement) in addition to the reference and theground electrodes are sufficient to achieve a reliable decodingaccuracy. Comparison of the results obtained at different scalpelectrodes revealed that the AAD accuracies are in the similarrange across different electrodes.Although the AAD framework was designed to detect theattention from EEG trials of the order of seconds, it wasobserved that the time taken to detect the attention switch waslonger than a single trial duration. Hence, further investigationis required to understand the fundamental reason behind it andreduce the attention switch duration.A
CKNOWLEDGMENT
We convey our gratitude to all participants who took part inthe study and would like to thank the student Laura Rupprechtwho helped us with data acquisition.R
EFERENCES[1] A. J. Power, J. J. Foxe, E.-J. Forde, R. B. Reilly, and E. C. Lalor, “Atwhat time is the cocktail party? A late locus of selective attention tonatural speech,”
European Journal of Neuroscience , vol. 35, no. 9, pp.1497–1503, 2012.[2] J. A. O’Sullivan, A. J. Power, N. Mesgarani, S. Rajaram, J. J. Foxe,B. G. Shinn-Cunningham, M. Slaney, S. A. Shamma, and E. C. Lalor,“Attentional Selection in a Cocktail Party Environment can be Decodedfrom Single-Trial EEG,”
Cerebral Cortex , vol. 25, no. 7, pp. 1697–1706,2014.[3] L. Fiedler, M. W¨ostmann, S. K. Herbst, and J. Obleser, “Late corticaltracking of ignored speech facilitates neural selectivity in acousticallychallenging conditions,”
NeuroImage , vol. 186, pp. 33–42, 2019.[4] N. Das, J. Vanthornhout, T. Francart, and A. Bertrand, “Stimulus-awarespatial filtering for single-trial neural response and temporal responsefunction estimation in high-density EEG with applications in auditoryresearch,”
NeuroImage , vol. 204, p. 116211, 2020.[5] S. J. Aiken and T. W. Picton, “Human Cortical Responses to the SpeechEnvelope,”
Ear and hearing , vol. 29, no. 2, pp. 139–157, 2008.[6] E. C. Lalor and J. J. Foxe, “Neural responses to uninterrupted naturalspeech can be extracted with precise temporal resolution,”
EuropeanJournal of Neuroscience , vol. 31, no. 1, pp. 189–193, 2010.7] N. Ding and J. Z. Simon, “Neural coding of continuous speech inauditory cortex during monaural and dichotic listening,”
Journal ofNeurophysiology , vol. 107, no. 1, pp. 78–89, 2012.[8] G. M. Di Liberto, J. A. O’Sullivan, and E. C. Lalor, “Low-FrequencyCortical Entrainment to Speech Reflects Phoneme-Level Processing,”
Current Biology , vol. 25, no. 19, pp. 2457–2465, 2015.[9] N. Mesgarani, S. V. David, J. B. Fritz, and S. A. Shamma, “Influence ofContext and Behavior on Stimulus Reconstruction From Neural Activityin Primary Auditory Cortex,”
Journal of Neurophysiology , vol. 102,no. 6, pp. 3329–3339, 2009.[10] B. Mirkovic, S. Debener, M. Jaeger, and M. De Vos, “Decoding theattended speech stream with multi-channel EEG: Implications for online,daily-life applications,”
Journal of Neural Engineering , vol. 12, no. 4,p. 046007, 2015.[11] W. Biesmans, N. Das, T. Francart, and A. Bertrand, “Auditory-InspiredSpeech Envelope Extraction Methods for Improved EEG-Based Audi-tory Attention Detection in a Cocktail Party Scenario,”
IEEE Transac-tions on Neural Systems and Rehabilitation Engineering , vol. 25, no. 5,pp. 402–412, 2017.[12] S. A. Fuglsang, T. Dau, and J. Hjortkjær, “Noise-robust cortical trackingof attended speech in real-world acoustic scenes,”
NeuroImage , vol. 156,pp. 435–444, 2017.[13] L. Fiedler, M. Woestmann, C. Graversen, A. Brandmeyer, T. Lunner, andJ. Obleser, “Single-channel in-ear-EEG detects the focus of auditoryattention to concurrent tone streams and mixed speech,”
Journal ofNeural Engineering , vol. 14, no. 3, p. 036020, 2017.[14] D. D. Wong, S. A. Fuglsang, J. Hjortkjær, E. Ceolini, M. Slaney, andA. de Cheveign´e, “A Comparison of Regularization Methods in Forwardand Backward Models for Auditory Attention Decoding,”
Frontiers inNeuroscience , vol. 12, p. 531, 2018.[15] S. Haufe, F. Meinecke, K. G¨orgen, S. D¨ahne, J.-D. Haynes, B. Blankertz,and F. Bießmann, “On the interpretation of weight vectors of linearmodels in multivariate neuroimaging,”
NeuroImage , vol. 87, pp. 96–110,2014.[16] S. Miran, S. Akram, A. Sheikhattar, J. Z. Simon, T. Zhang, andB. Babadi, “Real-Time Tracking of Selective Auditory Attention FromM/EEG: A Bayesian Filtering Approach,”
Frontiers in Neuroscience ,vol. 12, p. 262, 2018.[17] Narayanan, Abhijith Mundanad and Bertrand, Alexander, “Analysis ofMiniaturization Effects and Channel Selection Strategies for EEG SensorNetworks With Application to Auditory Attention Detection,”
IEEETransactions on Biomedical Engineering , vol. 67, no. 1, pp. 234–244,2019.[18] I. Kuruvila, E. Fischer, and U. Hoppe, “An LMMSE-based Estimationof Temporal Response Function in Auditory Attention Decoding,” in . IEEE, 2020, pp. 2837–2840.[19] E. Alickovic, T. Lunner, F. Gustafsson, and L. Ljung, “A Tutorial onAuditory Attention Identification Methods,”
Frontiers in Neuroscience ,vol. 13, 2019.[20] S. Akram, J. Z. Simon, and B. Babadi, “Dynamic Estimation ofthe Auditory Temporal Response Function from MEG in Competing-Speaker Environments,”
IEEE Transactions on Biomedical Engineering ,vol. 64, no. 8, pp. 1896–1905, 2016.[21] A. Presacco, J. Z. Simon, and S. Anderson, “Speech-in-noise represen-tation in the aging midbrain and cortex: Effects of hearing loss,”
PloSone , vol. 14, no. 3, 2019.[22] L. Decruy, J. Vanthornhout, and T. Francart, “Hearing impairmentis associated with enhanced neural tracking of the speech envelope,”
Hearing Research , p. 107961, 2020.[23] S. A. Fuglsang, J. M¨archer-Rørsted, T. Dau, and J. Hjortkjær, “Effects ofSensorineural Hearing Loss on Cortical Synchronization to CompetingSpeech during Selective Attention,”
Journal of Neuroscience , vol. 40,no. 12, pp. 2562–2572, 2020.[24] A. D. Baddeley, N. Thomson, and M. Buchanan, “Word length and thestructure of short-term memory,”
Journal of Verbal Learning and VerbalBehavior , vol. 14, no. 6, pp. 575–589, 1975.[25] M. J. Crosse, G. M. Di Liberto, A. Bednar, and E. C. Lalor, “The Mul-tivariate Temporal Response Function (mTRF) Toolbox: A MATLABToolbox for Relating Neural Signals to Continuous Stimuli,”
Frontiersin Human Neuroscience , vol. 10, p. 604, 2016.[26] J. M. Mendel,
Lessons in Estimation Theory for Signal Processing,Communications and Control . Pearson Education, 1995.[27] S. M. Kay,
Fundamentals of Statistical Signal Processing: EstimationTheory . Prentice Hall, 1997. [28] J. R. Wolpaw and C. C. Wood, “Scalp distribution of human auditoryevoked potentials. I. Evaluation of reference electrode sites,”
Electroen-cephalography and Clinical Neurophysiology , vol. 54, no. 1, pp. 15–24,1982.[29] R. F. Burkard, J. J. Eggermont, and M. Don,
Auditory Evoked Potentials:Basic Principles and Clinical Application . Lippincott Williams &Wilkins, 2007.[30] U. Hoppe, S. Weiss, R. W. Stewart, and U. Eysholdt, “An AutomaticSequential Recognition Method for Cortical Auditory Evoked Poten-tials,”
IEEE Transactions on Biomedical Engineering , vol. 48, no. 2,pp. 154–164, 2001.[31] S. Geirnaert, T. Francart, and A. Bertrand, “An Interpretable Perfor-mance Metric for Auditory Attention Decoding Algorithms in a Contextof Neuro-Steered Gain Control,”
IEEE Transactions on Neural Systemsand Rehabilitation Engineering , vol. 28, no. 1, pp. 307–317, 2019.[32] C. Cortes and V. Vapnik, “Support-vector networks,”
Machine learning ,vol. 20, no. 3, pp. 273–297, 1995.[33] T. Hastie, R. Tibshirani, and J. Friedman,
The Elements of StatisticalLearning . Springer New York Inc., 2001.[34] J. Platt et al. , “Probabilistic outputs for support vector machines andcomparisons to regularized likelihood methods,”
Advances in largemargin classifiers , vol. 10, no. 3, pp. 61–74, 1999.[35] T. de Taillez, B. Kollmeier, and B. T. Meyer, “Machine learning fordecoding listeners’ attention from electroencephalography evoked bycontinuous speech,”
European Journal of Neuroscience , vol. 51, no. 5,pp. 1234–1241, 2020.[36] O. Etard, M. Kegler, C. Braiman, A. E. Forte, and T. Reichenbach,“Decoding of selective attention to continuous speech from the humanauditory brainstem response,”
NeuroImage , vol. 200, pp. 1–11, 2019.[37] L. L. Scharf,
Statistical Signal Processing . Addison-Wesley, 1991.[38] C. Brodbeck, A. Presacco, and J. Z. Simon, “Neural source dynamics ofbrain responses to continuous stimuli: Speech processing from acousticsto comprehension,”
NeuroImage , vol. 172, pp. 162–174, 2018.[39] P. J. Sch¨afer, F. I. Corona-Strauss, R. Hannemann, S. A. Hillyard,and D. J. Strauss, “Testing the Limits of the Stimulus ReconstructionApproach: Auditory Attention Decoding in a Four-Speaker Free FieldEnvironment,”
Trends in Hearing , vol. 22, p. 2331216518816600, 2018.[40] A. Presacco, S. Miran, B. Babadi, and J. Z. Simon, “Real-TimeTracking of Magnetoencephalographic Neuromarkers during a DynamicAttention-Switching Task,” in .IEEE, 2019, pp. 4148–4151.[41] D. Wang and J. Chen, “Supervised Speech Separation Based on DeepLearning: An Overview,”