Far-Field Automatic Speech Recognition
Reinhold Haeb-Umbach, Jahn Heymann, Lukas Drude, Shinji Watanabe, Marc Delcroix, Tomohiro Nakatani
11 Far-Field Automatic Speech Recognition
Reinhold Haeb-Umbach, Jahn Heymann, Lukas Drude, Shinji Watanabe, Marc Delcroix,and Tomohiro Nakatani
Abstract —The machine recognition of speech spoken at adistance from the microphones, known as far-field automaticspeech recognition (ASR), has received a significant increaseof attention in science and industry, which caused or wascaused by an equally significant improvement in recognitionaccuracy. Meanwhile it has entered the consumer marketwith digital home assistants with a spoken language interfacebeing its most prominent application. Speech recordedat a distance is affected by various acoustic distortionsand, consequently, quite different processing pipelines haveemerged compared to ASR for close-talk speech. A signalenhancement front-end for dereverberation, source separa-tion and acoustic beamforming is employed to clean upthe speech, and the back-end ASR engine is robustifiedby multi-condition training and adaptation. We will alsodescribe the so-called end-to-end approach to ASR, whichis a new promising architecture that has recently beenextended to the far-field scenario. This tutorial article givesan account of the algorithms used to enable accurate speechrecognition from a distance, and it will be seen that, althoughdeep learning has a significant share in the technologicalbreakthroughs, a clever combination with traditional signalprocessing can lead to surprisingly effective solutions.
Index Terms —Automatic speech recognition, speech en-hancement, dereverberation, acoustic beamforming, end-to-end speech recognition
I. I
NTRODUCTION F AR-field, also called distant ASR is concerned withthe machine recognition of speech spoken at a dis-tance from the microphone. Such recording conditionsare common for applications like voice-control of digitalhome assistants, the automatic transcription of meetings,human-to-robot communication, and several other more.In recent years far-field ASR has witnessed a greatincrease of attention in the speech research community.This popularity can be attributed to several factors. Thereis first the large gains in recognition performance en-abled by Deep Learning (DL), which made the morechallenging task of accurate far-field ASR come withinreach. A second reason is the commercial success ofspeech enabled digital home assistants, which has becomepossible through progress in various fields, includingsignal processing, ASR and natural language processing(NLP). Finally, scientific challenges related to far-fieldnoise and reverberation robust ASR, such as the REVERBchallenge [1], the series of CHiME challenges [2]–[5],
J. Heymann and L. Drude were in part supported by a Google FacultyResearch Award. and the ASpIRE challenge [6] exposed the task to awide research audience and met with a lot of publicity.Conversely, those challenges have also helped to get aclearer picture as to which techniques and algorithms arehelpful for far-field ASR.The reason why far-field ASR is more challenging thanASR of speech recorded by a close-talking microphoneis the degraded signal quality. First, the speech signalis attenuated when propagating from the speaker to themicrophones, resulting in low signal power and oftenalso low Signal-to-Noise Ratio (SNR). Second, in anenclosure, such as the living or a meeting room, the sourcesignal is repeatedly reflected by walls and objects in theroom, resulting in multi-path propagation, which causesa temporal smearing of the source signal called reverber-ation, much like multi-path propagation does in wirelesscommunications. Third, it is likely that the microphonewill capture other interfering sounds, in addition to thedesired speech signal, such as the television or HVACequipment. These sources of acoustic interference can bediverse, hard to predict, and often nonstationary in natureand thus difficult to compensate for. All these factors havea detrimental impact on ASR recognition performance.Given these signal degradations, it is not surprisingthat quite different processing pipelines have emergedcompared to ASR for close-talk speech. There is, fore-mostly, the use of a microphone array instead of a singlemicrophone for sound capture. This allows for multi-channel speech enhancement, which has proven verysuccessful in noisy reverberant environments. Second,the speech recognition engine is trained with data whichrepresents the typical signal degradations the recognizeris exposed to in a far-field scenario. This robustifiesthe acoustic model (AM), which is the component ofthe recognizer which translates the speech signal intolinguistic units. The following examples demonstrate thepower of enhancement and acoustic modeling: • The REVERB challenge data consists of recordingsof the text prompts of the Wall Street Journal (WSJ)data set, respoken and rerecorded in a far-fieldscenario with a distance of 2-3 m between speakerand microphone array [1]. The challenge baselineASR system, defined in 2014, which operates on asingle channel microphone signal, achieved a WordError Rate (WER) of 49%. Using a strong AM basedon DL, the WER could be reduced to 22.2% [7]– a r X i v : . [ ee ss . A S ] S e p [9], while the addition of a multi-microphone front-end and strong dereverberation brought the error ratedown to 6.14% [10]. • The data set of the CHiME-3 challenge consistsof recordings of the WSJ sentences in four differ-ent noise environments (bus, street, cafe, pedestrianarea) [11]. The data was recorded using a tabletcomputer with six microphones mounted around theframe of the device. The baseline system reacheda WER of 33%, while a robust back-end speechrecognizer achieved 11.4% [12]. Finally, the multi-microphone front-end processing brought the errorrate down to 2.7% [12]. • CHiME-5/6 consists of recordings of casual conver-sations among friends during a dinner party. Thespontaneous speech, reverberation, and the large por-tion of times where more than one speaker is speak-ing simultaneously results in a WER of barely below80% achieved by the baseline system. Using a strongback-end, approximately 60% WER is achieved [13],while the addition of multi-microphone source sep-aration and dereverberation results in a WER of43.2% [13]. Improvements in both front-end andback-end resulted in 30.5% WER in the follow-upCHiME-6 campaign [14].The progress in ASR brought about by DL is well doc-umented in the literature [15]–[17]. In this contributionwe therefore concentrate on those aspects of acousticmodeling that are typical of far-field ASR. But thoseaspects, although improving the error rate a lot, provedto be insufficient to cope with high reverberation, lowSNR and concurrent speech, as is typical of far-field ASR.This is because common ASR feature representationsare agnostic to phase (a.k.a. spatial) information and arevulnerable to reverberation, i.e., the temporal dispersionof the signal over multiple analysis frames, and because itis difficult for a single AM to decide which speech sourceto decode, if multiple are present. Therefore, front-endprocessing for cleaning up the signals has been developed,including techniques for acoustic beamforming [18,19],dereverberation [20,21], and source separation/extraction[22]. All of those have been shown to significantlyimprove speech recognition performance, as can be seenin the examples above.In the last years, neural networks (NNs) have chal-lenged the traditional signal processing based solutionsfor speech enhancement [23]–[25], and achieved excel-lent performance on a number of tasks. However, thoseadvances come at a price. The networks are notorious fortheir computational and memory demands, often requirelarge sets of parallel data (clean and distorted versionof the same utterance) for training, which have to bematched to the test scenario, and are “black box” systems,lacking interpretability by a human. In multi-channel
Array EnhancementSec. III ASRSec. IV
In addition...The profit...
Fig. 1. Typical far-field ASR system. Here, exemplarily with M = 3 sensors, I = 2 sources and additive noise. scenarios, it is furthermore not obvious how to handlephase information. As a consequence researchers tried tocombine the best of both worlds, i.e., to blend classicmulti-channel signal processing with deep learning.The purpose of this tutorial article is to describe thespecific challenges of far-field ASR and how they areapproached. We will discuss the general components of anASR system only as much as is necessary to understandthe modifications introduced in the far-field scenario. Theorganization of the paper is oriented along the processingpipeline of a typical far-field ASR as shown in Fig. 1.First, the signal is captured by an array of M micro-phones. The signal model, which describes the typicaldistortions encountered, is given in Section II. Althoughrecently good single-channel dereverberation [21] andsource separation techniques have been developed [24,26,27], the use of an array of microphones instead of a singleone has the clear advantage that spatial information canbe exploited, which often leads to a much more effectivesuppression of noise and competing audio sources, as wellas to better dereverberation performance. Dereverberation,acoustic beamforming and source separation/extractiontechniques will be discussed in Section III.Once the signal is cleaned up it is forwarded tothe ASR back-end, whose task it is to transcribe theaudio in a machine readable form. In far-field ASR itis particularly important to make the acoustic modelrobust against remaining signal degradations. We willexplain in Section IV how this can be achieved by so-called multi-style training and by adaptation techniques.Section V discusses end-to-end approaches to ASR. Inthis rather new approach, the recognizer consists of amonolithic neural network, which directly models theposterior distribution of linguistic units given the audio.This paradigm has recently been extended to the far-fieldscenario, as we explain in that section.We conclude this tutorial article with a summary anddiscuss remaining challenges in Section VI. We furtherprovide pointers to software and databases in Section VIIfor those who want to gain some hands-on experience.This article primarily focuses on speech recognitionaccuracy, a.k.a. word error rate (WER), in far-field con-ditions as a criterion for success. Factors such as al-gorithmic latency or computational efficiency are onlytouched on in passing, although they are certainly of pivotal importance for the success of a technology in themarket.II. S IGNAL MODEL AND PERFORMANCE METRICS
A. Signal model
In a typical far-field scenario the signal of interest isdegraded due to room reverberation, competing speakers,and ambient noise. Assuming an array of M microphones,the signal at the m th microphone can be written asfollows: y m [ (cid:96) ] = I (cid:88) i =1 (cid:16) a ( i ) m ∗ s ( i ) (cid:17) [ (cid:96) ] + n m [ (cid:96) ] , (1)where ∗ is a convolution operation, a ( i ) m [ (cid:96) ] is the acousticimpulse response (AIR) from the origin of the i th speechsignal s ( i ) [ (cid:96) ] to the m th microphone, and n m [ (cid:96) ] is theadditive noise. Depending on the application, we mightonly be interested in one of the I signals, say s ( i ) [ (cid:96) ] , whilethe remaining ones are considered unwanted competingaudio signals.In the following we assume that the AIR is timeinvariant, although it is well-known that movements ofthe speaker or changes in the environment, and evenroom temperature changes, cause a change of the AIR.Nevertheless, time invariance is a common assumption inASR applications, justified by the fact, that an utterance,for which the AIR is assumed to be constant, is only afew seconds long.However, the nonstationarity of the speech and noisesignals has to be taken into account. When moving to afrequency domain representation we therefore have to usethe Short-Time Fourier Transformation (STFT), i.e., applythe DFT to windowed segments of the signal. Typicalwindow, also called frame, lengths are –
128 ms andframe advances are –
32 ms .When expressing the signal model of Eq. (1) in theSTFT domain, it is important to note that, in a commonsetup, the AIR is much longer than the length of theanalysis window. In a typical living room environmentit takes . – . for the AIR to decay to −
60 dB ofits initial value, which is considerably longer than theaforementioned window length. Then the convolution inEq. (1) no longer corresponds to a multiplication in theSTFT domain, but instead to a convolution over the frameindex. To a good approximation [28,29], Eq. (1) can beexpressed in the STFT domain as y m,t,f = I (cid:88) i =1 L − (cid:88) τ =0 a ( i ) m,τ,f s ( i ) t − τ,f + n m,t,f , (2)where a ( i ) m,t,f is a time-frequency representation of theAIR, called acoustic transfer function (ATF); s ( i ) t,f and n m,t,f are the STFTs of the i th source speech signal and of the noise at microphone index m , frame index t , and frequency bin index f . Furthermore, L denotesthe length of the ATF in number of frames. Note thatwe used, in an abuse of notation, the same symbols forthe time domain and frequency domain representations.This should not lead to confusion, because time domainsignals have an argument, as in y [ (cid:96) ] , while frequencydomain variables have an index, as in y t,f . The modelof Eq. (2) strongly contrasts with the model for a close-talking situation, where y t,f = s t,f + n t,f , or where eventhe noise term can be neglected.When trying to extract s ( i ) t,f from y m,t,f , it comesto our help that multi-channel input is available, i.e., m ∈ [1 , . . . , M ] . Defining the vector of microphonesignals y t,f = (cid:2) y ,t,f . . . , y M,t,f (cid:3) T , we can write y t,f = I (cid:88) i =1 L − (cid:88) τ =0 a ( i ) τ,f s ( i ) t − τ,f + n t,f , (3)where a ( i ) t,f and n t,f are similarly defined as y t,f .Fig.2 displays a typical AIR: it consists of three parts,the direct signal, early reflections and late reverberationcaused by multiple reflections off walls and objects inthe room. The early reflections are actually beneficial bothfor human listeners and for ASR. Its intelligibility is evenbetter than that of the “dry” line-of-sight signal. After themixing time, which is in the order of
50 ms , the diffusereverberation tail begins. This late reverberation degradeshuman intelligibility and also leads to a significant lossin recognition accuracy of a speech recognizer. Thus, wesplit the ATF in an early and a late part: y t,f = I (cid:88) i =1 d ( i ) t,f + I (cid:88) i =1 r ( i ) t,f + n t,f , (4)where the early-arriving speech signals are given by d ( i ) t,f = ∆ − (cid:88) τ =0 a ( i ) τ,f s ( i ) t − τ,f ≈ h ( i ) f s ( i ) t,f , (5)and the late-arriving speech signals are given by r ( i ) t,f = L − (cid:88) τ =∆ a ( i ) τ,f s ( i ) t − τ,f . (6)Here, ∆ is the temporal extent of the direct signal andearly reflections, which is typically set to correspond tothe mixing time. For example, ∆ is set at when a frameadvance is set at 16 ms. In Eq. (5), the desired signal isapproximated by the product of a time-invariant (non-convolutive) ATF vector h ( i ) f with the clean speech s ( i ) t,f ,disregarding the spread of the desired signal over multipleanalysis frames. Other works have tried to overcomethis approximation by employing a convolutive transferfunction model for the desired signal [30,31].Considering Eq. (4), the tasks of the enhancement stagecan be defined as follows: Time in ms A m p lit ud e Direct soundEarly reflectionsLate reverberation
Fig. 2. An acoustic impulse response consists of the direct sound, earlyreflections and late reverberation. • Dereverberation (also known as deconvolution) aimsat removing the late reverberation component fromthe observed mixture signal. • The goal of source separation is to disentangle themixture into its I speech components, while • Beamforming aims at extracting a target speech sig-nal, which can be any of the I sources, by projectingthe input signal to the one-dimensional subspacespanned by the target signal, thereby diminishingsignal components in other subspaces.We will discuss each of the above tasks in detail inSection III. B. Performance metrics
Clearly, the ultimate performance measure depends onthe application. For a transcription task it is the worderror rate, while it is the success rate (high precision andrecall) for an information retrieval task. However, whendeveloping the speech enhancement front-end it is veryhelpful to be able to assess the quality of the enhancementwith an instrumental measure which is independent of theASR or a downstream NLP component. This will givenot only smaller turnaround times in system development,but also gives more insight in how to improve front-endperformance.Clearly, speech quality and intelligibility is most infor-matively assessed by human listening experiments. Butbecause these are too expensive and time consuming thereis a whole body of literature devoted to how to measurespeech quality or intelligibility by an “instrumental” mea-sure. Measures, which have been originally developed toevaluate speech communication systems and which havefound widespread use in speech enhancement are Percep-tual Evaluation of Speech Quality (PESQ) [32] for speechquality and Short-Time Objective Intelligibility (STOI)for speech intelligibility [33]. Note that both measures are“intrusive”, which means that a clean reference signal is where in some approaches, the noise is treated like an additional, ( I + 1) st component. required. Please do also note that those measures are onlymoderately correlated with ASR performance, as has beenempirically observed, e.g., in [12]. They are still useful insystem development, but for a definite assessment of thebenefits of an enhancement system for ASR, recognitionexperiments are indispensable.For the evaluation of source separation performance themost common measure is the Signal-to-Distortion Ratio(SDR) [34]. It measures the ratio of the power of thesignal of interest to the power of the difference betweenthe signal of interest and its prediction (obtained by thesource separation algorithm). Today, values of more than
10 dB are not uncommon.III. M
ULTI - CHANNEL SPEECH ENHANCEMENT
We now discuss enhancement techniques to addressthe aforementioned signal degradations. While linear andnon-linear filtering approaches are developed for speechenhancement, the linear filtering has empirically beenshown to be advantageous to estimate the desired signal d ( i ) t,f in Eq. (3) from the observation y t,f in terms of WERreduction of far-field speech recognition [1,2,13]. Thislinear filtering leverages information the AM typicallydoes not have access to, while not introducing time-dependent artifacts such as musical tones. On the otherhand, the non-linear filtering approach has been shownto be useful for estimating statistics of signals, such astime-frequency dependent variances and masks of signals[23,35], which are effectively used for estimating a linearfilter.A very general form of a (causal time-invariant) linearfilter can be represented by a convolutional beamformer[30,31,36,37]. It is defined as ˆ d ( i )1 ,t,f = L w − (cid:88) τ =0 (cid:16) w ( i ) τ,f (cid:17) H y t − τ,f , (7)where ˆ d ( i )1 ,t,f is an estimate of d ( i ) t,f at the st microphone , w ( i ) τ,f = [ w ( i )1 ,τ,f , . . . , w ( i ) M,τ,f ] T ∈ C M × is a coefficientvector of the convolutional beamformer to be optimizedfor the estimation of ˆ d ( i )1 ,t,f , L w is the length of the convo-lutional beamformer, and ( · ) H denotes transposition andcomplex conjugation. While many techniques have beendeveloped for optimizing a convolutional beamformer[30,31,36], an approach decomposing it into a multi-channel linear prediction (MCLP) filter and a beamformeris widely used as a frontend for the far-field ASR. With Without loss of generality, we here declare the first microphone asthe reference microphone. ∆ = 1 and by applying the distributive property, Eq. (7)can be rewritten as ˆ d ( i )1 ,t,f = (cid:16) w ( i )0 ,f (cid:17) H (cid:124) (cid:123)(cid:122) (cid:125) Beamformer (cid:32) y t,f − L w − (cid:88) τ =∆ (cid:16) C ( i ) τ,f (cid:17) H y t − τ,f (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) MCLP filter , (8)where C ( i ) τ,f ∈ C M × M is a MCLP coefficient matrixsatisfying C ( i ) τ,f w ( i )0 ,f = − w ( i ) τ,f . Equation (8) highlightsthat a convolutional beamformer that estimates ˆ d ( i )1 ,t,f can be decomposed into two consecutive linear filters:A MCLP filter [38] corresponding to the terms in theparentheses, and a (non-convolutional) beamformer w ( i )0 ,f [39,40]. As will be discussed later, the MCLP filter canperform reduction of late reverberation, namely derever-beration. The beamformer, on the other hand, can performreduction of noise, i.e., denoising, and extraction of adesired source from other competing sources, i.e., sourceseparation.The factorization in Eq. (8) allows us to use a cascadeconfiguration for speech enhancement, i.e., dereverbera-tion followed by denoising and source separation. This isadvantageous because we can decompose the complicatedenhancement problem into sub-problems that are easierto handle. Furthermore, it is shown that, under certainmoderate conditions, even when we separately optimizedereverberation and beamforming, the estimate obtainedby the cascade configuration is equivalent to (or can beeven better than) that obtained by direct optimization ofthe convolutional beamformer in Eq. (7) [41].Although both dereverberation and beamforming arewell-known concepts from antenna arrays [39,42], acous-tic signal processing in a non-stationary acoustic envi-ronment requires additional efforts, such as estimationof time-varying statistics of temporally-correlated desiredsources and noise, and “broadband” processing in thetime-frequency domain [43,44]. For this purpose, manytechniques have been developed: • For dereverberation, estimation and subtraction ofthe spectrum of the late reverberation has beenemployed, e.g., [45]. Also, MCLP filtering with de-layed prediction and a time-varying Gaussian sourceassumption have been developed and shown effectivefor both single and multiple desired source scenarios[46,47]. • For denoising, techniques for effectively estimatingthe time-varying statistics of the desired signal andthe noise have been developed based on estimationof a time-frequency dependent mask. [18,48]. • For source separation, sophisticated techniques forestimating masks of multiple competing sourceshave been developed. Modern techniques are evenable to handle multiple sources in single-channelinput [25,26,49]. While these techniques are well established in classicalsignal processing areas [20,50]–[52], recently purely deeplearning based solutions have challenged those solutions,e.g. [23,35]. The advantage of the deep learning-basedsolutions is their powerful capability of modeling sourcemagnitude spectral patterns over wide time frequencyranges, which were very difficult to handle by classi-cal signal processing approaches. The deep learning ap-proaches, however, are also notorious for being resourcehungry and hard to interpret. Their training for speechenhancement tasks requires parallel data, i.e., a databasewhich contains each speech utterance in two versions,distorted and clean, one serving as input to the network,and the other as training target. Reasonably, this can onlybe obtained by artificially adding the distortions to aclean speech utterance, leaving an unavoidable mismatchbetween artificially degraded speech in training and realrecordings in noisy reverberant environments during test.Classical signal processing solutions are typically muchmore resource efficient and do not have this paralleldata training problem. We will show for each of thethree enhancement tasks how “neural network-supportedsignal processing” or “signal processing supported neuralnetworks” can combine the advantages of both worlds,achieving high enhancement performance, being resourceefficient and rendering parallel data unnecessary [53]–[59].A typical processing pipeline for dereverberation, sep-aration, and extraction is illustrated in Fig. 3.
A. Dereverberation
The goal of dereverberation is to reduce the late rever-beration r ( i ) t,f from the observation y t,f in Eq. (4) whilekeeping the desired signal d ( i ) t,f unchanged. Based on thedecomposition in Eq. (8), we here highlight a techniquebased on MCLP filtering, referred to as Weighted Predic-tion Error (WPE) dereverberation [21,46]. In the follow-ing, we first explain WPE dereverberation in the noiselesssingle source case, i.e., assuming y t,f = d ( i ) t,f + r ( i ) t,f , andthen explain its applicability to the noisy multiple sourcecase at the very end of this section.The core idea of WPE dereverberation is to predictthe late reverberation of the desired signal from pastobservations. This late reverberation is then subtractedfrom the observed signal to obtain an estimate of thedesired signal. Just as Eq. (8) indicates which past ob-servations are used for prediction, Fig. 4 visualizes thepast observations, the prediction delay and which frameof late reverberation is predicted. A unique characteristicof WPE is the introduction of the prediction delay ∆ ,which corresponds to the duration of the direct signal andearly reflections in Eq. (5). It avoids the desired signalbeing predicted from the immediately past observations,because this would destroy the short-time correlation Varianceestimation WPE filterestimationWPEfiltering Maskestimation BF filterestimationBFfiltering y t,f ˜ y t,f λ t,f C τ,f γ ( i ) t,f w ( i )0 ,f ˆ d ( i )1 ,t,f LearnableStructural priorAnalytically dervived
Fig. 3. Overview of the enhancement system consisting of a neural network supported dereverberation module and a neural network supportedor spatial clustering model supported beamforming module. The MCLP coefficient matrix C τ,f as well as the time-varying variance λ t,f arespeaker-independent as argued in the last paragraph of Sec. III-A. The BF filtering block may contain additional postfiltering to compensate forpotential artifacts the beamformer may have produced. t , channel m = 1 t , channel m = 2 t , channel m = 3 t , channel m = 4 predictPast observations Delay ∆ Fig. 4. WPE estimates a filter to predict the late reverberation in thecurrent observation from the past observations (skipping ∆ − frames).The late reverberation is then subtracted from the current observation. typical of a speech signal. Thanks to this, the WPE canonly predict the late reverberation and keep the desiredsignal unchanged.To deal with the time-varying characteristics of speechin the MCLP framework, WPE estimates the coefficientmatrix C ( i ) τ,f based on maximum likelihood estimation.It is assumed that the desired signal d ( i ) t,f follows a zero-mean circularly-symmetric complex Gaussian distributionwith the unknown channel-independent time-varying vari-ance λ ( i ) t,f of the early-arriving speech signal: p (cid:16) d ( i ) t,f (cid:17) = N C (cid:16) d ( i ) t,f ; , λ ( i ) t,f I M (cid:17) , (9)where d ( i ) t,f is obtained from MCLP filtering in Eq. (8)and I M is an M × M -dimensional identity matrix. Withthis model, the objective to minimize becomes L ( ψ f ) = − (cid:88) t log p (cid:16) d ( i ) t,f ; ψ f (cid:17) = (cid:88) t || y t,f − (cid:80) L w − τ =∆ (cid:16) C ( i ) τ,f (cid:17) H y t − τ,f || λ t,f + (cid:88) t M log λ t,f + const. (10)where ψ f is a set of parameters to be estimated atfrequency f , composed of C τ,f and λ t,f for all τ and t , and || · || denotes the Euclidean norm. Variations ofthe objective have also been proposed for better derever-beration performance by introducing sparse source priors[60,61]. The minimization of the above objective leads to aniterative algorithm which alternates between estimatingthe time-varying variance λ ( i ) t,f and the coefficient matrix C ( i ) τ,f . The steps can be summarized as follows:Step 1) λ ( i ) t,f = 1( δ + 1 + δ ) M t + δ (cid:88) τ = t − δ (cid:88) m | d ( i ) m,τ,f | , (11)Step 2) R ( i ) f = (cid:88) t ¯y t − ∆ ,f ¯y H t − ∆ ,f λ ( i ) t,f ∈ C MK × MK , (12) P ( i ) f = (cid:88) t ¯y t − ∆ ,f y H t,f λ ( i ) t,f ∈ C MK × M , (13) ¯C ( i ) f = (cid:16) R ( i ) f (cid:17) − P ( i ) f ∈ C MK × M , (14)where ¯y t − ∆ ,f ∈ C MK × is the stacked observationvector as depicted by the box on the left hand side ofFig. 4 and δ defines a temporal context.In the variance estimation step in Eq. (11), λ ( i ) t,f isupdated dependent on the previous estimate of C ( i ) τ,f , i.e.,it is estimated as the variance of the signal dereverberatedwith C ( i ) τ,f according to the MCLP filter in Eq. (8). Often,smoothing by averaging over neighboring frames with aleft context of δ and a right context of δ is introduced toreduce the variance of this variance estimate.In the filter matrix estimation step in Eqs. (12)–(14),fixing λ ( i ) t,f at its value estimated in the previous stepmakes Eq. (10) a simple quadratic form, and thus wecan reach a global minimum by a closed-form update.Here, R ( i ) f can be interpreted as an auto-correlationmatrix of normalized stacked observation vectors. Further, K = L w − ∆ is the number of filter taps. Finally, Eq. (14)computes the stacked filter matrix ¯C ( i ) f = (cid:20)(cid:16) C ( i )∆ ,f (cid:17) T , . . . , (cid:16) C ( i ) L w − ,f (cid:17) T (cid:21) T (15)using the Wiener-Hopf equation.This iterative algorithm may be started by initializingthe time-varying variance λ ( i ) t,f with that of the observa-tion. Although this is a rather crude approximation, ittypically converges within three iterations. The use of a neural network further allows to esti-mate the time varying variance λ ( i ) t,f within a single stepavoiding the iterative estimation, and eases the transitiontowards online processing [56,58]. In [56] a neural net-work is trained with a Mean Squared Error (MSE) loss topredict (the logarithm of) the time-varying variance λ ( i ) t,f and applied to offline and block-online processing, while[58] extends this to frame-online processing.In order to handle noisy multi-source cases, we slightlyrevise the goal of the WPE dereverberation to estimate asingle set of coefficient matrices C τ,f that can reduce thelate reverberation r ( i ) t,f for all i at the same time, ratherthan estimating a different set of matrices C ( i ) t,f separatelyfor dereverberation of each source i . Existence of such aset of coefficient matrices is guaranteed by the multiple-input/output inverse theorem (MINT) [62] when M ≥ I , n t,f = 0 , and the acoustic transfer functions share nocommon zeros. The coefficient matrices can be estimatedbased on the objective of the WPE in Eq. (10), by setting λ t,f to represent the variance of the mixture of all d ( i ) t,f .Although n t,f = 0 is usually not satisfied within the far-field setting, due to the inherent robustness of the MCLPfiltering, WPE works well with such additive noise.While we discussed here WPE in some detail, becauseit has found widespread use in the ASR community, this isby no means the only approach to dereverberation. Insteadof estimating the direct signal and early reflections, onecan estimate the power spectral density of the late rever-beration and subtract it from the observed signal, therebyachieving a dereverberating effect [45,63]. Also, neuralnetworks trained to estimate the nonreverberant signalfrom the observed reverberant one are very successful[10]. B. Beamforming
Beamforming aims at reducing additive noise andresidual reverberation from the observation. As in thedecomposition in Eq. (8), a spatial filter w ( i )0 ,f (commonlyreferred to as beamformer) is used to obtain an estimate ofthe desired signal from the output of the WPE dereverber-ation. Consequently, we here define new variables whichdescribe the signal components after WPE processing. Letus define the input of the beamformer as ˜y t,f = d (1) t,f + · · · + d ( I ) t,f + ˜n t,f = d ( i ) t,f + ˜n ( i ) t,f , (16)where ˜n t,f contains all residual reverberation and noise,and where ˜n ( i ) t,f collectively represents all the interferencesignal components from the viewpoint of speaker i : theseare the remaining reverberation, the source signals otherthan the desired signal, ambient noise, and possible otherdeviations from d ( i ) t,f . In other words, Eq. (16) shows thedecomposition from the perspective of speaker i and notfor all speakers. Then, the beamforming step is meant to remove all interferences ˜n ( i ) t,f from ˜y t,f while keeping d ( i ) t,f unchanged.Most statistical beamforming approaches rely on esti-mated second order statistics, namely the spatial covari-ance matrices of the desired signal Φ ( i ) dd ,f and that of theinterference Φ ( i ) ˜n˜n ,f . A beamforming algorithm is derivedby defining an optimization criterion. A widely usedapproach is Minimum Variance Distortionless Response(MVDR) beamforming which minimizes the expectedvariance of the resultant interference subject to a dis-tortionless constraint involving the ATF vector h ( i ) f inEq. (5). It is defined as w ( i )0 ,f = argmin w w H Φ ( i ) ˜n˜n ,f w s.t. w H h ( i ) f = h ( i )1 ,f , (17)where Φ ( i ) ˜n˜n ,f is the spatial covariance matrix of allinterferences, assumed to be time-invariant, and h ( i )1 ,f isthe st microphone element of h ( i ) f . Thanks to the dis-tortionless constraint, the beamformer keeps the desiredsignal unchanged, while reducing the additive distortions.The optimization problem in Eq. (17) results in w ( i )0 ,f = (cid:16) Φ ( i ) ˜n˜n ,f (cid:17) − ˜ h ( i ) f (cid:16) ˜ h ( i ) f (cid:17) H (cid:16) Φ ( i ) ˜n˜n ,f (cid:17) − ˜ h ( i ) f , (18)where ˜ h ( i ) f is a relative transfer function (RTF) [64,65]defined as the ATF vector normalized by its st micro-phone component, i.e., ˜ h ( i ) f = h ( i ) f /h ( i )1 ,f . The RTF is awidely used representation to avoid scale ambiguity ofATF vector estimation.Techniques for estimating the RTF vector ˜ h ( i ) f havebeen developed, which in general require an estimate of aspatial covariance matrix Φ ( i ) dd ,f of the desired signal d ( i ) t,f [18,40]. Alternative objectives can also be used for beam-forming, such as likelihood maximization with a time-varying Gaussian source assumption, similar to WPE,resulting in the weighted Minimum Power Distortionlessresponse (wMPDR) beamformer [41], and maximizationof expected output SNR resulting in maximum SNRbeamformer (also called Generalized Eigenvalue Decom-position (GEV) beamformer) [66].One way to estimate these covariance matrices is toselect time frames in which just one signal componentis active, e.g., the beginning of a recording where onlynoise is active. This approach is appropriate under theassumption that the corresponding signals are stationary.However, a better and more fine-grained approach is touse a time-frequency mask, γ ( i ) t,f , to decide for each time-frequency (TF) bin how well it corresponds to the targetspeaker or the interference. This leads to a covariance matrix calculation with time- and frequency-dependentmasks γ ( i ) t,f : ˆΦ ( i ) dd ,f = (cid:88) t γ ( i ) t,f ˜y t,f ˜y H t,f (cid:30)(cid:88) t γ ( i ) t,f , (19) ˆΦ ( i ) ˜n˜n ,f = (cid:88) t (cid:88) i (cid:48) (cid:54) = i γ ( i (cid:48) ) t,f ˜y t,f ˜y H t,f (cid:30)(cid:88) t (cid:88) i (cid:48) (cid:54) = i γ ( i (cid:48) ) t,f . (20)Conceptually, assuming that the selected TF bins in-deed only contain the desired signal, ˆΦ ( i ) dd ,f ∈ C M × M approximates the covariance matrix of d ( i ) t,f , and on asimilar assumption, ˆΦ ( i ) ˜n˜n ,f ∈ C M × M approximates thecovariance matrix of all interferences ˜n ( i ) t,f .Depending on the acoustic environment, the a prioriknowledge for the given utterance, the number of speakersin the recording, and the available training data differentways to estimate the masks for each speaker are possible.The two predominant approaches for mask estimationare unsupervised spatial clustering and neural network-based mask estimation and are explained in the followingsection. C. Mask estimation for denoising, single source extrac-tion, and source separation
The goal of a mask estimator is to estimate a presenceprobability mask for each speaker and for noise. Thissection first describes unsupervised spatial clustering ap-proaches for single- and multi-speaker scenarios and thencontinues with neural network-based approaches again forsingle- and multi-speaker scenarios.
1) Unsupervised spatial clustering:
Unsupervised spa-tial clustering is a technique used to assign each TF bin toa particular class based solely on spatial cues, i.e., phaseand level differences between microphone channels thatprovide information about the direction of sound withrespect to the microphone array. A class then models thedifferent speakers characterized by different locations ornoise with more diffuse characteristics. Assuming that thespeakers speak from different locations, it is possible toseparate the microphone signals into speech signals of thedifferent speakers by clustering the spatial cues [67].To do so, one typically formulates a statistical modelwhich consists of a class-dependent distribution for eachsource i and an additional noise class which is hereindexed by i = I + 1 : p ( ˜y t,f ) = I +1 (cid:88) i =1 p ( ˜y t,f | θ ( i ) ) p ( z t,f = i ) , (21)where z t,f is the hidden class affiliation variable, and θ ( i ) summarizes the class-dependent parameters. Typicalclass-dependent distributions are the complex Watsondistribution [68], complex Bingham distribution [69], orthe complex angular central Gaussian distribution [70]. The parameters and the masks are then obtained throughan Expectation-Maximization (EM) algorithm in whichthe E-step and the M-step alternate. In the M-step, theclass-dependent parameters are updated. In the E-step,the masks γ ( i ) t,f = p ( z t,f = i | ˜y t,f ) , which here correspondto posterior probabilities, are obtained using Bayes’ rule: γ ( i ) t,f = p ( z t,f = i ) p ( ˜y t,f | θ ( i ) ) (cid:80) I +1 i (cid:48) =1 p ( z t,f = i (cid:48) ) p ( ˜y t,f | θ ( i (cid:48) ) ) . (22)In a single-speaker scenario, where one just wishes todistinguish between target speaker and noise, one can usespatial clustering with I = 1 . To name an example, thewinning system of the CHiME 3 robust speech recogni-tion challenge employed such an unsupervised clusteringapproach with I = 1 successfully to single-speakerrecordings [71]. In case of a multi-source scenario, I hasto be set to the number of speakers in the mixtures, whicheither has to be known a-priory, or estimated separately.The consecutive steps, e.g., beamforming as in Fig. 3 arethen repeated for each speaker.
2) Neural network-based mask estimation:
In contrast,mask estimation networks are trained with a supervisionsignal. To discuss neural network-based approaches, wefirst introduce a neural mask estimator as used in neuralnetwork-based beamforming in the following. We thenintroduce SpeakerBeam as a speaker-informed mask es-timator. Lastly, we introduce neural network-based blindsource separation approaches.For neural network-based mask estimation, a supervi-sion signal such as an ideal binary mask (IBM) [72] is firstextracted on each training mixture. To do so, one needsaccess to the speech images and the noise image, i.e., eachindividual speech component and the noise component atthe microphones, separately:IBM ( i ) t,f = (cid:40) , for (cid:13)(cid:13) d ( i ) t,f (cid:13)(cid:13) > (cid:13)(cid:13) d ( i (cid:48) ) t,f (cid:13)(cid:13) ∀ i (cid:48) (cid:54) = i , otherwise , (23)where i corresponds to the source index. This definitioncan be extended to an additional noise class by treating theoracle noise signal as d ( I +1) := ˜n t,f . Fig. 5 illustrates theunderlyinging signal components and the correspondingIBM with an additional noise class. Further definitionsof oracle masks suitable for supervision can be found in,e.g., [73,74].Then, depending on the particular use-case, a neuralnetwork can be trained with such a supervision signal.The different use-cases are illustrated in Fig. 6. a) Separate speech from noise: One can now traina neural network with, e.g., log-amplitude spectrogramfeatures as input, to predict a speech mask and a noisemask by providing noisy speech training data from vari-ous speaker and the corresponding IBMs or clean speechsignals. Since speech and noise have different spectro-temporal characteristics, a neural network can distinguish I B M ca l c u l a ti on E q . ( ) Fig. 5. Visualization of the spectrograms of the underlying images d (1) t,f , d (2) t,f , and ˜n t,f on the left and the ideal binary masks IBM (1) t,f , IBM (2) t,f ,and IBM (3) t,f on the right. Bright colors indicate higher values. (a) Separate speech from noise (Sec. III-C2a)(b) Extract a single speaker from a mixture (Sec. III-C2b)(c) Separate multiple speakers from a mixture (Sec. III-C2c)Maskestimator Beamformer y t,f γ (1) t,f ˆ d (1) t,f Maskestimator Beamformer y t,f d ( ref ) t,f γ (1) t,f ˆ d (1) t,f Maskestimator BeamformerBeamformer y t,f γ (1) t,f γ (2) t,f ˆ d (1) t,f ˆ d (2) t,f Fig. 6. Processing flow for different use-cases of a mask estimator. Thecorresponding interference mask to calculate the interference covariancematrix in Eq. (20) is not shown for brevity. between these signals very well. An exemplary trainingcriterion is the binary cross entropy between the estimatedmask γ it,f and the corresponding oracle IBM ( i ) t,f .This mask estimation procedure with a subsequentbeamforming step led to dramatic WER reductions, e.g.,in the CHiME 3/4 challenges [74,75]. Often, these masksare estimated independently for each channel and thenpooled over all channels such that a single mask can beused, e.g., in Eq. (20). b) Extract a single speaker from a mixture: Inmany practical applications one is interested in one targetspeaker in a mixture, e.g., the speaker who is actuallyinteracting with the digital home assistant. While dealingwith speech mixtures, simply training a neural networkto extract the target speaker is not possible because boththe target speech and interference signal have similarspectro-temporal characteristics. However, if additionalinformation about the target speaker is available, a neural mask estimator can be informed about speaker-dependentcharacteristics. These characteristics may stem from aseparate adaptation utterance or from the wake-up key-word. In the SpeakerBeam framework [76,77] a sequence-summarizing neural network [78] which captures thespeaker-dependent characteristics is jointly trained with amask estimation network which uses these characteristicsas additional features to estimate a target speaker maskand an interference mask. VoiceFilter implements thisapproach with a Convolutional Neural Network (CNN)architecture [79]. c) Separate multiple speakers and noise:
While, ina single-speaker scenario, a mask estimator only needsto distinguish between speech and non-speech time fre-quency bins (compare Sec. III-C2a), source separationapproaches have to solve the following problem: given theobservation the algorithm should yield a mask for eachspeaker as well as an additional noise mask. For quitesome time it has been complicated to do this with neuralnetworks due to the permutation problem: while the orderin which the speakers appear at the different outputchannels of the system is unpredictable, a loss functionwhich assumes a particular order can result in misleadinggradients. While the spatial clustering model in Eq. (21) isnaturally permutation invariant (switching speaker indicesdoes not change the likelihood), permutation invariantlosses for neural networks appeared just recently.Kolbaek et al. formulated a way to turn any lossfunction, e.g., Cross Entropy (CE), into a permutationinvariant loss function [26]: the original loss is calculatedfor every possible permutation. Then, only the minimalloss is used for back-propagation, e.g.: J = argmin Π I +1 (cid:88) i =1 CE t,f (cid:16) γ (Π( i )) t,f , IBM ( i ) t,f (cid:17) , (24)where Π is a permutation of (1 , . . . , I + 1) . A neuralnetwork with I + 1 mask outputs can now be trainedwith such a Permutation Invariant Training (PIT) loss.The estimated masks γ ( i ) t,f can then be used, e.g., forbeamforming. In its original formulation the networkarchitecture of a PIT system depends on the maximumnumber of speakers expected in a mixture. The systemcan be trained in such a way that some output channelsare empty when there are less speakers.Fundamentally differently, Deep clustering, while pi-oneering this area, used a neural network to calculateembedding vectors for each time frequency bin [24]. Theloss, as any typical embedding loss, is designed in sucha way that the embedding vectors belonging to the sameclass move closer together while the embedding vectorsof different classes move further apart. Naturally, sucha formulation is permutation invariant in itself. The em-bedding vectors can then be used for clustering yieldingmasks in a similar way as explained in the clustering approach before. Interestingly, at least the embeddingnetwork is then independent of the number of speakersin a mixture [24].
3) Comparison of spatial and spectral approachesand integrations thereof:
The main advantage of spatialclustering models over neural network-based mask esti-mation is the interpretability of the underlying stochasticdependencies. Closely related, this interpretability allowsto incorporate a priori knowledge by modifying the pa-rameter updates, e.g., [80] uses externally provided timeannotations for the CHiME 5 database. Due to the spatialfeatures, it exploits spatial selectivity and, as long as thespatial properties of each source are distinct enough, isable to produce meaningful separation results. Since notraining phase is involved, this unsupervised clusteringapproach naturally generalizes well to unseen conditions.One drawback of the spatial clustering approaches is,that it is most suited for offline processing. Althoughquite a few online or block-online clustering approacheshad been proposed, these did not find a lot of applica-tion in far-field ASR challenges yet. Moving sources,if no online algorithm is used, can only be handled tosome extent: small head movements can still be capturedin the class dependent parameters. Larger movements,however, invalidate the underlying model assumptions.Further, since clustering is often performed independentlyacross frequency bins, a frequency permutation problemarises [81]: from one frequency bin to another the spatialclustering solution may have resulted in switched speakerindices. This frequency permutation problem is indepen-dent of the aforementioned global permutation problemwhen discussing PIT.In contrast to the spatial clustering approaches, neu-ral network-based approaches rely on spectral cues andprocess all frequency bins jointly. Therefore, a frequencypermutation problem does not occur. Quite remarkably,the neural network-based separation models learn rela-tions from training databases and tend to perform betterwith an ever increasing amount of training data.However, alongside this comes their biggest limitation:depending on the variability of the training data, themodels have limited generalizability to unseen conditions,e.g., Yu et al. demonstrated that the performance alreadydegrades significantly when switching from English toDanish [82]. The training corpus needs to contain themixed speech as well as access to the clean sources tobe able to compute gradients. A notable exception areunsupervised approaches to train a neural network-basedsource separator [83]–[85]. Further, most neural network-based approaches are single-channel. Even when multi-channel features are employed [86], in which way thosecontribute to better separation performance is far fromunderstood.By no means these approaches are mutually exclusive.Judging by the aforementioned advantages and disadvan- tages, both methods are highly complementary, e.g., [87]proposed to combine neural network-based mask estima-tion with spatial clustering for speech enhancement, while[55,88] proposed an integration of Deep Clustering andspatial clustering for multi-talker scenarios.
D. Front-end overview
The entire front-end system is now composed of dere-verberation, mask estimation, and beamforming. An es-tablished configuration is depicted in Fig. 3. The optimalprocessing order, as demonstrated in [7] for conventionalbeamforming and in [89] for neural network supportedbeamforming turns out to be applying WPE on the multi-channel signal first and then applying the beamformingstep on the dereverberated signal.Spatial clustering based source separation approachesprofit in particular from a preceding WPE dereverber-ation (experimental results in [80]) since the sparse-ness assumption, which implies that different speakerspopulate different TF bins, is much better fulfilled forless reverberant speech. Further experiments also reportimproved separation performance with neural network-based separation methods [90]. However, it is worthto acknowledge that a publication which clearly tracksdown the gains of better source separation due to betterdereverberation is still missing.In Fig. 3 the variance estimation network and themask estimation network conceptually perform a similartask (at least in the single-speaker scenario). Thus, itmight be worth investigating if both models can be fusedinto a single model with two different outputs. Further,for practical reasons, the mask estimation network oftenoperates on the observation signal y t,f to avoid needingto train on dereverberation results.From a machine learning perspective, it is worthhighlighting that the building blocks in Fig. 3 are verydifferently motivated: the filtering blocks can be seen asstructural priors motivated by an a priori understanding offield experts. The filter coefficient estimation blocks arederived analytically from separate optimization criteria,and the variance estimation neural network as well as themask estimation neural network are trained independentlywith gradient descent on a separate training database.More recently, it has been demonstrated that the neuralnetworks can also be trained with gradients from adownstream task [57,59] (compare Sec. V).IV. ASR BACK - END
To achieve high ASR performance in a far-field sce-nario, we need not only employ a powerful speechenhancement front-end but also design carefully the ASRback-end. The ASR back-end used for far-field ASR hasessentially the same structure as a general back-end usedfor recognition of clean speech. Those interested can find Featureextraction DecodingAcousticmodel Languagemodel ˆ d [ ‘ ] o t p ( v j | O ) Fig. 7. Schematic diagram of a general ASR back-end. an overview of legacy ASR systems in [91], while [15]–[17] describe general ASR in the era of deep learning.However, several elements need careful considerationwhen dealing with far-field ASR. In this section, we willfirst briefly review a general ASR back-end and thenemphasize the key components and design choices thatare most relevant for far-field ASR.
A. Overview of a general ASR back-end
The goal of the ASR back-end is to find the mostlikely word sequence, ˆv , given a sequence of observedspeech features O . Here for generality, the speech featurescan be derived either from clean speech, microphoneobservations or enhanced speech, as described in SectionIII. The task of ASR is formulated with the Bayesdecision theory as ˆv = argmax v ∈V p ( v | O ) , (25)where O = ( o , . . . , o t , . . . , o T ) is a sequence of speechfeatures, o t ∈ R D , is a feature vector for frame t , v =( v , . . . , v j , . . . , v J ) is a J -length word sequence, v j ∈ V is a word at position j , and V is the set of possible words,called vocabulary. Since it is complex to deal with p ( v | O ) directly, the problem is usually rewritten using the Bayestheorem as, ˆv = argmax v ∈V p ( O | v ) p ( v ) , (26)where the likelihood function p ( O | v ) is called the acous-tic model (AM) and the prior distribution p ( v ) is thelanguage model (LM) [92]. Note that some recent end-to-end ASR systems described in Section V aim at directlymodeling p ( v | O ) .Fig. 7 depicts a general ASR back-end with its maincomponents, i.e., the feature extraction module, the AMand the language model, which are briefly describedbelow.
1) Feature extraction:
The first component of an ASRback-end is a feature extraction module that converts thetime domain signal ˆ d [ l ] into speech feature o t moresuitable for ASR. There has been a lot of research ondesigning robust features for ASR. However, the simplelog-Mel filterbank (LMF) coefficients are widely usedboth for general and far-field ASR. LMF coefficientsare obtained by computing the power spectrum of the time-domain signal using a STFT, then applying a Melfilter to emphasize low-frequency components of thespectrum. Finally, the dynamic range is compressed usingthe logarithm operation as, o t,ν = Feat (cid:16) [ ˆ d t, , . . . , ˆ d t,f , . . . , ˆ d t,F ] T (cid:17) , = MVN (cid:32) log (cid:16) (cid:88) f b ν,f | ˆ d t,f | (cid:17)(cid:33) , (27)where Feat denotes the feature extraction process. Further, ˆ d t,f is the STFT coefficient of the enhanced speech, F is the number of frequency bins, b ν,f represents theMel filterbank associated with the ν -th channel, andMVN ( · ) represents the mean and variance normalization(MVN) operation. Note that in general the parameters ofthe STFT (window type, length and overlap) used forspeech enhancement and recognition differ. Therefore,the speech enhancement front-end usually converts thesignals back to the time domain before doing featureextraction for ASR. The features are often normalizedwith MVN to have zero-mean and unit variance usingstatistics computed either for each utterance or over thewhole training data set.
2) Acoustic model:
The AM employs phonemes as abasic unit of speech sounds. In this section, we focusour discussion on Hidden Markov Model (HMM) basedAMs, where each phoneme is associated with a HMMthat models the dynamic evolution of speech within thatphoneme [92,93]. An HMM representing the whole wordsequence is constructed from several phoneme HMMsusing a pronunciation dictionary to map each word toa phoneme sequence. HMM based AMs make the con-ditional independence assumption, according to which anobserved feature vector only depends on the current stateand is independent of neighboring HMM states. Thisgives the following expression for the likelihood, p ( O | v ) = a σ ,σ p ( o | σ ) T (cid:89) t =2 p ( o t | σ t ) a σ t − ,σ t , (28)where σ t is an HMM state at time t , a σ t ,σ t +1 is thetransition probability between state σ t and σ t +1 , a σ ,σ isthe initial state probability, and p ( o t | σ t ) is the emissionprobability.In legacy systems, the emission probability was mod-eled with a Gaussian Mixture Model (GMM). Morerecent systems use a Deep Neural Network (DNN) insteadand are called DNN-HMM hybrid systems. Let g ( o t ; θ ) be the Σ -dimensional softmax output vector of a DNNAM with parameters θ , where Σ is the total number ofHMM states, and g σ ( o t ; θ ) is the output associated withHMM state σ . g σ ( o t ; θ ) can be interpreted as a posterior Note that other types of AMs such as Connectionist TemporalClassification (CTC)-based AM are also becoming widely used [94,95]. probability p ( σ | o t ) , which can be be converted into apseudo likelihood using Bayes rule as [15] p ( o t | σ ) ∝ p ( σ | o t ) p ( σ ) , = g σ ( o t ; θ ) p ( σ ) , (29)where a prior probability p ( σ ) is derived from the statis-tics of the training data set.There has been much research on designing appropriatenetwork architectures for g σ ( o t ; θ ) . The choice for aspecific architecture foremostly depends on latency con-straints during inference time and the amount of availabletraining data. It is a fast-evolving research field withnew results claiming state-of-the-art performance due tooften only slight modifications of the architecture beingpublished almost on a weekly basis. Equally importantis the choice of training hyperparameters and schemes.Both need extensive tuning for a fair comparison amongarchitectures but this is often not possible due to a limitedcompute budget. In general, a solid baseline architectureare time delay neural networks (TDNNs) [96] or con-volutional neural networks in general (e.g. [97]) possiblyfollowed by some (bi-directional) Long-Short Term Mem-ory (LSTM) layers [98]. Variants of this architecture havebeen employed successfully in the latest CHiME chal-lenges. Recently, architectures with self-attention [99],often referred to as transformers, have shown competitiveresults on several benchmark tasks [100,101].
3) Language model:
The language model (LM) pro-vides the prior probability of a word sequence. Thereexist N-gram LMs and neural LMs such as RecurrentNeural Network (RNN) LM [102]. The LM is trainedon a large text corpus, and, unlike the other componentsof the ASR back-end, it is not affected by the acousticconditions such as noise or reverberation. It can thus bevery effective to improve the performance of far-fieldASR when the language is well constrained such as forread speech tasks [71,103]. However, for conversationalsituations, it is more difficult to model the speech contentand thus the LM appears less effective [104].
4) Training procedure:
Building an ASR back-endrequires training the AM with speech training data andthe associated transcriptions. The goal of the training isfinding the DNN parameters, θ , which optimize a trainingcriterion as, ˆ θ = argmax θ (cid:88) u C ( g ( O u ; θ ) , v u ) , (30)where C ( · ) is an objective function, O u and v u are thesequence of feature vectors and words associated with the u th utterance of the training set, respectively. By abuseof notation, g ( O u ; θ ) refers to the sequence of outputvectors of the DNN AM with O u at its input. The modelparameters θ are learned by backpropagation. Various criteria can be used for training the AM. Themost basic criterion is the CE, which is given as [16], C CE = (cid:88) u (cid:88) t Σ (cid:88) σ =1 p ( σ ) log( g σ ( o t ; θ ))= (cid:88) u (cid:88) t log( g ˜ σ u,t ( o t ; θ )) , (31)where (˜ σ u,τ ) Tτ =1 is the HMM-state label sequence asso-ciated with the reference word sequence v u . Because weuse hard HMM-state labels, p ( σ ) = δ σ, ˜ σ u,t where δ i,j isthe Kronecker Delta. Thus, the CE takes the expressionof the log-likelihood in Eq. (31) [16]. Besides, the signof C CE is opposite to the CE loss [16] because we definedthe training as a maximization problem in Eq. (30). TheHMM-state label sequence can be obtained from thetranscription using forced alignment (see section IV-B2).CE is a frame level criterion, that does not consider thewhole context of the sequence in the loss computation andthus differs from what is performed by the ASR decodingin Eq. (26).Alternatively, sequence-level criteria have been pro-posed to better match the ASR decoding scheme, such asmaximum mutual information (MMI) or segmental Min-imum Bayes-Risk (sMBR) [16,105]. For example MMIaims at directly maximizing the posterior probability, C MMI = (cid:88) u log( p ( v u | O u ; θ ))= (cid:88) u log (cid:18) p ( O u | v u ; θ ) p ( v u ) (cid:80) v (cid:48) p ( O u | v (cid:48) ; θ ) p ( v (cid:48) ) (cid:19) . (32)The numerator represents the likelihood of the observedspeech given the correct word sequence. It can be ob-tained from forced alignment as for CE. The denominatorrepresents the total likelihood of the observed speechfeatures obtained over all possible word sequences (i.e.all word sequences that could be obtained by recognizingthe training utterance using the acoustic and languagemodels). MMI is a sequence discriminative criterion thatoffers the possibility to make correct word sequencesmore likely by maximizing the numerator, while makingall other word sequences less likely by minimizing thedenominator. MMI and other sequence discriminative cri-teria have shown to improve performance over CE [105].However, the summation in the denominator makes MMIcomputationally complex. Recently, an efficient way toimplement MMI called lattice-free MMI has been pro-posed [106]. It has become the standard for ASR and isalso widely used for far-field ASR [104,107]. B. Practical considerations for far-field ASR1) Multi-condition training data:
To train the ASRback-end, we need training speech data and their cor-responding word transcriptions. Training the ASR back-end on clean speech would expose it to too little variation of the acoustic conditions, which may severely affect itsperformance when exposed to far-field conditions. Indeed,the speech enhancement front-end cannot completelyremove acoustic distortions caused by the environment.Therefore, to make the ASR back-end robust, it is usuallytrained with multi-condition data that cover many acousticconditions, including various types and levels of noise,reverberation, etc.It is very costly to collect and transcribe a largeamount of speech data in various real environments.Consequently, it is common to resort to simulation tocreate far-field speech data. If we have access to a cleanspeech training corpus, creating far-field speech signalscan be easily done by convolving clean speech signalswith acoustic impulse responses and adding noise, asshown in the signal model of Eq. (1). The procedure tocreate multi-condition data is thus as follows:1) Prepare a set of clean speech training data S Train ,noise samples N and AIRs A ,2) For each clean training speech signal s Train ∈ S
Train ,create noisy and reverberant speech as, y Train m [ (cid:96) ] = (cid:0) a m ∗ s Train (cid:1) [ (cid:96) ] + n m [ (cid:96) ] , where ( a , . . . , a m , . . . , a M ) ∼ A , ( n , . . . , n m , . . . , n M ) ∼ N . (33)It is thus possible to create any amount of distant speechdata by varying the AIRs and the type and level of noise.The AIRs can be obtained from databases of AIRsmeasured in real environments [108]–[110] or artificiallygenerated using the image method which is a simplemodel of sound propagation in an enclosure [111,112].With the image method, it is simple to generate far-field speech data in various rooms with different rever-beration time and microphone/speaker positions. To addbackground noise, we can use several noise recordingsdatasets [113], and increase the acoustic variations bychanging the SNR.The above data augmentation techniques affect onlythe acoustic environment. It is also possible to modifythe speech signal itself by, e.g., modifying the speed ofthe audio signal [114].Although simulation data can be used to create variousacoustic conditions, some aspects cannot be well simu-lated such as, e.g., head movements, the Lombard effect etc. It is thus usually beneficial to augment the trainingdata with some amount of real recordings. Moreover,if multi-microphone recordings are available, using eachmicrophone recording as separate training samples canalso help increase the acoustic variation [71].Besides these data augmentation techniques that rely onphysical models of speech or the room acoustics, there The Lombard effect describes the phenomenon that speech is artic-ulated differently when uttered in heavy noise. have been a number of approaches proposed recentlyto artificially augment training data without relying onphysical models by e.g. generating adversarial trainingexamples [115]–[118]. Moreover, the recently proposedSpectral Augmentation (SpecAug) technique [119] hasalso been employed to increase the robustness of acousticmodels for far-field ASR tasks [14,120]. It can also becombined with physically motivated augmentation yield-ing significant improvements even for large scale datasets [119].The usefulness of multi-condition training data cov-ering various acoustic conditions has been demonstratedin various tasks and challenges [7,71,104], and in thedevelopment of commercial products [95]. Note, however,that using simulation to create such data can only increasethe acoustic context seen during training but not the actualspeech content (spoken words), which can be a limitationif the clean speech training corpus used as a basis forsimulation is too small.In theory, the impact of noise and reverberation on ASRcould be largely mitigated by training acoustic modelswith a very large amount of training data that would coverthe acoustic variety seen during application. In such acase, the speech enhancement front-end could eventuallybecome unnecessary. However, in many scenarios, theacoustic conditions can be so diverse that it would requirea prohibitively large amount of transcribed training data.This is especially true if multiple microphones are avail-able. There are a few studies that investigate the impactof data augmentation on far-field ASR with and withoutany front-end, but currently it remains unclear how muchdata would be sufficient to address a general far-fieldscenario [7,121]–[124]. Most studies suggest that an ASRback-end trained with data augmentation techniques alonecannot solve the far-field ASR problem even when usinga large amount of training data. For example, for theCHiME 5 challenge, a system trained with 4500 hours oftraining data [107] was outperformed by systems using10 times less data [13,104]. Moreover, even when usinga large amount of data to train the ASR back-end, higherperformance is usually achieved when is it combined witha SE front-end, although for some systems the impact ofthe front-end may become small [95,121].
2) HMM-state alignments:
As mentioned in the de-scription of the training procedure, training the AM re-quires the HMM-state labels, (˜ σ u,τ ) Tτ =1 . Such labels canbe obtained by Viterbi forced alignment, which performsViterbi decoding on the HMM model constructed fromthe reference word sequence to obtain for each observedspeech feature in the utterance the most likely HMM-state, thus performing time-alignment of the input speechand the HMM states [93].Viterbi forced alignment can provide accurate align-ments when using clean speech. However, when theobserved speech is corrupted by noise, reverberation or other persons’ voices, there may be alignment errors.For example, when the observed speech also containsspeech of an interfering speaker, that speaker’s speechmay be mapped to HMM-states of the utterance ofthe target speaker, which distorts the alignments [125].Reverberation and noise also make it harder to correctlyidentify phoneme boundaries.These problems can be mitigated if clean speech isavailable to compute the alignments, leading to moreaccurate HMM-state labels. For example, when usingsimulated far-field data, we can use the clean speechsignals used to generate the training data to perform thealignment. With real recordings, it is sometimes possibleto use a headset or lapel microphone synchronized withthe distant microphone to obtain a cleaner version of thetarget speaker’s speech that can provide more reliableHMM-state labels. The training procedure is thus asfollows,1) For each training utterance,a) construct the utterance HMM from the wordlabels and the pronunciation dictionary,b) compute the HMM-state alignments (˜ σ clean u,τ ) Tτ =1 from clean speech and utteranceHMM using Viterbi decoding.2) Train the AM using e.g. cross entropy criterion asdefined in Eq. (31), C CE = (cid:88) u (cid:88) t log( g ˜ σ clean u,t ( o noisy t ; θ )) , (34)where o noisy t is the noisy speech training sampleand ˜ σ clean u,t is computed from the clean trainingutterances.Simply using clean speech for computing the align-ments instead of the microphone signals can improveASR performance by up to 10% when using CE fortraining [125,126]. Besides, using heuristics to filter outtraining utterances that could not be properly alignedcan also be important [125]. Lattice-free MMI is lesssensitive than CE to alignment errors. Moreover, the statealignment issue may not occur with other types of AMsuch as CTC-based AM because they do not requireHMM-state labels for their training.
3) Adaptation of the ASR back-end to the speech en-hancement front-end:
The speech enhancement front-enddoes not fully remove the acoustic interference and mayintroduce artifacts, which causes a mismatch between theinput speech signal and the AM that is trained usingmulti-condition training data. Several approaches can beused to mitigate such a mismatch. For example, we canprocess the far-field training data with the enhancementfront-end and add this processed speech data to the unpro-cessed multi-condition training dataset, so that the AM isexposed to some enhanced speech during training. Notethat in general using only enhanced speech for training the AM may reduce the acoustic variation observed duringtraining and generate a weaker AM [127,128].Alternatively, we can use the enhanced speech to adaptan already trained AM. For example, we can obtainan AM matched to the test conditions by retraining itsparameters with adaptation data that is similar to the testconditions as θ adapt = argmax θ (cid:88) u C (cid:16) g ( O adapt u ; θ ) , ˆv u (cid:17) , (35)where O adapt u is the sequence of feature vectors of the u -th adaptation utterance, and (cid:98) v u is the word sequenceassociated with the adaption utterance. We can use thetraining data processed with the speech enhancementfront-end as adaptation data, in which case (cid:98) v u simplycorresponds to the transcriptions. Alternatively, if theadaptation data has no transcriptions (as is the case inunsupervised adaptation), (cid:98) v u can be obtained by a firstASR decoding pass.There may be much fewer adaptation data than trainingdata, which makes the process prone to overfitting. Inpractice, overfitting can be mitigated by regularizationtechniques, early stopping, or only updating some pa-rameters of the AM such as the input layer [129,130].Adaptation has been shown to consistently improve theperformance of top systems in recent challenges by –
10 % relative word error rate reduction [71,131].
4) Joint-training:
The above adaptation technique onlyadjusts the AM of the ASR back-end to the speech en-hancement front-end. However, the speech enhancementfront-end is usually optimized for a criterion that is notdirectly related to ASR. Recent works have explored atighter integration of the speech enhancement front-endand ASR back-end, enabling optimization of the front-end for the ASR criterion [59,132,133]. This is relativelyeasy to realize because both the front-end and the back-end use neural networks, and therefore it is possible tocombine them into a single neural network with learnableand fixed computational nodes. Both systems can then bejointly optimized with backpropagation as, ˆ θ = argmax θ (cid:88) u C (cid:16) g (cid:16) Feat (cid:16)
Enh ( Y u ; θ enh ) (cid:17) ; θ am (cid:17) , v u (cid:17) , (36)where Enh ( · ) represents the processing of the enhance-ment front-end, Y u represents the multi-channel STFTcoefficients for a training utterance, and θ = { θ am , θ enh } are the model parameters of the AM and front-end,respectively.Fig. 8 shows an example of a joint-training scheme thatcombines a beamforming based front-end with the AMof the ASR back-end [59,132,133]. The mask estimationDNN of the front-end and the DNN of the AM are thelearnable components of the system. They are intercon-nected with fixed computational blocks that consist of Mask estimationDNN Beamforming Featureextraction Acoustic modelDNN ASR losscomputation y t,f lossGradient flow Learnable Fixed Fig. 8. Schematic diagram of the joint training of the speech enhancement front-end and ASR back-end. the beamformer computation (see Sec. III-B) and thefeature extraction (see Sec. IV-A1). The gradient can flowfrom the AM to the speech enhancement front-end, whichenables optimization of the front-end for ASR.We have discussed the joint-training scheme witha beamforming front-end, but joint training has alsobeen used for dereverberation [57] and source separa-tion/extraction [77]. Significant ASR gains have beenreported on several tasks with joint training schemes.However, joint-training can sometimes lead to a perfor-mance drop because it may weaken the AM [133].One advantage of joint-training is that the whole systemcan be optimized using only far-field speech and theassociated word transcriptions. Therefore, it alleviates theneed for parallel clean and far-field speech data to trainthe speech enhancement front-end, which may be anadvantage when training or adapting systems with realrecordings.V. T
OWARD F AR -F IELD E ND - TO -E ND ASRThis section describes the recent efforts towards end-to-end solutions which allow to optimize all components ofthe front-end speech enhancement and back-end speechrecognizer jointly. This optimization is performed withrespect to our final objective, the Bayes decision rule, asintroduced in Eq. (25).
A. End-to-End ASR
End-to-end ASR approaches directly model the outputdistribution p ( v | O ) over the character, subword, or wordsequence v = ( v , . . . , v J ) , given the speech featuresequence O = ( o , . . . , o T ) . This is quite different fromconventional approaches to ASR [92] composed of theacoustic model p ( O | v ) and language model p ( v ) , aswe discussed in IV-A. End-to-end models subsume allof these components in a single neural network, whichgreatly simplifies the model building process and alsoenables joint training of the whole system. The end-to-endneural speech processing has become a popular alternativeto conventional ASR, and several approaches have beenproposed including CTC [94], attention-based encoder-decoder models [134,135], and their variants [136,137].For example, attention-based methods start from theBayes decision theory, similar to Section IV, but do notuse any conditional independence assumption, and simply factorize the posterior probability p ( v | O ) based on theprobabilistic chain rule and the attention mechanism, asfollows: p ( v | O ) = (cid:89) j p ( v j | v j − , O )= (cid:89) j p ( v j | v j − , c j ; θ dec ) , (37)where v j − = ( v , . . . , v j − ) is a subsequence of v representing the word history before word v j . c j is calleda context vector obtained at each token position j , andis extracted from the input speech feature O based onthe attention mechanism, which we will explain below. p ( v j | v j − , c j ; θ dec ) is computed with a neural networkcalled a decoder network with its set of model parameters θ dec , which can generate a token sequence v j given thehistory v j − and a context vector c j . The decodernetwork is often represented as an LSTM model withhidden state vector z j for each token position j .To obtain context vector c j in Eq. (37), we first focuson an input feature conversion based on an encodernetwork. The encoder network takes the original speechfeature sequence O as input and converts it to high-level hidden vector sequence O enc = ( o enc , . . . , o enc T (cid:48) ) , asfollows: O enc = Enc ( O ; θ enc ) , (38)where θ enc is a set of model parameters in the encodernetwork.We often use bi-directional LSTM (BLSTM) or self-attention models as an encoder network.Given O enc , an attention mechanism produces contextvector c j for each token v j as follows [134]: c j = Att (cid:0) O enc , z j − ; θ att (cid:1) , (39)where z j − is a hidden state vector introduced in thedecoder network. Att ( · ) is an attention network with aset of model parameters θ att , which first computes theattention weight ζ jt ∈ [0 , given the encoder outputvector o enc t and the decoder hidden vector z j − obtainedin the previous output time step [134], as follows: ζ jt = f att ( o enc t , z j − ) , (40) In general, the length of the encoder output sequence T (cid:48) is shorterthan the length of the original sequence T , i.e., T (cid:48) < T due tosubsampling. o enc1 o enc2 o enc3 o enc4 c c c ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ Fig. 9. The attention mechanism to compute the alignment betweeninput encoder vector o enc t at frame t and output context vector c j attoken j . ζ jt denotes the attention weight. The bold lines correspond tothe higher attention weights and the attention mechanism obtains thesoft alignment between these input and output vectors. where f att ( · ) is a function to produce the attention weight,which can be a dot product or neural network-based oper-ations with trainable parameters. ζ jt satisfies the sum-to-one condition across the input frames, i.e., (cid:80) T (cid:48) t =1 ζ jt = 1 .Given the attention weight ζ jt in Eq. (40), the contextvector c j is obtained as a weighted summation of encoderoutput sequence O enc , i.e., c j = T (cid:48) (cid:88) t =1 ζ jt o enc t . (41)Note that Eq. (41) can perform a conversion betweentwo values with different time scales (input time t andoutput time j ) through the soft alignment based onthe weighted summation. For example, Fig. 9 depictsthe attention mechanism based on Eq. (41). The boldlines correspond to the higher attention weights and theattention mechanism obtains the soft alignment betweenthese input and output vectors. This is different from thealignment process in conventional ASR, which is basedon HMMs, as discussed in Section IV-A2.The forward computation of the attention-based end-to-end ASR is processed as follows:1) Encoder processing: O enc = Enc ( O ; θ enc )
2) For each j a) compute c j = Att ( O enc , z j − ; θ att ) b) obtain p ( v j | v j − , c j ; θ dec ) .Figure 10 shows an entire encoder-decoder neural net-work with an attention mechanism. Note that the historysubsequence v j − can be obtained from the refer-ence transcription during training and from predictionresults during decoding. All of these steps are differ-entiable, and we can estimate the model parameters θ = { θ enc , θ att , θ dec } by maximizing the following log-likelihood, similar to Eq. (30), ˆ θ = argmax θ (cid:88) u log( p ( v u | O u ; θ )) . (42)Thus, the attention-based encoder decoder network repre-sents an entire ASR process with a single neural network,and can be trained in an end-to-end manner unlike the o o o o ... o T BLSTM BLSTM BLSTM BLSTM ... BLSTM o enc1 o enc2 o enc T AttentionLSTM z j − c j − v j − LSTM z j c j v j DecoderEncoder
Fig. 10. Attention-based encoder decoder network. The attention iscontrolled by the decoder LSTM state. conventional HMM-based ASR system. Alternatively, atransformer architecture, which is originally proposed inneural machine translation [99] to replace RNNs withself-attention networks, has been used as a variant ofattention based methods for ASR [100].
B. Multi-Channel End-to-End ASR
The straightforward extension of this methodology tofar-field speech recognition is to combine all speechenhancement modules and ASR with a single neural net-work to enable joint optimization [138,139]. This methodcan be regarded as an extension of the joint-training meth-ods [59,132,133] of multi-channel speech enhancementand acoustic modeling as discussed in Section IV-B4.By following Eq. (37), multi-channel end-to-end ASRdirectly models the posterior distribution p ( v |Y ) , giventhe sequence of multi-channel (STFT) signals Y =([ y T1 , , . . . , y T1 ,F ] , . . . , [ y T t, , . . . , y T t,F ] , . . . ) : p ( v |Y ) = (cid:89) j p ( v j | v j − , Y ) = (cid:89) j p ( v j | v j − , ˆ O ) , (43)where ˆ O = Feat ( Enh ( Y ; θ enh )) . (44)Enh ( · ) corresponds to the multi-channel enhancementwith a set of parameters θ enh and Feat ( · ) denotes thestandard speech feature extraction to produce an enhancedspeech feature sequence, ˆ O . Both are introduced in thejoint training of speech enhancement and recognition inEq. (36). As an instance of the multi-channel enhancementfunction, [138] uses BLSTM mask-based beamforming[18,75], as described in Section III. This model is trainedwith an end-to-end ASR objective (cross entropy giventhe reference transcriptions v u for utterance u ) as follows: ˆ θ = argmax θ (cid:88) u log( p ( v u |Y u ; θ )) , (45)where the model parameters θ consist of the parameters ofthe enhancement, encoder, attention and decoder networksas, θ = { θ enh , θ enc , θ att , θ dec } . (46)Compared with the standard end-to-end ASR training inEq. (42), the multi-channel extension can jointly estimateboth ASR model parameters and the enhancement param-eters θ enh in an end-to-end manner. Note that this modelcan be trained without requiring any parallel data (pairs ofclean and noisy speech data), as described in Section III orany other intermediate HMM state/phoneme alignmentscompared with standard acoustic model training describedin Eq. (34). End-to-end joint training thus allows trainingthe enhancement parameters with real far-field data, forwhich clean reference signals are usually not available.The only requirement is the availability of the transcrip-tion of the far-field data, which is always required forASR training based on supervised learning.There are several variants and extensions of multi-channel end-to-end ASR including • Attention-based channel/array selection [140,141] • Incorporation of a dereverberation component [142] • Extension to multispeaker ASR [143] • Extension to target speech extraction [144,145].Although end-to-end approaches are promising, they donot reach the performance of current state-of-the-art far-field ASR systems. The main reason is that these solutionstend to require larger amounts of training data, which, inthe case of multi-channel far-field recordings, may notalways be available. However, there has been a lot ofprogress in end-to-end ASR including extensive investiga-tions of training methods and architectures [146,147], ro-bust training based on data augmentation [119], and newarchitectures based on the transformer model [100,148].VI. S
UMMARY AND REMAINING CHALLENGES
A. Summary
This paper emphasizes that multi-channel speech en-hancement is an essential component for far-field ASR,and provides a comprehensive description of state-of-the-art enhancement techniques in Section III. The combi-nation of powerful signal processing with deep learningsignificantly boosted the performance, compared to earlier This is implemented based on DNN-WPE [56] developed in https://github.com/nttcslab-sp/dnn wpe. signal processing-only solutions. This trend of solvinga problem with signal processing supported by a neuralnetwork is not so often seen in other applications of deeplearning. Consider, for example, computer vision, wherean entire signal processing pipeline has been replacedwith a very deep network. The main reason of thisunique approach in speech enhancement is that well-established physical models exist, which can be viewedas regularizers when devising a deep learning solution.We can thus minimize the size of the neural networksand can make multi-channel speech enhancement workrobustly with a relatively small amount of training data.The main focus of the description of the back-end ASRsystem in Section IV is on how to make use of deeplearning techniques in ASR acoustic models for the far-field ASR scenario. This includes techniques like dataaugmentation, refinement of supervisions, and adaptation.Note that, unlike speech enhancement, ASR is not basedon a solid physical model describing human speechperception and recognition, while at the same time single-channel data in the order of thousands of hours havebecome available also in an academic research setting.This is why pure deep learning based solutions excelat ASR. Overall, the fusion of neural network-supportedsignal processing in the front-end and the massive use ofdeep learning in the back-end has made far-field ASR soreliable that it entered the consumer market with productslike digital home assistants.This paper also introduced the new research paradigmof jointly modeling front-end speech enhancement andback-end ASR acoustic models in Section IV-B4. SectionV further extended this joint training scheme towardsthe emergent end-to-end ASR framework. The underlyingidea of both approaches is to strictly follow the aboveestablished far-field ASR pipeline, but to represent itwith a single neural network so that we can performback propagation to train both speech enhancement andrecognition jointly. Currently, joint training and end-to-end approaches have not yet become as mainstreamas the pipeline approach due to their complex networkarchitecture and the lack of a sufficient amount of multi-channel far-field training data. However, we believe thatthese approaches have a lot of potential to provide furtherbreakthroughs in far-field ASR, and we put emphasison describing them as our most important on-going andfuture research directions.
B. Remaining challenges
The following subsections list remaining challengesin far-field ASR. For some of those, including voiceactivity detection and speaker diarization, there exist well-established solutions in a clean speech environment, whilethey remain to be challenging in far-field ASR conditions. • Voice Activity Detection (VAD) (also called SpeechActivity Detection (SAD)) is an essential techniqueto segment continuous audio signals in on-linestreaming ASR, or long audio recordings in off-lineASR into utterances of manageable length (up to,say, a dozen seconds). Traditionally, energy-basedVAD or likelihood based solutions [149] have beenused. However, these methods face significant degra-dation in low SNR conditions. Learning based meth-ods, especially RNN-based ones, combined withdata augmentation techniques as described in Sec-tion IV-B1 have become popular [150,151], becausethey can detect speech activity regions by non-linearfeature mapping even in the presence of low SNR.There are also several challenge activities includingOpenSAD . Further note that VAD-related challengeactivities are also included in the speaker diarizationchallenge, see next item. • Speaker diarization : Speaker diarization can beregarded as an extension of VAD to multi-speakerrecordings, which provides speaker identities orspeaker cluster assignments for each utterance fromunsegmented audio signals, i.e., it provides infor-mation about “who speaks when” [152]. Recently,speaker diarization has received increased attentionbecause the focus of the ASR research communityis shifting more and more towards recognition ofmulti-speaker recordings such as conversations ormeetings, The interest in diarization is boosted byseveral challenge activities including DIHARD andCHiME-6. There are two main technologies de-pending on whether single-channel or multi-channeldata is available. When we have multi-channel au-dio signals, source speaker locations can be es-timated based on beamforming, and this can inturn be exploited to provide diarization information[152,153]. In the single-channel case, people usespeaker embeddings, such as the i-vector [154] or x-vector [155], to map a speech utterance into a fixeddimensional vector, and then perform clustering onthose obtained embedding vectors (e.g., agglomera-tive hierarchical clustering (AHC) [156,157]). VADis used as an initial module in the speaker diarizationpipeline to segment the recordings into manageableutterances. However, most single-channel techniquescannot explicitly handle regions of speech, wheremore than one speaker is active. But such overlapregions are common in real conversations [4]. Acombination of speech separation, speaker counting,and diarization based on neural networks [158] and https://coml.lscp.ens.fr/dihard/2018/index.html https://chimechallenge.github.io/chime6/ permutation-free neural diarization based on multi-ple label classification [159] would be a promisingdirection to tackle regions of overlapped speech. • On-line processing : Another challenge of far-fieldspeech processing is on-line, low-latency process-ing which is mandatory when used in a spokenlanguage interface. It also has some benefits indynamical environments, when, e.g., moving sourceshave to be tracked, see the next bullet in thislist. Speech enhancement techniques often requireto estimate signal statistics across frames, such asthe spatial covariance matrix Φ for beamformingused in Eq. (20) and the MCLP coefficient matrix C for dereverberation, Eqs. (12) and (13). If lowlatency is required, this statistics computation mustbe performed in an on-line manner, often based onrecursive update equations, e.g., by a linear interpo-lation between previously estimated statistics and thecurrent observations. Online processing for mask-based beamforming is discussed in [160,161]. [95]gives an overview of the development of the GoogleHome device and describes several online techniques[47], especially for dereverberation. [56,58] realizesonline WPE dereverberation with the help of DNN-based time varying variance estimation. • Dynamic environments: moving sensors andsources : Acoustic environments are changing overtime due to nonstationary noise, moving sourcesor moving sensors. For example, the participantsrecorded in the CHiME-5 data set are moving fromroom to room [4], and front-end processing has totrack such moving sources accordingly. In addition,with wearable microphones and in moving robotscenarios [162], we should also take moving micro-phones into consideration. In these situations, on-line processing as discussed above is necessary todeal with adaptive estimation of enhancement filters(beamforming, dereverberation). Recently, there hasbeen a challenge activity, the LOCATA Challenge[163], on locating and tracking moving sources.Although this challenge mainly focuses on acousticsource localization and not on speech enhancementand recognition, their designs of dynamic environ-ments and the defined evaluation metrics for sourcetracking would be a good reference for tackling far-field speech recognition in dynamic environments. • More natural conversations and spontaneousspeech . Our conversations are often spontaneous,and speech characteristics are quite variable andcomplex. For example, in the dinner party scenariosof CHiME 5 [4] and the Santa Barbara corpus [164],we often observe very different speaking durations,volumes, and speaking styles during the conversa-tion. Such variable speech characteristics make the statistical properties of source signals complex andrenders estimation of speech statistics harder. Inaddition, the spoken contents are grammatically lessregular due to filler words, mispronunciation, stam-mering, etc., which makes ASR quite challengingfrom both acoustic and language model perspectives.Finally, such conversations are challenging in termsof the data collection and annotation perspectives,because the preparation of precise transcriptions isdifficult. • Improving signal extraction with semantic andsyntactic context information : A human’s abilityto track a conversation in acoustically adverse con-ditions (e.g., in a cocktail party) can in part beattributed to the use of context information aboutthe discussion topic, our “world knowledge” andsyntactic constraints we are aware of. Only fewworks exist towards utilizing high-level guidance forthe low-level signal extraction tasks. In [165] thespeech separation is improved by feeding back deepfeatures extracted from an end-to-end ASR systemto cover the long-term dependence of phonetic as-pects, while sound separation is improved in [166]by utilizing sound classification results. Exploringways to support front-end processing with back-endknowledge appears to be a promising way to improveoverall system performance. • Distributed microphone setup : In many applica-tion scenarios, including smart homes [4,167], wear-able computing, and human-to-robot communication[162], distributed microphones can be of an advan-tage, compared to a single spatially concentratedmicrophone array. However, the challenge of dis-tributed microphones is that their spatial locationis often a priori unknown and may change overtime. Furthermore, the microphone characteristicscan be different, e.g., if both mobile phones anddesktop microphones are part of the network. Finally,and most importantly, the sampling rates of themicrophones are not synchronized in general. Theseproperties often break important assumptions madein conventional front-end processing, and thus stan-dard beamforming and dereverberation techniquescannot be straightforwardly applied. However, thereexist several studies to tackle the distributed mi-crophone setup including [168]–[172] by solvingthe synchronization problem to make beamformingwork in this setup. There are also many workson distributed beamforming, e.g., [173]–[175], toavoid collecting all signals at a central processingnode. Active microphone (subset) selection insteadof fusing the signals of multiple microphones isanother simple yet effective approach [176,177].Also, late fusion techniques (acoustic model fusion [103] or hypothesis fusion [107] in ASR) instead ofsignal-level fusion can be a viable alternative thanksto the relative insensitivity of acoustic models tosynchronization errors. • Multimodality : A final challenges in far-field speechrecognition is the use of multimodal information in-cluding videos, accelerometer, biosignals and so on.Such information would be complementary to audiosignals, be robust against acoustic noise, and thusthe fusion can bring benefits. In particular, audio-visual speech recognition gains a lot of attention asthe video channel can provide the speaker locationinformation for steering an acoustic beamformer.Furthermore, visual features can complement theaudio features for noise robust speech recognition[178]. However, the visual or other multimodaldata have their own distortions (e.g., brightness andframe-out issues of the image), and synchronizationacross different modalities is also another challenge.VII. T
O PROBE FURTHER
Open-source implementations are available for most ofthe described techniques and provide a good starting pointfor a more hands-on experience.A Python implementation of the WPE algorithm de-scribed in Sec. III-A based on Numpy and Tensorflow isprovided by NaraWPE [179]. The Matlab implementa-tion originally used in [21,46] is available as pcode .For beamforming as described in Sec. III-B, two dif-ferent Python implementations are provided. NN-GEV focuses on neural network-based mask estimation andsubsequent beamforming while PB-BSS focuses on spa-tial clustering-based Speech Presense Probability (SPP)estimation. Other useful toolkits implementing derever-beration and beamforming techniques include the BTKtoolkit and Pyroomacoustics [180]. The latter one alsoallows to simulate acoustic scenarios to generate data.An overview of selected implementations is given inTable II while databases are listed in Table I. To visualizethe comprehension of the effect of far-field speech andprospective improvements for several acoustic scenarios,Fig. 11 depicts the ASR performance transition of theCHiME and REVERB challenges from the challengebaseline at the challenge release period, the challengebest system, and the challenge follow up studies. Byreferring to Fig. 11 and corresponding acoustic scenariosin Table I., we can monitor the prospective improvementof various far-field ASR problems. https://github.com/fgnt/nara wpe https://github.com/fgnt/nn-gev https://github.com/fgnt/pb bss https://distantspeechrecognition.sourceforge.io/ https://github.com/LCAV/pyroomacoustics Fig. 11. The WER transitions of far-field ASR systems based onthe REVERB and CHiME-3/4/5/6 challenge results from their baselinesystems, challenge best systems, and the follow-up studies.
Note that many of these ASR results can be repro-duced by using publicly available toolkits. For a head-start on ASR tasks, the Kaldi toolkit [181] providesseveral recipes for the listed databases which includesome of the tools discussed above. The CHiME-6 recipe for example uses NaraWPE and PB-GSS while theCHiME-3/4 recipe includes BeamformIt and NN-GEV.ESPnet [182] also provides multichannel end-to-end ASRfor the REVERB and CHiME-4 data with the help ofDNN-WPE. R EFERENCES[1] K. Kinoshita, M. Delcroix, S. Gannot, E. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani,B. Raj, A. Sehr, and T. Yoshioka, “A summary of the REVERBchallenge: state-of-the-art and remaining challenges in reverber-ant speech processing research,”
EURASIP Journal on Advancesin Signal Processing , 2016.[2] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third”CHiME” speech separation and recognition challenge: Analysisand outcomes,”
Computer Speech and Language , vol. 46, pp.605–626, Nov. 2017. https://github.com/kaldi-asr/kaldi/tree/master/egs/chime6 https://github.com/fgnt/pb chime5 https://github.com/kaldi-asr/kaldi/tree/master/egs/chime4 https://github.com/espnet/espnet/tree/master/egs/reverb/asr1multich https://github.com/espnet/espnet/tree/master/egs/chime4/asr1multich University Paderborn https://github.com/fgnt/nara wpe Nippon Telephone & Telegraph, Japan https://github.com/nttcslab-sp/dnn wpe https://github.com/fgnt/nn-gev https://github.com/fgnt/pb bss Carnegie-Mellon University, USA https://distantspeechrecognition.sourceforge.io/ ´Ecole polytechnique f´ed´erale de Lausanne, Switzerland https://github.com/LCAV/pyroomacoustics https://github.com/xanguera/BeamformIt [3] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, andR. Marxer, “An analysis of environment, microphone and datasimulation mismatches in robust speech recognition,” ComputerSpeech and Language , 2016.[4] J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth’CHiME’ speech separation and recognition challenge: Dataset,task and baselines,” in
Proc. of Annual Conference of the Inter-national Speech Communication Association (Interspeech) , 2018.[5] S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora,X. Chang, S. Khudanpur, V. Manohar, D. Povey, D. Raj, D. Sny-der, A. S. Subramanian, J. Trmal, B. B. Yair, C. Boeddeker, Z. Ni,Y. Fujita, S. Horiguchi, N. Kanda, T. Yoshioka, and N. Ryant,“CHiME-6 challenge: Tackling multispeaker speech recognitionfor unsegmented recordings,”
CoRR , 2020.[6] M. Harper, “The automatic speech recogition in reverberantenvironments (ASpIRE) challenge,” in
Proc. of IEEE Workshopon Automatic Speech Recognition and Understanding (ASRU) ,Dec 2015, pp. 547–554.[7] M. Delcroix, T. Yoshioka, A. Ogawa, Y. Kubo, M. Fujimoto,N. Ito, K. Kinoshita, M. Espi, S. Araki, T. Hori et al. , “Strate-gies for distant speech recognition in reverberant environments,”
EURASIP Journal on Advances in Signal Processing , vol. 2015,no. 1, p. 60, 2015.[8] X. Feng, K. Kumatani, and J. McDonough, “The CMU-MITREVERB challenge 2014 system: description and results,” in
REVERB Challenge Workshop , 2014.[9] F. Weninger, S. Watanabe, J. L. Roux, J. R. Hershey, Y. Ta-chioka, and G. Rigoll, “The MERL/MELCO/TUM system for theREVERB challenge using deep recurrent neural network featureenhancement,” in
REVERB Challenge Workshop , 2014.[10] Z.-Q. Wang and D. Wang, “Deep learning based target cancel-lation for speech dereverberation,”
IEEE Transactions on Audio,Speech, and Language Processing , vol. 28, pp. 941–950, 2020.[11] J.Barker, R. Marxer, E. Vincent, and S. Watanabe, “Multi-microphone speech recognition in everyday environments,”
Com-puter Speech and Language , vol. 46, pp. 386–387, June 2017.[12] S.-J. Chen, A. S. Subramanian, H. Xu, and S. Watanabe, “Build-ing state-of-the-art distant speech recognition using the CHiME-4challenge with a setup of speech enhancement baseline,” in
Proc.of Annual Conference of the International Speech Communica-tion Association (Interspeech) , 2018, pp. 1571–1575.[13] C. Zoril˘a, C. Boeddeker, R. Doddipatla, and R. Haeb-Umbach,“An investigation into the effectiveness of enhancement in ASRtraining and test for CHiME-5 dinner party transcription,” in
Proc. of IEEE Workshop on Automatic Speech Recognition andUnderstanding (ASRU) , 2019.[14] J. Du, Y.-H. Tu, L. Sun, L. Chai, X. Tang, M.-K. He, F. Ma,J. Pan, J.-Q. Gao, D. Liu, C.-H. Lee, and J.-D. Chen, “The USTC-NELSLIP systems for CHiME-6 challenge,” in , 2020.[15] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. r. Mohamed,N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath,and B. Kingsbury, “Deep neural networks for acoustic modelingin speech recognition: The shared views of four research groups,”
IEEE Signal Processing Magazine , vol. 29, no. 6, pp. 82–97, Nov2012.[16] D. Yu and L. Deng,
Automatic Speech Recognition – A DeepLearning Approach . Springer, 2015.[17] J. Li, L. Deng, R. Haeb-Umbach, and Y. Gong,
Robust AutomaticSpeech Recognition . Elsevier, Oct 2015.[18] M. Souden, J. Benesty, and S. Affes, “On optimal frequency-domain multichannel linear filtering for noise reduction,”
IEEETransactions on Audio, Speech, and Language Processing ,vol. 18, no. 2, pp. 260–276, 2010.[19] D. Van Compernolle, W. Ma, F. Xie, and M. Van Diest, “Speechrecognition in noisy environments with the aid of microphonearrays,”
Speech Communication , vol. 9, no. 5-6, pp. 433–442,1990.[20] P. Naylor and N. Gaubitch, Eds.,
Speech Dereverberation .Springer, 2010.[21] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized TABLE IR
ECENT NOISE ROBUST SPEECH RECOGNITION TASKS .Task Vocabularysize Amount oftraining data Realism Type of distortions Numberof mics Mic-speakerdistance GroundtruthASpIRE [6] 100k N/A Real Reverberation 8/1 N/A N/AAMI [183] 11k ∼ ∼
75 h) Real Multi-speaker conversationsReverberation and noise 8 N/A HeadsetCHiME-3/4 [2,3] 5k 8738 utt.( ∼
18 h ) Simu+Real Nonstationary noise infour public environments 6/2/1 . Clean/close talkCHiME-5/6 [4] 100k ∼
80k utt.( ∼
40 h ) Real Nonstationary noise,multi-speaker conversations,reverberation 32 . to BinauralheadsetREVERB [1] 5k 7861 utt.( ∼
15 h ) Simu+Real Reverberation in differentliving rooms 8/2/1 . to Clean/headsetTABLE IIT
OOLKITS
Name Affiliation Function Interface Back-end License Ref.NaraWPE UPB WPE dereverberation Python NumPy, TensorFlow MIT [179] WPE NTT WPE dereverberation Matlab Matlab Custom [21,46] DNN-WPE NTT DNN-based WPE dereverberation Python NumPy, PyTorch Custom [56] NN-GEV UPB Neural mask-based beamforming Python NumPy, Chainer Custom [74] PB-BSS UPB Spatial clustering, beamforming Python NumPy MIT [55] BTK CMU Beamforming, dereverberation Python, C++ C++ MIT N/A PyRoomAcoustics EPFL Beamforming, RIR generation Python NumPy/C MIT [180] BeamformIt ICSI Delay-and-sum beamforming CLI, C++ C/C++ N/A [184] HARK HRI Source localization, separation CLI, Python C++ Custom hark.jpdelayed linear prediction,”
IEEE Transactions on Audio, Speech,and Language Processing , vol. 18, no. 7, pp. 1717–1731, 2010.[22] S. Araki, M. Okada, T. Higuchi, A. Ogawa, and T. Nakatani,“Spatial correlation model based observation vector clusteringand MVDR beamforming for meeting recognition,” in
Proc. ofIEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2016, pp. 385–389.[23] B. Wu, K. Li, M. Yang, C.-H. Lee, B. Wu, K. Li, M. Yang, C.-H.Lee, B. Wu, M. Yang, C.-H. Lee, and K. Li, “A reverberation-time-aware approach to speech dereverberation based on deepneural networks,”
IEEE Transactions on Audio, Speech, andLanguage Processing , vol. 25, no. 1, pp. 102–111, Jan. 2017.[24] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deepclustering: discriminative embeddings for segmentation and sep-aration,” in
Proc. of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2016, pp. 31–35.[25] Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network forsingle-microphone speaker separation,” in
Proc. of IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) , March 2017, pp. 246–250.[26] M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speechseparation with utterance-level permutation invariant training ofdeep recurrent neural networks,”
IEEE/ACM Transactions onAudio, Speech, and Language Processing , vol. 25, no. 10, pp.1901–1913, 2017.[27] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,”
IEEE/ACMTransactions on Audio, Speech, and Language Processing ,vol. PP, pp. 1–1, May 2019.[28] Y. Avargel and I. Cohen, “On multiplicative transfer functionapproximation in the short-time fourier transform domain,”
IEEE Signal Processing Letters , vol. 14, pp. 337–340, 2007.[29] A. Gilloire and M. Vetterli, “Adaptive filtering in sub-bandswith critical sampling: analysis, experiments, and applicationto acoustic echo cancellation,”
IEEE Transactions on SignalProcessing , vol. 40, no. 8, pp. 1862–1875, 1992.[30] R. Talmon, I. Cohen, and S. Gannot, “Convolutive transferfunction generalized sidelobe canceler,”
IEEE Transactions onAudio, Speech, and Language Processing , vol. 17, no. 7, pp.1420–1434, 2009.[31] X. Li, L. Girin, S. Gannot, and R. Horaud, “Multichannel speechseparation and enhancement using the convolutive transfer func-tion,”
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing
Proc. of IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) , March 2010,pp. 4214–4217.[34] E. Vincent, R. Gribonval, and C. F´evotte, “Performance mea-surement in blind audio source separation,”
IEEE Transactionson Audio, Speech, and Language Processing , vol. 14, no. 4, pp.1462–1469, 2006.[35] D. Wang and J. Chen, “Supervised speech separation based ondeep learning: An overview,”
IEEE/ACM Transactions on Audio,Speech, and Language Processing , vol. 26, no. 10, pp. 1702–1726, 2018.[36] H. Buchner, R. Aichner, and W. Kellermann, “TRINICON: aversatile framework for multichannel blind signal processing,” in Proc. of IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , vol. III, 2004, pp. 889–892.[37] T. Nakatani and K. Kinoshita, “Maximum likelihood convolu-tional beamformer for simultaneous denoising and dereverber-ation,” in , 2019.[38] D. T. M. Slock, “Blind fractionally-spaced equalization, perfectreconstruction filter-banks and multichannel linear prediction,” in
Proc. of IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , vol. 4, 1994, pp. 585–588.[39] B. D. Van Veen and K. M. Buckley, “Beamforming: A versatileapproach to spatial filtering,”
IEEE ASSP Magazine , vol. 5, no. 2,pp. 4–24, April 1988.[40] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov,“A consolidated perspective on multimicrophone speech en-hancement and source separation,”
IEEE Transactions on Audio,Speech, and Language Processing , vol. 25, no. 4, 2017.[41] C. Boeddeker, T. Nakatani, K. Kinoshita, and R. Haeb-Umbach,“Jointly optimal dereverberation and beamforming,” Submittedto ICASSP, 2020. [Online]. Available: http://arxiv.org/abs/1910.13707[42] K. Abed-Meraim and P. Loubaton, “Prediction error method forsecond-order blind identification,”
IEEE Transactions on SignalProcessing , vol. 45, no. 3, pp. 694–705, 1997.[43] E. Vincent, T. Virtanen, and S. Gannot,
Audio source separationand speech enhancement . John Wiley & Sons, 2018.[44] S. Makino, Ed.,
Audio Source Separation . Springer, 2018.[45] F. Xiong, B. T. Meyer, N. Moritz, R. Rehr, J. Anem¨uller, T. Gerk-mann, S. Doclo, and S. Goetze, “Front-end technologies forrobust asr in reverberant environments—spectral enhancement-based dereverberation and auditory modulation filterbank fea-tures,”
EURASIP Journal on Advances in Signal Processing , vol.2015, no. 1, p. 70, 2015.[46] T. Yoshioka and T. Nakatani, “Generalization of multi-channellinear prediction methods for blind MIMO impulse responseshortening,”
IEEE Transactions on Audio, Speech, and LanguageProcessing , 2012.[47] J. Caroselli, I. Shafran, A. Narayanan, and R. Rose, “Adaptivemultichannel dereverberation for automatic speech recognition,”in
Proc. of Annual Conference of the International SpeechCommunication Association (Interspeech) , 2017.[48] T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, “Robust MVDRbeamforming using time-frequency masks for online/offline asr innoise,” in
Proc. of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , March 2016, pp. 5210–5214.[49] Y. Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. Hershey,“Single-channel multi-speaker separation using deep clustering,”in
Proc. of Annual Conference of the International SpeechCommunication Association (Interspeech) , 2016.[50] L. Griffiths and C. Jim, “An alternative approach to linearly con-strained adaptive beamforming,”
IEEE Transactions on Antennasand Propagation , vol. 30, no. 1, pp. 27–34, January 1982.[51] S. Makino, T. Lee, and H. Sawada,
Blind speech separation .Springer, 2007, vol. 615.[52] G. Naik and W. Wang, Eds.,
Blind Source Separation . Springer,2014.[53] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural networkbased spectral mask estimation for acoustic beamforming,” in
Proc. of IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , 2016.[54] X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey, M. L.Seltzer, G. Chen, Y. Zhang, M. Mandel, and D. Yu, “Deepbeamforming networks for multi-channel speech recognition,” in
Proc. of IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , March 2016, pp. 5745–5749.[55] L. Drude and R. Haeb-Umbach, “Tight integration of spatialand spectral features for BSS with deep clustering embeddings,”in
Proc. of Annual Conference of the International SpeechCommunication Association (Interspeech) , 2017.[56] K. Kinoshita, M. Delcroix, H. Kwon, T. Mori, and T. Nakatani,“Neural network-based spectrum estimation for online WPE dere- verberation,” in
Proc. of Annual Conference of the InternationalSpeech Communication Association (Interspeech) , 2017, pp. 384–388.[57] J. Heymann, L. Drude, R. Haeb-Umbach, K. Kinoshita, andT. Nakatani, “Joint optimization of neural network-based WPEdereverberation and acoustic model for robust online ASR,” in
Proc. of IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, 2019, pp. 6655–6659.[58] ——, “Frame-online DNN-WPE dereverberation,” in
Proc.IWAENC , Tokyo, Japan, September 2018.[59] J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, and R. Haeb-Umbach, “BEAMNET: End-to-end training of a beamformer-supported multi-channel ASR system,” in
Proc. of IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2017.[60] A. Juki´c, T. van Waterschoot, T. Gerkmann, and S. Doclo,“Multi-channel linear prediction-based speech dereverberationwith sparse priors,”
IEEE Transactions on Audio, Speech, andLanguage Processing , vol. 23, no. 9, pp. 1509–1520, 2015.[61] S. R. Chetupalli and T. V. Sreenivas, “Late reverberation cancel-lation using Bayesian estimation of multi-channel linear predic-tors and student’s t-source prior,”
IEEE Transactions on Audio,Speech, and Language Processing , vol. 27, no. 6, 2019.[62] M. Miyoshi and Y. Kaneda, “Inverse filtering of room acoustics,”
IEEE Trans. Acoustics, Speech, and Signal Processing , vol. 36,no. 2, pp. 145–152, 1988.[63] S. Braun, A. Kuklasi´nski, O. Schwartz, O. Thiergart, E. A.Habets, S. Gannot, S. Doclo, and J. Jensen, “Evaluation and com-parison of late reverberation power spectral density estimators,”
IEEE Transactions on Audio, Speech, and Language Processing ,vol. 26, no. 6, pp. 1056–1071, 2018.[64] S. Gannot, D. Burshtein, and E. Weinstein, “Signal enhance-ment using beamforming and nonstationarity with applicationsto speech,”
IEEE Transactions on Signal Processing , vol. 49,no. 8, pp. 1614–1626, 2001.[65] O. Schwartz, S. Gannot, and E. Habets, “Multi-microphonespeech dereverberation and noise reduction using relative earlytransfer functions,”
IEEE/ACM Transactions on Audio, Speech,and Language Processing , vol. 23, no. 2, pp. 240–251, 2015.[66] E. Warsitz, A. Krueger, and R. Haeb-Umbach, “Speech en-hancement with a new generalized eigenvector blocking matrixfor application in a generalized sidelobe canceller,” in
Proc. ofIEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , 2008, pp. 73–76.[67] M. I. Mandel, D. P. Ellis, and T. Jebara, “An EM algorithm forlocalizing multiple sound sources in reverberant environments,”in
Advances in neural information processing systems , 2007, pp.953–960.[68] D. H. Tran Vu and R. Haeb-Umbach, “Blind speech separationemploying directional statistics in an expectation maximizationframework,” in
Proc. of IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE, 2010, pp.241–244.[69] N. Ito, S. Araki, and T. Nakatani, “Modeling audio directionalstatistics using a complex Bingham mixture model for blindsource extraction from diffuse noise,” in
Proc. of IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2016, pp. 465–468.[70] ——, “Complex angular central Gaussian mixture model fordirectional statistics in mask-based microphone array signal pro-cessing,” in
European Signal Processing Conference (EUSIPCO) .IEEE, 2016, pp. 1153–1157.[71] T. Yoshioka, N. Ito, M. Delcroix, A. Ogawa, K. Kinoshita,M. Fujimoto, C. Yu, W. J. Fabian, M. Espi, T. Higuchi et al. ,“The NTT CHiME-3 system: Advances in speech enhancementand recognition for mobile multi-microphone devices,” in
Proc.of IEEE Workshop on Automatic Speech Recognition and Under-standing (ASRU) . IEEE, 2015, pp. 436–443.[72] Y. Li and D. Wang, “On the optimality of ideal binary time–frequency masks,”
Speech Communication , vol. 51, no. 3, pp.230–239, 2009. [73] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognition-boosted speech separation using deeprecurrent neural networks,” in Proc. of IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2015.[74] J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach,“BLSTM supported GEV beamformer front-end for the 3rdCHiME challenge,” in
Proc. of IEEE Workshop on AutomaticSpeech Recognition and Understanding (ASRU) , December 2015.[75] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, andJ. Le Roux, “Improved MVDR beamforming using single-channel mask prediction networks,” in
Proc. of Annual Con-ference of the International Speech Communication Association(Interspeech) , 2016, pp. 1981–1985.[76] K. ˇZmol´ıkov´a, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa,and T. Nakatani, “Speaker-aware neural network based beam-former for speaker extraction in speech mixtures,” in
Proc. ofAnnual Conference of the International Speech CommunicationAssociation (Interspeech) , 2017, pp. 2655–2659.[77] K. ˇZmol´ıkov´a, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani,L. Burget, and J. ˇCernock`y, “SpeakerBeam: Speaker aware neuralnetwork for target speaker extraction in speech mixtures,”
IEEEJournal of Selected Topics in Signal Processing , vol. 13, no. 4,pp. 800–814, 2019.[78] K. Vesel`y, S. Watanabe, K. ˇZmol´ıkov´a, M. Karafi´at, L. Burget,and J. H. ˇCernock`y, “Sequence summarizing neural network forspeaker adaptation,” in
Proc. of IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2016, pp. 5315–5319.[79] Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Her-shey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno,“Voicefilter: Targeted voice separation by speaker-conditionedspectrogram masking,” arXiv preprint arXiv:1810.04826 , 2018.[80] C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude,J. Heymann, and R. Haeb-Umbach, “Front-end processing for theCHiME-5 dinner party scenario,” in
CHiME5 Workshop , 2018.[81] H. Sawada, R. Mukai, S. Araki, and S. Makino, “A robustand precise method for solving the permutation problem offrequency-domain blind source separation,”
IEEE Transactionson Speech and Audio Processing , vol. 12, no. 5, pp. 530–538,2004.[82] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invari-ant training of deep models for speaker-independent multi-talkerspeech separation,” arXiv preprint arXiv:1607.00325 , 2016.[83] L. Drude, D. Hasenklever, and R. Haeb-Umbach, “Unsupervisedtraining of a deep clustering model for multichannel blind sourceseparation,” in
Proc. of IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE, 2019.[84] P. Seetharaman, G. Wichern, J. Le Roux, and B. Pardo, “Boot-strapping single-channel source separation via unsupervised spa-tial clustering on stereo mixtures,” in
Proc. of IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019.[85] E. Tzinis, S. Venkataramani, and P. Smaragdis, “Unsuperviseddeep clustering for source separation: Direct learning frommixtures using spatial information,” in
Proc. of IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019.[86] Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Multi-channel deepclustering: Discriminative spectral and spatial embeddings forspeaker-independent speech separation,” in
Proc. of IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2018.[87] T. Nakatani, N. Ito, T. Higuchi, S. Araki, and K. Kinoshita,“Integrating DNN-based and spatial clustering-based mask esti-mation for robust MVDR beamforming,” in
Proc. of IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2017.[88] L. Drude and R. Haeb-Umbach, “Integration of neural networksand probabilistic spatial models for acoustic blind source sepa-ration,”
IEEE Journal of Selected Topics in Signal Processing ,2019. [89] L. Drude, C. Boeddeker, J. Heymann, R. Haeb-Umbach, K. Ki-noshita, M. Delcroix, and T. Nakatani, “Integrating neural net-work based beamforming and weighted prediction error dere-verberation,” in
Proc. of Annual Conference of the Interna-tional Speech Communication Association (Interspeech) , 2018,pp. 3043–3047.[90] T. Yoshioka, H. Erdogan, Z. Chen, X. Xiao, and F. Alleva,“Recognizing overlapped speech in meetings: A multichannelseparation approach using neural networks,”
Proc. of AnnualConference of the International Speech Communication Associ-ation (Interspeech) , pp. 3038–3042, 2018.[91] L. Rabiner and B.-H. Juang, “Historical perspective of the fieldof ASR/NLU,” in
Springer handbook of speech processing .Springer, 2008, pp. 521–538.[92] L. R. Rabiner, “A tutorial on hidden Markov models and selectedapplications in speech recognition,”
Proceedings of the IEEE ,vol. 77, no. 2, pp. 257–286, 1989.[93] X. Huang, A. Acero, and H.-W. Hon,
Spoken Language Process-ing: A Guide to Theory, Algorithm, and System Development ,1st ed. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2001.[94] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber,“Connectionist temporal classification: labelling unsegmentedsequence data with recurrent neural networks,” in
Proc. ofInternational Conference on Machine Learning (ICML) , 2006,pp. 369–376.[95] B. Li, T. Sainath, A. Narayanan, J. Caroselli, M. Bacchiani,A. Misra, I. Shafran, H. Sak, G. Pundak, K. Chin, K. C. Sim, R. J.Weiss, K. Wilson, E. Variani, C. Kim, O. Siohan, M. Weintraub,E. McDermott, R. Rose, and M. Shannon, “Acoustic modelingfor Google Home,” in
Proc. of Annual Conference of theInternational Speech Communication Association (Interspeech) ∼ chanwook/MyPapers/b li interspeech 2017.pdf[96] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J.Lang, “Phoneme recognition using time-delay neural networks,” IEEE Transactions on Acoustics, Speech, and Signal Processing ,vol. 37, no. 3, pp. 328–339, 1989.[97] J. Huang, J. Li, and Y. Gong, “An analysis of convolutionalneural networks for speech recognition,” in
Proc. of IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2015, pp. 4989–4993.[98] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional,long short-term memory, fully connected deep neural networks,”in
Proc. of IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , 2015, pp. 4580–4584.[99] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”in
Advances in neural information processing systems , 2017, pp.5998–6008.[100] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang,M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang, S. Watan-abe, T. Yoshimura, and W. Zhang, “A comparative study ontransformer vs RNN in speech applications,” in
Proc. of IEEEWorkshop on Automatic Speech Recognition and Understanding(ASRU) , 2019.[101] Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar,H. Huang, A. Tjandra, X. Zhang, F. Zhang, C. Fuegen, G. Zweig,and M. L. Seltzer, “Transformer-based acoustic modeling forhybrid speech recognition,” in
Proc. of IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) ,2020, pp. 6874–6878.[102] T. Mikolov, M. Karafi´at, L. Burget, J. Cernock´y, and S. Khudan-pur, “Recurrent neural network based language model,” in
Proc.of Annual Conference of the International Speech Communica-tion Association (Interspeech) , 2010, pp. 1045–1048.[103] J. Du, Y.-H. Tu, L. Sun, F. Ma, H.-K. Wang, J. Pan, C. Liu, J.-D.Chen, and C.-H. Lee, “The USTC-iFlytek System for CHiME-4Challenge,” in
CHiME4 Workshop , 2016.[104] J. Du, T. Gao, L. Sun, F. Ma, Y. Fang, D.-Y. Liu, Q. Zhang,X. Zhang, H.-K. Wang, J. Pan, J.-Q. Gao, C.-H. Lee, and J.-D.Chen, “The USTC-iFlytek systems for CHiME-5 Challenge,”in
Proc. CHiME 2018 Workshop on Speech Processing in Everyday Environments , 2018, pp. 11–15. [Online]. Available:http://dx.doi.org/10.21437/CHiME.2018-3[105] K. Vesel`y, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative training of deep neural networks,” in
Proc. ofAnnual Conference of the International Speech CommunicationAssociation (Interspeech) , 2013, pp. 2345–2349.[106] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar,X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trainedneural networks for ASR based on lattice-free MMI,” in
Proc. ofAnnual Conference of the International Speech CommunicationAssociation (Interspeech) . ISCA, 2016, pp. 2751–2755.[107] N. Kanda, R. Ikeshita, S. Horiguchi, Y. Fujita, K. Nagamatsu,X. Wang, V. Manohar, N. E. Y. Soplin, M. Maciejewski, S.-J. Chen et al. , “The Hitachi/JHU CHiME-5 system: Advancesin speech recognition for everyday home environments usingmultiple microphone arrays,”
Proc. CHiME-5 , pp. 6–10, 2018.[108] S. Nakamura, K. Hiyane, F. Asano, T. Nishiura, and T. Yamada,“Acoustical sound database in real environments for sound sceneunderstanding and hands-free speech recognition,” in
Proceed-ings of the Second International Conference on Language Re-sources and Evaluation (LREC’00) , Athens, Greece, May 2000.[109] M. Jeub, M. Schafer, and P. Vary, “A binaural room impulseresponse database for the evaluation of dereverberation algo-rithms,” in , July 2009, pp. 1–5.[110] I. Sz¨oke, M. Sk´acel, L. Moˇsner, J. Paliesek, and J. ˇCernock´y,“Building and evaluation of a real room impulse responsedataset,”
IEEE Journal of Selected Topics in Signal Processing ,vol. 13, no. 4, pp. 863–876, Aug 2019.[111] J. Allen and D. Berkley, “Image method for efficiently simulatingsmall-room acoustics,”
The Journal of the Acoustical Society ofAmerica , vol. 65, pp. 943–950, 1979.[112] E. A. P. Habets, “Room impulse response generator,” TechnischeUniversiteit Eindhoven, Tech. Rep., 2010. [Online]. Available:http://home.tiscali.nl/ehabets/rir generator/rir generator.pdf[113] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech,and Noise Corpus,” 2015, arXiv:1510.08484v1.[114] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio aug-mentation for speech recognition.” in
Proc. of Annual Confer-ence of the International Speech Communication Association(Interspeech) . ISCA, 2015, pp. 3586–3589.[115] W. Hsu, Y. Zhang, and J. Glass, “Unsupervised domain adapta-tion for robust speech recognition via variational autoencoder-based data augmentation,” in , 2017, pp. 16–23.[116] S. Sun, C. Yeh, M. Ostendorf, M. Hwang, and L. Xie,“Training augmentation with adversarial examples for robustspeech recognition,”
CoRR , vol. abs/1806.02782, 2018. [Online].Available: http://arxiv.org/abs/1806.02782[117] H. Hu, T. Tan, and Y. Qian, “Generative adversarial networksbased data augmentation for noise robust speech recognition,”in
Proc. of IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , 2018, pp. 5044–5048.[118] Y. Qian, H. Hu, and T. Tan, “Data augmentation using genera-tive adversarial networks for robust speech recognition,”
SpeechCommunication , vol. 114, 08 2019.[119] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.Cubuk, and Q. V. Le, “SpecAugment: A simple augmentationmethod for automatic speech recognition,” in
Proc. of AnnualConference of the International Speech Communication Associ-ation (Interspeech) , 2019.[120] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Ko-renevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. An-drusenko, I. Podluzhny, A. Laptev, and A. Romanenko, “TheSTC system for the CHiME-6 challenge,” in , 2020.[121] S. Xue, Z. Yan, T. Yu, and Z. Liu, “A study on improvingacoustic model for robust and far-field speech recognition,” in , 2018, pp. 1–5. [122] H. Tang, W.-N. Hsu, F. Grondin, and J. Glass, “A studyof enhancement, augmentation and autoencoder methods fordomain adaptation in distant speech recognition,”
Interspeech2018 , Sep 2018. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-2030[123] V. Manohar, S. Chen, Z. Wang, Y. Fujita, S. Watanabe, andS. Khudanpur, “Acoustic modeling for overlapping speech recog-nition: Jhu chime-5 challenge system,” in
Proc. of IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2019, pp. 6665–6669.[124] N. Kanda, Y. Fujita, S. Horiguchi, R. Ikeshita, K. Nagamatsu, andS. Watanabe, “Acoustic modeling for distant multi-talker speechrecognition with single- and multi-channel branches,” in
Proc. ofIEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , 2019, pp. 6630–6634.[125] V. Peddinti, V. Manohar, Y. Wang, D. Povey, and S. Khudanpur,“Far-field ASR without parallel data,” in
Proc. of AnnualConference of the International Speech CommunicationAssociation (Interspeech) , 2016, pp. 1996–2000. [Online].Available: http://dx.doi.org/10.21437/Interspeech.2016-1475[126] M. Delcroix, Y. Kubo, T. Nakatani, and A. Nakamura, “Is speechenhancement pre-processing still relevant when using deep neuralnetworks for acoustic modeling?” in
Proc. of Annual Confer-ence of the International Speech Communication Association(Interspeech) . ISCA, 2013, pp. 2992–2996.[127] E. Vincent, “When mismatched training data outperformmatched data,” in
Systematic approaches to deep learningmethods for audio , Vienna, Austria, Sep. 2017. [Online].Available: https://hal.inria.fr/hal-01588876[128] C. R. Gonz´alez and Y. S. Abu-Mostafa, “Mismatched training andtest distributions can outperform matched ones,”
Neural Comput. ,vol. 27, no. 2, pp. 365–387, Feb. 2015.[129] H. Liao, “Speaker adaptation of context dependent deep neuralnetworks,” in
Proc. of IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) , May 2013, pp.7947–7951.[130] D. Yu, K. Yao, H. Su, G. Li, and F. Seide, “Kl-divergence regu-larized deep neural network adaptation for improved large vocab-ulary speech recognition,” in
Proc. of IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) ,May 2013, pp. 7893–7897.[131] H. Erdogan, T. Hayashi, J. R. Hershey, T. Hori, C. Hori, W.-N.Hsu, S. Kim, J. Le Roux, Z. Meng, and S. Watanabe, “Multi-channel speech recognition: LSTMs all the way through,” in
CHiME-4 workshop , 2016, pp. 1–4.[132] X. Xiao, S. Zhao, D. L. Jones, E. S. Chng, and H. Li, “Ontime-frequency mask estimation for MVDR beamforming withapplication in robust speech recognition,” in
Proc. of IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2017, pp. 3246–3250.[133] J. Heymann, M. Bacchiani, and T. N. Sainath, “Performance ofmask based statistical beamforming in a smart home scenario,”in
Proc. of IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , April 2018, pp. 6722–6726.[134] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-gio, “Attention-based models for speech recognition,” in
Proc. ofAdvances in neural information processing systems (NeurIPS) ,2015, pp. 577–585.[135] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attendand spell: A neural network for large vocabulary conversationalspeech recognition,” in
Proc. of IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , 2016, pp.4960–4964.[136] A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognitionwith deep recurrent neural networks,” in
Proc. of IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2013, pp. 6645–6649.[137] S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention basedend-to-end speech recognition using multi-task learning,” in
Proc.of IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2017, pp. 4835–4839. [138] T. Ochiai, S. Watanabe, T. Hori, and J. R. Hershey, “Multichannelend-to-end speech recognition,” in Proc. of International Confer-ence on Machine Learning (ICML) , 2017.[139] T. Ochiai, S. Watanabe, T. Hori, J. R. Hershey, and X. Xiao, “Uni-fied architecture for multichannel end-to-end speech recognitionwith neural beamforming,”
IEEE Journal of Selected Topics inSignal Processing , vol. 11, no. 8, pp. 1274–1288, 2017.[140] S. Braun, D. Neil, J. Anumula, E. Ceolini, and S.-C. Liu, “Multi-channel attention for end-to-end speech recognition,” in
Proc. ofAnnual Conference of the International Speech CommunicationAssociation (Interspeech) , 2018, pp. 17–21.[141] X. Wang, R. Li, S. H. Mallidi, T. Hori, S. Watanabe, andH. Hermansky, “Stream attention-based multi-array end-to-endspeech recognition,” in
Proc. of IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2019, pp. 7105–7109.[142] A. S. Subramanian, X. Wang, M. K. Baskar, S. Watanabe,T. Taniguchi, D. Tran, and Y. Fujita, “Speech enhancement usingend-to-end speech recognition objectives,” in
Proc. of IEEE ASSPWorkshop on Applications of Signal Processing to Audio andAcoustics . IEEE, 2019.[143] X. Chang, W. Zhang, Y. Qian, J. L. Roux, and S. Watan-abe, “MIMO-SPEECH: End-to-end multi-channel multi-speakerspeech recognition,” in
Proc. of IEEE Workshop on AutomaticSpeech Recognition and Understanding (ASRU) , 2019.[144] M. Delcroix, S. Watanabe, T. Ochiai, K. Kinoshita, S. Karita,A. Ogawa, and T. Nakatani, “End-to-End SpeakerBeam forSingle Channel Target Speech Recognition,” in
Proc. of AnnualConference of the International Speech Communication Associ-ation (Interspeech) , 2019, pp. 451–455.[145] P. Denisov and N. T. Vu, “End-to-End Multi-Speaker SpeechRecognition Using Speaker Embeddings and Transfer Learning,”in
Proc. Interspeech 2019 , 2019, pp. 4425–4429.[146] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen,Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al. , “State-of-the-art speech recognition with sequence-to-sequence models,”in
Proc. of IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2018, pp. 4774–4778.[147] A. Zeyer, K. Irie, R. Schl¨uter, and H. Ney, “Improved trainingof end-to-end attention models for speech recognition,”
Proc. ofAnnual Conference of the International Speech CommunicationAssociation (Interspeech) , pp. 7–11, 2018.[148] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrencesequence-to-sequence model for speech recognition,” in
Proc. ofIEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2018, pp. 5884–5888.[149] J. Sohn, N. S. Kim, and W. Sung, “A statistical model-basedvoice activity detection,”
IEEE Signal Processing Letters , vol. 6,no. 1, pp. 1–3, 1999.[150] T. Hughes and K. Mierle, “Recurrent neural networks for voiceactivity detection,” in
Proc. of IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2013, pp. 7378–7382.[151] F. Eyben, F. Weninger, S. Squartini, and B. Schuller, “Real-lifevoice activity detection with LSTM recurrent neural networks andan application to hollywood movies,” in
Proc. of IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2013, pp. 483–487.[152] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Fried-land, and O. Vinyals, “Speaker diarization: A review of recentresearch,”
IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 20, no. 2, pp. 356–370, 2012.[153] T. Hori, S. Araki, T. Yoshioka, M. Fujimoto, S. Watanabe, T. Oba,A. Ogawa, K. Otsuka, D. Mikami, K. Kinoshita, T. Nakatani,A. Nakamura, and J. Yamato, “Low-latency real-time meetingrecognition and understanding using distant microphones andomni-directional camera,”
IEEE Transactions on Audio, Speech,and Language Processing , vol. 20, no. 2, pp. 499–513, 2011.[154] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front-end factor analysis for speaker verification,”
IEEE Trans-actions on Audio, Speech, and Language Processing , vol. 19,no. 4, pp. 788–798, 2011. [155] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu-danpur, “X-vectors: Robust dnn embeddings for speaker recog-nition,” in
Proc. of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 5329–5333.[156] C. Wooters and M. Huijbregts, “The ICSI RT07s speaker di-arization system,” in
Multimodal Technologies for Perception ofHumans . Springer, 2007, pp. 509–519.[157] G. Sell and D. Garcia-Romero, “Speaker diarization with PLDAi-vector scoring and unsupervised calibration,” in . IEEE, 2014,pp. 413–417.[158] T. von Neumann, K. Kinoshita, M. Delcroix, S. Araki,T. Nakatani, and R. Haeb-Umbach, “All-neural online sourceseparation, counting, and diarization for meeting analysis,” in
Proc. of IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2019, pp. 91–95.[159] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, andS. Watanabe, “End-to-end neural speaker diarization with self-attention,” in
Proc. of IEEE Workshop on Automatic SpeechRecognition and Understanding (ASRU) , 2019.[160] T. Higuchi, N. Ito, S. Araki, T. Yoshioka, M. Delcroix, andT. Nakatani, “Online MVDR beamformer based on complexgaussian mixture model with spatial prior for noise robust asr,”
IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 25, no. 4, pp. 780–793, 2017.[161] C. Boeddeker, H. Erdogan, T. Yoshioka, and R. Haeb-Umbach,“Exploring practical aspects of neural mask-based beamformingfor far-field speech recognition,” in
Proc. of IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2018, pp. 6697–6701.[162] K. Nakadai, T. Takahashi, H. G. Okuno, H. Nakajima,Y. Hasegawa, and H. Tsujino, “Design and implementationof robot audition system ’HARK’—open source software forlistening to three simultaneous speakers,”
Advanced Robotics ,vol. 24, no. 5-6, pp. 739–761, 2010.[163] C. Evers, H. Loellmann, H. Mellmann, A. Schmidt, H. Bar-fuss, P. Naylor, and W. Kellermann, “The LOCATA chal-lenge: Acoustic source localization and tracking,” arXiv preprintarXiv:1909.01008 , 2019.[164] J. W. Du Bois, W. L. Chafe, C. Meyer, S. A. Thompson, andN. Martey, “Santa Barbara corpus of spoken American English,”
CD-ROM. Philadelphia: Linguistic Data Consortium , 2000.[165] N. Takahashi, M. K. Singh, S. Basak, P. Sudarsanam, S. Gana-pathy, and Y. Mitsufuji, “Improving voice separation by incor-porating end-to-end speech recognition,” in
Proc. of IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 41–45.[166] E. Tzinis, S. Wisdom, J. R. Hershey, A. Jansen, and D. P. Ellis,“Improving universal sound separation using sound classifica-tion,” in
Proc. of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2020, pp. 96–100.[167] L. Cristoforetti, M. Ravanelli, M. Omologo, A. Sosi, A. Abad,M. Hagm¨uller, and P. Maragos, “The DIRHA simulated corpus.”in
LREC , 2014, pp. 2629–2634.[168] S. Wehr, I. Kozintsev, R. Lienhart, and W. Kellermann, “Syn-chronization of acoustic sensors for distributed ad-hoc audionetworks and its use for blind source separation,” in
IEEE SixthInternational Symposium on Multimedia Software Engineering .IEEE, 2004, pp. 18–25.[169] N. Ono, H. Kohno, N. Ito, and S. Sagayama, “Blind alignmentof asynchronously recorded signals for distributed microphonearray,” in
Proc. of IEEE ASSP Workshop on Applications ofSignal Processing to Audio and Acoustics . IEEE, 2009, pp.161–164.[170] D. Cherkassky and S. Gannot, “Blind synchronization in wirelessacoustic sensor networks,”
IEEE Transactions on Audio, Speech,and Language Processing , vol. 25, no. 3, pp. 651–661, 2017.[171] S. Araki, N. Ono, K. Kinoshita, and M. Delcroix, “Meetingrecognition with asynchronous distributed microphone array us-ing block-wise refinement of mask-based mvdr beamformer,” in Proc. of IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, 2018, pp. 5694–5698.[172] H. Afifi, J. Schmalenstroeer, J. Ullmann, R. Haeb-Umbach, andH. Karl, “Marvelo-a framework for signal processing in wirelessacoustic sensor networks,” in
Speech Communication; 13th ITG-Symposium . VDE, 2018, pp. 1–5.[173] A. Bertrand and M. Moonen, “Distributed node-specific lcmvbeamforming in wireless sensor networks,”
IEEE Transactionson Signal Processing , vol. 60, no. 1, pp. 233–246, 2011.[174] R. Heusdens, G. Zhang, R. C. Hendriks, Y. Zeng, and W. B.Kleijn, “Distributed mvdr beamforming for (wireless) micro-phone networks using message passing,” in
IWAENC 2012;International Workshop on Acoustic Signal Enhancement . VDE,2012, pp. 1–4.[175] S. Markovich-Golan, A. Bertrand, M. Moonen, and S. Gan-not, “Optimal distributed minimum-variance beamforming ap-proaches for speech enhancement in wireless acoustic sensornetworks,”
Signal Processing , vol. 107, pp. 4–20, 2015.[176] A. Bertrand, “Applications and trends in wireless acoustic sensornetworks: A signal processing perspective,” in . IEEE, 2011, pp. 1–6.[177] K. Kumatani, J. McDonough, J. F. Lehman, and B. Raj, “Channelselection based on multichannel cross-correlation coefficients fordistant speech recognition,” in . IEEE,2011, pp. 1–6.[178] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata,“Audio-visual speech recognition using deep learning,”
AppliedIntelligence , vol. 42, no. 4, pp. 722–737, 2015.[179] L. Drude, J. Heymann, C. Boeddeker, and R. Haeb-Umbach,“NARA-WPE: A Python package for weighted prediction errordereverberation in Numpy and Tensorflow for online and offlineprocessing,” in
ITG 2018, Oldenburg, Germany , October 2018.[180] R. Scheibler, E. Bezzam, and I. Dokmani´c, “Pyroomacoustics:A python package for audio room simulation and array process-ing algorithms,” in
Proc. of IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) , 2018, pp.351–355.[181] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The kaldi speech recognition toolkit,” in
Proc. of ASRU’11 , no.EPFL-CONF-192584. IEEE Signal Processing Society, 2011.[182] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba,Y. Unno, N.-E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen et al. , “Espnet: End-to-end speech processing toolkit,”
Proc. ofAnnual Conference of the International Speech CommunicationAssociation (Interspeech) , pp. 2207–2211, 2018.[183] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot,T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal et al. , “The AMI meeting corpus: A pre-announcement,” in