[PDF] An End-to-end Architecture of Online Multi-channel Speech Separation

Abstract

Multi-speaker speech recognition has been one of the keychallenges in conversation transcription as it breaks the singleactive speaker assumption employed by most state-of-the-artspeech recognition systems. Speech separation is consideredas a remedy to this problem. Previously, we introduced a sys-tem, calledunmixing,fixed-beamformerandextraction(UFE),that was shown to be effective in addressing the speech over-lap problem in conversation transcription. With UFE, an inputmixed signal is processed by fixed beamformers, followed by aneural network post filtering. Although promising results wereobtained, the system contains multiple individually developedmodules, leading potentially sub-optimum performance. In thiswork, we introduce an end-to-end modeling version of UFE. Toenable gradient propagation all the way, an attentional selectionmodule is proposed, where an attentional weight is learnt foreach beamformer and spatial feature sampled over space. Ex-perimental results show that the proposed system achieves com-parable performance in an offline evaluation with the originalseparate processing-based pipeline, while producing remark-able improvements in an online evaluation.

Full PDF

AAn End-to-end Architecture of Online Multi-channel Speech Separation

Jian Wu , ∗ , Zhuo Chen ,Jinyu Li ,Takuya Yoshioka , Zhili Tan , Ed Lin , Yi Luo , Lei Xie † Audio, Speech and Language Processing Group (ASLP), School of Computer Science,Northwestern Polytechnical University, Xi’an, China Microsoft, STCA, Beijing, China, Microsoft, One Microsoft Way, Redmond, WA, USA { jianwu,lxie } @nwpu-aslp.org, { zhuc,jinyli,tayoshio } @microsoft.com Abstract

Multi-speaker speech recognition has been one of the keychallenges in conversation transcription as it breaks the singleactive speaker assumption employed by most state-of-the-artspeech recognition systems. Speech separation is consideredas a remedy to this problem. Previously, we introduced a sys-tem, called unmixing , ﬁxed-beamformer and extraction (UFE),that was shown to be effective in addressing the speech over-lap problem in conversation transcription. With UFE, an inputmixed signal is processed by ﬁxed beamformers, followed by aneural network post ﬁltering. Although promising results wereobtained, the system contains multiple individually developedmodules, leading potentially sub-optimum performance. In thiswork, we introduce an end-to-end modeling version of UFE. Toenable gradient propagation all the way, an attentional selectionmodule is proposed, where an attentional weight is learnt foreach beamformer and spatial feature sampled over space. Ex-perimental results show that the proposed system achieves com-parable performance in an ofﬂine evaluation with the originalseparate processing-based pipeline, while producing remark-able improvements in an online evaluation. Index Terms : multi-channel speech separation, robust speechrecognition, speaker extraction, source localization, ﬁxed beam-former

1. Introduction

Deep learning approaches have brought about remarkableprogress to speaker-independent speech separation in the pastfew years [1, 2, 3, 4]. The separated signal quality hasbeen steadily improved on benchmark datasets such as WSJ0-2mix [2]. However, multi-talker speech recognition still re-mains to be a challenging problem.Speech separation is a common practice to handle thespeech overlaps. Existing efforts in overlapped speech recog-nition can be roughly categorized into two families: building arobust separation system as a front-end processor to automaticspeech recognition (ASR) tasks [5, 6, 7, 8, 9, 10, 11] or develop-ing multi-talker aware ASR models [12, 13, 14, 15, 16, 17, 18].Although better performance can be expected from the end-to-end training including ASR, the independent front end process-ing approach is often preferable in real world applications suchas meeting transcription [19] for two reasons. Firstly, in theconversation transcription systems, the front end module bene-ﬁts multiple acoustic processing components, including speechrecognition, diarization, and speaker veriﬁcation. Secondly,commercial ASR models are usually trained with a tremendous ∗ Work done during internship at Microsoft STCA Beijing. † Lei Xie is the corresponding author. amount of data and is highly engineered, making it extremelycostly to change the training scheme.The recent work of [19] applied speech separation to a real-world conversation transcription task, where a multi-channelseparation network, namely speech unmixing network, trainedwith permutation invariant training (PIT) [1] continuously sep-arates the input audio stream into two channels, ensuring eachoutput channel only contains at most one activate speaker.A mask-based adaptive Minimum Variance Distortionless Re-sponse (MVDR) beamformer was used for generating enhancedsignals. In [20], a ﬁxed beamformer based separation solu-tion was introduced, namely the unmixing , ﬁxed-beamformer and extraction (UFE) system. The mask-based adaptive beam-former of the speech unmixing system is replaced by a pro-cess selecting two ﬁxed beamformers from a pre-deﬁned set ofbeamformers by using a sound source localization (SSL) basedbeam selection algorithm. This is followed by the speech ex-traction model introduced in [8] to ﬁlter the residual interfer-ence in the selected beams. The UFE system has comparableperformance with MVDR-based approach, with reduced pro-cessing latency.One limitation of the UFE system lies in its modularized op-timization, where each component is individually trained withan indirect objective function. For example, the signal recon-struction objective function used for speech unmixing does notnecessarily beneﬁt the accuracy of UFE’s beam selection mod-ule. As a subsequent work of [20], in this paper, we proposea novel end-to-end structure of UFE (E2E-UFE) model, whichutilizes a similar system architecture to UFE, with improvedperformance thanks to end-to-end optimization. To enable jointtraining, several updates are implemented on the speech unmix-ing and extraction networks. We also introduce an attentionalmodule to allow the gradients to propagate though the beamselection module, which was non-differentiable in the originalUFE. The performance of the E2E-UFE is evaluated in bothblock online and ofﬂine setups. Our experiments conducted onsimulated and semi-real two-speaker mixtures show that E2E-UFE yields comparable results with the original UFE system inthe ofﬂine evaluation. Signiﬁcant WER reduction is observedin the block online evaluation.

2. Overview of UFE System

The outline of the UFE pipeline is depicted in Figure. 1, whichconsists four major components, the ﬁxed beamformer, maskbased sound source localization (SSL), speech unmixing net-work and location based speech extraction network.In UFE, the M -channel short-time Fourier transform(STFT) of the input speech mixture Y , ··· ,M − = { Y , · · · , Y M − } is ﬁrstly processed by the speech unmixingmodule, where a time-frequency mask (TF mask) is estimated a r X i v : . [ ee ss . A S ] S e p ch MaskSSLSpeechUnmixing SpeechExtractionAngle Feature FixedBeamformer

Figure 1:

Overview of the UFE system. The grey block is anneural network trained independently. for each participating speaker. In this work, we set the maxi-mum number for simultaneously talking speakers to two so twomasks M , ∈ R T × F are generated by unmixing network. Thespeech unmixing module is trained with permutation invarianttraining (PIT) criteria with scaled-invariant signal-to-noise ratio(Si-SNR) [21] objective function: L = − max φ ∈P (cid:88) ( i,j ) ∈ φ Si-SNR ( s i , x j ) , (1)where P refers all possible permutations, x j is the clean refer-ence of speaker j , and s i refers the separated signal of speaker i , which is obtained via inverse short-time Fourier transform(iSTFT): s i = iSTFT ( M i (cid:12) Y ) . (2)Then the sound source localization module is applied to es-timate the spatial angle for each separated source with weightedmaximum likelihood estimation [20]. The direction of the i -thspeaker is estimated via ﬁnding a discrete angle θ sampled from0 ◦ to 360 ◦ that maximizes the following function: D θ,i = − (cid:88) t,f M i,tf log (cid:32) − | y Ht,f h θ,f | (cid:15) (cid:33) (3)where h θ,f is the normalized steer vector on each frequencyband f for source direction θ , (cid:15) refers a small ﬂooring value,and t denotes frame index in STFT.With the estimated direction, one beamformer is then se-lected for each source from a set of pre-deﬁned beamformer,deﬁned as w n,f ∈ C M × , where n indexes the beam and eachbeam has an center angle that is sampled uniformly across thespace, and the beamformed signal on each time-frequency binis obtained by Eqn. 4 b i,t,f = w Hi,f y t,f , (4)where y t,f = [ Y ,tf , · · · , Y M − ,tf ] T .Finally, the location based speech extraction [8] is appliedon each selected beam, and estimates the TF mask based onthe input of the beam spectrogram, the inter-microphone phasedifference (IPD) and the angle feature [8, 22]. The angle featureon frequency band f is computed as a θ,f = 1 P (cid:88) i,j ∈ ψ cos( o ij,f − ∆ θ,ij,f ) , (5)where ψ contains P microphone pairs and and o ij,f = ∠ y i,f − ∠ y j,f represents the observed IPD between channel i and j . ∆ θ,ij,f is the ground truth phase difference given the directionof arrival θ and array geometry. The ﬁnal output signal is ob-tained via applying the TF masking on corresponding selectedbeam, followed by iSTFT.As ﬁxed beamformer doesn’t need to estimate ﬁlter coef-ﬁcients based on input data, it has the potential achieve low latency processing and more robust performance in challengeacoustic environments. And the speech extraction networkcompensates the limitation in spacial discrimination in ﬁxedbeamformer.

3. End-to-end UFE

The proposed end-to-end UFE system is depicted in Figure 2.The overall system workﬂow is similar to original UFE, whilethe E2E framework largely simpliﬁes the whole process. Theproposed system takes the multi-channel recording as input, anddirectly outputs two separated speech. A single objective func-tion on the top of the network is used to optimize all parameters.In original UFE, three components are non-differentiable,which are SSL module, beam selection module and angle fea-ture extraction. To ensure the joint training, we introduce up-dates to each component.

In E2E framework, the permutation ambiguity is handled in theﬁnal objective function, the unmixing module in UFE reduces toa stack of pre-separation layers. Same as the original UFE, thenetwork takes the IPD and spectrogram of ﬁrst channel record-ing as input feature. The pre-separation layers consists of astack of recurrent layers, followed by H linear projection lay-ers. Here we use H = 2 as we consider at most 2 speakers inthis paper. After processed by pre-separation layers, an interme-diate representation E ∈ R H × T × K is formed where K denotesthe embedding dimension. We refer E as the “pre-separationmask” in later context. To avoid the hard angle selection in sound source module, i.e.Eqn. 3. An attention module is applied in E2E-UFE system,which consists of a pool of beamformed signal and angle fea-ture, followed by an attentional selection to estimate the loca-tion based bias for ﬁnal extraction layers.

The spatial feature pool is formed by stacking the spatial featurepointing to different directions. Two pools are formed, one forﬁxed beamforming and the other for angle feature. For beampool B ∈ C N b × T × F , we calculate the spectrogram of signalobtained though all pre-deﬁned ﬁxed beamformer. In this work,we use N b = 18 beamformers to scan the horizontal space,i.e., 20 degree is covered by each beamformer. The angle fea-ture pool A ∈ R N a × T × F are formed similarly, with N a = 36 directions.Note that in original UFE, only the beam and angle featurecorresponding to the selected angle are calculated, while theE2E UFE calculates beamformed signal and angle feature fromall directions beforehand, resulting in an increased computationburden. But this also open the possibility of jointly optimizethe beamformer and the angle feature, as suggested in [23], asthey are now part of the network. In this work, we freeze thebeamformer ﬁlter coefﬁcients and angle feature representation.The complex operation in beamforming is implemented usingthe multiplication of two real matrices [24]. With pre-separation mask E , beam pool B and angle featureset A as input, an attention selection module is implemented to … Candidate Beam P r o j ec ti on ⊗ S o f t m a x ⊗ ⊗ … … Pre-separationmask ⨀ ⨀ ⨀ …… S u mm a r y Selected BeamAbs & Projection × S ca l e d & A vg ExtractionLayerAttentionalSelectionPre-separationLayerAF & FB PoolLayer

Figure 2:

Overview of the E2E-UFE system (left) and scheme of the attentional beam selection (right). AF and FB are abbreviation ofthe angle feature and ﬁxed beamformer, respectively. The extraction layers accept both of the weighted beam and angle feature. form the location based acoustic bias for each source. The in-tuition for the attention selection is straightforward, where oneattention weight is estimated for each beam and angle feature,based on their learnt similarity with pre-separation representa-tions, followed by a weighted sum to form the ﬁnal beam andangle feature that are sent to the ﬁnal extraction layers. In moredetail, the attention module is operated in four steps. We use thebeam attention as example for illustration, while the selection ofangle feature operates in the same manner. The correspondingscheme is depicted in Figure. 2.Firstly, the pre-separation mask and beam pool are pro-jected using two linear layers: V P = EW p , (6) V B = | B | W b , (7)where W p ∈ R K × D and W b ∈ R F × D are projection layerweights that convert the pre-separation mask and beam pool intothe same dimension D , resulting updated embedding matrices V P ∈ R H × T × D and V B ∈ R N b × T × D .Then a pair-wised similarity matrix is deﬁned between eachframe in V P and V B using dot product distance, scaled by ( √ D ) − . Averaging the similarity matrix along the time axisresulted in beam selection for different time resolution, which ispassed by the softmax function to generate the ﬁnal weight. InEqn. 8, s h,b,t is the similarity score between h -th pre-separationmask and b -th beam at time t , ˆ s h,b refers the time averagedweights and w h,b is the ﬁnal attention weight for each beam.Finally, the weight average operation is performed in order toget the combined beam ˆ B h for h -th speaker, as shown in theEqn. 11. s h,b,t = ( √ D ) − (cid:16) V Ph,t (cid:17) T V Bb,t (8) ˆ s h,b = ( T ) − (cid:88) t s h,b,t , (9) w h,b = softmax b (ˆ s h,b ) , (10) ˆ B h = (cid:88) b w h,b B b . (11)The combined angle feature ˆ A h can be calculated with thesame mechanism. The proposed attention module connects thespecial feature, pre-separation and the later extraction step, en-suring the gradient can be passed in an end-to-end optimizationscheme. Note that, the averaging step in Eqn. 9 can be adjustedaccording to different application scenarios. For ofﬂine process-ing, averaging over entire utterance usually leads to more robustestimation, assuming the position of the speaker is not changed. While averaging only based on past information is more desir-able for online processing. The same mechanism can be appliedwith the other information as well, e.g., speaker inventory [25]or visual clues, etc. The combined beam and angle feature estimated via the atten-tional selection module are processed by the extraction layers.The extraction layers have essentially the same structure as theoriginal UFE, except that the PIT training criteria is requiredas the permutation ambiguity is not disentangled by unmixingmodule in E2E framework. We use the clean source from theground truth beam selection as training target, so both beamselection and wave reconstruction will be optimized with oneobjective function. Denoting r i as the training target for thespeaker- i , the objective function is given in a permutation-freeform: L = − max φ ∈P (cid:88) ( i,j ) ∈ φ Si-SNR ( s i , r j ) , (12)where the s i is the network’s estimation of the speaker- i .

4. Experiments

The proposed system was trained with multi-channel artiﬁciallymixed speech. A total of 1000 hours of training speech data wasgenerated. Source clean speech signals were taken from pub-licly available datasets, including LibriSpeech [26], CommonVoice , as well as Microsoft internal recordings. Seven-channelsignals were simulated by convolving clean speech signals withartiﬁcial room impulse responses (RIRs) generated with the im-age method [27]. We used the same microphone array geometryas the one used in [20]. The T60 reverberation times were uni-formly sampled from [0.1, 0.5] s with a room size of [2,20] min length and width and [2,5] m in height. The speaker and mi-crophone locations were randomly determined in the simulatedrooms. Simulated isotropic noise [28] was added to each mix-ing utterance at an SNR sampled from [10, 20] dB. We madesure each speech mixture contained one or two speakers, withthe mixing SNR between [-5, 5] dB and an average overlappingratio of 50%. All the data had a sampling rate of 16 kHz.Two test sets were created for model evaluation. The ﬁrsttest set was created by using the same generation pipeline asthe one for the training data, denoted as the simu test set, whichamounts to 3000 utterances. The speakers were sampled fromthe test-clean set in LibriSpeech. There was no shared speakers https://voice.mozilla.org/en able 1: WER (%) performance in the ofﬂine evaluation.

Method simu semi-real

OV35 OV75 OV35 OV75Mixed Beam 67.40 52.40 70.92 57.63Clean Beam 10.67 10.56 20.34 19.71UFE 16.44 18.55 35.60 37.54E2E-UFE in the training and test sets. The second test set was generatedby directly mixing our internal real recorded multi-channel sin-gle speaker signals. 2000 mixed utterances were created withthe same mixing strategy as in the training set, except that noscaling was applied on the source signals. We refer to this setas the semi-real test set. For each set, we created two overlap-ping conditions, whose overlap ratio ranged from 20—50% or50–100%. We denote these two condition as OV35 and OV75,respectively.

The original UFE system served as the baseline of the pro-posed E2E architecture. We observed that when trained withthe PIT criterion, the extraction model of UFE yielded signiﬁ-cantly better results. Therefore, we used PIT-trained extractionin our UFE baseline. For reference, we included the results ob-tained with the ﬁxed beamforming system applied directly tospeech mixtures (Mixed Beam) and those obtained by applyingthe same beamformers to the clean utterances (Clean Beam),where the beams were selected based on oracle direction of ar-rival information.

In the proposed E2E-UFE framework, both extraction and un-mixing layers consisted of three contextual LSTM layers [29],each with 512 nodes and a dropout rate of 0.2. For better conver-gence, the unmxing and extraction networks were pre-trainedindividually before joint optimization. The same model archi-tecture for unmixing and extraction was used for the UFE base-line. The log magnitude spectrum with an FFT size of 512 anda hop of 256 samples was used as spectral features for all net-works. For the unmixing network, cosIPDs between three mi-crophone pairs (1 , , (2 , , (3 , were extracted.We used Adam optimizer and train both the networks fora maximum of 80 epochs with a weight decay value of e − .The early stopping strategy was used to avoid over-ﬁtting. Ini-tial learning rate was set to e − and halved if no validation im-provement was observed for two consecutive epochs. For jointtraining in E2E-UFE, a smaller learning rate e − was used forﬁne tuning. All systems were evaluated in ofﬂine and block online setups.In the ofﬂine evaluation, the system was allowed to use the in-formation from an entire utterance. That is, SSL and attentionalselection, i.e., Eqn. 3 and 8, respectively, were performed byusing averages over the whole utterance. In the block onlineprocessing, a double buffering [20] scheme was applied, whereeach system estimated the output block-wisely through time.Each evaluation block contained a two second window, withadditional two or four second history information. The hop be-tween two evaluation block was two seconds, resulting in an Table 2:

WER (%) performance in the online evaluation.

Method (history) simu semi-real

OV35 OV70 OV35 OV70UFE (2s) 24.10 31.40 44.05 45.13UFE (4s) 23.66 28.85 43.49 44.06E2E-UFE (2s) 17.50 19.43 38.64 39.98E2E-UFE (4s) average latency of one second.The word error rate (WER) was used as a performance met-ric. The ASR pipeline we used for decoding included a tri-gramlanguage model and an acoustic model consisting of six layersof 512-element layer trajectory LSTM [30]. The acoustic modelwas trained with maximum mutual information (MMI) [31] on30k hours of noise-corrupted data.

The ofﬂine evaluation results are shown in Table 1. The simpleﬁxed beamforming (Mixed Beam) yielded a high WER eventhough it used the oracle DoA. The result of the clean beam setsthe upper bound to the UFE performance. The proposed E2E-UFE system achieved comparable performance as the originalUFE for the simulated data set, while demonstrating a clear per-formance advantage in semi-real the semi-real set, showing theefﬁcacy of the end-to-end training scheme. Overall, E2E-UFEachieved 4.8% and 4.3% relative WER reduction over the UFEsystem on OV35 and OV75 of the semi-real , respectively, reach-ing 33.89% and 35.92% WERs.Table 2 shows the block online evaluation results. E2E-UFE shows robustness for different look-back conﬁgurations (a2s or 4s history context), achieving slightly worse results thanfor the ofﬂine evaluation on both datasets. On the simu set,E2E-UFE showed no signiﬁcant degradation compared with theofﬂine performance. It achieved lower WERs than the originalUFE. On the semi-real set, it brought about a 12.47% averagerelative WER reduction compared with the UFE system using a2 s history context, while on the simu set, the relative reductionincreases to 29.71% . By contrast, the original UFE resultedin a much larger performance degradation for the online eval-uation, degrading from 16.44/18.55% to 24.10/31.40% on the simu set and 35.60/37.54% to 44.05/43.15% on the semi-real set. One hypothesis for the robustness of E2E-UFE is that, dur-ing training, the E2E-UFE model already optimized for wrongbeam selections, while for the original UFE, only the correctbeams were selected as input. Another potential reason couldbe that the spariﬁcation trick in [20] was not applied in eitherUFE or E2E-UFE, which might result in more energy leakagefor UFE system, while E2E-UFE system doesn’t suffer fromthis problem as all modules are jointly optimized.

5. Conclusion

In this paper, we proposed an end-to-end structure of multi-channel speech separation, named E2E-UFE, for robust ASR.It replaces the SSL module in the previously proposed UFEsystem with a small attention network and enables joint opti-mization of the unmixing and extraction networks. The experi-ments were conducted on two 2-speaker datasets (simulated andsemi-real mixtures) and the performance was evaluated for bothofﬂine and online settings. The experimental results showedthat E2E-UFE provided comparable performance with the UFEsystem in the ofﬂine situations and yielded an average relativeWER reduction of 12.47% on block online processing. . References [1] M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speechseparation with utterance-level permutation invariant training ofdeep recurrent neural networks,”

IEEE/ACM Transactions on Au-dio, Speech and Language Processing (TASLP) , vol. 25, no. 10,pp. 1901–1913, 2017.[2] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deepclustering: Discriminative embeddings for segmentation and sep-aration,” in . IEEE, 2016, pp. 31–35.[3] Z.-Q. Wang, K. Tan, and D. Wang, “Deep learning based phase re-construction for speaker separation: A trigonometric perspective,”in

ICASSP 2019-2019 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp.71–75.[4] Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,”

IEEE/ACMTransactions on Audio, Speech, and Language Processing ,vol. 27, no. 8, pp. 1256–1266, 2019.[5] T. Higuchi, N. Ito, S. Araki, T. Yoshioka, M. Delcroix, andT. Nakatani, “Online mvdr beamformer based on complex gaus-sian mixture model with spatial prior for noise robust asr,”

IEEE Transactions on Audio, Speech, and Language Processing ,vol. 25, no. 4, pp. 780–793, 2017.[6] K. Zmolikova, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa,and T. Nakatani, “Speaker-aware neural network based beam-former for speaker extraction in speech mixtures.” in

Interspeech ,2017, pp. 2655–2659.[7] L. Drude and R. Haeb-Umbach, “Tight integration of spatial andspectral features for bss with deep clustering embeddings.” in

In-terspeech , 2017, pp. 2650–2654.[8] Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, andY. Gong, “Multi-channel overlapped speech recognition with lo-cation guided speech extraction network,” in . IEEE, 2018, pp. 558–565.[9] C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude,J. Heymann, and R. Haeb-Umbach, “Front-end processing forthe chime-5 dinner party scenario,” in

CHiME5 Workshop, Hy-derabad, India , 2018.[10] J. Wu, Y. Xu, S.-X. Zhang, L.-W. Chen, M. Yu, L. Xie, and D. Yu,“Improved speaker-dependent separation for chime-5 challenge,” arXiv preprint arXiv:1904.03792 , 2019.[11] T. Yoshioka, H. Erdogan, Z. Chen, X. Xiao, and F. Alleva,“Recognizing overlapped speech in meetings: A multichan-nel separation approach using neural networks,” arXiv preprintarXiv:1810.03655 , 2018.[12] D. Yu, X. Chang, and Y. Qian, “Recognizing multi-talkerspeech with permutation invariant training,” arXiv preprintarXiv:1704.01985 , 2017.[13] Z. Chen, J. Droppo, J. Li, and W. Xiong, “Progressive joint mod-eling in unsupervised single-channel overlapped speech recogni-tion,”

IEEE Transactions on Audio, Speech, and Language Pro-cessing , vol. 26, no. 1, pp. 184–196, 2018.[14] M. W. Lam, J. Wang, X. Liu, H. Meng, D. Su, and D. Yu, “Extract,adapt and recognize: an end-to-end neural network for corruptedmonaural speech recognition,”

Proc. Interspeech 2019 , pp. 2778–2782, 2019.[15] N. Kanda, S. Horiguchi, R. Takashima, Y. Fujita, K. Nagamatsu,and S. Watanabe, “Auxiliary interference speaker loss for target-speaker speech recognition,” arXiv preprint arXiv:1906.10876 ,2019.[16] X. Chang, Y. Qian, K. Yu, and S. Watanabe, “End-to-end monau-ral multi-speaker asr system without pretraining,” in

ICASSP2019-2019 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2019, pp. 6256–6260. [17] M. Delcroix, S. Watanabe, T. Ochiai, K. Kinoshita, S. Karita,A. Ogawa, and T. Nakatani, “End-to-end speakerbeam for singlechannel target speech recognition,”

Proc. Interspeech 2019 , pp.451–455, 2019.[18] N. Kanda, Y. Gaur, X. Wang, Z. Meng, and T. Yoshioka, “Seri-alized output training for end-to-end overlapped speech recogni-tion,” arXiv preprint arXiv:2003.12687 , 2020.[19] T. Yoshioka, I. Abramovski et al. , “Advances in online audio-visual meeting transcription,” in

Proc. Worksh. Automat. SpeechRecognition, Understanding , 2019.[20] T. Yoshioka, Z. Chen, C. Liu, X. Xiao, H. Erdogan, and D. Dim-itriadis, “Low-latency speaker-independent continuous speechseparation,” in

ICASSP 2019-2019 IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 6980–6984.[21] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or well done?” in

ICASSP 2019-2019 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 626–630.[22] Z.-Q. Wang and D. Wang, “On spatial features for supervisedspeech separation and its application to beamforming and robustasr,” in . IEEE, 2018, pp. 5709–5713.[23] W. Minhua, K. Kumatani, S. Sundaram, N. Str¨om, andB. Hoffmeister, “Frequency domain multi-channel acoustic mod-eling for distant speech recognition,” in

ICASSP 2019-2019 IEEEInternational Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) . IEEE, 2019, pp. 6640–6644.[24] C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian,J. F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, and C. J.Pal, “Deep complex networks,” arXiv preprint arXiv:1705.09792 ,2017.[25] P. Wang, Z. Chen, X. Xiao, Z. Meng, T. Yoshioka, T. Zhou, L. Lu,and J. Li, “Speech separation using speaker inventory,” in . IEEE, 2019, pp. 230–236.[26] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an asr corpus based on public domain audio books,”in . IEEE, 2015, pp. 5206–5210.[27] J. B. Allen and D. A. Berkley, “Image method for efﬁciently sim-ulating small-room acoustics,”

The Journal of the Acoustical So-ciety of America , vol. 65, no. 4, pp. 943–950, 1979.[28] E. A. Habets and S. Gannot, “Generating sensor signals inisotropic noise ﬁelds,”

The Journal of the Acoustical Society ofAmerica , vol. 122, no. 6, pp. 3464–3470, 2007.[29] J. Li, L. Lu, C. Liu, and Y. Gong, “Improving layer trajectory lstmwith future context frames,” in

ICASSP 2019-2019 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 6550–6554.[30] J. Li, C. Liu, and Y. Gong, “Layer trajectory lstm,” arXiv preprintarXiv:1808.09522 , 2018.[31] K. Vesel`y, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative training of deep neural networks.” in