[PDF] Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form Multi-talker Recordings

Abstract

An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification. The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers. However, the E2E modeling approach is susceptible to the mismatch between the training and testing conditions. It has yet to be investigated whether the E2E SA-ASR model works well for recordings that are much longer than samples seen during training. In this work, we first apply a known decoding technique that was developed to perform single-speaker ASR for long-form audio to our E2E SA-ASR task. Then, we propose a novel method using a sequence-to-sequence model, called hypothesis stitcher. The model takes multiple hypotheses obtained from short audio segments that are extracted from the original long-form input, and it then outputs a fused single hypothesis. We propose several architectural variations of the hypothesis stitcher model and compare them with the conventional decoding methods. Experiments using LibriSpeech and LibriCSS corpora show that the proposed method significantly improves SA-WER especially for long-form multi-talker recordings.

Full PDF

aa r X i v : . [ c s . S D ] J a n HYPOTHESIS STITCHER FOR END-TO-END SPEAKER-ATTRIBUTED ASR ONLONG-FORM MULTI-TALKER RECORDINGS

Xuankai Chang †∗ , Naoyuki Kanda ∗ , Yashesh Gaur , Xiaofei Wang , Zhong Meng , Takuya Yoshioka Center for Language and Speech Processing, Johns Hopkins University, USA Microsoft Corp., USA

ABSTRACT

An end-to-end (E2E) speaker-attributed automatic speech recog-nition (SA-ASR) model was proposed recently to jointly performspeaker counting, speech recognition and speaker identiﬁcation. Themodel achieved a low speaker-attributed word error rate (SA-WER)for monaural overlapped speech comprising an unknown number ofspeakers. However, the E2E modeling approach is susceptible tothe mismatch between the training and testing conditions. It has yetto be investigated whether the E2E SA-ASR model works well forrecordings that are much longer than samples seen during training.In this work, we ﬁrst apply a known decoding technique that wasdeveloped to perform single-speaker ASR for long-form audio toour E2E SA-ASR task. Then, we propose a novel method using asequence-to-sequence model, called hypothesis stitcher. The modeltakes multiple hypotheses obtained from short audio segments thatare extracted from the original long-form input, and it then outputs afused single hypothesis. We propose several architectural variationsof the hypothesis stitcher model and compare them with the conven-tional decoding methods. Experiments using LibriSpeech and Lib-riCSS corpora show that the proposed method signiﬁcantly improvesSA-WER especially for long-form multi-talker recordings.

Index Terms — Hypothesis stitcher, speech recognition, speakeridentiﬁcation, rich transcription

1. INTRODUCTION

Speaker-attributed automatic speech recognition (SA-ASR) for over-lapped speech has long been studied to realize meeting transcrip-tion [1–3]. It requires counting the number of speakers, transcribingutterances, and diarizing or identifying the speaker of each utter-ance from multi-talker recordings that may contain overlapping ut-terances. Despite the signiﬁcant progress that has been made espe-cially for multi-microphone settings (e.g., [4–8]), SA-ASR remainsvery challenging when we can only access monaural audio. Onetypical approach is a modular approach combining individual mod-ules, such as speech separation, speaker counting, ASR and speakerdiarization/identiﬁcation. However, since different modules are de-signed based on different criteria, the simple combination does notnecessarily result in an optimal solution for the SA-ASR task.An end-to-end (E2E) SA-ASR model was recently proposedin [9] for monaural multi-talker speech as a joint model of speakercounting, speech recognition, and speaker identiﬁcation. Unlikeprior studies in joint SA-ASR modeling [10–12], it uniﬁes the keycomponents of SA-ASR, i.e., speaker counting, speech recognition,and speaker identiﬁcation, and thus can handle speech consisting of † Work performed during internship at Microsoft. ∗ Equal contribution. any number of speakers. The model greatly improved the speaker-attributed word error rate (SA-WER) over the modular system. [13]also showed that the E2E SA-ASR model could be used even withoutprior speaker knowledge by applying speaker counting and speakerclustering to model’s internal embeddings.While promising results have been reported, it is still unclearwhether the E2E SA-ASR model works well for audio recordingsthat are much longer than those seen during training. Our prelim-inary analysis using in-house real conversation data revealed that aconsiderable number of long segments existed even after applyingvoice activity detection to the original long-form recordings. Notethat, for a conventional single-speaker E2E ASR, it is known that thelong-form audio input causes signiﬁcant accuracy degradation dueto the mismatch of the training and testing conditions [14–16]. Tomitigate this, [15] proposed an “overlapping inference” algorithm inwhich the long-form audio was segmented by sliding windows witha 50% shift and the hypotheses from each window were fused togenerate one long hypothesis with an edit-distance-based heuristic.Although the overlapping inference could be applied to multi-talkerrecordings, the effectiveness of the method could be negatively im-pacted by the speaker recognition errors. In addition, the decodingcost in the overlapping inference is two times higher due to the over-lap of the adjacent window positions.With this as a background, in this paper, we propose a novelmethod using a sequence-to-sequence model, called hypothesisstitcher, which takes multiple hypotheses obtained from a sliding-window and outputs a fused single hypothesis. We propose severalarchitectural variants of the hypothesis stitcher model and comparethem with the conventional decoding methods. Note that, whilewe propose and evaluate the hypothesis stitcher in the context ofSA-ASR, the idea is directly applicable to a standard ASR task.In our evaluation using the LibriSpeech [17] and LibriCSS [18]corpora, we show that the proposed method signiﬁcantly improvesSA-WER over the previous methods. We also show that the hypoth-esis stitcher can work with a sliding window with less than 50%overlap and thereby reduce the decoding cost.

2. REVIEW OF RELEVANT TECHNIQUES

In this section, we review relevant techniques to the proposedmethod. Due to the page limitation, we only describe the overviewof each technique. Refer to the original papers for further details.

The E2E SA-ASR was proposed as a joint model of speaker count-ing, speech recognition and speaker identiﬁcation for monauraloverlapped audio [9]. The inputs to the model are acoustic features X = { x , . . . , x T } and speaker proﬁles D = { d , . . . , d K } ,here T is the length of the acoustic feature, and K is the numberof the speaker proﬁles. The model outputs the multi-speaker tran-scription Y = { y , . . . , y N } and the corresponding speaker labels S = { s , . . . , s N } for each output token, where N is the outputlength. Following the idea of serialized output training (SOT) [19],the multi-speaker transcription Y is represented by concatenatingindividual speakers’ transcriptions interleaved by a special symbol h sc i which represents the speaker change.The E2E SA-ASR consists of two interdependent blocks: an ASR block and a speaker identiﬁcation block . The ASR block isrepresented as follows. E enc = AsrEncoder ( X ) , (1) c n , α n = Attention ( u n , α n − , E enc ) , (2) u n = DecoderRNN ( y n − , c n − , u n − ) , (3) o n = DecoderOut ( c n , u n , ¯ d n ) . (4)Firstly, AsrEncoder maps X into a sequence of hidden representa-tions, E enc = { h enc , . . . , h enc T } (Eq. (1)). Then, an attention mod-ule computes attention weights α n = { α n, , . . . , α n,T } and con-text vector c n as an attention weighted average of E enc (Eq. (2)).The DecoderRNN then computes hidden state vector u n (Eq. (3)).Finally, DecoderOut computes the distribution of output token o n based on the context vector c n , the decoder state vector u n , and theweighted average of the speaker proﬁles ¯ d n (Eq. (4)). Note that ¯ d n is computed in the speaker identiﬁcation block, which is explainedin the next paragraph. The posterior probability of token i (i.e. i -thtoken in the dictionary) at the n -th decoder step is represented as P ( y n = i | y n − , s n , X , D ) ∼ o n,i , (5)where o n,i represents the i -th element of o n .Meanwhile, the speaker identiﬁcation block works as follows. E spk = SpeakerEncoder ( X ) , (6) p n = X Tt =1 α n,t h spk t , (7) q n = SpeakerQueryRNN ( y n − , p n , q n − ) , (8) β n = InventoryAttention ( q n , D ) , (9) ¯ d n = X Kk =1 β n,k d k . (10)Firstly, SpeakerEncoder converts the input X to a sequence of hid-den representations, E spk = n h spk , . . . , h spk T o , as speaker embed-dings (Eq. (6)). Then, by using the attention weight α n , a speakercontext vector p n is computed for each output token (Eq. (7)). TheSpeakerQueryRNN module then generates a speaker query q n bytaking p n as an input (Eq. (8)). After getting the speaker query q n ,an InventoryAttention module estimates the attention weights β n = { β n, , . . . , β n,K } for each speaker proﬁle d k in D (Eq. (9)). Fi-nally, ¯ d n is obtained by calculating the weighted sum of the speakerproﬁles using β n (Eq. (10)), which is input to the ASR block. Inthe formulation, the attention weight β n,k can be seen as a posteriorprobability of person k speaking the n -th token given all the previoustokens and speakers as well as X and D , i.e., P ( s n = k | y n − , s n − , X, D ) ∼ β n,k . (11)With these model components, all the E2E SA-ASR parametersare trained by maximizing log P ( Y , S | X , D ) , which is deﬁned as log P ( Y , S | X , D ) = log N Y n =1 { P ( y n | y n − , s n , X , D ) · P ( s n | y n − , s n − , X , D ) γ } , where γ is a scaling parameter. Decoding with the E2E SA-ASRmodel is conducted by an extended beam search. Refer to [9] forfurther details. Overlapping inference was proposed for conventional single-speakerASR systems to deal with long-form speech [15]. With the over-lapping inference, the input audio is ﬁrst broken into ﬁxed-lengthsegments. There is overlap between every two consecutivesegments so that the information loss around the segment bound-aries can be recovered from the overlapping counterpart. Given theword hypothesis ˆ Y m = { y m , . . . , y mN m } , where y ji represents the i -th recognized word in the j -th segment and N m is the hypothesislength of segment m , we can generate two hypothesis sequences byconcatenating all the odd segments or even segments as follows: ˆ Y o = . . . , y m − , . . . , y m − N m − , y m +11 , . . . , y m +1 N m +1 , . . . , ˆ Y e = . . . , y m , . . . , y mN m , y m +21 , . . . , y m +2 N m +2 , . . . . An edit-distance-based algorithm is then applied to align the odd andeven sequences, ˆ Y o and ˆ Y e .To avoid aligning words from non-overlapped segments (e.g.segments m and m + 2 ), the distance between two words from non-overlapped segments is set to inﬁnity. As a result, a sequence ofword pairs h o , e i , h o , e i , . . . , h o L , e L i is generated: h o i , e i i =  h y op o ,q o , y ep e ,q e i if y op o ,q o aligned with y ep e ,q e h ∅ , y ep e ,q e i if y ep e ,q e has no alignment h y op o ,q o , ∅ i if y op o ,q o has no alignment , where i is the pair index, L is the total number of matched pairs, y op o ,q o is the p o -th word from q o -th segment in ˆ Y o , y ep e ,q e is the p e -th word from q e -th segment in ˆ Y e and ∅ denotes no predictions.The ﬁnal hypothesis is formed by selecting words from thealignment according to the conﬁdence: ˆ Y ∗ i = (cid:26) o i if f ( o i ) ≥ f ( e i ) e i otherwise.Here, f ( · ) denotes the function to compute the conﬁdence valueof the token. In [15], a simple heuristic of setting a lower con-ﬁdence value to the word near the edges of each segment (and ahigher conﬁdence value to the word near the center of each seg-ment) was proposed and shown to be effective. In this approach,the conﬁdence score for word n from segment m , y mn , is deﬁned as f ( y mn ) = −| n/C m − / | , where C m denotes the number of wordsin window m . As a baseline system, we consider applying the overlapping infer-ence to the SA-ASR task. Because the E2E SA-ASR model gener-ates multiple speakers’ transcriptions from each segment, we sim-ply apply the overlapping inference for each speaker independently.Namely, we ﬁrst group the hypotheses from each segment based onthe speaker identities estimated by the E2E SA-ASR model. We thenapply the overlapping inference for each speaker’s hypotheses.There are two possible problems in this procedure. Firstly,speaker misrecognition made by the model could cause a confusionin the alignment process. Secondly, even when two hypotheses areobserved from overlapping segments for the same speaker, there is case where we should not merge the two hypotheses. This casehappens when speakers A and B speak in the order of A-B-A andthe audio is broken into two overlapped segments such that the over-lapped region contains only speaker B’s speech. In this case, weobserve hypotheses of “A-B” for the ﬁrst segment, and hypothesesof “B-A” for the second segment. We should not merge the two hy-potheses for speaker A because they are not overlapped. However,avoiding these problems is not trivial.

3. HYPOTHESIS STITCHER3.1. Overview

To handle the long-form multi-talker recordings more effectively,we propose a new method using a sequence-to-sequence model,called hypothesis stitcher. The hypothesis stitcher consolidatesmultiple hypotheses obtained from short audio segments into asingle coherent hypothesis. In our proposed approach, the hy-pothesis stitcher takes the multiple hypotheses of the k -th speaker, ˆ Y k = { ˆ Y ,k , . . . , ˆ Y M,k } , where ˆ Y m,k represents the hypothesisof speaker k for audio segment m . Then, the hypothesis stitcheroutputs a fused single hypotheses H k given the input ˆ Y k . There areseveral possible architectures of the hypothesis stitcher, and we willexplain them in the next subsections.The entire procedure of applying the hypothesis stitcher is as fol-lows. Firstly, as with the overlapping inference, a long multi-talkerrecording is broken up into M ﬁxed-length segments with overlaps.Then, the SA-ASR model is applied to each segment m to estimatethe hypotheses { ˆ Y m, , . . . , ˆ Y m,K } , where K is the number of theproﬁles in the speaker inventory D . After performing the E2E SA-ASR for all segments, the hypotheses are grouped by the speaker toform ˆ Y k . Here, if speaker k is not detected in segment m , ˆ Y m,k isset to empty. Finally, the hypothesis stitcher works for each speakerto estimate a fused hypothesis H k .In this procedure, the input ˆ Y k could have the same problemsthat we discussed for the overlapping inference in Section 2.3. How-ever, we expect that the sequence-to-sequence model can work morerobustly if it is trained appropriately. In addition, as an importantdifference from the overlapping inference, which requires the adja-cent segments to overlap by 50%, variants of the hypothesis stitchercan work on segments with less than 50% of overlap. It is impor-tant because the smaller the overlap is, the lower the decoding costbecomes. In the next sections, we will explain the variants of thehypothesis stitcher that we examined in our experiments. We propose an alignment-based stitcher as an extension of the over-lapping inference. In this method, the input audio is segmented bya sliding window with 50% overlap as with the overlapping infer-ence. Then, the hypotheses from the odd-numbered segments andeven-numbered segments are each joined to yield two sequences as ˆ Y o,k wc = { ˆ Y ,k , h WC i , ˆ Y ,k , h WC i , . . . , ˆ Y m − ,k , h WC i , . . . } , ˆ Y e,k wc = { ˆ Y ,k , h WC i , ˆ Y ,k , h WC i , . . . , ˆ Y m,k , h WC i , . . . } . Here, we introduce a special window change symbols, h WC i , to in-dicate the boundary of each segment in the concatenated hypothe- We noticed a recent study that used the time-alignment information toreduce the overlap of segments [20]. However, it requires precise time-alignment information, and is not necessarily be applicable for all systems. ses. Next, by using the same algorithm as the overlapping infer-ence, we align ˆ Y o,k wc and ˆ Y e,k wc to generate a sequence of word pairs h o , e i , h o , e i , . . . , h o L , e L i , where o l and e l can be h WC i . Fi-nally, this word pair sequence is input to the hypothesis stitchermodel to estimate the fused single hypothesis H k . In this paper, werepresent the hypothesis stitcher by the transformer-based attentionencoder-decoder [21], and we simply concatenate the embeddings of o l and e l for each position l to form the input to the encoder.In the overlapping inference, the word-position-based conﬁ-dence function f ( · ) is used to select the word from the aligned wordpairs. On the other hand, in the alignment-based stitcher, we expectbetter word selection to be performed by the sequence-to-sequencemodel by training the model on an appropriate data. We also propose another architecture, called a serialized stitcher ,which is found to be effective while being much simpler than thealingment-based stitcher. In the serialized stitcher, we simply joinall hypotheses of the k -th speaker from every short segments as ˆ Y k wc = { ˆ Y ,k , h WC i , ˆ Y ,k , h WC i , ˆ Y ,k , h WC i , . . . , ˆ Y M,k } . Then, ˆ Y k wc is fed into the hypothesis stitcher to estimate the fusedsingle hypothesis. We again used the transformer-based attentionencoder-decoder to represent the hypothesis stitcher.We also examine a variant of the serialized stitcher where we in-sert two different symbols h WCO i and h WCE i to explicitly indicatethe odd-numbered segment and even-numbered segment. ˆ Y k wco/e = { ˆ Y ,k , h WCO i , ˆ Y ,k , h WCE i , ˆ Y ,k , h WCO i , . . . , ˆ Y M,k } . Note that the serialized stitcher no longer requires the short seg-ments to be 50% overlapped because there is no alignment proce-dure. In Section 4, we examine how the overlap ratio between theadjacent segments affects the accuracy of the serialized stitcher.

4. EXPERIMENTS

We conduct a basic evaluation of the hypothesis stitcher by usingsimulated mixtures of the LibriSpeech utterances [17] in Section 4.1.The best model is then evaluated on the LibriCSS [18], a noisy multi-talker audio recorded in a real meeting room, in Section 4.2.

Our experiments used the E2E SA-ASR model of [13], which wastrained using LibriSpeech. The training data were generated by ran-domly mixing multiple utterances in “train 960”. Up to 5 utteranceswere mixed for each sample, and the average duration was 29.4 sec.Refer to [13] for further details of the model and training conﬁgura-tions.On top of the E2E SA-ASR model, we trained hypothesisstitcher models by using a 16-second-long sliding window. We ﬁrstsimulated monaural multi-talker data from LibriSpeech (see Ta-ble 1). There were , long-form audio training samples, eachof which was a mixture of multiple utterances randomly selectedfrom “train 960”. Each sample consisted of up to 12 utterancesspoken by 6 or fewer speakers. When the utterances were mixed,each utterance was shifted by a random delay to simulate partiallyoverlapped conversational speech so that an average overlap time able 1 . Summary of the clean simulated data set Train Dev Testshort long short longSource train 960 dev clean dev clean test clean test clean

Table 2 . SA-WER (%) on clean simulated evaluation set.

Decoding method Segment Dev Testoverlap short long short longNo segmentation - 14.2 17.2 13.1 21.1Block-wise inference 0% 11.8 12.4 12.5 15.2Overlapping inference 50% 12.0 12.8 12.7 15.2Stitcher (alignment-based) 50% 11.4 12.0 12.3 14.5Stitcher (serialized, WC) 50% 10.6 11.9 10.8 13.7Stitcher (serialized, WCO/E) 50% ratio became . All training samples were then broken into16-second overlapped segments in a simlar way to the settings ofthe overlapping inference paper [15], and they were decoded by theE2E SA-ASR model with relevant speaker proﬁles. The generatedhypotheses and the original reference label were used for the inputand output, respectively, to train the hypothesis stitcher.For all variants of the hypothesis stitchers, we used the transformer-based attention encoder-decoder with 6-layers of encoder (1,024-dim, 8-heads) and 6-layers of decoder (1,024-dim, 8-heads) byfollowing [21]. The model was trained by using an Adam optimizerwith a learning rate of 0.0005 until no improvement was observed for3 consecutive epochs on the development set. We applied the labelsmoothing and dropout with parameters of 0.1 and 0.3, respectively.For evaluation, we generated two types of development and testsets: a short set with roughly 30-second-long samples and a long setwith roughly 60-second-long samples (Table 1). We used the SA-WER, which was calculated by comparing the ASR hypothesis andthe reference transcription of each speaker, as an evaluation metric.

First, we evaluated the hypothesis stitcher using a sliding windowof 50% overlap. Table 2 shows the results. As the baseline, weevaluated the E2E SA-WER without segmentation, denoted as “Nosegmentation” in the table. We also evaluated a “Block-wise infer-ence” method, in which the E2E SA-ASR was performed by usinga non-overlapping 16-second sliding window. For this method, theestimated hypotheses from each window were naively concatenated.As shown in the table, we observed a signiﬁcant improvement bythis simple method especially on the long evaluation set. This resultsuggests that the E2E SA-ASR that was used for the evaluation didnot generalize well for very long audio. We then evaluated the over-lapping inference using 16 seconds of segments with 8 seconds ofsegment shift. However, the overlapping inference showed slightlyworse results than the simple block-wise inference. Note that, in ourpreliminary experiment, we have conﬁrmed that the overlapping in-ference achieved better accuracy than the block-wise inference whenthe recording consists of only one speaker. Therefore, the degra-dation could have been caused by the two types of issues as wediscussed in Section 2.3. Finally, the proposed hypothesis stitchermodels were evaluated. We found that all variants of the hypothe-sis stithcers signiﬁcantly outperformed all the baseline methods. Wealso found that the serialized stitcher with the two types of auxiliary

Table 3 . SA-WER (%) with various overlap ratio of segments.

Decoding method Segment Dev Testoverlap short long short longStitcher (serialized, WCO/E) 0% 11.6 12.4 12.3 15.1Stitcher (serialized, WCO/E) 25% 11.0 11.9 11.5 14.3Stitcher (serialized, WCO/E) 50% 10.5 11.5 10.6 13.4

Table 4 . SA-WER (%) on LibriCSS.

Method Speech overlap ratio (%) Total0S 0L 10 20 30 40No segmentation symbols, h WCO i and h WCE i , worked the best.Given the serialized stitcher yielding the best performance, wefurther conducted the evaluation by using a sliding window with aless overlap ratio (i.e. with a larger stride of the window). The re-sults are shown in Table 3. While the best performance was stillobtained with the half-overlapping segments, the serialized stitcherworked well even with the non-overlapping segments. In practice,the segment overlap ratio could be chosen by considering the bal-ance between the computational cost and accuracy. Finally, we evaluated the proposed method on the LibriCSS dataset[18]. The dataset consists of 10 hours of recordings of concatenatedLibriSpeech utterances that were played back by multiple loudspeak-ers in a meeting room and captured by a seven-channel microphonearray. We used only the ﬁrst channel data (i.e. monaural audio) forour experiments. Before applying the decoding procedure, we splitthe recordings at every silence regions by an oracle voice activity de-tector (VAD) that uses the reference time information. Note that theaudio set after VAD still contains considerable amount of long-formmulti-talker audio. Each recording consists of utterances of 8 speak-ers. When we applied the E2E SA-ASR, we fed the speaker proﬁlescorresponding to the 8 speakers for each recording.We used the serialized stitcher trained in the previous experi-ments on the segments obtained by a 50% overlapping sliding win-dow. The evaluation result is shown in Table 4. According to theconvention in [18], we report the results for different overlap speechratios ranging from 0% to 40%. As shown in the table, we observeda large improvement when the overlap speech ratio was large (30–40%). This is because these test sets contained much more long-form audio due to the frequent utterance overlaps.

5. CONCLUSION

In this paper, we proposed the hypothesis stitcher to help improvethe accuracy of the E2E SA-ASR model on the long-form audio. Weproposed several variants of the model architectures for the hypoth-esis stitcher. The experimental results showed that one of the modelcalled serialized stitcher worked the best. We also showed that thehypothesis stitcher yielded good performance even with sliding win-dows with less than 50% overlaps, which is desirable to reduce thecomputational cost of the decoding.

6. REFERENCES [1] J. G. Fiscus, J. Ajot, and J. S. Garofolo, “The rich transcription2007 meeting recognition evaluation,” in

Multimodal Tech-ologies for Perception of Humans . Springer, 2007, pp. 373–389.[2] A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Mor-gan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke et al. , “TheICSI meeting corpus,” in

Proc. ICASSP , vol. 1, 2003, pp. I–I.[3] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot,T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal et al. , “The AMI meeting corpus: A pre-announcement,” in

International workshop on machine learning for multimodalinteraction . Springer, 2005, pp. 28–39.[4] N. Kanda, R. Ikeshita, S. Horiguchi, Y. Fujita, K. Nagamatsu,X. Wang, V. Manohar, N. E. Y. Soplin, M. Maciejewski, S.-J. Chen et al. , “The Hitachi/JHU CHiME-5 system: Advancesin speech recognition for everyday home environments usingmultiple microphone arrays,” in

Proc. CHiME-5 , 2018, pp. 6–10.[5] T. Yoshioka, I. Abramovski, C. Aksoylar, Z. Chen, M. David,D. Dimitriadis, Y. Gong, I. Gurvich, X. Huang, Y. Huang et al. ,“Advances in online audio-visual meeting transcription,” in . IEEE, 2019, pp. 276–283.[6] N. Kanda, C. Boeddeker, J. Heitkaemper, Y. Fujita,S. Horiguchi, K. Nagamatsu, and R. Haeb-Umbach, “Guidedsource separation meets a strong ASR backend: Hi-tachi/Paderborn University joint investigation for dinner partyASR,” in

Proc. Interspeech , 2019, pp. 1248–1252.[7] S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora,X. Chang, S. Khudanpur, V. Manohar, D. Povey, D. Raj et al. ,“CHiME-6 challenge: Tackling multispeaker speech recogni-tion for unsegmented recordings,” in

Proc. CHiME 2020 , 2020.[8] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov,M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov,A. Andrusenko, I. Podluzhny et al. , “The STC system for theCHiME-6 challenge,” in

CHiME 2020 Workshop on SpeechProcessing in Everyday Environments , 2020.[9] N. Kanda, Y. Gaur, X. Wang, Z. Meng, Z. Chen, T. Zhou, andT. Yoshioka, “Joint speaker counting, speech recognition, andspeaker identiﬁcation for overlapped speech of any number ofspeakers,” in

Interspeech , 2020.[10] L. El Shafey, H. Soltau, and I. Shafran, “Joint speech recog-nition and speaker diarization via sequence transduction,” in

Proc. Interspeech , 2019, pp. 396–400.[11] H. H. Mao, S. Li, J. McAuley, and G. Cottrell, “Speech recog-nition and multi-speaker diarization of long conversations,” arXiv preprint arXiv:2005.08072 , 2020.[12] N. Kanda, S. Horiguchi, Y. Fujita, Y. Xue, K. Nagamatsu, andS. Watanabe, “Simultaneous speech recognition and speakerdiarization for monaural dialogue recordings with target-speaker acoustic models,” in

Proc. ASRU , 2019.[13] N. Kanda, X. Chang, Y. Gaur, X. Wang, Z. Meng, Z. Chen, andT. Yoshioka, “Investigation of end-to-end speaker-attributedasr for continuous multi-talker recordings,” arXiv preprintarXiv:2008.04546 , 2020.[14] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, andY. Bengio, “Attention-based models for speech recognition,”in

Proc. NIPS , 2015, pp. 577–585. [15] C.-C. Chiu, W. Han, Y. Zhang, R. Pang, S. Kishchenko,P. Nguyen, A. Narayanan, H. Liao, S. Zhang, A. Kannan et al. ,“A comparison of end-to-end models for long-form speechrecognition,” in . IEEE, 2019, pp. 889–896.[16] A. Narayanan, R. Prabhavalkar, C.-C. Chiu, D. Rybach, T. N.Sainath, and T. Strohman, “Recognizing long-form speech us-ing streaming end-to-end models,” in

Proc. ASRU , 2019, pp.920–927.[17] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an asr corpus based on public domain audio books,”in . IEEE, 2015, pp. 5206–5210.[18] Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu,X. Xiao, and J. Li, “Continuous speech separation: Datasetand analysis,” in

ICASSP 2020-2020 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2020, pp. 7284–7288.[19] N. Kanda, Y. Gaur, X. Wang, Z. Meng, and T. Yoshioka, “Seri-alized output training for end-to-end overlapped speech recog-nition,” in

Interspeech , 2020.[20] C.-C. Chiu, A. Narayanan, W. Han, R. Prabhavalkar, Y. Zhang,N. Jaitly, R. Pang, T. N. Sainath, P. Nguyen, L. Cao et al. ,“RNN-T models fail to generalize to out-of-domain audio:Causes and solutions,” arXiv preprint arXiv:2005.03271 , 2020.[21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is allyou need,” in