Investigation of End-To-End Speaker-Attributed ASR for Continuous Multi-Talker Recordings
Naoyuki Kanda, Xuankai Chang, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka
IINVESTIGATION OF END-TO-END SPEAKER-ATTRIBUTED ASR FORCONTINUOUS MULTI-TALKER RECORDINGS
Naoyuki Kanda , Xuankai Chang ∗ , Yashesh Gaur , Xiaofei Wang , Zhong Meng ,Zhuo Chen , Takuya Yoshioka Microsoft Corp., USA Johns Hopkins University, USA
ABSTRACT
Recently, an end-to-end (E2E) speaker-attributed automatic speechrecognition (SA-ASR) model was proposed as a joint model ofspeaker counting, speech recognition and speaker identification formonaural overlapped speech. It showed promising results for sim-ulated speech mixtures consisting of various numbers of speakers.However, the model required prior knowledge of speaker profilesto perform speaker identification, which significantly limited theapplication of the model. In this paper, we extend the prior work byaddressing the case where no speaker profile is available. Specif-ically, we perform speaker counting and clustering by using theinternal speaker representations of the E2E SA-ASR model to di-arize the utterances of the speakers whose profiles are missing fromthe speaker inventory. We also propose a simple modification to thereference labels of the E2E SA-ASR training which helps handlecontinuous multi-talker recordings well. We conduct a comprehen-sive investigation of the original E2E SA-ASR and the proposedmethod on the monaural LibriCSS dataset. Compared to the originalE2E SA-ASR with relevant speaker profiles, the proposed methodachieves a close performance without any prior speaker knowledge.We also show that the source-target attention in the E2E SA-ASRmodel provides information about the start and end times of thehypotheses.
Index Terms — Rich transcription, speech recognition, speakeridentification, speaker diarization, serialized output training
1. INTRODUCTION
Speaker-attributed automatic speech recognition (SA-ASR), whichrecognizes ”who spoke what”, is essential to meeting transcription.SA-ASR requires to count the number of speakers, transcribe the ut-terances, and identify or diarize the speaker of each utterance fromconversational recordings where some utterances are usually over-lapped. It has a long research history, from the projects in the early2000’s [1, 2, 3] to the recent international efforts such as the CHiME[4, 5] and DIHARD [6, 7] challenges. While significant progresshas been made especially in multi-microphone settings (e.g., [8, 9,10, 11]), SA-ASR for monaural audio remains challenging due to thedifficulty in handling overlapped speech for both ASR and speakerdiarization/identification.One dominant approach to SA-ASR is applying speech sepa-ration (e.g., [12, 13, 14]) before ASR and speaker diarization/iden-tification. However, a speech separation module is often designedand trained with a signal-level criterion and therefore suboptimal for ∗ Work performed during internship at Microsoft. the downstream modules. To overcome this problem, joint model-ing of multiple modules has been investigated from a variety of viewpoints. For example, a number of studies have investigated jointmodeling of speech separation and ASR (e.g., [15, 16, 17, 18, 19,20]). Several methods were also proposed for integrating speakeridentification and speech separation [21, 22, 23]. A few studies at-tempted to improve the speaker diarization by leveraging ASR re-sults [24, 25].However, only a limited number of research works investigatedthe joint modeling of all necessary modules of SA-ASR. [26] pro-posed to generate transcriptions for different speakers interleaved byspeaker role tags to recognize doctor-patient conversations based ona recurrent neural network transducer (RNN-T). Although promisingresults were shown, the method cannot deal with speech overlaps dueto the monotonicity constraint of RNN-T. Furthermore, their methodis difficult to extend to an arbitrary number of speakers because thetarget speaker roles need to be uniquely defined. In [27], the au-thors applied a similar technique to [26] by interleaving multipleutterances with speaker identity tags instead of speaker role tags.To handle speakers who were unseen in the training data, the au-thors used speaker identity tags from the training data even for theunseen test speakers, or they simply applied a separated speaker di-arization module. However, their method showed severe degrada-tion of ASR and speaker diarization accuracy when the oracle ut-terance boundaries were not used. [28] proposed a joint decodingframework for overlapped speech recognition and speaker diariza-tion, where speaker embedding estimation and target-speaker ASRwere performed alternately. While their formulation is applicableto any number of speakers, the method was actually implementedand evaluated in a way that could be used only for the two-speakercase, as target-speaker ASR was performed with an auxiliary outputbranch representing a single interference speaker [20].Recently, an end-to-end (E2E) SA-ASR model has been pro-posed as a joint model of speaker counting, speech recognition, andspeaker identification for monaural (possibly) overlapped speech[29]. It was trained to maximize the joint probability for multi-talkerspeech recognition and speaker identification, and achieved a sig-nificantly lower speaker-attributed word error rate (SA-WER) thana system that separately performs overlapped speech recognitionand speaker identification. However, the model only works with aspeaker inventory that includes the profiles (i.e., embeddings) of allspeakers involved in the input speech. This requirement stronglylimited its application to real scenarios.In this paper, we extend the previous E2E SA-ASR work toaddress the case where no speaker profile is available. Specifically,we propose to cluster the internal speaker representations of theE2E SA-ASR model to diarize the utterances of the speakers whose a r X i v : . [ ee ss . A S ] A ug ig. 1 . E2E SA-ASR model.speaker profiles are not included in the speaker inventory. Combinedwith a silence-region detector, this also allows a very long-form sig-nal spanning an entire meeting to be handled. We also propose asimple modification to the reference label construction for the E2ESA-ASR training to handle continuous multi-talker recordings moreeffectively. Comprehensive experimental results using the monauralLibriCSS dataset [30], consisting of eight-speaker sessions, showthe effectiveness of the proposed method.
2. REVIEW: E2E SA-ASR2.1. Overview
In this section, we review the E2E SA-ASR method proposed in[29]. The goal of this method is to estimate a multi-speaker tran-scription Y = { y , ..., y N } and the speaker identity of each token S = { s , ..., s N } given acoustic input X = { x , ..., x T } and aspeaker inventory D = { d , ..., d K } . Here, N is the number of theoutput tokens, T is the number of the input frames, and K is thenumber of the speaker profiles (e.g., d-vector [31]) in the inventory D . Following the idea of serialized output training (SOT) [32], themulti-speaker transcription Y is represented by concatenating indi-vidual speakers’ transcriptions interleaved by a special symbol (cid:104) sc (cid:105) representing the speaker change.In the E2E SA-ASR modeling, it is assumed that the profiles ofall the speakers involved in the input speech are included in D . Notethat, as long as this assumption holds, the speaker inventory mayinclude irrelevant speakers’ profiles. Figure 1 shows the architecture of the E2E SA-ASR model. Itconsists of ASR-related blocks (shown in green), and speakeridentification-related blocks (shown in yellow). The computationconsists of the following five steps.
Given the acoustic input X , an ASR encoder firstly converts X intoa sequence, H enc , of embeddings for ASR, i.e., H enc = { h enc , ..., h encT } = AsrEncoder( X ) . (1)At the same time, a speaker encoder converts X into a sequence, H spk , of embeddings representing the speaker features of the input X as follows: H spk = { h spk , ..., h spkT } = SpeakerEncoder( X ) . (2) Secondly, at each decoder step n , an attention module generates at-tention weight α n = { α n, , ..., α n,T } as α n = Attention( u n , α n − , H enc ) , (3) u n = DecoderRNN( y n − , c n − , u n − ) , (4)where u n is a decoder state vector at the n -th step, and c n − is acontext vector at the previous time step. Then, context vector c n for the current decoder step n is generatedas a weighted sum of the encoder embeddings as follows: c n = T (cid:88) t =1 α n,t h enct . (5) At every decoder step n , the attention weight α n is also applied to H spk to extract an attention-weighted average, p n , of the speakerembeddings as p n = T (cid:88) t =1 α n,t h spkt . (6)Note that p n could be contaminated by interfering speech becausesome time frames include two or more speakers.The speaker query RNN in Fig. 1 then generates a speaker query q n given the speaker embedding p n , the previous output y n − , andthe previous speaker query q n − , i.e., q n = SpeakerQueryRNN( p n , y n − , q n − ) . (7)With the speaker query q n , an attention module for speaker inven-tory (shown as InventoryAttention in the diagram) estimates atten-tion weight β n,k for each profile in D : b n,k = q n · d k | q n || d k | , (8) β n,k = exp( b n,k ) (cid:80) Kj exp( b n,j ) . (9)The attention weight β n,k can be seen as a posterior probability ofperson k speaking the n -th token given all the previous tokens andspeakers as well as X and D , i.e., P r ( s n = k | y n − , s n − , X, D ) ∼ β n,k . (10)Attention-weighted speaker profile ¯ d n is also calculated basedon the attention weight β n,k and input profile d k as ¯ d n = K (cid:88) k =1 β n,k d k . (11) .2.5. Step5: ASR using context and speaker vectors Finally, the output distribution for y n is estimated given the contextvector c n , the decoder state vector u n , and the weighted speakervector ¯ d n as follows: P r ( y n | y n − , s n , X, D ) ∼ DecoderOut( c n , u n , ¯ d n )= Softmax( W out · LSTM( c n + u n + W d ¯ d n )) . (12)Here, it is assumed that c n and u n have the same dimensionality, and W d is a matrix to change the dimension of ¯ d n to that of c n . Variable W out is the affine transformation matrix of the final layer. Typically, DecoderOut consists of a single affine transform with a softmaxoutput layer. However, in this work, we insert one LSTM just beforethe affine transform as it improves the efficacy of the SOT model asshown in [32].
All network parameters are optimized by maximizing the speaker-attributed maximum mutual information criterion as follows: F SA − MMI = log
P r ( Y, S | X, D ) (13) = log N (cid:89) n =1 { P r ( y n | y n − , s n , X, D ) · P r ( s n | y n − , s n − , X, D ) γ } . (14)Here, γ is a scaling parameter for the speaker estimation probabilityand is set to 0.1 per [29]. An extended beam search algorithm is used for decoding for the E2ESA-ASR. With the conventional beam search, each hypothesis con-tains estimated tokens accompanied by the posterior probability ofthe hypothesis. In addition to these, a hypothesis for the E2E SA-ASR method contains speaker estimation β n,k . Each hypothesis ex-pands until (cid:104) eos (cid:105) is detected, and the estimated tokens in each hy-pothesis are grouped by (cid:104) sc (cid:105) to form multiple utterances. For eachutterance, the speaker with the highest β n,k value at the point of (cid:104) sc (cid:105) or (cid:104) eos (cid:105) token is selected as the predicted speaker of that utterance Finally, when the same speaker is predicted for multiple utterances,those utterances are concatenated to form a single utterance.
3. EXTENSIONS OF E2E SA-ASR
This section describes our proposed extensions of the E2E SA-ASRfor recognizing continuous multi-talker recordings without priorspeaker knowledge.
The E2E SA-ASR requires the speaker inventory to include the pro-files of all speakers involved in the input speech. However, it is oftendifficult to prepare such a speaker inventory for various reasons, in-cluding the participation of guest speakers who are not originallyinvited to a meeting and the privacy concern about voice enrollment. We observed slight performance improvement by using the speakerestimation at the end of an utterance (i.e., the (cid:104) sc (cid:105) or (cid:104) eos (cid:105) position) insteadof the original scheme proposed in [29] which uses the average β n,k valuescalculated over all tokens of the utterance. Fig. 2 . Speaker-based and utterance-based FIFO training.To cope with the case where no prior speaker knowledge is avail-able, we combine the E2E SA-ASR and speaker clustering. Here, weassume we have a well-trained E2E SA-ASR model. Then, our pro-posed procedure to recognize long audio recordings is as follows.1. Firstly, we apply a silence-region detector to divide an inputlong audio recording into multiple shorter segments at everysilence regions. Each segment may include multiple utter-ances of different speakers with overlaps.2. Then, we apply the E2E SA-ASR for each segment with a setof example speaker profiles who do not appear in the inputaudio.3. Finally, we cluster the speaker query vectors q n of the recog-nized hypotheses (i.e., the query vectors obtained at the lasttoken of each utterance) to count and diarize the speakers.Specifically, we first determine the number of clusters basedon normalized maximum eigengap (NME) [33], and then per-form spectral clustering with a normalized graph Laplacianmatrix [34]. One may have multiple questions about this procedure. For ex-ample, how many example (irrelevant) speaker profiles are necessaryin step 2? How does the silence-region detector in step 1 affect thefinal result? How about using the weighted profile ¯ d n for speakercounting and clustering in step 3 instead of the speaker query q n ?We will experimentally examine these questions in Section 4. We also introduce a simple yet effective modification of the refer-ence transcription construction for the E2E SA-ASR training. In theprevious work [29], the authors trained the E2E SA-ASR model withoverlapped speech of up to three utterances. However, in real conver-sation, there are many cases where the same speaker utters multipletimes in one continuous audio segment as illustrated in the upper partof Fig. 2. In this example, three people are speaking in one audiosegment, and r ij represents the j -th reference token of speaker i . Theterm N i,u represents the end position of u -th utterance of speaker i .The previous work [29] employed the first-in first-out (FIFO)training scheme [32], where the reference labels of different speakersare sorted by their start times and concatenated by (cid:104) sc (cid:105) token. Sincethe (cid:104) sc (cid:105) token represents the speaker change, the transcriptions ofindividual speakers are sorted by the times they start speaking. Wecall this original version speaker-based FIFO training, and shows anexample in Fig. 2.Alternatively, we may sort the reference labels according to thestart time of each utterance and join the utterances with the (cid:104) sc (cid:105) In [33], spectral clustering was applied to a binarized and unnormalizedgraph Laplacian matrix after speaker counting. However, we applied theconventional spectral clustering with a normalized graph Laplacian as thisyielded slightly better results in our preliminary experiments. able 1 . CpWERs (%) of the E2E SA-ASR with speaker inventoryof 8 relevant speakers. LSTM-LM was not used in this experiment.Audio recordings were segmented at non-speech points based on or-acle boundary information.
FIFO cpWER (%) for different overlap ratioorder 0S 0L 10 20 30 40 Avg.Speaker 7.1 6.8 21.4 24.3 42.7 44.6 26.7Utterance 6.9 7.0 11.2 15.0 28.4 30.3 17.8 token. Note that this scheme implicitly assumes that we can definewhat an end of an utterance is in continuous speech. We call thismodified version utterance-based
FIFO training, as illustrated in Fig.2. In the next section, we experimentally investigate which FIFOtraining scheme results in better performance.
4. EXPERIMENTS4.1. Evaluation settings
We evaluated the effectiveness of the proposed method by using theLibriCSS dataset [30], which comprises conversation-like record-ings created based on the LibriSpeech corpus [35]. The dataset con-sists of 10 hours of recordings of concatenated LibriSpeech utter-ances that were played back by multiple loudspeakers in a meetingroom and captured by a seven-channel microphone array. While therecordings have seven channels, we used only the first channel data(i.e. monaural audio) for all our experiments.The LibriCSS dataset consists of 10 sessions, each being onehour long and comprising eight speakers. Per [30], each session isdecomposed to six 10-minute-long “mini-sessions” that have differ-ent overlap ratios ranging from 0% to 40%. The recordings of thefirst session (Session 0) was used to tune the decoding parameters,and those in the rest of 9 sessions (Session 1–9) were used for theevaluation. Note that there are two types of mini-sessions for the 0%overlap case: one has only 0.1-0.5 sec of silence between adjacentutterances (called “0S”); one has 2.9-3.0 sec of silence between theadjacent utterances (called “0L”).
For the E2E SA-ASR training, we used multi-speaker signals thatwere generated by room simulation from the 960 hours of Lib-riSpeech training data (“train 960”) [35, 36]. We generated 500,000training samples, each of which was a mixture of multiple utterancesrandomly selected from train 960. When the utterances were mixed,each utterance was shifted by a random delay to simulate partiallyoverlapped conversational recordings. Each training sample wasgenerated under the following conditions. • The number of speakers was randomly chosen from 1 to 5. • The number of utterances was randomly chosen from 1 to 5. • The start times of different utterances were apart by 0.5 secor longer. • Every utterance in each mixed audio sample had at least onespeaker-overlapped region with other utterances. • Utterances of the same speakers do not overlap.Before mixing the source utterances, a room impulse response gen-erated by the image method was applied to each utterance [37]. Inaddition, random noise was generated by following [38], and added at a random SNR from 10 to 40 dB after mixing the utterances. Fi-nally, the volume of the mixed audio was changed by a random scalebetween 0.125 and 2.0.In addition to the multi-speaker signals, speaker profiles weregenerated for each training sample as follows. For a training sampleconsisting of S speakers, the number of the profiles was randomlyselected from S to 8. Among those profiles, S profiles were forthe speakers involved in the overlapped speech. The utterances forcreating the profiles of these speakers were different from those con-stituting the input overlapped speech. The rest of the profiles wererandomly extracted from different speakers in train 960. Each pro-file was extracted by using 10 utterances. The main evaluation metric used in this paper is the concatenatedminimum-permutation word error rate (cpWER) [5]. The cpWER iscomputed as follows: (i) concatenate all reference transcriptions foreach speaker; (ii) concatenate all hypothesis transcriptions for eachdetected speaker; (iii) compute the WER between the reference andhypothesis and repeat this for all possible speaker permutations; and(iv) pick the lowest WER among them. The cpWER is affected byboth the speech recognition and speaker diarization results.Besides cpWER, we evaluated the mean speaker counting error,which is the absolute difference between the estimated number ofspeakers and the actual number of speakers (= 8 in LibriCSS) av-eraged over all mini-sessions. We also analyzed the source-targetattention of our system in terms of the diarization error rate (DER).It should be noted that the mean speaker counting error and the DERare not the performance metrics we care, and they were evaluatedonly for analysis purposes. The hyper-parameters of our systemswere tuned on the development set to improve only the cpWER.
In our experiments, an 80-dim log mel filterbank extracted every10 msec was used for the input feature. 3 frames of features werestacked, and the model was applied on top of the stacked features.For the speaker profile, we used a 128-dim d-vector [31], whoseextractor was separately trained on VoxCeleb Corpus [39, 40]. Thed-vector extractor consisted of 17 convolution layers followed byan average pooling layer, which was a modified version of the onepresented in [41].The AsrEncoder consisted of 5 layers of 1024-dim bidirectionallong short-term memory (BLSTM), interleaved with layer normal-ization [42]. The DecoderRNN consisted of 2 layers of 1024-dimunidirectional LSTM, and the DecoderOut consisted of 1 layer of1024-dim unidirectional LSTM. We used a conventional location-aware content-based attention [43] with a single attention head. TheSpeakerEncoder had the same architecture as the d-vector extractorexcept for not having the final average pooling layer. Our Speak-erQueryRNN consisted of 1 layer of 512-dim unidirectional LSTM.We used 16k subwords based on a unigram language model [44] asa recognition unit.When we trained the E2E SA-ASR model, we initialized the pa-rameters of AsrEncoder, Attention, DecoderRNN, and DecoderOutby the parameter values of a three-speaker SOT-ASR model trainedon simulated LibriSpeech utterance mixtures. We followed the set-ting described in [32] for pre-training the SOT model, and used theparameter values obtained after 640k training iterations. We alsoinitialized the SpeakerEncoder parameters by using those of the d-vector extractor. After the initialization, we updated the entire net- able 2 . CpWERs (%) and speaker counting errors with different speaker profile settings. The audio recordings were segmented at non-speechpoints based on oracle boundary information. LSTM-LM was used in this experiment.
Avg. counting error
E2E SA-ASR
E2E SA-ASR + Speaker Clustering (proposed method) √ oracle 5.6 6.8 9.3 14.2 26.4 30.3 √ NME (max=8) 6.6 9.0 13.3 14.2 26.4 30.7 √ NME (max=12) 6.6 13.7 14.9 15.9 28.1 30.7 √ NME (max=16) 11.0 13.7 14.9 15.9 28.1 30.7
Table 3 . Average cpWER (%) with different numbers of irrelevantprofiles (i.e., example profiles) for proposed method using oraclespeaker numbers. Oracle boundary-based segmentation was used.
LM √ Table 4 . CpWERs (%) with different internal speaker embed-dings for speaker clustering. Oracle boundary-based segmentationwas used.
Speaker embedding Overlap ratio in %for clustering 0S 0L 10 20 30 40 Avg.Weighted profile ¯ dn q n work based on F SA − MMI with γ = 0 . by using an Adam opti-mizer with a learning rate of 0.00002. We used 8 GPUs, each ofwhich worked on 6k frames of minibatch. We report the results ofthe dev clean-based best models found after 120k training iterations.In addition to the E2E SA-ASR model described above, wetrained an external language model (LM) that consisted of 4 layersof 2,048-dim LSTM. As training data, we generated a text corpusby (1) shuffling the official training text corpus for LibriSpeech andthe transcription of train 960, and (2) concatenating every consec-utive rand (1 , utterances interleaved by (cid:104) sc (cid:105) token. We usedthe shallow fusion (i.e. simple weighted sum) to combine the E2ESA-ASR and the LM scores with an LM weight calibrated by usingthe development set. We firstly evaluated the proposed method with an oracle silence-region detector. Namely, we divided each recording at every si-lence position obtained from the oracle utterance boundary infor-mation. Note that each segmented audio still consisted of multipleoverlapped utterances of different speakers. The minimum and max-imum numbers of utterances were found to be 1 and 24, respectively.In this subsection, we used the oracle silence detection. The perfor-mance using an automatic silence detector is reported in the nextsubsection.
As a baseline, we evaluated the E2E SA-ASR with a speaker inven-tory consisting only of the eight relevant speakers. Each speaker’s profile was extracted by using 5 utterances that were not includedin the recording used for the evaluation. We firstly compared thespeaker-based and utterance-based FIFO training schemes that wedescribed in Section 3.2. The result is shown in Table 1. We can seethat the utterance-based FIFO training significantly outperformedthe speaker-based FIFO training. Therefore, we always used theE2E SA-ASR model based on the utterance-based FIFO training inthe remaining experiments.Next, we evaluated the accuracy of the E2E SA-ASR when thespeaker inventory included irrelevant speaker profiles. In this ex-periment, irrelevant speakers were randomly chosen from train 960of LibriSpeech, and a randomly selected one utterance was used toextract the speaker profile of each irrelevant speaker. The result isshown in the first five rows of Table 2. When no irrelevant profileswere included in the speaker inventory, the E2E SA-ASR achievedthe best cpWER of 16.9%. The cpWER gradually deteriorated as theaddition of irrelevant profiles, but the system still achieved 22.1% ofcpWER even with 100 irrelevant profiles.Finally, to analyze the impact of the speaker profiles, we alsoevaluated the E2E SA-ASR with no relevant speaker profiles. Theresults of this experiment are shown from the 6th to 8th rows of Ta-ble 2, where we provided 10, 20, or 100 irrelevant profiles as aninput while not using any profiles for the relevant speakers. Speakerdiarization was conducted purely based on the speaker identificationresult for each utterance. As expected, we observed a very high cp-WER of 72.1–81.3%. Note that, the mean speaker counting error forthe 10 irrelevant profile case was relatively small (= 1.19) just be-cause the given (10) and correct (8) numbers of speakers were close.
We then evaluated the proposed procedure of the combination of theE2E SA-ASR and speaker clustering. The results are shown in thelast two rows of Table 2. In this experiment, we used 100 irrele-vant speaker profiles as a set of example profiles. When we appliedthe speaker clustering with the oracle number of speakers, the pro-posed method achieved 16.7% of cpWER, which was even betterthan the best number obtained by the E2E SA-ASR with the rele-vant speaker inventory. This is because spectral clustering can ac-cess to the speaker embeddings of all utterances while the speakeridentification inside the E2E SA-ASR was done by accessing onlythe information of the single segment. When we estimated the num-ber of speakers by using NME with a maximum possible number ofspeakers of {
8, 12, 16 } , the cpWER was slightly degraded to 17.9–20.0%. Nonetheless, it was still as good as the E2E SA-ASR with able 5 . CpWERs (%) with automatic silence-region detector to segment the audio recordings. System
Avg. counting error
E2E SA-ASR
E2E SA-ASR + Speaker Clustering (proposed method) √ oracle 15.8 10.3 13.4 17.1 24.4 28.6 √ NME (max=16) 24.4 12.2 15.0 17.1 28.6 28.6
Table 6 . Analysis on the source-target attention of two systems inTable 5 based on DERs (%).
DER (%) for different overlap ratio0S 0L 10 20 30 40
Avg.
System 1 15.72 11.15 12.64 14.50 18.04 17.33 † System 7 19.71 12.97 13.81 14.49 19.98 17.97 ‡† Miss = 4.72%, false alarm = 7.04%, speaker error = 3.47% ‡ Miss = 4.75%, false alarm = 7.00%, speaker error = 5.00% ¯ d n when we have too few profiles, whichends up with degrading the overall accuracy. Note that the compu-tational cost of the inventory attention (Eq. (8)–(11)) was negligibleeven with 100 profiles. Thus, we used 100 irrelevant speaker profilesin the following experiments unless otherwise stated.We also compared clustering using the weighted profile ¯ d n andthat using speaker query q n . The results are shown in Table 4. In thisexperiment, we applied the E2E SA-ASR with 100 irrelevant speakerprofiles, and then applied speaker clustering given the oracle numberof speakers. As seen in the table, the use of the speaker query q n resulted in significantly better speaker clustering performance. We finally evaluated the proposed method with an automatic silence-region detector. In this experiment, we applied the WebRTC VoiceActivity Detector for each recording, and segmented the audiowhenever silence regions were detected.The result with the automatic silence-region detector is shownin Table 5. The original E2E SA-ASR with the relevant speakerinventory achieved 18.6% to 26.0% of cpWER depending on thenumber of the additional irrelevant profiles. On the other hand, theproposed combination of the E2E SA-ASR and speaker clusteringachieved 19.2% of cpWER with oracle speaker counting, and 21.8%of cpWER with NME-based speaker counting, respectively.Compared with the case using the oracle silence-region infor-mation, the cpWER was degraded by 3.1%. Especially, we noticedthat “0S” setting showed a severe cpWER degradation even thoughthe overlap ratio was 0%. With “0S”, there was very short silence(0.1-0.5 sec) between adjacent utterances of different speakers. Asa result, segments in “0S” often consisted of consecutive speech ofmultiple speakers. We observed the E2E SA-ASR sometimes mis-recognized the speaker change point for such speech, which resulted https://github.com/wiseman/py-webrtcvad in the degradation of cpWER. Note that the speaker change detec-tion for non-overlapped speech could be more difficult than that foroverlapped speech because speech overlaps could be used as a clueof speaker change besides the difference of voice characteristics. We analyzed the source-target attention α n of the E2E SA-ASR. Weestimated the start and end times of each utterance based on α n asfollows and calculated the DER accordingly.1. For each utterance hypothesis, the attention ( α n )-weightedaverage of the frame indices was calculated for each tokenother than (cid:104) sc (cid:105) or (cid:104) eos (cid:105) .2. The minimum frame index f min and the maximum frame in-dex f max were calculated.3. The start time T s was defined as T s = max(0 , f min · T f − T m ) . The end time T e was defined as T e = f max · T f + T m .Here, T f is the frame shift in second, and it was 0.03 sec accordingto our model settings. The term T m is a heuristic margin tuned by thedevelopment set, and it was determined as 0.5 sec in our experiment.The DER result is shown in Table 6. In this evaluation, we calcu-lated the DER without a collar margin, and the overlapping regionswere included in the DER calculation. As shown in the table, theE2E SA-ASR systems showed 15.23–16.75% of DER on average.In the high overlap test sets (with the overlap ratios of 20%–40%),the DERs were significantly better than the overlap ratios of the in-put audio, which indicates that the source-target attention scannedthe encoder embeddings back and forth to recognize overlapped ut-terances one by one as originally designed by SOT [32]. On the otherhand, the DER was as high as 11.15% even for the non overlappedspeech (0L). This could be because the our model is optimized toachieve good SA-ASR accuracy, unlike other diarization methods,such as the end-to-end neural diarization [45, 46] or target-spekaervoice activity detection [11, 47], that are optimized for DER. Thatbeing said, the result shows that the source-target attention in theE2E SA-ASR model provides information about the start and endtimes of the hypotheses and thus can be used for applications requir-ing both the time boundary and the recognition result.
5. CONCLUSION
In this paper, we proposed to apply speaker counting and clusteringto the speaker query of an E2E SA-ASR model to diarize utterancesof speakers whose speaker profiles are not included in the speaker in-ventory. We also proposed a simple yet effective modification to thereference label construction for E2E SA-ASR training, which helpscope with the continuous multi-talker recordings. In the evaluation,compared with the original E2E SA-ASR with a speaker inventoryconsisting only of relevant speaker profiles, the proposed methodachieved a close cpWER even without any prior speaker knowledge. . REFERENCES [1] Jonathan G Fiscus, Jerome Ajot, and John S Garofolo, “Therich transcription 2007 meeting recognition evaluation,” in
Multimodal Technologies for Perception of Humans , pp. 373–389. Springer, 2007.[2] Adam Janin, Don Baron, Jane Edwards, Dan Ellis, David Gel-bart, Nelson Morgan, Barbara Peskin, Thilo Pfau, ElizabethShriberg, Andreas Stolcke, et al., “The ICSI meeting corpus,”in
Proc. ICASSP , 2003, vol. 1, pp. I–I.[3] Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn,Mael Guillemot, Thomas Hain, Jaroslav Kadlec, VasilisKaraiskos, Wessel Kraaij, Melissa Kronenthal, et al., “TheAMI meeting corpus: A pre-announcement,” in
Internationalworkshop on machine learning for multimodal interaction .Springer, 2005, pp. 28–39.[4] Jon Barker, Shinji Watanabe, Emmanuel Vincent, and Jan Tr-mal, “The fifth ’CHiME’ speech separation and recognitionchallenge: Dataset, task and baselines,”
Proc. Interspeech , pp.1561–1565, 2018.[5] Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vin-cent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vi-mal Manohar, Daniel Povey, Desh Raj, et al., “CHiME-6challenge: Tackling multispeaker speech recognition for un-segmented recordings,” in
Proc. CHiME 2020 , 2020.[6] Neville Ryant, Kenneth Churchb, Christopher Cieria, Alejan-drina Cristiac, Jun Dud, Sriram Ganapathye, and Mark Liber-mana, “First DIHARD challenge evaluation plan,” 2018.[7] Neville Ryant, Kenneth Church, Christopher Cieri, Alejand-rina Cristia, Jun Du, Sriram Ganapathy, and Mark Liberman,“The second DIHARD diarization challenge: Dataset, task,and baselines,”
Proc. Interspeech , pp. 978–982, 2019.[8] Naoyuki Kanda, Rintaro Ikeshita, Shota Horiguchi, YusukeFujita, Kenji Nagamatsu, Xiaofei Wang, Vimal Manohar,Nelson Enrique Yalta Soplin, Matthew Maciejewski, Szu-JuiChen, et al., “The Hitachi/JHU CHiME-5 system: Advancesin speech recognition for everyday home environments usingmultiple microphone arrays,” in
Proc. CHiME-5 , 2018, pp.6–10.[9] Takuya Yoshioka, Igor Abramovski, Cem Aksoylar, ZhuoChen, Moshe David, Dimitrios Dimitriadis, Yifan Gong, IlyaGurvich, Xuedong Huang, Yan Huang, et al., “Advances in on-line audio-visual meeting transcription,” in
Proc. ASRU , 2019,pp. 276–283.[10] Naoyuki Kanda, Christoph Boeddeker, Jens Heitkaemper,Yusuke Fujita, Shota Horiguchi, Kenji Nagamatsu, and Rein-hold Haeb-Umbach, “Guided source separation meets a strongASR backend: Hitachi/Paderborn University joint investiga-tion for dinner party ASR,” in
Proc. Interspeech , 2019, pp.1248–1252.[11] Ivan Medennikov, Maxim Korenevsky, Tatiana Prisyach, YuriKhokhlov, Mariya Korenevskaya, Ivan Sorokin, Tatiana Timo-feeva, Anton Mitrofanov, Andrei Andrusenko, Ivan Podluzhny,et al., “The STC system for the CHiME-6 challenge,” in
CHiME 2020 Workshop on Speech Processing in Everyday En-vironments , 2020.[12] John R Hershey, Zhuo Chen, Jonathan Le Roux, and ShinjiWatanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in
Proc. ICASSP , 2016, pp. 31–35.[13] Zhuo Chen, Yi Luo, and Nima Mesgarani, “Deep attractornetwork for single-microphone speaker separation,” in
Proc.ICASSP , 2017, pp. 246–250.[14] Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen,“Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in
Proc. ICASSP .IEEE, 2017, pp. 241–245.[15] Dong Yu, Xuankai Chang, and Yanmin Qian, “Recognizingmulti-talker speech with permutation invariant training,”
Proc.Interspeech 2017 , pp. 2456–2460, 2017.[16] Hiroshi Seki, Takaaki Hori, Shinji Watanabe, JonathanLe Roux, and John R Hershey, “A purely end-to-end systemfor multi-speaker speech recognition,” in
Proc. ACL , 2018, pp.2620–2630.[17] Xuankai Chang, Yanmin Qian, Kai Yu, and Shinji Watanabe,“End-to-end monaural multi-speaker ASR system without pre-training,” in
Proc. ICASSP , 2019, pp. 6256–6260.[18] Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan LeRoux, and Shinji Watanabe, “MIMO-SPEECH: End-to-endmulti-channel multi-speaker speech recognition,” in
Proc.ASRU , 2019, pp. 237–244.[19] Naoyuki Kanda, Yusuke Fujita, Shota Horiguchi, RintaroIkeshita, Kenji Nagamatsu, and Shinji Watanabe, “Acous-tic modeling for distant multi-talker speech recognition withsingle-and multi-channel branches,” in
Proc. ICASSP , 2019,pp. 6630–6634.[20] Naoyuki Kanda, Shota Horiguchi, Ryoichi Takashima, YusukeFujita, Kenji Nagamatsu, and Shinji Watanabe, “Auxiliary in-terference speaker loss for target-speaker speech recognition,”in
Proc. Interspeech , 2019, pp. 236–240.[21] Peidong Wang, Zhuo Chen, Xiong Xiao, Zhong Meng, TakuyaYoshioka, Tianyan Zhou, Liang Lu, and Jinyu Li, “Speechseparation using speaker inventory,” in
Proc. ASRU , 2019, pp.230–236.[22] Thilo von Neumann, Keisuke Kinoshita, Marc Delcroix, ShokoAraki, Tomohiro Nakatani, and Reinhold Haeb-Umbach, “All-neural online source separation, counting, and diarization formeeting analysis,” in
Proc. ICASSP , 2019, pp. 91–95.[23] Keisuke Kinoshita, Marc Delcroix, Shoko Araki, and Tomo-hiro Nakatani, “Tackling real noisy reverberant meetings withall-neural source separation, counting, and diarization system,” arXiv preprint arXiv:2003.03987 , 2020.[24] Tae Jin Park and Panayiotis Georgiou, “Multimodal speakersegmentation and diarization using lexical and acoustic cuesvia sequence to sequence neural networks,” in
Proc. Inter-speech , 2018, pp. 1373–1377.[25] Tae Jin Park, Kyu J Han, Jing Huang, Xiaodong He,Bowen Zhou, Panayiotis Georgiou, and Shrikanth Narayanan,“Speaker diarization with lexical information,” in
Proc. Inter-speech , 2019, pp. 391–395.[26] Laurent El Shafey, Hagen Soltau, and Izhak Shafran, “Jointspeech recognition and speaker diarization via sequence trans-duction,” in
Proc. Interspeech , 2019, pp. 396–400.27] Huanru Henry Mao, Shuyang Li, Julian McAuley, and Garri-son Cottrell, “Speech recognition and multi-speaker diariza-tion of long conversations,” arXiv preprint arXiv:2005.08072 ,2020.[28] Naoyuki Kanda, Shota Horiguchi, Yusuke Fujita, Yawen Xue,Kenji Nagamatsu, and Shinji Watanabe, “Simultaneous speechrecognition and speaker diarization for monaural dialoguerecordings with target-speaker acoustic models,” in
Proc.ASRU , 2019.[29] Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng,Zhuo Chen, Tianyan Zhou, and Takuya Yoshioka, “Jointspeaker counting, speech recognition, and speaker identifica-tion for overlapped speech of any number of speakers,” arXivpreprint arXiv:2006.10930 , 2020.[30] Zhuo Chen, Takuya Yoshioka, Liang Lu, Tianyan Zhou, ZhongMeng, Yi Luo, Jian Wu, and Jinyu Li, “Continuous speechseparation: dataset and analysis,” in
Proc. ICASSP , 2020 (toappear).[31] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio LopezMoreno, and Javier Gonzalez-Dominguez, “Deep neural net-works for small footprint text-dependent speaker verification,”in
Proc. ICASSP , 2014, pp. 4052–4056.[32] Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng,and Takuya Yoshioka, “Serialized output training for end-to-end overlapped speech recognition,” arXiv preprintarXiv:2003.12687 , 2020.[33] Tae Jin Park, Kyu J Han, Manoj Kumar, and ShrikanthNarayanan, “Auto-tuning spectral clustering for speaker di-arization using normalized maximum eigengap,”
IEEE SignalProcessing Letters , vol. 27, pp. 381–385, 2019.[34] Ulrike Von Luxburg, “A tutorial on spectral clustering,”
Statis-tics and computing , vol. 17, no. 4, pp. 395–416, 2007.[35] Vassil Panayotov, Guoguo Chen, Daniel Povey, and SanjeevKhudanpur, “Librispeech: an ASR corpus based on publicdomain audio books,” in
Proc. ICASSP , 2015, pp. 5206–5210.[36] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Bur-get, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, PetrMotlicek, Yanmin Qian, Petr Schwarz, et al., “The Kaldispeech recognition toolkit,” in
ASRU , 2011.[37] Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael LSeltzer, and Sanjeev Khudanpur, “A study on data augmen-tation of reverberant speech for robust speech recognition,” in
Proc. ICASSP , 2017, pp. 5220–5224.[38] Emanu¨el AP Habets and Sharon Gannot, “Generating sensorsignals in isotropic noise fields,”
The Journal of the AcousticalSociety of America , vol. 122, no. 6, pp. 3464–3470, 2007.[39] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman,“Voxceleb: A large-scale speaker identification dataset,” in
Proc. Interspeech , 2017, pp. 2616–2620.[40] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman,“Voxceleb2: Deep speaker recognition,” in
Proc. Interspeech ,2018, pp. 1086–1090.[41] Tianyan Zhou, Yong Zhao, Jinyu Li, Yifan Gong, and Jian Wu,“CNN with phonetic attention for text-independent speakerverification,” in
Proc. ASRU , 2019, pp. 718–725.[42] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton,“Layer normalization,” arXiv preprint arXiv:1607.06450 ,2016. [43] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk,Kyunghyun Cho, and Yoshua Bengio, “Attention-based mod-els for speech recognition,” in
Proc. NIPS , 2015, pp. 577–585.[44] Taku Kudo, “Subword regularization: Improving neural net-work translation models with multiple subword candidates,” arXiv preprint arXiv:1804.10959 , 2018.[45] Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Naga-matsu, and Shinji Watanabe, “End-to-end neural speaker di-arization with permutation-free objectives,”
Proc. Interspeech ,pp. 4300–4304, 2019.[46] Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Yawen Xue,Kenji Nagamatsu, and Shinji Watanabe, “End-to-end neuralspeaker diarization with self-attention,” in
Proc. ASRU , 2019,pp. 296–303.[47] Ivan Medennikov, Maxim Korenevsky, Tatiana Prisyach, YuriKhokhlov, Mariya Korenevskaya, Ivan Sorokin, Tatiana Timo-feeva, Anton Mitrofanov, Andrei Andrusenko, Ivan Podluzhny,et al., “Target-speaker voice activity detection: a novel ap-proach for multi-speaker diarization in a dinner party scenario,” arXiv preprint arXiv:2005.07272arXiv preprint arXiv:2005.07272