[PDF] Online End-to-End Neural Diarization with Speaker-Tracing Buffer

Abstract

This paper proposes a novel online speaker diarization algorithm based on a fully supervised self-attention mechanism (SA-EEND). Online diarization inherently presents a speaker's permutation problem due to the possibility to assign speaker regions incorrectly across the recording. To circumvent this inconsistency, we proposed a speaker-tracing buffer mechanism that selects several input frames representing the speaker permutation information from previous chunks and stores them in a buffer. These buffered frames are stacked with the input frames in the current chunk and fed into a self-attention network. Our method ensures consistent diarization outputs across the buffer and the current chunk by checking the correlation between their corresponding outputs. Additionally, we trained SA-EEND with variable chunk-sizes to mitigate the mismatch between training and inference introduced by the speaker-tracing buffer mechanism. Experimental results, including online SA-EEND and variable chunk-size, achieved DERs of 12.54% for CALLHOME and 20.77% for CSJ with 1.4s actual latency.

Full PDF

OOnline End-to-End Neural Diarization with Speaker-Tracing Buffer

Yawen Xue , Shota Horiguchi , Yusuke Fujita , Shinji Watanabe , Kenji Nagamatsu Hitachi, Ltd. Research & Development Group Center for Language and Speech Processing, Johns Hopkins University yawen.xue.wn,[email protected], [email protected]

Abstract

End-to-end speaker diarization using a fully supervised self-attention mechanism (SA-EEND) has achieved signiﬁcant im-provement from the state-of-art clustering-based methods, espe-cially for the overlapping case. However, applications of orig-inal SA-EEND are limited since it has been developed basedon ofﬂine self-attention algorithms. In this paper, we propose anovel speaker-tracing mechanism to extend SA-EEND to onlinespeaker diarization for practical use. First, this paper demon-strates oracle experiments to show that a straightforward on-line extension, in which SA-EEND is performed independentlyfor each chunked recording, results in degrading the diariza-tion error rate (DER) due to the speaker permutation inconsis-tency across the chunk. To circumvent this inconsistency is-sue, our proposed method, called speaker-tracing buffer, main-tains the speaker permutation information determined in pre-vious chunks within the self-attention mechanism for correctspeaker-tracing. Our experimental results show that the pro-posed online SA-EEND with speaker-tracing buffer achievedthe DERs of .

84 % for CALLHOME and .

64 % for Corpusof Spontaneous Japanese with latency. These results are sig-niﬁcantly better than the conventional online clustering methodbased on x-vector with . latency, which achieved the DERsof .

90 % and .

45 % , respectively.

1. Introduction

When building an audio-based human-computer interaction(HCI) system, it is important to provide the speaker turn infor-mation as well as speech transcription information. Speaker di-arization can locate speaker turn information and use it to iden-tify the speaker in a corresponding segment, which is deﬁned as“who spoke when” [1–3]. Speaker diarization has been widelyapplied to meetings, call-center telephone conversations, andthe home environment (CHiME-5) [4–7].Online speaker diarization outputs the diarization result assoon as the audio segment arrives, which means no future in-formation is available when analyzing the current segment. Incontrast, in an ofﬂine mode, the whole recording is processedso that all segments can be compared and clustered at the sametime [8]. Currently, few speaker diarization systems can be ap-plied in practical scenarios because most of them work wellonly under speciﬁc conditions such as long latency, no over-lapping, or no noise level. [9,10]. An online speaker diarizationsystem with low latency is still an open technical problem.State-of-art speaker diarization systems mostly concen-trate on integrating several components: voice activity de-tection, speaker change detection, feature representation, andclustering [11, 12]. Current research focuses primarily on thespeaker model or speaker embeddings, such as Gaussian mix-ture models (GMM) [8, 13], i-vector [14–16], d-vector [17, 18],and x-vector [19, 20], and on a better clustering method such as agglomerative hierarchical clustering or spectral clustering[19,21–23]. The issue with these methods is that they cannot di-rectly minimize the diarization error because they are based onan unsupervised algorithm. Zhang, et al [12, 24] proposed a su-pervised online speaker diarization approach while the methodstill assumes only one speaker in one segment (no overlap).To solve these issues, Fujita, et al. [25–27] proposed an end-to-end speaker diarization system that directly minimizes the di-arization error by training a neural network using PermutationInvariant Training (PIT) with multi-speaker recordings. Theirexperimental results show that the self-attention based end-to-end speaker diarization (SA-EEND) system [26, 27] outper-formed the state-of-art i-vector and x-vector clustering and longshort-term memory (LSTM) [25] based end-to-end method. Al-though SA-EEND has achieved signiﬁcant improvement, it isonly working in the ofﬂine condition due to the self-attentionmechanism, which outputs speaker labels only after the wholerecording is provided.This paper ﬁrst investigates a straightforward online exten-sion of SA-EEND by performing diarization independently foreach chunked recording. However, this straightforward onlineextension degrades the diarization error rate (DER) due to thespeaker permutation inconsistency across the chunk, especiallyfor short-duration chunks. Therefore, we propose a methodcalled speaker-tracing buffer, which can track speaker informa-tion consistently across the chunk by extending a self-attentionmechanism to maintain the speaker permutation information de-termined in previous chunks. More speciﬁcally, we select aﬁxed number of input frames in the previous chunk that havedominant speaker permutation information based on the diariza-tion output probability. These additional input frames are fedinto the self attention layer to take over the speaker permuta-tion information determined in the previous chunk. Our ex-perimental results show that choosing the buffer frames usingthe absolute probability difference of the output speaker labelsyields the best results compared with other methods. The codeof SA-EEND with speaker tracing buffer will be available at https://github.com/hitachi-speech/EEND .

2. Analysis of online SA-EEND

In SA-EEND [26], the speaker diarization task is formulatedas a probabilistic multi-label classiﬁcation problem. Given the T length acoustic feature X = ( x t ∈ R D | t = 1 , · · · , T ) ,with a D -dimensional observation feature vector at time index t , SA-EEND predicts the corresponding speaker label sequence ˆ Y = (ˆ y t | t = 1 , · · · , T ) . Here, speaker label ˆ y t = [ˆ y t,s ∈{ , } | s = 1 , · · · , S ] represents a joint activity for multiplespeakers ( S ) at time t . For example, ˆ y t,s = ˆ y t,s (cid:48) = 1( s (cid:54) = s (cid:48) ) means both s and s (cid:48) spoke at time t . Thus, determining ˆ Y isthe key for determining the speaker diarization information as a r X i v : . [ ee ss . A S ] J un ollows: ˆ Y = SA( X ) ∈ (0 , S × T , (1)where SA( · ) is a multi-head self-attention based neural net-work.Note that the vanilla self-attention layers has to wait theprocessing of all speech features in an entire recording to com-pute the output speaker labels. Thus, this method causes verylarge latency determined by the length of the recording, andcannot be adequate for online/real-time speech interface. This paper ﬁrst investigates the use of SA-EEND as shown inEq. (1) for chunked recordings with chunk size ∆ , as follows: ˆ Y t i +1: t i +∆ (cid:124) (cid:123)(cid:122) (cid:125) (cid:44) ˆ Y i = SA( X t i +1: t i +∆ (cid:124) (cid:123)(cid:122) (cid:125) (cid:44) X i ) ∈ (0 , S × ∆ . (2) i denotes a chunk index, and t i =1 (cid:44) . X i and ˆ Y i denotes sub-sequences of X and ˆ Y at chunk i , respectively. The latency canbe suppressed by chunk size ∆ instead of the entire recordinglength T . We ﬁrst investigate the inﬂuence of chunk size ∆ interms of the diarization performance. The SA-EEND system was trained using simulated training/testsets for two speakers and following the procedure in [27] . Heretwo encoder blocks with 256 attention units containing fourheads without residual connections were trained. The input fea-tures are 23-dimensional log-Mel-ﬁlterbanks concatenated withprevious seven frames and subsequent seven frames with a 25-ms frame length and 10-ms frame shift. And then subsamplingwith a factor of ten. In a word, a (23 × -dimensional inputfeature is inputted into the neural network every

100 ms whichmeans the duration length of one chunk is

100 ms .Two datasets are used for this analysis. The ﬁrst one,CALLHOME [6], consists of actual two-speaker telephone con-versations. Following the steps in [27], we split CALLHOMEinto the two parts: 155 recordings for adaptation and 148recordings for evaluation. The average overlap ratio with thetest set is . . The average duration of recordings in CALL-HOME is . . The second dataset is the Corpus of Sponta-neous Japanese (CSJ) dataset [28] which consists of interviews,natural conversations, etc. We used 54 recordings in this eval-uation and their average overlap ratio is . . There are twospeakers in each recording and the average duration is . . In this section, we analyze the relationship between chunk size ∆ in Eq. (2) and DER. The recordings to be evaluated were ﬁrstdivided according to chunk size and then fed into a SA-EENDsystem one by one to obtain the diarization result of each chunk.These chunk-wise diarization results were then combined as theﬁnal diarization result of the whole recording. We call it asrecording-wise DER calculated on the entire recording. Whencomputing the DER, a .

25 s collar tolerance was used at thestart and the end of each segment. We also evaluated overlap-ping speech and non-speech regions.Note that this chunk-wise SA-EEND method does not guar-antee that the speaker labels obtained across the chunk are thesame due to the speaker permutation ambiguity underlying in https://github.com/hitachi-speech/EEND D E R ( % ) Recording-wiseChunk-wise (a)

CALLHOME D E R ( % ) Recording-wiseChunk-wise (b)

CSJ

Figure 1:

Recording-wise and oracle chunk-wise DER (%). the general speaker diarization problem. Thus, the recording-wise DER would be degraded due to this across-chunk speakerinconsistency. To measure this degradation, we also computedthe oracle DER in each chunk separately (chunk-wise DER),which does not include the across-chunk speaker inconsistencyerror.The analytical results are shown in Figure 1 for the CALL-HOME and CSJ datasets. In these ﬁgures, the x-axis representschunk size ∆ during inference. Here, one chunk unit corre-sponds to . , which means the latency of the system is when the chunk size is 10 (i.e., . ×

10 = ). The y-axisrepresents the ﬁnal DER of the whole dataset. As shown in Fig-ure 1, the recording-wise DER decreased as the chunk size in-creased for both datasets. When the chunk size was larger than800, the recording-wise DER tended to converge for CALL-HOME. On the other hand, the oracle chunk-wise DER wasmuch smaller and more stable than the recording-wise DEReven when the chunk size was small, for both datasets. Thisindicates that the main degradation of online chunk-wise SA-EEND comes from the across-chunk speaker permutation in-consistency. Based on these ﬁndings, the next section exploreshow to solve this across-chunk speaker permutation issue.

3. Speaker-tracing buffer

In this section, we propose a method called speaker-tracingbuffer, that utilizes previous information as a clue to solve theacross-chunk permutation issue.

Let X buf ∈ R D × L and Y buf ∈ (0 , S × L be L -length acousticfeature buffer and the corresponding SA-EEND outputs, respec-tively, which contain the speaker-tracing information. At theinitial stage, X buf and Y buf are empty. Our online diarization isperformed by referring and updating this speaker-tracing buffer,as shown in Algorithm 1. The input of the SA-EEND system isthe concatenation of acoustic feature subsequence X i ∈ R D × ∆ at current chunk i and the acoustic features in buffer X buf , i.e., [ X buf ; X i ] ∈ R D × ( L +∆) . The corresponding output of SA-EEND is [ ˆ Y buf ; ˆ Y i ] ∈ (0 , S × ( L +∆) . If Y buf is not empty, thecorrelation coefﬁcient CC ( · , · ) between Y buf and current bufferoutput ˆ Y buf φ at output speaker permutation φ is calculated as CC ( Y buf , ˆ Y buf φ ) = (cid:80) Ss =1 (cid:80) Ll =1 ( y buf s,l − y buf )(ˆ y buf φ s ,l − ˆ y buf φ ) (cid:113)(cid:80) Ss =1 (cid:80) Ll =1 ( y buf s,l − y buf ) (cid:113)(cid:80) Ss =1 (cid:80) Ll =1 (ˆ y buf φ s ,l − ˆ y buf φ ) , lgorithm 1: Online diarization using SA-EEND.

Input: { X i } i // Chunked acoustic subsequnces S // L // Buffer size SA( · ) // SA-EEND system Output: ˆ Y // Diarization results X buf ← ∅ , Y buf ← ∅ // Initialize buffer for i = 1 , . . . do [ ˆ Y buf ; ˆ Y i ] ← SA([ X buf ; X i ]) // Input toSA-EEND if Y buf (cid:54) = ∅ then ψ ← arg max φ ∈ perm( S ) CC ( Y buf , ˆ Y buf φ ) ˆ Y i ← ˆ Y i,ψ ˆ Y ← [ ˆ Y ; ˆ Y i ] Update X buf and Y buf according to selection ruleswhere y buf = (cid:80) Ss =1 (cid:80) Ll =1 y buf s,l SL , ˆ y buf φ = (cid:80) Ss =1 (cid:80) Ll =1 ˆ y buf φ s ,l SL . (3)Permutation ψ with the largest correlation coefﬁcient ischosen as follows: ψ = arg max φ ∈ perm( S ) CC ( Y buf , ˆ Y buf φ ) , (4)where perm( S ) generates all permutations according to thenumber of speakers S . The corresponding buffer output ˆ Y buf i,ψ is chosen as the ﬁnal output ˆ Y i of chunk i , which can maintainthe consistent speaker permeation across the chunk. The ob-tained output ˆ Y i is stacked with the previously estimated outputto form the whole recording’s output ˆ Y in the end. An exampleof applying the speaker-tracing buffer to SA-EEND in the ﬁrsttwo chunks is shown in Figure 2, where ∆ is equal to 10, thebuffer size L is 5, and the speaker number S is 2.Speaker-tracing buffer ( X buf ; Y buf ) for the next chunk i +1 is selected from [ Y buf ; ˆ Y i ] and [ X buf ; X i ] in the current chunk.We consider three selection strategies, as explained in the nextsection. If chunk size ∆ is not larger than the pre-deﬁned buffer size L , we can simply store all the features in the buffer until thenumber of stored features reaches the buffer size. Once thenumber of accumulated features becomes larger than buffer size L , we have to select and store informative features that con-tain the speaker permutation information from [ X buf ; X i ] and [ Y buf ; ˆ Y i ] . In this section, three selection rules for updating thebuffer are listed. Here we assume that the number of speaker S is 2. • Uniform sampling (US). L acoustic features from [ X buf ; X i ] and the corresponding diarization resultsfrom [ Y buf ; ˆ Y i ] are randomly extracted based on the uni-form distribution.• Deterministic selection (DS) using the absolute differ-ence of probabilities of speakers, as δ m = | y ,m − y ,m | , (5)where y ,m , y ,m are the probabilities of the ﬁrst andsecond speakers at time index m . The maximum value of Table 1: DER (%) on ofﬂine situation and three buffer selectionstrategies with variable chunk sizes ∆ for ofﬂine system andvariable buffer sizes L , but ﬁxed chunk size ∆ = 10 for online. Ofﬂine US DS WS ∆ /L CH CSJ CH CSJ CH CSJ CH CSJ

10 36 .

93 47 .

51 45 . .

70 42 .

23 49 . .

31 37 .

86 22 .

38 31 .

06 23 . . .

13 33 .

38 16 .

28 26 .

38 17 .

11 27 .

150 25 .

61 31 .

26 14 .

33 25 .

81 15 .

68 26 .

200 25 .

31 29 .

51 13 .

95 24 .

99 14 .

47 25 .

500 15 .

67 24 .

35 13 .

05 24 . . . δ m ( = 1 ) is realized in either case of y ,m = 1 , y ,m =0 or y ,m = 0 , y ,m = 1 . This means that we try to ﬁnddominant active-speaker frames. Top L samples with thehighest δ m are selected from [ X buf ; X i ] and [ Y buf ; ˆ Y i ] .• Weighted sampling (WS): This is a combination of theuniform sampling and deterministic selection. We ran-domly select L features but the probability of selecting m -th feature is proportional to δ m in Eq. (5).

4. Experimental results

We analyzed the effect of the selection strategy using the samechunk size ∆

10 and several buffer sizes L from 10 to 600.In Table 1, the number in the left column (10-500) representthe chunk size ∆ for ofﬂine system but the buffer size L foronline situation. As shown in Table 1, applying the speaker-tracing buffer improved the performance of online SA-EENDperformance regardless of which selection strategy was used.As for the strategies, WS performed best for both datasets atmost cases when L was large (larger than 100). Therefore, Weconsidered WS as the selection strategy for future analysis. Next, we analyzed the effect of buffer and chunk size. TheDER results for the CALLHOME and CSJ when applying theweighted sampling (WS) selection strategy are shown in Fig-ure 3. Chunk sizes ∆ were 10 and 20 and had the latency timeof and respectively. Regarding the chunk size in Fig-ure 3, all DERs from the large chunk size ∆ = 20 are betterthan those from the small chunk size ∆ = 10 even if the buffersize is the same. As for the buffer size, when the chunk size wasthe same, DER decreased as buffer size increases. These resultsare in line with our assumption that a large input size would leadto a better result. RTF was calculated as the ratio of the summation of the exe-cution time of every chunk to the recording duration which canmeasure the speech decoding speed and express the time per-formance of the proposed system. To avoid an unequal sizeof buffers in the ﬁrst several chunks, we ﬁrst ﬁlled the bufferwith dummy values and then calculated the RTF. Our experi-ment was conducted on an Intel ® Xeon ® CPU E52697A v2 @2.60GHz using one thread. RTFs are equal to 0.40, 1.07 whenthe buffer size L are 500 and 1000 which indicates that the pro-posed method is acceptable for online applications when buffersize is smaller than 1000 (

100 s ). SA-EEND

Output for 1st chunk

BufferSelection

SA-EEND

Output for 2nd chunk

Selection

Buffer

Solve permutation using and

Figure 2:

Applying speaker-tracing buffer for SA-EEND. %XIIHUVL]H ' ( 5 &KXQNVL]H&KXQNVL]H (a) CALLHOME %XIIHUVL]H ' ( 5 &KXQNVL]H&KXQNVL]H (b) CSJ

Figure 3:

Relationship among DER, chunk size and buffer size.

For a comparison with other methods, we evaluated our pro-posed methods using two real datasets (CALLHOME (CH) andCSJ), and three simulated datasets which are shown in Table 2.The simulated datasets were created by using two speakers’ seg-ments. The background noise and room impulse response werefrom MUSAN corpus [29] and Simulated Room Response cor-pus [30] following the procedure in [26]. Three kinds of simu-lated datasets were created with overlap ratios equal to . , . and . , respectively.For the ofﬂine i-vector and x-vector method, we applied theKaldi CALLHOME diarization v1 and v2 recipe [19, 31, 32].They are ofﬂine diarization methods that applies probabilisticlinear discriminant analysis [33] along with an agglomerative-cluster method with TDNN-based speech activity detection [34]and oracle number of speakers. Ofﬂine SA-EEND is referred toas a chunk size of the entire recording, The system in [27] whichachieved the best performance is applied here, not only for theofﬂine SA-EEND but also for the online SA-EEND ( ∆ = 10 )and all proposed methods.For the online x-vector, the speech segments are ﬁrstly di-vided into subsequent . chunk. And then judging whetherthe entire chunk is speech or silence using the energy-basedVAD for real datasets and oracle VAD for simulated datasets.The following part will be skipped if the entire chunk is consid-ered as silence whose the If the percentage of voiced frames ofthe entire chunk are fewer than

20 % , it will be considered assilence and the fol If it is voiced chunk, we will extract x-vectorand assign it to the ﬁrst cluster until a dissimilar x-vector ar-rives according to the probabilistic linear discriminant analysis(PLDA) score. Here we applied a threshold 0 as the dissimilarcriterion. After two clusters exist, when a new x-vector comes,calculating the PLDA score between the new segment and the Table 2:

DERs (%) on simulated mixtures and real dataset.

Simulated RealSystem . . . CH CSJOfﬂine i-vector 33.73 30.93 25.96 12.10 27.99Ofﬂine x-vector 28.77 24.46 17.78 11.53 22.96Ofﬂine SA-EEND

Online x-vector ( ∆ = 15 ) 36.94 34.94 33.19 26.90 25.45Online SA-EEND ( ∆ = 10 ) 33.18 37.31 41.41 36.93 47.57Proposed ( ∆ = 10 , L = 500 ) 7.91

Proposed ( ∆ = 5 , L = 600 ) Table 3:

DERs (%) on real datasets with calibration period(30s).

Within 30s After 30s AllSystem CH CSJ CH CSJ CH CSJOfﬂine SA-EEND 9.38 23.43 9.53 20.23 9.54 20.48Proposed ( ∆ = 10 , L = 500 ) 15.91 25.37 10.05 21.30 12.84 21.64 two clusters and ﬁnally, assigning x-vector to the nearest clus-ter. For online SA-EEND, the chunk size is set to 10 withoutapplying a speaker-tracing buffer. The proposed method ap-plied the weighted sampling based speaker-tracing buffer to theSA-EEND.As shown in Table 2, comparing with x-vector based onlinesystem, the proposed method obtained the best results for theonline situations. The proposed method performed even bet-ter on CSJ than the ofﬂine clustering i-vector and the x-vectorbased methods. The online method increased the DER by 3.27and 1.16 point comparing with the ofﬂine SA-EEND system fortwo real dataset when buffer size is 500 and the latency time is . In order to explore the increased DER, the broken downDER with a calibration period of

30 s is calculated and shownin Table 3. Comparing with ofﬂine SA-EEND, the proposedmethod only increases the DERs of 0.52 and 1.07 point for twodatasets with a

30 s calibration period. Therefore, we can con-clude that our proposed method can improve latency of the SA-EEND system from the entire duration of recording to only or even . with comparable diarization performance.

5. Conclusion

In this paper, we proposed a speaker-tracing buffer to mem-ory previous speaker permutation information which enable thepre-trained ofﬂine SA-EEND system directly work online. Thelatency time can reduce to

500 ms with comparable diariza-tion performance. Future work will be concentrated to ﬂexiblespeaker as current method is limited to two-speaker case. . References [1] S. E. Tranter and D. A. Reynolds, “An overview of auto-matic speaker diarization systems,”

IEEE Transactions on audio,speech, and language processing , vol. 14, no. 5, pp. 1557–1565,2006.[2] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba,M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe et al. , “Diarization is hard: Some experiences and lessons learnedfor the JHU team in the inaugural DIHARD challenge.” in

IN-TERSPEECH , 2018, pp. 2808–2812.[3] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, andM. Liberman, “The second dihard diarization challenge: Dataset,task, and baselines,” in

INTERSPEECH , 2019, pp. 978–982.[4] J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The ﬁfth“chime” speech separation and recognition challenge: Dataset,task and baselines,” in

INTERSPEECH , 2018.[5] N. Kanda, R. Ikeshita, S. Horiguchi, Y. Fujita, K. Nagamatsu,X. Wang, V. Manohar, N. E. Y. Soplin, M. Maciejewski, S.-J.Chen et al. , “The Hitachi/JHU CHiME-5 system: Advances inspeech recognition for everyday home environments using mul-tiple microphone arrays,” in

The 5th International Workshop onSpeech Processing in Everyday Environments (CHiME 2018), In-terspeech , 2018.[6] “2000 NIST Speaker Recognition Evaluation,” https://catalog.ldc.upenn.edu/LDC2001S97.[7] A. Martin and M. Przybocki, “The NIST 1999 speaker recognitionevaluationan overview,”

Digital signal processing , vol. 10, no. 1-3, pp. 1–18, 2000.[8] J. Geiger, F. Wallhoff, and G. Rigoll, “GMM-UBM based open-set online speaker diarization,” in

INTERSPEECH , 2010.[9] T. von Neumann, K. Kinoshita, M. Delcroix, S. Araki,T. Nakatani, and R. Haeb-Umbach, “All-neural online sourceseparation, counting, and diarization for meeting analysis,” in

ICASSP , 2019, pp. 91–95.[10] M. Maciejewski, D. Snyder, V. Manohar, N. Dehak, and S. Khu-danpur, “Characterizing performance of speaker diarization sys-tems on far-ﬁeld speech using standard methods,” in

ICASSP ,2018, pp. 5244–5248.[11] S. H. Shum, N. Dehak, R. Dehak, and J. R. Glass, “Unsuper-vised methods for speaker diarization: An integrated and iterativeapproach,”

IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 21, no. 10, pp. 2015–2028, 2013.[12] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully su-pervised speaker diarization,” in

ICASSP , 2019, pp. 6301–6305.[13] K. Markov and S. Nakamura, “Improved novelty detection for on-line gmm based speaker diarization,” in

INTERSPEECH , 2008.[14] S. Madikeri, I. Himawan, P. Motlicek, and M. Ferras, “Integrat-ing online i-vector extractor with information bottleneck basedspeaker diarization system,” in

INTERSPEECH , 2015, pp. 3105–3109.[15] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree,“Speaker diarization using deep neural network embeddings,” in

ICASSP , 2017, pp. 4930–4934.[16] W. Zhu and J. Pelecanos, “Online speaker diarization usingadapted i-vector transforms,” in

ICASSP , 2016, pp. 5045–5049.[17] Q. Wang, C. Downey, L. Wan, P. A. Mansﬁeld, and I. L. Moreno,“Speaker diarization with LSTM,” in

ICASSP , 2018, pp. 5239–5243.[18] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker veriﬁcation,” in

ICASSP , 2018, pp. 4879–4883.[19] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust DNN embeddings for speaker recogni-tion,” in

ICASSP , 2018, pp. 5329–5333. [20] G. Sell and D. Garcia-Romero, “Speaker diarization with plda i-vector scoring and unsupervised calibration,” in

SLT , 2014, pp.413–417.[21] H. Ning, M. Liu, H. Tang, and T. S. Huang, “A spectral clusteringapproach to speaker diarization,” in

Ninth International Confer-ence on Spoken Language Processing , 2006.[22] D. Dimitriadis and P. Fousek, “Developing on-line speaker di-arization system.” in

INTERSPEECH , 2017, pp. 2739–2743.[23] J. Patino, R. Yin, H. Delgado, H. Bredin, A. Komaty, G. Wis-niewski, C. Barras, N. W. Evans, and S. Marcel, “Low-latency speaker spotting with online diarization and detection.”in

Odyssey , 2018, pp. 140–146.[24] E. Fini and A. Brutti, “Supervised online diarization with samplemean loss for multi-domain data,” in

ICASSP , 2020, pp. 7134–7138.[25] Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watan-abe, “End-to-end neural speaker diarization with permutation-freeobjectives,” in

INTERSPEECH , 2019, pp. 4300–4304.[26] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, andS. Watanabe, “End-to-end neural speaker diarization with self-attention,” in

ASRU , 2019.[27] Y. Fujita, S. Watanabe, S. Horiguchi, Y. Xue, and K. Naga-matsu, “End-to-end neural diarization: Reformulating speakerdiarization as simple multi-label classiﬁcation,” arXiv preprintarXiv:2003.02966 , 2020.[28] K. Maekawa, “Corpus of Spontaneous Japanese: Its design andevaluation,” in

ISCA & IEEE Workshop on Spontaneous SpeechProcessing and Recognition , 2003.[29] D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech,and noise corpus,” arXiv preprint arXiv:1510.08484 , 2015.[30] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur,“A study on data augmentation of reverberant speech for robustspeech recognition,” in

ICASSP , 2017, pp. 5220–5224.[31] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,“Deep neural network embeddings for text-independent speakerveriﬁcation.” in

INTERSPEECH , 2017, pp. 999–1003.[32] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The Kaldi speech recognition toolkit,” in

ASRU , 2011.[33] S. Ioffe, “Probabilistic linear discriminant analysis,” in

ECCV ,2006, pp. 531–542.[34] V. Peddinti, G. Chen, V. Manohar, T. Ko, D. Povey, and S. Khu-danpur, “JHU ASPIRE system: Robust LVCSR with TDNNs,iVector adaptation and RNN-LMS,” in