Online End-to-End Neural Diarization with Speaker-Tracing Buffer
Yawen Xue, Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Kenji Nagamatsu
OOnline End-to-End Neural Diarization with Speaker-Tracing Buffer
Yawen Xue , Shota Horiguchi , Yusuke Fujita , Shinji Watanabe , Kenji Nagamatsu Hitachi, Ltd. Research & Development Group Center for Language and Speech Processing, Johns Hopkins University yawen.xue.wn,[email protected], [email protected]
Abstract
End-to-end speaker diarization using a fully supervised self-attention mechanism (SA-EEND) has achieved significant im-provement from the state-of-art clustering-based methods, espe-cially for the overlapping case. However, applications of orig-inal SA-EEND are limited since it has been developed basedon offline self-attention algorithms. In this paper, we propose anovel speaker-tracing mechanism to extend SA-EEND to onlinespeaker diarization for practical use. First, this paper demon-strates oracle experiments to show that a straightforward on-line extension, in which SA-EEND is performed independentlyfor each chunked recording, results in degrading the diariza-tion error rate (DER) due to the speaker permutation inconsis-tency across the chunk. To circumvent this inconsistency is-sue, our proposed method, called speaker-tracing buffer, main-tains the speaker permutation information determined in pre-vious chunks within the self-attention mechanism for correctspeaker-tracing. Our experimental results show that the pro-posed online SA-EEND with speaker-tracing buffer achievedthe DERs of .
84 % for CALLHOME and .
64 % for Corpusof Spontaneous Japanese with latency. These results are sig-nificantly better than the conventional online clustering methodbased on x-vector with . latency, which achieved the DERsof .
90 % and .
45 % , respectively.
1. Introduction
When building an audio-based human-computer interaction(HCI) system, it is important to provide the speaker turn infor-mation as well as speech transcription information. Speaker di-arization can locate speaker turn information and use it to iden-tify the speaker in a corresponding segment, which is defined as“who spoke when” [1–3]. Speaker diarization has been widelyapplied to meetings, call-center telephone conversations, andthe home environment (CHiME-5) [4–7].Online speaker diarization outputs the diarization result assoon as the audio segment arrives, which means no future in-formation is available when analyzing the current segment. Incontrast, in an offline mode, the whole recording is processedso that all segments can be compared and clustered at the sametime [8]. Currently, few speaker diarization systems can be ap-plied in practical scenarios because most of them work wellonly under specific conditions such as long latency, no over-lapping, or no noise level. [9,10]. An online speaker diarizationsystem with low latency is still an open technical problem.State-of-art speaker diarization systems mostly concen-trate on integrating several components: voice activity de-tection, speaker change detection, feature representation, andclustering [11, 12]. Current research focuses primarily on thespeaker model or speaker embeddings, such as Gaussian mix-ture models (GMM) [8, 13], i-vector [14–16], d-vector [17, 18],and x-vector [19, 20], and on a better clustering method such as agglomerative hierarchical clustering or spectral clustering[19,21–23]. The issue with these methods is that they cannot di-rectly minimize the diarization error because they are based onan unsupervised algorithm. Zhang, et al [12, 24] proposed a su-pervised online speaker diarization approach while the methodstill assumes only one speaker in one segment (no overlap).To solve these issues, Fujita, et al. [25–27] proposed an end-to-end speaker diarization system that directly minimizes the di-arization error by training a neural network using PermutationInvariant Training (PIT) with multi-speaker recordings. Theirexperimental results show that the self-attention based end-to-end speaker diarization (SA-EEND) system [26, 27] outper-formed the state-of-art i-vector and x-vector clustering and longshort-term memory (LSTM) [25] based end-to-end method. Al-though SA-EEND has achieved significant improvement, it isonly working in the offline condition due to the self-attentionmechanism, which outputs speaker labels only after the wholerecording is provided.This paper first investigates a straightforward online exten-sion of SA-EEND by performing diarization independently foreach chunked recording. However, this straightforward onlineextension degrades the diarization error rate (DER) due to thespeaker permutation inconsistency across the chunk, especiallyfor short-duration chunks. Therefore, we propose a methodcalled speaker-tracing buffer, which can track speaker informa-tion consistently across the chunk by extending a self-attentionmechanism to maintain the speaker permutation information de-termined in previous chunks. More specifically, we select afixed number of input frames in the previous chunk that havedominant speaker permutation information based on the diariza-tion output probability. These additional input frames are fedinto the self attention layer to take over the speaker permuta-tion information determined in the previous chunk. Our ex-perimental results show that choosing the buffer frames usingthe absolute probability difference of the output speaker labelsyields the best results compared with other methods. The codeof SA-EEND with speaker tracing buffer will be available at https://github.com/hitachi-speech/EEND .
2. Analysis of online SA-EEND
In SA-EEND [26], the speaker diarization task is formulatedas a probabilistic multi-label classification problem. Given the T length acoustic feature X = ( x t ∈ R D | t = 1 , · · · , T ) ,with a D -dimensional observation feature vector at time index t , SA-EEND predicts the corresponding speaker label sequence ˆ Y = (ˆ y t | t = 1 , · · · , T ) . Here, speaker label ˆ y t = [ˆ y t,s ∈{ , } | s = 1 , · · · , S ] represents a joint activity for multiplespeakers ( S ) at time t . For example, ˆ y t,s = ˆ y t,s (cid:48) = 1( s (cid:54) = s (cid:48) ) means both s and s (cid:48) spoke at time t . Thus, determining ˆ Y isthe key for determining the speaker diarization information as a r X i v : . [ ee ss . A S ] J un ollows: ˆ Y = SA( X ) ∈ (0 , S × T , (1)where SA( · ) is a multi-head self-attention based neural net-work.Note that the vanilla self-attention layers has to wait theprocessing of all speech features in an entire recording to com-pute the output speaker labels. Thus, this method causes verylarge latency determined by the length of the recording, andcannot be adequate for online/real-time speech interface. This paper first investigates the use of SA-EEND as shown inEq. (1) for chunked recordings with chunk size ∆ , as follows: ˆ Y t i +1: t i +∆ (cid:124) (cid:123)(cid:122) (cid:125) (cid:44) ˆ Y i = SA( X t i +1: t i +∆ (cid:124) (cid:123)(cid:122) (cid:125) (cid:44) X i ) ∈ (0 , S × ∆ . (2) i denotes a chunk index, and t i =1 (cid:44) . X i and ˆ Y i denotes sub-sequences of X and ˆ Y at chunk i , respectively. The latency canbe suppressed by chunk size ∆ instead of the entire recordinglength T . We first investigate the influence of chunk size ∆ interms of the diarization performance. The SA-EEND system was trained using simulated training/testsets for two speakers and following the procedure in [27] . Heretwo encoder blocks with 256 attention units containing fourheads without residual connections were trained. The input fea-tures are 23-dimensional log-Mel-filterbanks concatenated withprevious seven frames and subsequent seven frames with a 25-ms frame length and 10-ms frame shift. And then subsamplingwith a factor of ten. In a word, a (23 × -dimensional inputfeature is inputted into the neural network every
100 ms whichmeans the duration length of one chunk is
100 ms .Two datasets are used for this analysis. The first one,CALLHOME [6], consists of actual two-speaker telephone con-versations. Following the steps in [27], we split CALLHOMEinto the two parts: 155 recordings for adaptation and 148recordings for evaluation. The average overlap ratio with thetest set is . . The average duration of recordings in CALL-HOME is . . The second dataset is the Corpus of Sponta-neous Japanese (CSJ) dataset [28] which consists of interviews,natural conversations, etc. We used 54 recordings in this eval-uation and their average overlap ratio is . . There are twospeakers in each recording and the average duration is . . In this section, we analyze the relationship between chunk size ∆ in Eq. (2) and DER. The recordings to be evaluated were firstdivided according to chunk size and then fed into a SA-EENDsystem one by one to obtain the diarization result of each chunk.These chunk-wise diarization results were then combined as thefinal diarization result of the whole recording. We call it asrecording-wise DER calculated on the entire recording. Whencomputing the DER, a .
25 s collar tolerance was used at thestart and the end of each segment. We also evaluated overlap-ping speech and non-speech regions.Note that this chunk-wise SA-EEND method does not guar-antee that the speaker labels obtained across the chunk are thesame due to the speaker permutation ambiguity underlying in https://github.com/hitachi-speech/EEND D E R ( % ) Recording-wiseChunk-wise (a)
CALLHOME D E R ( % ) Recording-wiseChunk-wise (b)
CSJ
Figure 1:
Recording-wise and oracle chunk-wise DER (%). the general speaker diarization problem. Thus, the recording-wise DER would be degraded due to this across-chunk speakerinconsistency. To measure this degradation, we also computedthe oracle DER in each chunk separately (chunk-wise DER),which does not include the across-chunk speaker inconsistencyerror.The analytical results are shown in Figure 1 for the CALL-HOME and CSJ datasets. In these figures, the x-axis representschunk size ∆ during inference. Here, one chunk unit corre-sponds to . , which means the latency of the system is when the chunk size is 10 (i.e., . ×
10 = ). The y-axisrepresents the final DER of the whole dataset. As shown in Fig-ure 1, the recording-wise DER decreased as the chunk size in-creased for both datasets. When the chunk size was larger than800, the recording-wise DER tended to converge for CALL-HOME. On the other hand, the oracle chunk-wise DER wasmuch smaller and more stable than the recording-wise DEReven when the chunk size was small, for both datasets. Thisindicates that the main degradation of online chunk-wise SA-EEND comes from the across-chunk speaker permutation in-consistency. Based on these findings, the next section exploreshow to solve this across-chunk speaker permutation issue.
3. Speaker-tracing buffer
In this section, we propose a method called speaker-tracingbuffer, that utilizes previous information as a clue to solve theacross-chunk permutation issue.
Let X buf ∈ R D × L and Y buf ∈ (0 , S × L be L -length acousticfeature buffer and the corresponding SA-EEND outputs, respec-tively, which contain the speaker-tracing information. At theinitial stage, X buf and Y buf are empty. Our online diarization isperformed by referring and updating this speaker-tracing buffer,as shown in Algorithm 1. The input of the SA-EEND system isthe concatenation of acoustic feature subsequence X i ∈ R D × ∆ at current chunk i and the acoustic features in buffer X buf , i.e., [ X buf ; X i ] ∈ R D × ( L +∆) . The corresponding output of SA-EEND is [ ˆ Y buf ; ˆ Y i ] ∈ (0 , S × ( L +∆) . If Y buf is not empty, thecorrelation coefficient CC ( · , · ) between Y buf and current bufferoutput ˆ Y buf φ at output speaker permutation φ is calculated as CC ( Y buf , ˆ Y buf φ ) = (cid:80) Ss =1 (cid:80) Ll =1 ( y buf s,l − y buf )(ˆ y buf φ s ,l − ˆ y buf φ ) (cid:113)(cid:80) Ss =1 (cid:80) Ll =1 ( y buf s,l − y buf ) (cid:113)(cid:80) Ss =1 (cid:80) Ll =1 (ˆ y buf φ s ,l − ˆ y buf φ ) , lgorithm 1: Online diarization using SA-EEND.
Input: { X i } i // Chunked acoustic subsequnces S // L // Buffer size SA( · ) // SA-EEND system Output: ˆ Y // Diarization results X buf ← ∅ , Y buf ← ∅ // Initialize buffer for i = 1 , . . . do [ ˆ Y buf ; ˆ Y i ] ← SA([ X buf ; X i ]) // Input toSA-EEND if Y buf (cid:54) = ∅ then ψ ← arg max φ ∈ perm( S ) CC ( Y buf , ˆ Y buf φ ) ˆ Y i ← ˆ Y i,ψ ˆ Y ← [ ˆ Y ; ˆ Y i ] Update X buf and Y buf according to selection ruleswhere y buf = (cid:80) Ss =1 (cid:80) Ll =1 y buf s,l SL , ˆ y buf φ = (cid:80) Ss =1 (cid:80) Ll =1 ˆ y buf φ s ,l SL . (3)Permutation ψ with the largest correlation coefficient ischosen as follows: ψ = arg max φ ∈ perm( S ) CC ( Y buf , ˆ Y buf φ ) , (4)where perm( S ) generates all permutations according to thenumber of speakers S . The corresponding buffer output ˆ Y buf i,ψ is chosen as the final output ˆ Y i of chunk i , which can maintainthe consistent speaker permeation across the chunk. The ob-tained output ˆ Y i is stacked with the previously estimated outputto form the whole recording’s output ˆ Y in the end. An exampleof applying the speaker-tracing buffer to SA-EEND in the firsttwo chunks is shown in Figure 2, where ∆ is equal to 10, thebuffer size L is 5, and the speaker number S is 2.Speaker-tracing buffer ( X buf ; Y buf ) for the next chunk i +1 is selected from [ Y buf ; ˆ Y i ] and [ X buf ; X i ] in the current chunk.We consider three selection strategies, as explained in the nextsection. If chunk size ∆ is not larger than the pre-defined buffer size L , we can simply store all the features in the buffer until thenumber of stored features reaches the buffer size. Once thenumber of accumulated features becomes larger than buffer size L , we have to select and store informative features that con-tain the speaker permutation information from [ X buf ; X i ] and [ Y buf ; ˆ Y i ] . In this section, three selection rules for updating thebuffer are listed. Here we assume that the number of speaker S is 2. • Uniform sampling (US). L acoustic features from [ X buf ; X i ] and the corresponding diarization resultsfrom [ Y buf ; ˆ Y i ] are randomly extracted based on the uni-form distribution.• Deterministic selection (DS) using the absolute differ-ence of probabilities of speakers, as δ m = | y ,m − y ,m | , (5)where y ,m , y ,m are the probabilities of the first andsecond speakers at time index m . The maximum value of Table 1: DER (%) on offline situation and three buffer selectionstrategies with variable chunk sizes ∆ for offline system andvariable buffer sizes L , but fixed chunk size ∆ = 10 for online. Offline US DS WS ∆ /L CH CSJ CH CSJ CH CSJ CH CSJ
10 36 .
93 47 .
51 45 . .
70 42 .
23 49 . .
31 37 .
86 22 .
38 31 .
06 23 . . .
13 33 .
38 16 .
28 26 .
38 17 .
11 27 .
150 25 .
61 31 .
26 14 .
33 25 .
81 15 .
68 26 .
200 25 .
31 29 .
51 13 .
95 24 .
99 14 .
47 25 .
500 15 .
67 24 .
35 13 .
05 24 . . . δ m ( = 1 ) is realized in either case of y ,m = 1 , y ,m =0 or y ,m = 0 , y ,m = 1 . This means that we try to finddominant active-speaker frames. Top L samples with thehighest δ m are selected from [ X buf ; X i ] and [ Y buf ; ˆ Y i ] .• Weighted sampling (WS): This is a combination of theuniform sampling and deterministic selection. We ran-domly select L features but the probability of selecting m -th feature is proportional to δ m in Eq. (5).
4. Experimental results
We analyzed the effect of the selection strategy using the samechunk size ∆
10 and several buffer sizes L from 10 to 600.In Table 1, the number in the left column (10-500) representthe chunk size ∆ for offline system but the buffer size L foronline situation. As shown in Table 1, applying the speaker-tracing buffer improved the performance of online SA-EENDperformance regardless of which selection strategy was used.As for the strategies, WS performed best for both datasets atmost cases when L was large (larger than 100). Therefore, Weconsidered WS as the selection strategy for future analysis. Next, we analyzed the effect of buffer and chunk size. TheDER results for the CALLHOME and CSJ when applying theweighted sampling (WS) selection strategy are shown in Fig-ure 3. Chunk sizes ∆ were 10 and 20 and had the latency timeof and respectively. Regarding the chunk size in Fig-ure 3, all DERs from the large chunk size ∆ = 20 are betterthan those from the small chunk size ∆ = 10 even if the buffersize is the same. As for the buffer size, when the chunk size wasthe same, DER decreased as buffer size increases. These resultsare in line with our assumption that a large input size would leadto a better result. RTF was calculated as the ratio of the summation of the exe-cution time of every chunk to the recording duration which canmeasure the speech decoding speed and express the time per-formance of the proposed system. To avoid an unequal sizeof buffers in the first several chunks, we first filled the bufferwith dummy values and then calculated the RTF. Our experi-ment was conducted on an Intel ® Xeon ® CPU E52697A v2 @2.60GHz using one thread. RTFs are equal to 0.40, 1.07 whenthe buffer size L are 500 and 1000 which indicates that the pro-posed method is acceptable for online applications when buffersize is smaller than 1000 (
100 s ). SA-EEND
Output for 1st chunk
BufferSelection
SA-EEND
Output for 2nd chunk
Selection
Buffer
Solve permutation using and
Figure 2:
Applying speaker-tracing buffer for SA-EEND. % X I I H U V L ] H ' ( 5 &