Speaker Diarization with Lexical Information
Tae Jin Park, Kyu J. Han, Jing Huang, Xiaodong He, Bowen Zhou, Panayiotis Georgiou, Shrikanth Narayanan
SSpeaker Diarization with Lexical Information
Tae Jin Park , Kyu J. Han , Jing Huang , Xiaodong He , Bowen Zhou , Panayiotis Georgiou andShrikanth Narayanan University of Southern California JD AI Research [email protected]
Abstract
This work presents a novel approach for speaker diarizationto leverage lexical information provided by automatic speechrecognition. We propose a speaker diarization system that canincorporate word-level speaker turn probabilities with speakerembeddings into a speaker clustering process to improve theoverall diarization accuracy. To integrate lexical and acous-tic information in a comprehensive way during clustering, weintroduce an adjacency matrix integration for spectral cluster-ing. Since words and word boundary information for word-levelspeaker turn probability estimation are provided by a speechrecognition system, our proposed method works without anyhuman intervention for manual transcriptions. We show that theproposed method improves diarization performance on variousevaluation datasets compared to the baseline diarization systemusing acoustic information only in speaker embeddings.
Index Terms : speaker diarization, automatic speech recogni-tion, lexical information, adjacency matrix integration, spectralclustering,
1. Introduction
Speaker diarization is a process of partitioning a given multi-speaker audio signal in terms of “who spoke when”, gener-ally consisting of two sub-processes: speaker segmentation ofcutting the given audio into homogeneous speech segmentsin terms of speaker characteristics and speaker clustering ofgrouping all the segments from one speaker into the same clus-ter and assigning them with the same speaker label. Speakerdiarization plays a critical role in speech applications likeautomatic speech recognition (ASR) or behavioral analytics[1, 2, 3, 4].Speaker diarization has long been considered a pre-processing step in the context of ASR. This is mostly because,considering research setups where oracle results for speech ac-tivity detection or segmentation are given, grouping speech por-tions from the same speakers in a multi-speaker audio signalcan benefit ASR systems. It can enable speaker-specific featuretransformation, e.g., fMLLR [5] or total variability factor anal-ysis for i-vectors [6]. However, such oracle results would notbe available for speaker diarization in practice. Also, perform-ing speaker diarization before ASR on production systems inthe wild without proper post-processing would degrade recogni-tion accuracy significantly since it is likely to determine speakerchange points in the middle of words, not between words, andresult in word cuts or deletions. These practical issues werealso pointed out in [7]. In addition, we recently showed in [8]that lexical cues in words or utterances can help diarization ac-curacy improve when combined with acoustic features. All ofthese suggest that it would make more sense and practical forspeaker diarization to be considered as a post-processing step Figure 1:
The data flow of the proposed system. and take advantage of utilizing ASR outputs within the ASRpipeline. With this regard, in this paper, we assume that thereare available ASR outputs in a text form for speaker diarizationand propose a system to incorporate such lexical informationinto the diarization process.There have been a handful of works to employ ASR out-puts to enhance speaker diarization systems, but mostly limitedto speaker segmentation. ASR outputs are used in [9] for de-termining potential speaker change points. In [10], the lexicalinformation provided by an ASR system is utilized to train acharacter-level language model and improve speaker segmenta-tion performance. In our previous work [8], we exploited lex-ical information, from either reference transcripts or ASR out-puts, along with acoustic information to enhance speaker seg-mentation in estimating speaker turns and showed the overallimprovement in speaker diarization accuracy.In this paper we extend the exploitation of lexical informa-tion provided by an ASR system to a speaker clustering pro-cess in speaker diarization. The challenge of employing lexicalinformation to speaker clustering is multifaceted and requirespractical design choices. In our proposal, we use word-levelspeaker turn probabilities as lexical representation and com-bine them with acoustic vectors of speaker embedding whenperforming spectral clustering [11]. In order to integrate lexi-cal and acoustic representations in the spectral clustering frame-work, we create adjacency matrices representing lexical andacoustic affinities between speech segments respectively andcombine them later with a per-element max operation. It isshown that the proposed speaker diarization system improvesa baseline performance on two evaluation datasets.The rest of the paper is organized as follows. In Section2, we explain the data flow of our proposed speaker diariza-tion system. In Sections 3 and 4, we detail how we processacoustic and lexical information, respectively. In Section 5, wedescribe the integration of the two sets of information in theframework of spectral clustering. Experimental results are dis-cussed in Section 6 and we conclude the work with some re-marks in Section 7.
2. Proposed speaker diarization system
The overall data flow of our proposed speaker diarization sys-tem is depicted in Fig 1. In the proposed system, there are twostreams of information: lexical and acoustic. On the lexical in-formation side, we receive the automated transcripts with thecorresponding time stamps for word boundaries from an avail- a r X i v : . [ ee ss . A S ] A p r ble ASR system. This text information is passed to the speakerturn probability estimator to compute word-level speaker turnprobabilities. On the acoustic information side, we perform acommon diarization task. MFCCs are extracted from the inputspeech signal after speech activity detection (SAD). FollowingSAD, we uniformly segment the SAD outputs. These uniformsegments are relayed to the speaker embedding extractor thatprovides speaker embedding vectors. We use the publicly avail-able Kaldi ASpIRE SAD Model [12] for SAD in our proposeddiarization pipeline.After processing the two streams of information, we createtwo adjacency matrices which hold lexical as well as acousticaffinities between speech segments, respectively, and combinethem with a per-element max operation to generate the com-bined affinity matrix that contains lexical and acoustic informa-tion in a comprehensive space. With the integrated adjacencymatrix, we finally obtain speaker labels using a spectral clus-tering algorithm. Each module in Fig. 1 is explained in moredetails in the following sections.
3. Acoustic information stream:Speaker embedding extractor
We employ the x-vector model [13] as our speaker embed-ding generator that has been showing the state-of-the-art per-formances for speaker verification and diarization tasks. Toperform the general diarization task with acoustic informationin the proposed system pipeline, we use 0.5 second window,0.25 second shift and 0.5 second minimum window size for23-dimensional MFCCs. The performance improvement ofspeaker embedding is out of the scope of this paper.
4. Lexical information stream:Speaker turn probability estimator
While acoustic speaker characteristics can be used for speakerturn detection tasks [14], our proposal of word-level speakerturn probability estimation comes behind the reasoning that lex-ical data can also provide an ample amount of information forsimilar tasks. It is likely for words in a given context (i.e., utter-ance) to have different probabilities on whether speaker turnschange at the time of being spoken. For example, the wordsin the phrase “how are you” are very likely to be spoken bya single speaker rather than by multiple speakers. This meansthat each word in this phrase “how are you” would likely havelower speaker turn probabilities than the word right after thephrase would have. In addition to lexical information, we alsofuse a speaker embedding vector per each word to increase theaccuracy of the turn probability estimation.To estimate speaker turn probability, we train bi-directionalthree-layer gated recurrent units (GRUs) [15] with 2,048 hiddenunits on the Fisher [16] and Switchboard [17] corpora using theforce-aligned texts. The actual inputs to the proposed speakerturn probability estimator would be the decoder outputs of theASR. The words and the corresponding word boundaries areused to generate word embedding and speaker embedding vec-tors respectively, as follows:•
Speaker embedding vector (S) : With the given start and endtime stamps of a word from ASR, we retrieve the speakerembedding vector using the speaker embedding extractor de-scribed in Section 3. The x-vector speaker embedding is 128-dimensional. http://kaldi-asr.org/models/m4 http://kaldi-asr.org/models/m6 Figure 2:
An illustration of the proposed speaker turn probabil-ity estimator. • Word embedding vector (W) : We map the same word in-put to a 40K-dimensional one-hot vector, which is fully con-nected to the word embedding layer shown in Fig. 2. Thedimension of the embedding layer is set to 256.These two vectors are appended to make a 384-dimensionalvector for every word and fed to the GRU layer. The softmaxlayer has one node and, during inference, outputs speaker turnprobability. The parameters of the speaker turn probability esti-mator are trained with the cross entropy loss. The ASR systemused in this paper for decoding is the Kaldi ASpIRE recipe [12]that is publicly available.
5. Adjacency matrix integration
The biggest challenge of integrating speaker turn probabili-ties (from lexical information) and speaker embedding vectors(from acoustic information) in the spectral clustering frame-work is the heterogeneity of the information sources for theserepresentations. To tackle this challenge, we first create two in-dependent adjacency matrices that contain lexical and acousticaffinities between speech segments, respectively, and then com-bine them with a per-element max operation to handle the in-formation from the two different sources in the common spaceused for spectral clustering. For each adjacency matrix, we em-ploy undirected graphs to represent the corresponding affinitiesbetween the segments.•
Adjacency matrix using speaker embeddings
1) Initially compute the cosine similarity p i,j between speakerembedding vectors for segments s i and s j to form the ad-jacency matrix P , where ≤ i, j ≤ M and M is the totalnumber of segments in a given audio signal.2) For every i -th row of P , update p i,j as follows: p i,j = (cid:40) if p i,j ≤ W ( r )0 otherwise (1)where W ( r ) is the cosine similarity value that is at r -percentile in each row and r is optimized on the dev set.This operation converts P to a discrete-valued affinity ma-trix through N nearest neighbor connections.3) Note that P is asymmetric and can be seen an adjacencymatrix for a directed graph where each node represents a http://kaldi-asr.org/models/m1 igure 3: An example of the word sequence processing for the adjacency matrix calculation using the speaker turn probabilities.
Figure 4:
An example of the speech segment selection processusing the utterance boundary information. speech segment in our case. As spectral clustering finds theminimum cuts on an undirected graph in theory [11], wechoose an undirected version of P , P ud , as the adjacencymatrix for speaker embeddings by averaging P and P T asbelow: P ud = 12 ( P + P T ) (2)The pictorial representation of P ud is given in the left sideof Fig. 5.• Adjacency matrix using speaker turn probabilities
The following steps 1) to 4) match to the numbered illustra-tions in Fig. 3, where c = 0.3 and ν = 3 are given as exampleparameters.1) For a given threshold c , pick all the turn words that havespeaker turn probabilities greater than c in the word se-quence provided by ASR. The colored boxes in Fig. 3-1)indicate the turn words. The threshold c is determined bythe eigengap heuristic that we will discuss in Section 5.2.2) Break the word sequence at every point where the turn wordstarts as in Fig. 3-2). The given word sequence is brokeninto multiple utterances as a result.3) Pick all the utterances that have more than one word be-cause a duration spanning one word may be too short tocarry any speaker-specific information. In Fig. 3-3), thewords “well” and “great” are thus not considered for fur-ther processing. Additionally, we always arrange sevenback channel words (“yes”, “oh”, “okay”, “yeah”, “uh-huh”, “mhm”, “[laughter]”) as independent utterances re-gardless of their turn probabilities.4) To mitigate the effect of any miss detection by the speakerturn probability estimator, we perform over-segmentationon the utterances by limiting the max utterance length to ν . In Fig. 3-4), the resulting utterances are depicted withdifferent colors. Maximum utterance length ν is optimizedon the dev set in the range of 2 to 9.5) Find all the speech segments that fall into the boundary ofeach utterance. Fig. 4 explains how speech segments withinthe boundary of the example utterance “how are you” areselected. If a segment partly falls into the utterance bound-ary and its overlap ( l in Fig. 4) is greater than 50% of thesegment length, the segment is considered to fall into theutterance boundary. Figure 5: Examples of the two adjacency matrices.
6) Let s m be the first segment and s n be the last segmentfalling into the utterance boundary (e.g., segments s and s , respectively, in Fig. 4). For the elements q i,j in an ad-jacency matrix Q c (with the threshold c ) being initializedwith zeros, we do the following operation for all the utter-ances: q i,j = (cid:40) if m ≤ i, j ≤ nq i,j otherwise (3)The right side of Fig. 5 shows an example of Q c by theutterance “how are you” in Fig. 4.• Combining adjacency matrices
We combine the two adjacency matrices: A c = max ( P ud , Q c ) = max ( ( P + P T ) , Q c ) (4)where max is a per-element max operation. In spectral clustering, the Laplacian matrix is employed to getthe spectrum of an adjacency matrix. In this work, we employthe unnormalized graph Laplacian matrix L c [11] as below: L c = D c − A c (5)where D c = diag { d , d , ..., d M } , d i = (cid:80) Mk =1 a ik and a ij is the element in the i th row and j th column of the adjacencymatrix A c . We calculate eigenvalues from L c and set up aneigengap vector e c : e c = [ λ − λ , λ − λ , · · · , λ M − λ M − ] (6)where λ is the smallest eigenvalue and λ M is the largest eigen-value. The resulting adjacency matrix A c is passed to the spec-tral clustering algorithm, for which we use the implementationin [18].The number of clusters (in our case, number of speakers) isestimated by finding the arg max value of the eigengap vector e c as in the following equation: (cid:99) n s = arg max n ( e c ) (7)able 1: DER (%) on the RT03-CTS dataset.
Number of Speakers UnknownDataset Split(Quantity)
Dev (14)
Eval (58)Error Type
DER SER DER SER
Quan et al. [19] System SAD - - 12.3 3.76
Baseline M1
Proposed M2
W 3.97 1.00 5.19 1.93 M3 W+S 3.79 0.82 5.11 1.85where (cid:99) n s refers to the estimated number of speakers.
6. Experimental Results and Discussion
Evaluation datasets for our proposed system are limited to En-glish speech corpora since the proposed system relies on theEnglish ASR system at the moment. We report the diarizationperformance on the following corpora:•
RT03-CTS (LDC2007S10): We use the 14-vs-58 dev andeval split provided by the authors in [19]. All the parametersappear on this paper are optimized on the RT03 dev set.•
CH American English Speech (CHAES) (LDC97S42) [20]:Note that the CHAES corpus is different from the commonly-used multilingual dataset “NIST SRE 2000 CALLHOME(LDC2001S97)” which is the superset of the CHAES cor-pus. Within the CHAES corpus, we use two different subsets.(1)
CH-Eval : Evaluation set from CHAES. (2)
CH-109 : 109conversations from the CHAES corpus that have 2 speakersonly. The CH-109 subset is popularly used such as in [21]when evaluating diarization systems focusing only on the 2speaker cases.
We evaluate the proposed system (M3) with the baseline systemconfiguration (M1) on the two evaluation datasets (CH-109 andCH-Eval) as well as the RT03-CTS dataset. To evaluate thesystems in terms of diarization error rate (DER) and speakererror rate (SER), we use the md-eval software presented in [22].The gap between DER and SER orginates from the false alarmsand missed detections that are caused by SAD. The systemscompared in the tables above are configured in the followingmanners:• M1 : This baseline system configuration only exploits P ud as A c for spectral clustering (i.e., A c = P ud ). This is thegeneral speaker diarization system utilizing acoustic infor-mation only in speaker embeddings. The results of this sys-tem would contrast how much lexical information can con-tribute to the speaker clustering process to enhance the over-all speaker diarization accuracy in M2 and M3.• M2 : This configuration for the proposed system excludesthe speaker embedding part for the speaker turn probabilityestimator in Fig. 2 to show the contribution of lexical infor-mation in the speaker turn probability estimation process.• M3 : This is the full-blown configuration, as explainedthroughout this paper. The performance of our proposed system is compared to previ-ously published results [19, 21] on the same dataset. However,it should be noted that results in [19] and our proposed systemare based on system SAD that is bound to give higher DER than Table 2:
DER (%) on the CHAES dataset.
Number of Speakers Unknown KnownDataset(Quantity)
CH-Eval (20)
CH-109 (109)Error Type
DER SER DER SER
Quan et al. [19] System SAD 12.54 5.97 12.48 6.03Zajc et al. [21] Oracle SAD - - - 7.84
Baseline M1
Proposed M2
W 7.04 2.97 5.96 1.67 M3 W+S 6.97 2.9 6.03 1.73the systems based on oracle SAD. On the other hand, the systemin [21] uses oracle SAD which makes DER equal to SER.•
Table1 (RT03-CTS):
The M3 system improves the perfor-mance over M2, but the relative improvements are minimalas compared to the improvements of M2 over M1. Thisshows that most of the performance gain by the proposedspeaker diarization system comes from employing lexical in-formation to the speaker clustering process.•
Table2 (CH-Eval, CH-109):
This table compares our pro-posed speaker diarization system with the recently publishedresults [19, 21] on the CHAES datasets. For a fair compari-son, we applied the eigengap analysis based speaker numberestimation in Eq. (6) only to the CH-Eval dataset while fix-ing the number of speakers to 2 in the CH-109 dataset (sinceCH-109 is the chosen set of the CHAES conversations withonly 2 speakers). It is shown in the table that our proposedsystem (M3) outperforms the previously published results in[19, 21] on both CH-Eval and CH-109. It is worthwhile tomention that the proposed system did not gain the notice-able improvement in the CH-Eval dataset as compared to thebaseline configuration (M1). As for the CH-109 dataset, onthe other hand, M3 seems to provide a noticeble jump in SERover M1. Given our observation that in the CH-109 evalua-tion most of the performance improvement from M1 to M3was from the worst 10 sessions that the baseline system per-formed poorly on, we presume that the proposed system im-proves the clustering results on such challenging data.
The experimental results show that the baseline system outper-forms the previously published results due to the performanceof ASpIRE SAD [12] and x-vector [13]. However, our proposedsystem still improves the competitive baseline system by 36%for RT03-Eval and 19% for CH-109 in terms of SER.
7. Conclusions
In this paper, we proposed the speaker diarization system toexploit lexical information from ASR to the speaker cluster-ing process to improve the overall DER. The experimental re-sults showed that the proposed system provides meaningful im-provements on both of the CHAES and RT03-CTS datasets out-performing the baseline system which is already competitiveagainst the previously published state-of-the-art results. Thissupports our claim that lexical information can improve diarza-tion results by incorporating turn probability and word bound-aries. Further studies should target the optimal approachesof integrating the adjacency matrices by employing improvedsearch techniques which can improve not only the clusteringperformance but also the processing speed.
8. Acknowledgements
This research was supported in part by NSF, NIH, and DOD. . References [1] D. Liu, D. Kiecza, A. Srivastava, and F. Kubala, “Online speakeradaptation and tracking for real-time speech recognition,” in
NinthEuropean Conference on Speech Communication and Technology ,2005.[2] T. Hain, L. Burget, J. Dines, P. N. Garner, F. Gr´ezl, A. El Han-nani, M. Huijbregts, M. Karafiat, M. Lincoln, and V. Wan, “Tran-scribing meetings with the AMIDA systems,”
IEEE Trans. AudioSpeech Lang. Process. , vol. 20, no. 2, pp. 486–498, 2012.[3] P. Georgiou, M. Black, and S. Narayanan, “Behavioral signal pro-cessing for understanding (distressed) dyadic interactions: Somerecent developments,” in
Joint ACM Workshop on Human Gestureand Behavior Understanding , 2011, pp. 7–12.[4] S. Narayanan and P. Georgiou, “Behavioral signal processing:Deriving human behavioral informatics from speech and lan-guage,”
Proceedings of the IEEE , vol. PP, no. 99, pp. 1–31, 2013.[5] M. Gales, “Maximum likelihood linear transformations forHMM-based speech recognition,”
Comp. Speech and Lang. ,vol. 12, pp. 75–98, 1997.[6] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front-end factor analysis for speaker verification,”
IEEE Trans-actions on Audio, Speech, and Language Processing , vol. 19,no. 4, pp. 788–798, 2011.[7] D. Dimitriadis and P. Fousek, “Developing on-line speaker di-arization system,” in
Interspeech , 2017, pp. 2739–2743.[8] T. J. Park and P. Georgiou, “Multimodal speaker segmentationand diarization using lexical and acoustic cues via sequence tosequence neural networks,” in
Interspeech , 2018.[9] P. Cerva, J. Silovsky, J. Zdansky, J. Nouza, and L. Seps, “Speaker-adaptive speech recognition using speaker diarization for im-proved transcription of large spoken archives,”
Speech Commu-nication , vol. 55, no. 10, pp. 1033–1046, 2013.[10] M. `A. India Massana, J. A. Rodr´ıguez Fonollosa, and F. J. Her-nando Peric´as, “LSTM neural network-based speaker segmenta-tion using acoustic and language modelling,” in
INTERSPEECH2017: 20-24 August 2017: Stockholm . International SpeechCommunication Association (ISCA), 2017, pp. 2834–2838.[11] U. Von Luxburg, “A tutorial on spectral clustering,”
Statistics andcomputing , vol. 17, no. 4, pp. 395–416, 2007.[12] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The kaldi speech recognition toolkit,” IEEE Signal ProcessingSociety, Tech. Rep., 2011.[13] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust dnn embeddings for speaker recognition,”in . IEEE, 2018.[14] H. Bredin, “Tristounet: Triplet loss for speaker turn embedding,”in . IEEE, 2017, pp. 5430–5434.[15] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evalu-ation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555 , 2014.[16] C. Cieri, D. Miller, and K. Walker, “Fisher English training speechparts 1 and 2,”
Philadelphia: Linguistic Data Consortium , 2004.[17] J. J. Godfrey and E. Holliman, “Switchboard-1 release 2,”
Lin-guistic Data Consortium, Philadelphia , vol. 926, p. 927, 1997.[18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,and E. Duchesnay, “Scikit-learn: Machine learning in Python,”
Journal of Machine Learning Research , vol. 12, pp. 2825–2830,2011.[19] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L.Moreno, “Speaker diarization with LSTM,” in . IEEE, 2018, pp. 5239–5243. [20] L. D. Consortium et al. , “CALLHOME American Englishspeech,”
I997 , 1997.[21] Z. Zajc, M. Hrz, and L. Mller, “Speaker diarization usingconvolutional neural network for statistics accumulation refine-ment,” in
Proc. Interspeech 2017 , 2017, pp. 3562–3566. [Online].Available: http://dx.doi.org/10.21437/Interspeech.2017-51[22] J. G. Fiscus, J. Ajot, M. Michel, and J. S. Garofolo, “The RichTranscription 2006 spring meeting recognition evaluation,” in