A Review of Speaker Diarization: Recent Advances with Deep Learning
Tae Jin Park, Naoyuki Kanda, Dimitrios Dimitriadis, Kyu J. Han, Shinji Watanabe, Shrikanth Narayanan
AA Review of Speaker Diarization: Recent Advances with Deep Learning
Tae Jin Park a, ∗ , Naoyuki Kanda b, ∗ , Dimitrios Dimitriadis b, ∗ , Kyu J. Han c, ∗ , Shinji Watanabe d, ∗ , Shrikanth Narayanan a a University of Southern California, Los Angeles, USA b Microsoft, Redmond, USA c ASAPP, Mountain View, USA d Johns Hopkins University, Baltimore, USA
Abstract
Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity, or in short, a taskto identify “who spoke when”. In the early years, speaker diarization algorithms were developed for speech recognition on multi-speaker audio recordings to enable speaker adaptive processing, but also gained its own value as a stand-alone application overtime to provide speaker-specific meta information for downstream tasks such as audio retrieval. More recently, with the rise ofdeep learning technology that has been a driving force to revolutionary changes in research and practices across speech applicationdomains in the past decade, more rapid advancements have been made for speaker diarization. In this paper, we review notonly the historical development of speaker diarization technology but also the recent advancements in neural speaker diarizationapproaches. We also discuss how speaker diarization systems have been integrated with speech recognition applications and howthe recent surge of deep learning is leading the way of jointly modeling these two components to be complementary to each other.By considering such exciting technical trends, we believe that it is a valuable contribution to the community to provide a surveywork by consolidating the recent developments with neural methods and thus facilitating further progress towards a more e ffi cientspeaker diarization. Keywords: speaker diarization, automatic speech recognition, deep learning
1. Introduction “Diarize” is a word that means making a note or keeping anevent in a diary. Speaker diarization, like keeping a recordof events in such a diary, addresses the “who spoke when”question [1, 2, 3] by logging speaker-specific salient eventson multi-participant (or multi-speaker) audio data. Through-out the diarization process, the audio data would be dividedand clustered into groups of speech segments with the samespeaker identity / label. As a result, salient events, such as non-speech / speech transition, speaker turn changes, speaker classi-fication or speaker role identification, are labeled in an auto-matic fashion. In general, this process does not require anyprior knowledge of the speakers, such as their real identity ornumber of participating speakers in the audio data. Thanks toits innate feature of separating audio streams by these speaker-specific events, speaker diarization can be e ff ectively employedfor indexing or analyzing various types of audio data, e.g., au-dio / video broadcasts from media stations, conversations in con-ferences, personal videos from online social media or hand-helddevices, court proceedings, business meetings, earnings reportsin a financial sector, just to name a few.Traditionally speaker diarization systems consist of multiple,independent sub-modules as shown in Fig. 1. In order to mit-igate any artifacts in acoustic environments, various front-end ∗ Authors contributed equally processing techniques, for example, speech enhancement, dere-verberation, speech separation or target speaker extraction, areutilized. Voice or speech activity detection is then applied toseparate speech from non-speech events. Raw speech signalsin the selected speech portions are transformed to acoustic fea-tures or embedding vectors. In the clustering stage, the speechportion represented by the embedding vectors are grouped andlabeled by speaker classes and in the post-processing stage, theclustering results are further refined. Each of these sub-modulesis optimized individually in general.
During the early years of diarization technology (in the1990s), the research focus was on unsupervised speech seg-mentation and clustering of acoustic events including not onlyspeaker-specific ones but also those related to environmentalor background changes [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14].In this period some of the fundamental approaches to speakerchange detection and clustering, such as leveraging GeneralizedLikelihood Ratio (GLR) and Bayesian Information Criterion(BIC), were developed and quickly became the golden stan-dard. Most of the works benefited Automatic Speech Recogni-tion (ASR) on broadcast news recordings, by enabling speakeradaptive training of acoustic models [10, 15, 16, 17, 18]. Allthese e ff orts collectively laid out a path to consolidate ac-tivities across research groups around the world, leading toseveral research consortia and challenges in the early 2000s, Preprint submitted to Computer, Speech and Language January 26, 2021 a r X i v : . [ ee ss . A S ] J a n ront-End Processing Speech ActivityDetection Speaker Embedding Clustering PostProcessingAudio Input Diarization Output(RTTM)Segmentation Section 2.1 Section 2.2 Section 2.3 Section 2.4 Section 2.5 Section 2.6
Fig. 1: Traditional Speaker Diarization Systems. among which there were the Augmented Multi-party Interac-tion (AMI) Consortium [19] supported by the European Com-mission and the Rich Transcription Evaluation [20] hosted bythe National Institute of Standards and Technology (NIST).These organizations, spanning over from a few years to adecade, had fostered further advancements on speaker diariza-tion technologies across di ff erent data domains from broadcastnews [21, 22, 23, 24, 25, 26, 27, 28, 29] and ConversationalTelephone Speech (CTS) [24, 30, 31, 32, 33, 34] to meetingconversations [35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45]. Thenew approaches resulting from these advancements include,but not limited to, Beamforming [42], Information BottleneckClustering (IBC) [44], Variational Bayesian (VB) approaches[33, 45], Joint Factor Analysis (JFA) [46, 34].Since the advent of deep learning in the 2010s, there hasbeen a considerable amount of research to take the advantageof powerful modeling capabilities of the neural networks forspeaker diarization. One representative example is extractingthe speaker embeddings using neural networks, such as the d-vectors [47, 48, 49] or the x-vectors [50], which most often areembedding vector representations based on the bottleneck layeroutput of a “Deep Neural Network” (DNN) trained for speakerrecognition. The shift from i-vector [51, 52, 53, 54] to theseneural embeddings contributed to enhanced performance, eas-ier training with more data [55], and robustness against speakervariability and acoustic conditions. More recently, End-to-EndNeural Diarization (EEND) where individual sub-modules inthe traditional speaker diarization systems (c.f., Fig. 1) canbe replaced by one neural network gets more attention withpromising results [56, 57]. This research direction, although notfully matured yet, could open up unprecedented opportunitiesto address challenges in the field of speaker diarization, suchas, the joint optimization with other speech applications, withoverlapping speech, if large-scale data is available for trainingsuch powerful network-based models. Till now, there are two well-rounded overview papers inthe area of speaker diarization surveying the development ofspeaker diarization technology with di ff erent focuses. In [2],various speaker diarization systems and their subtasks in thecontext of broadcast news and CTS data are reviewed up tothe point of mid 2000s. As such, the historical progress ofspeaker diarization technology development in the 1990s andearly 2000s are hence covered. In contrast, the focus of [3]was put more on speaker diarization for meeting speech andits respective challenges. This paper thus weighs more in the corresponding technologies to mitigate problems from the per-spective of meeting environments, where there are usually moreparticipants than broadcast news or CTS data and multi-modaldata is frequently available. Since these two papers, especiallythanks to leap-frog advancements in deep learning approachesaddressing technical challenges across multiple machine learn-ing domains, speaker diarization systems have gone through alot of notable changes. We believe that this survey work is avaluable contribution to the community to consolidate the re-cent developments with neural methods and thus facilitate fur-ther progress towards a more e ffi cient diarization. Attempting to categorize the existing, most-diverse speakerdiarization technologies, both on the space of modularizedspeaker diarization systems before the deep learning era andthose based on neural networks of the recent years, a propergrouping would be helpful. The main categorization we adoptin this paper is based on two criteria, resulting in the total fourcategories, as shown in Table 1. The first criterion is whetherthe model is trained based on speaker diarization-oriented ob-jective function or not. Any trainable approaches to optimizemodels in a multi-speaker situation and learn relations betweenspeakers are categorized into the “Diarization Objective” class.The second criterion is whether multiple modules are jointlyoptimized towards some objective function. If a single sub-module is replaced into a trainable one, such method is catego-rized into the “Single-module Optimization” class. On the otherhand, for example, joint modeling of segmentation and cluster-ing [55], joint modeling of speech separation and speaker di-arization [76] or fully end-to-end neural diarization [56, 57] iscategorized into the “Joint Optimization” class.Note that our intention of this categorization is to help read-ers to quickly overview the broad development in the field, andit is not our intention to divide the categories into superior-inferior. Also, while we are aware of many techniques that fallinto the category “Non-Diarization Objective” and “Joint Opti-mization” (e.g., joint front-end and ASR [67, 68, 69, 70, 71, 72],joint speaker identification and speech separation [73, 74], etc.),we exclude them in the paper to focus on the review of speakerdiarization techniques.
The rest of the paper is organized as follows. • In Section 2, we overview techniques belonging to the“Non-Diarization Objective” and “Single-module Opti-mization” class in the proposed taxonomy, mostly those2 able 1: Table of Taxonomy
Non-Diarization DiarizationObjective ObjectiveSingle-moduleOptimization Section 2
Front-end [58, 59, 60], speakerembedding [61, 62, 50], speechactivity detection [63], etc.
Section 3.1
IDEC [64], a ffi nity matrixrefinement [65], TS-VAD [66], etc. JointOptimization Out of scope
Joint front-end & ASR[67, 68, 69, 70, 71, 72], jointspeaker identification & speechseparation [73, 74], etc.
Section 3.2
UIS-RNN [55], RPN [75], onlineRSAN [76], EEND [56, 57], etc.
Section 4
Joint ASR & speaker diarization.[77, 78, 79, 80], etc.used in the traditional, modular speaker diarization sys-tems. While there are some overlaps with the counterpartsections of the aforementioned two survey papers [2, 3] interms of reviewing notable developments in the past, thissection would add more latest schemes as well in the corre-sponding components of the speaker diarization systems. • In Section 3, we discuss advancements mostly leveragingDNNs trained with the diarization objective where singlesub-modules are independently optimized (subsection 3.1)or jointly optimized (subsection 3.2) toward fully end-to-end speaker diarization. • In Section 4, we present a perspective of how speaker di-arization has been investigated in the context of ASR, re-viewing historical interactions between these two domainsto peek the past, present and future of speaker diarizationapplications. • Section 5 provides information of speaker diarization chal-lenges and corpora to facilitate research activities and an-chor techonology advances. We also discuss evaluationmetrics such as Diarization Error Rate (DER), Jaccard Er-ror Rate (JER) and Word-level DER (WDER) in the sec-tion. • We share a few examples of how speaker diarization sys-tems are employed in both research and industry practicesin Section 6 and conclude this work in Section 7 with pro-viding summary and future challenges in speaker diariza-tion.
2. Modular Speaker Diarization Systems
This section provides an overview of algorithms for speakerdiarization belonging to the “Single-module Optimization,Non-Diarization Objective” class mostly modular speaker, asshown in Figure 1. Each subsection in this section correspondsto the explanation of each module in the traditional speaker di-arization system. In addition to the introductory explanation of each module, this section also summarizes the latest schemeswithin the module.
This section describes mostly front-end techniques, used forspeech enhancement, dereverberation, speech separation, andspeech extraction as part of the speaker diarization pipeline. Let s i , f , t ∈ C be the STFT representation of source speaker i onfrequency bin f at frame t . The observed noisy signal x t , f can berepresented by a mixture of the source signals, a room impulseresponse h i , f , t ∈ C , and additive noise n t , f ∈ C , x t , f = K (cid:88) i = (cid:88) τ h i , f ,τ s i , f , t − τ + n t , f , (1)where K denotes the number of speakers present in the audiosignal.The front-end techniques described in this section is to esti-mate the original source signal ˆ x i , t given the observation X = ( { x t , f } f ) t for the downstream diarization task,ˆ x i , t = FrontEnd( X ) , i = , . . . , K , (2)where ˆ x i , t ∈ C D is the i -th speaker’s estimated STFT spectrumwith D frequency bins at frame t .Although there are numerous speech enhancement, dereber-beration, and separation algorithms, e.g., [81, 82, 83], hereinmost of the recent techniques used in the DIHARD challengeseries [84, 85, 86], LibriCSS meeting recognition task [87, 88],and CHiME-6 challenge track 2 [89, 90, 91] are covered. Speech enhancement techniques focus mainly on suppress-ing the noise component of the noisy speech. Single-channelspeech enhancement has shown a significant improvement indenoising performance [92, 93, 94] thanks to deep learning,when compared with classical signal processing based speechenhancement [95]. For example, LSTM-based speech enhance-ment [96, 94] is used as a front-end technique in the DIHARD3I baseline [85], i.e., ˆ x t = LSTM( X ) , (3)where we only consider the single source example (i.e., K = i . This is a regression-based approachby minimizing the objective function, L MSE = || s t − ˆ x t || . (4)The log power spectrum or ideal ratio mask is often used as thetarget domain of the output s t . Also, the speech enhancementused in [95] applies this objective function in each layer basedon a progressive manner.The e ff ectiveness of the speech enhancement techniques canbe boosted multi-channel processing, including minimum vari-ance distortionless response (MVDR) beamforming [81]. [88]shows the significant improvement of the DER from 18.3%to 13.9% in the LibriCSS meeting task based on mask-basedMVDR beamforming [97, 98]. Compared with other front-end techniques, the major dere-verberation techniques used in various tasks is based on statis-tical signal processing methods. One of the most widely usedtechniques is Weighted Prediction Error (WPE) based derever-beration [99, 100, 101].The basic idea of WPE, for the case of single source, i.e. K =
1, without noise, is to decompose the original signal modelEq. (1) into the early reflection x early t , f and late reverberation x late t , f as follows: x t , f = (cid:88) τ h f ,τ s f , t − τ = x early t , f + x late t , f . (5)WPE tries to estimate filter coe ffi cients ˆ h wpe f , t ∈ C , which main-tain the early reflection while suppress the late reverberationbased on the maximum likelihood estimation.ˆ x early t ,, f = x t , f − L (cid:88) τ =∆ ˆ h wpe f ,τ x f , t − τ , (6)where ∆ is the number of frames to split the early reflection andlate reverberation, and L is the filter size.WPE is widely used as one of the golden standard front-end processing methods, e.g., it is part of the DIHARD andCHiME both the baseline and the top-performing systems[84, 85, 86, 89, 90]. Although the performance improvementof WPE-based dereverberation is not significant, it providessolid performance improvement across almost all tasks. Also,WPE is based on the linear filtering and since it does not intro-duce signal distortions, it can be safely combined with down-stream front-end and back-end processing steps. Similar to thespeech enhancement techniques, WPE-based dereberberationshows additional peformance improvements when applied onmulti-channel signals. Speech separation is a promising family of techniques whenthe overlapping speech regions are significant. Similarly toother research areas, DL-based speech separation has becomepopular, e.g, “Deep Clustering” [58], “Permutation InvariantTraining” (PIT) [59], and Conv-TasNet [60]. The e ff ective-ness of multi-channel speech separation based on beamforminghas been widely confirmed [102, 103], as well. For example,in the CHiME-6 challenge [89], “Guided Source Separation”(GSS) [103] based multi-channel speech extraction techniqueshave been used to achieve the top result. On the other hand,single-channel speech separation techniques do not often showany significant e ff ectiveness in realistic multi-speaker scenar-ios like the LibriCSS [87] or the CHiME-6 tasks [89], wherespeech signals are continuous and contain both overlapping andoverlap-free speech regions. The single-channel speech sepa-ration systems often produce a redundant non-speech or evena duplicated speech signal for the non-overlap regions, and assuch the “leakage” of audio causes many false alarms of speechactivity. A leakage filtering method was proposed in [104] tack-ling the problem, where a significant improvement of speakerdiarization performance was shown after including this pro-cessing step in the top-ranked system on the VoxCeleb SpeakerRecognition Challenge 2020 [105]. SAD distinguishes speech segments from non-speech seg-ments such as background noise. A SAD system is mostlycomprised of two parts. The first one is a feature extractionfrontend, where acoustic features such as Mel-Frequency Cep-stral Coe ffi cients (MFCCs) are extracted. The other part is aclassifier, where a model predicts whether the input frame isspeech or not. These models may include Gaussian MixtureModels (GMMs) [106], Hidden Markov Models (HMMs) [107]or DNNs [63].The performance of SAD largely a ff ects the overall perfor-mance of the speaker diarization system because it can cre-ate a significant amount of false positive salient events or missspeech segments [108]. A common practice in speaker diariza-tion tasks is to report DER with “oracle SAD” setup which in-dicates that the system output is using speech activity detectionoutput that is identical to the ground truth. On the other hand,the system output with an actual speech activity detector is re-ferred to as “system SAD” output. Speech segmentation breaks the input audio stream into mul-tiple segments so that the each segment can be assigned to aspeaker label. Before re-segmentation phase, the unit of the out-put of speaker diarization system is determined by segmenta-tion process. There are two ways of performing speech segmen-tation for speaker diarization tasks: either with speaker changepoint detection or uniform segmentation. The segmentation bydetecting the speaker change point was the golder standard ofthe earlier speaker diarization systems, where speaker change4oints are detected by comparing two hypotheses: Hypothe-sis H assumes both left and right samples are from the samespeaker and hypothesis H assumes the two samples are fromthe di ff erent speakers. Many algorithms for the hypothesis test-ing, such as Kullback Leibler 2 (KL2) [10], “Generalized Like-lihood Ratio” (GLR) [109] and BIC [110, 111] were proposedwith the BIC method been the most widely used method. TheBIC approach can be applied to segmentation process as fol-lows: assuming that X = { x , · · · , x N } is the sequence of speechfeatures extracted from the given audio stream and x is drawnfrom from an independent multivariate Gaussian process: x i ∼ N ( µ i , Σ i ) , (7)where µ i , Σ i is mean and covariance matrix of the i -th featurewindow, two hypothesis H and H can be denoted as follows: H : x · · · x N ∼ N ( µ, Σ ) (8) H : x · · · x i ∼ N ( µ , Σ ) (9) x i + · · · x N ∼ N ( µ , Σ ) (10)Thus, hypothesis H models two sample windows with oneGaussian while hypothesis H models two sample windowswith two Gaussians. Using the Eq. (8), the maximum likeli-hood ratio statistics can be expressed as R ( i ) = N log | Σ | − N log | Σ | − N log | Σ | , (11)where the sample covariance Σ is from { x , · · · , x N } , Σ is from { x , · · · , x i } and Σ is from { x i + , · · · , x N } . Finally, a BIC valuebetween two models is expressed: BIC ( i ) = R ( i ) − λ P , (12)where P is the penalty term [110] defined as P = (cid:32) d + d ( d + (cid:33) log N , (13)and d is dimension of the feature. The penalty weight λ is gen-erally set to λ =
1. The change point is set when the followingequation becomes true, (cid:26) max i BIC ( i ) (cid:27) > . (14)As described above, the speaker change points can be detectedby using hypothesis testing based on BIC values or other meth-ods such as KL2 [10], GLR [109]. However, if speech seg-mentation is done by speaker change point detection method,the length of each segment is not consistent. Therefore, afterthe advent of i-vector [51] and DNN-based embeddings [61]the segmentation based on speaker change point detection wasmostly replaced by uniform segmentation [112, 113, 49], sincevarying length of the segment created an additional variabilityinto the speaker representation and deteriorated the fidelity ofthe speaker representations. In uniform segmentation schemes,the given audio stream input is segmented with a fixed win-dow length and overlap length. Thus, the length of the unit ofspeaker diarization result is remains fixed. However, the process of uniformly segmenting the input sig-nals for diarization poses some potential problems. First, uni-form segmentation introduces a trade-o ff error related to thesegment length: segments need to be su ffi ciently short to safelyassume that they do not contain multiple speakers but at thesame time it is necessary to capture enough acoustic informa-tion to extract a meaningful speaker representation x j . In this section, we explain a few popular methods for mea-suring the similarity of speech segments. These methods arepaired with clustering algorithms, which will be explained inthe next section. We first introduce GMM based hypothesistesting approaches which are usually employed with segmenta-tion approaches based on a speaker change point detection. Wethen introduce well-known speaker representations for speakerdiarization systems that are usually employed with the uniformsegmentation method in Section 2.4.2 and Section 2.4.3.
The early days of speaker diarization systems were based ona GMM built on acoustic features such as the MFCCs. Alongwith GMM based method, AHC was also employed for clus-tering, resulting in the speaker homogeneous clusters. Whilethere are many hypothesis testing methods for speech segmentclustering process such as greedy BIC [110], GLR [114] andKL [115] methods, greedy BIC method was the most popularapproach. While greedy BIC method also employs BIC value asin speaker change point detection, in greedy BIC method, BICvalue is used for measuring the similarity between two nodesduring the AHC process. For the given nodes to be clustered, S = { s , · · · , s k } , greedy BIC method model each node s i as amultivariate Gaussian distribution N ( µ i , Σ i ) where µ i and Σ i aremean and co variance matrix of the merged samples in the node s i . BIC value for merging the node s and s is calculated as BIC = n log | Σ | − n log | Σ | − n log | Σ | − λ P , (15)where λ and P value are identical to Eq. (12) and n is samplesize of the merged node ( n = n + n ). During the clusteringprocess, we merge the modes if Eq. (15) is negative. GMMbased hypothesis testing method with bottom-up hierarchicalclustering method was popularly used until i-vector and DNN-based speaker representations dominate the speaker diarizationresearch scene. Before the advent of speaker representations such as i-vector [51] or x-vector [50], “Universal Background Model”(UBM) [116] framework showed success for speaker recogni-tion tasks by employing a large mixture of Gaussians, whilecovering a fairly large amount of speech data. The idea ofmodeling and testing the similarity of voice characteristics withGMM-UBM [116] is largely improved by JFA [117, 118].GMM-UBM based hypothesis testing had a problem of Max-imum a Posterior (MAP) adaptation that is not only a ff ected5y speaker-specific characteristics but also other nuisance fac-tors such as channel and background noise. Therefore, the con-cept of supervector generated by GMM-UBM method was notideal. JFA tackles this problem and decompose a supervectorinto speaker independent, speaker dependent, channel depen-dent and residual components. Thus, the ideal speaker super-vector s can be decomposed as in the Eq. (16). A term m de-notes speaker independent component, U denotes channel de-pendent component matrix, and D denotes speaker-dependentresidual component matrix. Along with these component ma-trices, vector y is for the speaker factors, vector x is for thechannel factors and vector z is for the speaker-specific residualfactors. All of these vectors have a prior distribution of N (0 , M ( s ) = m + Vy + Ux + Dz . (16)The idea of JFA approach is further simplified by employingthe so called “Total Variability” matrix T modeling both thechannel and the speaker variability, and the vector w which isreferred to as the “i-vector” [51]. The supervector M is modeledas: M = m + Tw , (17)In Eq. (17), m is the session and channel-independent com-ponent of the mean supervector. Similarly to JFA, w is as-sumed to follow standard normal distribution and calculated byMAP estimation, in [119]. The notion of speaker representa-tion is popularized by i-vectors, where the speaker represen-tation vector can contain a numerical feature that characterizethe vocal tract of each speaker. The i-vector speaker represen-tations have employed in not only speaker recognition studiesbut also in numerous speaker diarization studies [112, 120, 121]and showed superior performance over GMM-based hypothesistesting methods. Speaker representations for speaker diarization has also beenheavily a ff ected by the rise of neural networks and deep learn-ing approaches. The idea of representation learning was firstintroduced for face recognition tasks [122, 123]. The fun-damental idea of neural network-based representations is thatwe can use deep neural network architecture to map the in-put signal source (an image or an audio clip) to a dense vec-tor by sampling the activations of a layer in the neural networkmodel. The neural network based representation does not re-quire eigenvalue decomposition or factor analysis model thatinvolves hand-crafted design of the intrinsic factor. Also, thereis no assumption or requirement of Gaussianity for the inputdata. Thus, the representation learning process has becomemore straight-forward and the inference speed has been also im-proved compared to the traditional factor analysis based meth-ods.Among many of the neural network based speaker repre-sentations, d-vector [61] remains one of the most prominentspeaker representation extraction frameworks. The d-vectoremploys stacked filterbank features that include context framesas an input feature and trains a multiple fully connected layers Fig. 2: Diagram of d-vector model.Fig. 3: Diagram of x-vector embedding extractor. with the cross entropy loss. The d-vector embeddings are ob-tained in the last fully connected layer as in Fig. 2. The d-vectorscheme appears in numerous speaker diarization papers, e.g., in[49, 55].DNN-based speaker representations are even more improvedby x-vector [62, 50]. The x-vector showed a superior per-formance by winning the NIST speaker recognition challenge[124] and the first DIHARD challenge [84]. Fig. 3 shows thestructure of x-vector framework. The time-delay architectureand statistical pooling layer di ff erentiate x-vector architecturefrom d-vector while statistical pooling layer mitigates the e ff ectof the input length. This is especially advantageous when itcomes to speaker diarization since the speaker diarization sys-tems are bound to process segments that are shorter than theregular window length.For speaker diarization tasks, “Probabilistic Linear Distcrim-inant Analysis” (PLDA) has been frequently used along withx-vector or i-vector to measure the a ffi nity between two speechsegments. PLDA employs the following modeling for the given6peaker representation φ i j of the i -th speaker and j -th session asbelow: φ i j = µ + Fh i + Gw i j + (cid:15) i j . (18)Here, m is mean vector, F is speaker variability matrix, G ischannel variability matrix and (cid:15) is residual component. Theterms h i and w i j are latent variable for F and G respectively.During the training process of PLDA, m , Σ , F and G are esti-mated using expectation maximization (EM) algorithm where Σ is a covariance matrix. Based on the estimated variabilitymatrices and the latent variables h i and w i j , two hypotheses aretested: hypothesis H for the case that two samples are from thesame speaker and hypothesis H for the case that two samplesare from di ff erent speakers. The hypothesis H can be writtenas follows: (cid:34) φ φ (cid:35) = (cid:34) µµ (cid:35) + (cid:34) F G F G (cid:35) h w w + (cid:34) (cid:15) (cid:15) (cid:35) . (19)On the other hand, The hypothesis H can be modeled as thefollowing equation. (cid:34) φ φ (cid:35) = (cid:34) µµ (cid:35) + (cid:34) F G
F G (cid:35) h w h w + (cid:34) (cid:15) (cid:15) (cid:35) . (20)The PLDA model projects the given speaker representationonto the subspace F to co-vary the most while de-emphasizingthe subspace G pertaining to channel variability. Using theabove hypotheses, we can calculate a log likelihood ratio. s ( φ , φ ) = log p ( φ , φ | H ) − log p ( φ , φ | H ) . (21)Ideally, stopping criterion should be 0, but in practice it variesfrom around zero values and the stopping criterion needs to betuned on development set. The stopping criterion largely a ff ectsthe estimated number of speakers because the clustering pro-cess stops when the distance between closest samples reachesthreshold and the number of clusters is determined by the num-ber of remaining clusters at the step where clustering is stopped. After generating the speaker representations for each seg-ment, clustering algorithm is applied to make clusters of seg-ments. We introduce the most commonly used clustering meth-ods for speaker diarization task.
Mean-shift [125] is a clustering algorithm that assigns thegiven data points to the clusters iteratively by finding the modesin a non-parametric distribution. Mean-shift algorithm followsthe following steps:1. Start with the data points assigned to a cluster of their own.2. Compute a mean for the each group.
Fig. 4: Agglomerative Hierarchical Clustering.Fig. 5: General steps of spectral clustering.
3. Shift the search window to the new mean.4. Repeat the process until convergence.Mean-shift clustering algorithm was applied to speaker diariza-tion task with KL distance [126], i-vector and cosine distancein [112, 127] and i-vector and PLDA [128]. The advantageof mean-shift clustering algorithm is that the clustering algo-rithm does not require the number of clusters in advance unlikek-means clustering methods. This becomes a significant advan-tage in speaker diarization tasks where the number of speakersis unknown as in most of the applications.
AHC is a clustering method that has been constantly em-ployed in many speaker diarization systems with a number ofdi ff erent distance metric such as BIC [110, 129], KL [115] andPLDA [84, 90, 130]. AHC is an iterative process of mergingthe existing clusters until the clustering process meets a crite-rion. AHC process starts by calculating the similarity betweenN singleton clusters. At each step, a pair of clusters that has thehighest similarity is merged. The iterative merging process ofAHC produces a dendrogram which is depicted in Fig. 4.One of the most important aspect of AHC is the stoppingcriterion. For speaker diarization task, AHC process can bestopped using either a similarity threshold or a target numberof clusters. Ideally, if PLDA is employed as distance metric,the AHC process should be stopped at s ( φ , φ ) = K . Spectral Clustering is another popular clustering approachfor speaker diarization. While there are many variations, spec-tral clustering involves the following steps.i. A ffi nity Matrix Calculation: There are many ways to gen-erate an a ffi nity matrix A depending on the way the a ffi nityvalue is processed. The raw a ffi nity value d is processedby kernel such as exp (cid:16) − d /σ (cid:17) where σ is a scaling pa-rameter. On the other hand, the raw a ffi nity value d couldalso be masked by zeroing the values below a threshold toonly keep the prominent values.ii. Laplacian Matrix Calculation [131]: The graph Laplaciancan be calculated in the following two types; normalizedand unnormalized. The degree matrix D contains diagonalelements d i = (cid:80) nj = a i j where a i j is the element of the i -throw and j -th column in an a ffi nity matrix A .(a) Normalized Graph Laplacian: L = D − / AD − / . (22)(b) Unnormalized Graph Laplacian: L = D − A . (23)iii. Eigen Decomposition: The graph Laplacian matrix L isdecomposed into the eigenvector matrix X and the diago-nal matrix that contains eigenvalues. Thus, L = X Λ X (cid:62) .iv. Re-normalization (optional) : the rows of X is normalizedso that y i j = x i j / (cid:16)(cid:80) j x i j (cid:17) / where x i j and y i j are the ele-ments of the i -th row and j -th column in matrix X and Y ,respectively.v. Speaker Counting: Speaker number is estimated by find-ing the maximum eigengap.vi. k-means Clustering: The k -smallest eigenvalues λ , λ ,..., λ n and the corresponding eigenvectors v , v ,..., v k areused to make U ∈ R m × n where m is dimension of the rowvectors in U . Finally, the row vectors u , u ,..., u n are clus-tered by k-means algorithm.Among many variations of spectral clustering algorithm, Ng-Jordan-Weiss (NJW) algorithm [132] is often employed forspeaker diarization task. NJW algorithm employs a kernelexp (cid:16) − d /σ (cid:17) where d is a raw distance for calculating an a ffi n-ity matrix. The a ffi nity matrix is used for calculating a nor-malized graph Laplacian. In addition, NJW algorithm involvesrenormalization before the k-means clustering process. Thespeaker diarization system in [133] employed NJW algorithm while choosing σ by using predefined scalar value β and vari-ance values from the data points while the speaker diarizationsystem in [134] did not use β value for NJW algorithm. On theother hand, in the speaker diarization system in [52], σ = . ffi nity values, di ff usion process Y = XX T and row-wise max normalization ( Y i j = X i j / max k X ik ).In the spectral clustering approach appeared in [135], similar-ity values that are calculated from a neural network model wereused without any kernel, and the unnormalized graph Laplacianis employed to perform spectral clustering. More recently, auto-tuning spectral clustering method was proposed for speaker di-arization task [136] where the proposed clustering method doesnot require parameter tuning on a separate development set.The work in [136] employs binarized a ffi nity matrix with therow-wise count p and the binarization parameter p is selectedby choosing the minimum value of r ( p ) = p / g p where g p rep-resents the maximum eigengap from the unnormalized graphLaplacian matrix. Thus, r ( p ) represents how clear the clustersare for the given value p and p could be automatically selectedto perform spectral clustering without tuning the p -value. Resegmentation is a process to refine the speaker boundarythat is roughly estimated by the clustering procedure. In [137],Viterbi resegmentation method based on the Baum-Welch al-gorithm was introduced. In this method, estimation of Gaus-sian mixture model corresponding to each speaker and Viterbi-algorithm-based resgmentation by using the estimated speakerGMM are alternately applied.Later, a method to represent the diarization process based onVariational Bayeian Hidden Markov Model (VB-HMM) wasproposed, and was shown to be superior as a resegmentationmethod compared to Viterbi resegmentation [138, 139, 140].In the VB-HMM framework, the speech feature X = ( x t | t = , ..., T ) is assumed to be generated from HMM where eachHMM state corresponds to one of K possible speakers. Givenwe have M HMM states, M -dimensional variable Z = ( z t | t = , ..., T ) is introduced where k -th element of z t is 1 if k -thspeaker is speaking at the time index t , and 0 otherwise. At thesame time, the distribution of x t is modeled based on a hiddenvariable Y = { y k | i = , ..., K } , where y k is a low dimensionalvector for k -th speaker. Given these notation, the joint proba-bility of X , Y , and Z is decomposed as P ( X , Z , Y ) = P ( X | Z , Y ) P ( Z ) P ( Y ) , (24)where P ( X | Z , Y ) is the emission probability modeled by GMMwhose mean vector is represented by Y , P ( Z ) is the transitionprobability of the HMM, and P ( Y ) is the prior distribution of Y .Because Z represents the trajectory of speakers, the diarizationproblem can be expressed as the inference problem of Z thatmaximize the posterior distribution P ( Z | X ) = (cid:82) P ( Z , Y | X ) d Y .8ince it is intractable to directory solve this problem, Varia-tional Bayes method is used to estimate the model parametersthat approximate P ( Z , Y | X ) [139, 141]. The VB-HMM frame-work was originally designed as a standalone diarization frame-work. However, it requires the parameter initialization to startVB estimation, and the parameters are usually initialized basedon the result of speaker clustering. In that context, VB-HMMcan be seen as a resegmentation method, and widely used as thefinal step of speaker diarization (e.g., [142, 113]). As another direction of post processing, there have beena series of studies on the fusion method of multiple diariza-tion results to improve the diarization accuracy. While it iswidely known that the system combination generally yields bet-ter result for various systems (e.g., speech recognition [143] orspeaker recognition [144]), combining multiple diarization hy-potheses has several unique problems. Firstly, the speaker la-beling is not standardized among di ff erent diarization systems.Secondly, the estimated number of speakers may di ff er amongdi ff erent diarization systems. Finally, the estimated time bound-aries may be also di ff erent among multiple diarization systems.System combination methods for speaker diarization systemsneed to handle these problems during the fusion process of mul-tiple hypotheses.In [145], a method to select the best diarization result amongmultiple diarization systems were proposed. In this method,AHC is applied on the set of diarization results where the dis-tance of two diarization results are mesured by symmetric DER.AHC is executed until the number of groups becomes two, andthe diarization result that has the smallest distance to all otherresults in the biggest group is selected as the final diarization re-sult. In [146], two diarization systems are combined by findingthe matching between two speaker clusters, and then perform-ing the resegementation based on the matching result.More recently, DOVER (diarization output voting error re-duction) method [147] was proposed to combine multiple di-arization results based on the voting scheme. In the DOVERmethod, speaker labels among di ff erent diarization systems arealigned one by one to minimize DER between the hypotheses(the processes 2 and 3 of Fig. 6). After every hypotheses arealigned, each system votes its speaker label to each segmentedregion (each system may have di ff erent weight for voting), andthe speaker label that gains the highest voting weight is selectedfor each segmented region (the process 4 of Fig. 6). In case ofmultiple speaker labels get the same voting weight, a heuris-tic to break the ties (such as selecting the result from the firstsystem) is used.The DOVER method has an implicit assumption that thereis no overlapping speech, i.e., at most only 1 speaker is as-signed for each time index. To combine the diarization hy-potheses with overlapping speakers, two methods were recentlyproposed. In [104], the authors proposed the modified DOVERmethod, where the speaker labels in di ff erent diarization resultsare first aligned with a root hypothesis, and the speech activ-ity of each speaker is estimated based on the weighted votingscore for each speaker for each small segment. Raj et al. [148] Fig. 6: Example of DOVER system. proposed a method called DOVER-Lap, in which the speak-ers of multiple hypothesis are aligned by a weighted k-partitegraph matching, and the number of speakers K for each smallsegment is estimated based on the weighted average of multi-ple systems to select the top- K voted speaker labels. Both themodified DOVER and DOVER-Lap showed the improvementof DER for the speaker diarization result with speaker overlaps.
3. Recent Advances in Speaker Diarization using DeepLearning
This section introduces various recent e ff orts toward deeplearning-based speaker diarization techniques. Firstly, meth-ods that incorporate deep learning into a single component ofspeaker diarization, such as clustering or post-processing, areintroduced in Section 3.1, Then, methods that unify severalcomponents of speaker diarization into a single neural networkare introduced in Section 3.2, Several methods that enhance the speaker clustering basedon deep learning were proposed. A deep-learning based clus-tering algorithm, called Improved Deep Embedded Clustering(IDEC) is proposed in [149]. The goal is to transform the inputfeatures, herein speaker embeddings, to become more separa-ble, given the number of clusters / speakers. The key idea is thateach embedding has a probability of “belonging” to each of theavailable speaker cluster [150, 64], q i j = (cid:16) + (cid:107) z i − µ j (cid:107) / a (cid:17) − a + a (cid:80) l (cid:0) + (cid:107) z i − µ l (cid:107) / a (cid:1) − a + a , p i j = q i j / f i (cid:80) l q il / f l (25)9 ig. 7: Speaker diarization with graph neural network where z i are the bottleneck features, µ i is the centroid of i -thcluster and f i is the soft cluster frequency with f i = (cid:80) q i j .The clusters are iteratively refined based on a target distribu-tion [150] based on bottleneck features estimated using an au-toencoder.The initial DEC approach presented some problems. Assuch, improved versions of the algorithm have been proposed,where the possibility of trivial (empty) clusters is addressed(under the assumption that the distribution of speaker turns isuniform across all speakers, i.e. all speakers contribute equallyto the session). This assumption is not realistic in real meet-ing environments but it constrains the solution space enough toavoid the empty clusters without a ff ecting overall performance.An additional loss term penalizes the distance from the cen-troids µ i , bringing the behavior of the algorithm closer to k-means [149].Based on these improvements, the loss function of the revis-ited DEC algorithm consists of three di ff erent loss components,i.e. L c the clustering error, L u the uniform “speaker air-time”distribution constraint and L MS E the distance of the bottleneckfeatures from the centroids [149], L = α L c + β L r + γ L u + δ L MS E (26)allowing for the di ff erent loss functions to be weighted di ff er-ently and the weights α, β, γ and δ can be fine-tuned on someheld-out data.In [65], a di ff eent approach that purify the similarity matrixfor the spectral clustering based on the graph neural network(GNN) was proposed (Fig. 7). Given a sequence of speakerembeddings { e , ... e N } where N is the length of sequence. Thefirst layer of the GNN takes the input { x i = e i | i = , . . . , N } .The GNN then computes the output of the p -th layer { x ( p ) i | i = , . . . , N } as followings. x ( p ) i = σ ( W (cid:88) j L i , j x ( p − j ) , (27)where L represents a normalized a ffi nity matrix added by self-connection, W is a trainable weight matrix for the p -th layer, and σ ( · ) is a nonlinear function. GNN was optimized by min-imizing the distance between the reference a ffi nity matrix andestimated a ffi nity matrix, where the distance was calculated bya combination of histogram loss [151] and nuclear norm.There are also several di ff erent approaches to generate thea ffi nity matrix. In [152], self-attention-based network was in-troduced to directly generate a similarity matrix from a se-quence of speaker embeddings. In [153], several a ffi nity ma-trices with di ff erent temporal resolutions were fused into singlea ffi nity matrix based on a neural network. Data-driven techniques perform remarkably well on a widevariety of tasks [154]. However, traditional DL architecturesmay fail when the problem involves relational information be-tween observations [155]. Recently, Relational Recurrent Neu-ral Networks (RRNN) were introduced by [155, 156, 157] tosolve this ”relational information learning” task. Speaker di-arization can be seen as a member of this class of tasks, sincethe final decision depends on the distance relations betweenspeech segments and speaker profiles or centroids.The challenges of audio segmentation are detailed in Sec-tion 2.3. Further, speaker embeddings are usually extractedfrom a network trained to distinguish speakers among thou-sands of candidates [50]. However, a di ff erent level of granular-ity in the speaker space is required, since only a small numberof participants is typically involved in an interactive meetingscenario. In addition to that, the distance metric used is of-ten heuristic and / or dependent on certain assumptions which donot necessarily hold, e.g., assuming Gaussianity in the case ofPLDA [158], etc. Finally, the audio chunks are treated indepen-dently and any temporal information about the past and futureis simply ignored. Most of these issues can be addressed withthe RRNNs in [159], where a data-driven, memory-based ap-proach is bridging the performance gap between the heuristicand the trainable distance estimating approaches. The RRNNshave shown great success on several problems requiring rela-tional reasoning [156, 155, 159], and specifically using the Re-lational Memory Core (RMC) [155].In this context, a novel approach of learning the distancebetween such centroids (or speaker profiles) and the embed-dings was proposed in [159] (Fig. 8). The diarization pro-cess can be seen as a classification task on already segmentedaudio, Section 2.3, where the audio signal is first segmentedeither uniformly [160] or based on estimated speaker changepoints [161]. As these segments are assumed to be speaker-homogeneous, speaker embeddings x j for each segment are ex-tracted and then compared against all the available speaker pro-files or speaker centroids. By minimizing a particular distancemetric, the most suitable speaker label is assigned to the seg-ment. The final decision relies on a distance estimation, ei-ther the cosine [51] or the PLDA [158] distance, or the distancebased on RRNNs as proposed in [159]. The later method basedon memory networks has shown consistent improvements inperformance. ig. 8: Continuous speaker identification system based on RMC. The speechsignal is segmented uniformly and each segment x t is compared against all theavailable speaker profiles according to a distance metric d ( · , · ). A speaker label s t , j is assigned to each x t minimizing this metric.Fig. 9: Target Speaker Voice Activity Detector There are a few recent studies to train a neural network thatis applied on top of the result of a clustering-based speaker di-arization. These method can be categorized as an extension ofthe post processing.Medennikov et al. proposed the Target-Speaker Voice Ac-tivity Detection (TS-VAD) to achieve accurate speaker diariza-tion even with many speaker overlaps noisy conditions [91, 66].As shown in Fig. 9, TS-VAD takes the input of acoustic fea-ture (MFCC) as well as the i-vector of all target speakers. Themodel has an output layer where i -th element becomes 1 at timeframe t if i -th speaker is speaking at the time frame, and 0 oth-erwise. To convert the raw output into a sequence of segment,a further post-processing based on heuristics (median filtering,binarization with the threshold, etc.) or HMM-based decod-ing with states representing silence, non-overlapping speech ofeach speaker, and overlapping speech from all possible pairs ofspeakers is used. Prior to inference, TS-VAD requires the i-vector of all target speakers. The i-vectors are initialized basedon the conventional clustering-based speaker diarization result.After initializing the i-vector, the inference by TS-VAD andrefinement of i-vector based on the TS-VAD result can be re-peated until it converges. TS-VAD showed a significantly bet-ter DER compared with the conventional clustering based ap-proach [91, 88]. On the other hand, it has a constraint that themaximum number of speakers that the model can handle is lim- ited by the number of element of the output layer.As a di ff erent approach, Horiguchi et al. proposed to applythe EEND model (detailed in Section 3.2.4) to refine the resultof a clustering-based speaker diarization [162]. A clustering-based speaker diarization method can handle a large number ofspeakers while it is not good at handling the overlapped speech.On the other hand, EEND has the opposite characteristics. Tocomplementary use two methods, they first apply a conven-tional clustering method. Then, the two-speaker EEND modelis iteratively applied for each pair of detected speakers to refinethe time boundary of overlapped regions. A model called Unbounded Interleaved-State Recurrent Neu-ral Networks (UIS-RNN) was proposed that replaces the seg-mentation and clustering procedure into a trainable model [55].Given the input sequence of embeddings X = ( x t ∈ R d | t = , . . . , T ), UIS-RNN generates the diarization result Y = ( y t ∈ N | t = , . . . , T ) as a sequence of speaker index for each timeframe. The joint probability of X and Y can be decomposed bythe chain rule as follows. P ( X , Y ) = P ( x , y ) T (cid:89) t = P ( x t , y t | x t − , y t − ) . (28)To model the distribution of speaker change, UIS-RNN then in-troduce a latent variable Z = ( z t ∈ { , }| t = , . . . , T ), where z t becomes 1 if the speaker indices at time t − t are di ff er-ent, and 0 otherwise. The joint probability including Z is thendecomposed as follows. P ( X , Y , Z ) = P ( x , y ) T (cid:89) t = P ( x t , y t , z t | x t − , y t − , z t − ) (29)Finally, the term P ( x t , y t , z t | x t − , y t − , z t − ) is further decom-posed into three components. P ( x t , y t , z t | x t − , y t − , z t − ) = P ( x t | x t − , y t ) P ( y t | z t , y t − ) P ( z t | z t − ) (30)Here, P ( x t | x t − , y t ) represents the sequence generation proba-bility, and modeled by gated recurrent unit (GRU)-based recur-rent neural network. P ( y t | z t , y t − ) represents the speaker as-signment probability, and modeled by a distant dependent Chi-nese restaurant process [163], which can model the distributionof unbounded number of speakers. Finally, P ( z t | z t − ) repre-sents the speaker change probability, and modeled by Bernoullidistribution. Since all models are represented by trainable mod-els, the UIS-RNN can be trained in a supervised way by findingparameters that maximizes log P ( X , Y , Z ) over training data.The inference can be conducted by finding Y that maximizeslog P ( X , Y ) given X based on the beam search in an online fash-ion. While UIS-RNN works in an online fashion, UIS-RNNshowed better DER than that of the o ffl ine system based on thespectral clustering.11 ime freqchanchor SpeechActivityDetection SpeakerEmbeddingExtraction RegionRefinement (a) Region proposal network (RPN) feature map (b) Diarization by RPN
Remove OverlapClusteringRPN audio input
STFT feature time freq
Fig. 10: (a) RPN for speaker diarization, (b) diarization procedure based onRPN.
A speaker diarization method based on the Region ProposalNetworks (RPN) was proposed to jointly perform segmenta-tion, speaker embedding extraction, and re-segmentation proce-dures by a single neural network [75]. The RPN was originallyproposed to detect multiple objects from a 2-d image [164], and1-d variant of the RPN is used for speaker diarization along withthe time-axis. RPN works on the Short-Term Fourier Transform(STFT) features. A neural network converts the STFT featureinto the feature map (Fig. 10 (a)). Then, for each candidatesof time region of speech activity, called an anchor, the neuralnetwork jointly perform three tasks to (i) estimate whether theanchor includes speech activity or not, (ii) extract a speaker em-bedding corresponding to the anchor, and (iii) estimate the dif-ference of the duration and center position of the anchor and thereference speech activity. The first, second, and third tasks cor-responds to the segmentation, speaker embedding extraction,and re-segmentation, respectively.The inference procedure by RPN is depected in Fig. 10 (b).The RPN is firstly applied to every anchors on the test audio,and the regions with speech activity probability higher than apre-determined threshold are listed as a candidate time regions.Estimated regions are then clustered by using a conventionalclustering method (e.g., k-means) based on the speaker embed-dings corresponding to each region. Finally, a procedure callednon-maximum suppression is applied to remove highly over-lapped segments.The RPN-based speaker diarization has the advantage thatit can handle overlapped speech with possibly any number ofspeakers. Also, it is much simpler than the conventional speakerdiarization system. It was shown in multiple dataset that theRPN-based speaker diarization system achieved significantlybetter DER than the conventional clustering-based speaker di-arization system [75, 88].
There are also recent researches to jointly perform speechseparation and speaker diarization. Kounades-Bastian et al.[165, 166] proposed to incorporate a speech activity model into
Neural Network audio block 1
Neural Network Neural NetworkNeural Network audio block 2 .........
Neural Network ...
Fig. 11: Joint speech separation, speaker counting, and speaker diarizationmodel. speech separation based on the spatial covariance model withnon-negative matrix factorization. They derived the EM algo-rithm to estimate separated speech and speech activity of eachspeaker from the multi-channel overlapped speech. While theirmethod jointly perform speaker diarization and speech separa-tion, their method is based on a statistical modeling, and estima-tion was conducted solely based on the observation, i.e. withoutany model training.Neumann et al. [76, 167] later proposed a trainable model,called online Recurrent Selective Attention Network (onlineRSAN), for joint speech separation, speaker counting, andspeaker diarization based on a single neural network (Fig. 11).Their neural network takes the input of spectrogram X ∈ R T × F ,a speaker embedding e ∈ R d , and a residual mask R ∈ R T × F ,where T and F is the maximum time index and the maximumfrequency bin of the spectrogram, respectively. It output thespeech mask M ∈ R T × F and an updated speaker embeddingfor the speaker corresponding to e . The neural network isfirstly applied with R whose element is all 1, and e whose el-ement is all 0. After the first inference of M , R is updated as R ← max( R − M , R . This procedure is repeated until sum of R becomes less than a threshold. A separated speech can beobtained by M (cid:12) X where (cid:12) is the element-wise multiplica-tion. The speaker embedding is used to keep track the speakerof adjacent blocks. Thanks to the iterative approach, this neu-ral network can cope with variable number of speakers whilejointly performing speech separation and speaker diarization. ig. 12: Two-speaker end-to-end neural diarization model Recently, the framework called End-to-End Neural Diariza-tion (EEND) was proposed [56, 57], which performs all speakerdiarization procedure based on a single neural network. Thearchitecture of EEND is shown in Fig. 12. An input to theEEND model is a T -length sequence of acoustic features (e.g.,log mel filterbank), X = ( x t ∈ R F | t = , . . . , T ). A neuralnetwork then outputs the corresponding speaker label sequence Y = ( y t | t = , . . . , T ) where y t = [ y t , k ∈ { , }| k = , . . . , K ].Here, y t , k = k at the time frame t , and K is the maximum number of speak-ers that the neural network can output. Importantly, y t , k and y t , k (cid:48) can be both 1 for di ff erent speakers k and k (cid:48) , which repre-sents that two speakers k and k (cid:48) is speaking simultaneously (i.e.overlapping speech). The neural network is trained to maxi-mize log P ( Y | X ) ∼ (cid:80) t (cid:80) k log P ( y t , k | X ) over the training data byassuming the conditional independence of the output y t , k . Be-cause there can be multiple candidates of the reference label Y by swapping the speaker index k , the loss function is calcu-lated for all possible reference labels and the reference label thathas the minimum loss is used for the error back-propagation,which is inspired by the permutation free objective used inspeech separation [59]. EEND was initially proposed with abidirectional long short-term memory (BLSTM) network [56],and was soon extended to the self-attention-based network [57]by showing the state-of-the-art DER for CALLHOME dataset(LDC2001S97) and Corpus of Spontaneous Japanese [168].There are multiple advantages of EEND. Firstly, it can han-dle overlapping speech in a sound way. Secondly, the networkis directly optimized towards maximizing diarization accuracy,by which we can expect a high accuracy. Thirdly, it can beretrained by a real data (i.e. not synthetic data) just by feed-ing a reference diarization label while it is often not strait- EEND audio input
LSTM encoder LSTM decoder attractorsLinear + SigmoidSigmoid 1 1 1 1 0 embedding diarization result attractor existing probability
Fig. 13: EEND with encoder-decoder-based attractor (EDA). forward for the prior works. On the other hand, several limi-tations are also known for EEND. Firstly, the model architec-ture constrains the maximum number of speakers that the modelcan cope with. Secondly, EEND consists of BLSTM or self-attention neural networks, which makes it di ffi cult to do onlineprocessing. Thirdly, it was empirically suggested that EENDtends to overfit to the distribution of the training data [56].To cope with an unbounded number of speakers, severalextensions of EEND have been investigated. Horiguchi etal. [169] proposed an extension of EEND with the encoder-decoder-based attractor (EDA) (Fig. 13). This method ap-plies an LSTM-based encoder-decoder on the output of EENDto generate multiple attractors. Attractors are generated untilthe attractor existing probability becomes less than a thresh-old. Then, each attractor is multiplied with the embeddingsgenerated from EEND to calculate the speech activity for eachspeaker. On the other hand, Fujita et al. [170] proposed an-other approach to output the speech activity one after anotherby using a conditional speaker chain rule. In this method, aneural network is trained to produce a posterior probability P ( y k | y , . . . , y k − , X ), where y k = ( y t , k ∈ { , }| t = , . . . , T )is the speech activity for k -th speaker. Then, the joint speechactivity probability of all speakers can be estimated from thefollowing speaker-wise conditional chain rule as: P ( y , . . . , y K | X ) = K (cid:89) k = P ( y k | y , . . . , y k − , X ) . (31)During inference, the neural network is repeatedly applied untilthe speech activity y k for the last estimated speaker approacheszero. Kinoshita et al. [171] proposed a di ff erent approach thatcombines EEND and speaker clustering. In their method, a neu-ral network is trained to generate speaker embeddings as well asthe speech activity probability. Speaker clustering constrainedby the estimated speech activity by EEND is applied to alignthe estimated speakers among di ff erent processing blocks.There are also a few recent trials to extend the EEND foronline processing. Xue et al. [172] proposed a method with aspeaker tracing bu ff er to better align the speaker labels of adja-cent processing blocks. Han et al. [173] proposed a block on-13ine version of EDA-EEND [169] by carrying the hidden stateof the LSTM-encoder to generate attractors block by block.
4. Speaker Diarization in the context of ASR
From a conventional perspective, speaker diarization is con-sidered a pre-processing step for ASR. In the traditional systemstructure for speaker diarization as depicted in Fig. 1, speechinputs are processed sequentially across the diarization com-ponents without considering the ASR objective, which corre-sponds to minimize word error rate (WER). One issue is thatthe tight boundaries of speech segments as the outcomes ofspeaker diarization have a high chance of causing unexpectedword truncation or deletion errors in ASR decoding. In thissection we discuss how speaker diarization systems have beendeveloped in the context of ASR, not only resulting in betterWER by preventing speaker diarization from hurting ASR per-formance, but also benefiting from ASR artifacts to enhancediarization performance. More recently, there have been a fewpioneering proposals made for joint modeling of speaker di-arization and ASR, which we will introduce in the section aswell.
The lexical information from ASR output has been employedfor speaker diarization system in a few di ff erent ways. First,the earliest approach was RT03 evaluation [1] which used wordboundary information for segmentation purpose. In [1], a gen-eral ASR system for broadcast news data was built where thebasic components are segmentation, speaker clustering, speakeradaptation and system combination after ASR decoding fromthe two sub-systems with the di ff erent adaptation methods. Tounderstand the impact of the word boundary information, theyused ASR outputs to replace the segmentation part and com-pared the diarization performance of the each system. In ad-dition, ASR result was also used for refining SAD in IBM’ssubmission [174] for RT07 evaluation. The system appeared in[174] incorporates word alignments from speaker independentASR module and refines SAD result to reduce false alarms sothat the speaker diarization system can have better clusteringquality. The segmentation system in [175] also takes advantageof word alignments from ASR. The authors in [175] focusedon the word-breakage problem where the words from ASR out-put are truncated by segmentation results since segmentationresult and decoded word sequence are not aligned. Therefore,word-breakage (WB) ratio was proposed to measure the rate ofchange-points that are detected inside intervals correspondingto words. The DER and WB were reported together to mea-sure the influence of word truncation problem. While the fore-mentioned early works of speaker diarization systems that areleveraging ASR output are focusing on the word alignment in-formation to refine the SAD or segmentation resutl, the speakerdiarization system in [176] created a dictionary for the phrasesthat commonly appear in broadcast news. The phrases in thisdictionary provide identity of who is speaking, who will speakand who spoke in the broadcase news scenario. For example, Fig. 14: Integration of lexical information and acoustic information.Fig. 15: Integration of lexical information and acoustic information. “This is [name]” indicates who was the speaker of the broadcastnews section. Although the early speaker diarization studies didnot fully leverage the lexical information to drastically improveDER, the idea of integrating the information from ASR outputhas been employed by many studies to refine or improve thespeaker diarizatiohn output.
The more recent speaker diarization systems that take ad-vantage of the ASR transcript have employed a DNN modelto capture the linguistic pattern in the given ASR output to en-hance the speaker diarization result. The authors in [177] pro-posed a way of using the linguistic information for the speakerdiarization task where participants have distinct roles that areknown to the speaker diarization system. Fig. 14 shows thediagram of speaker diarization system appeared in [177]. Inthis system, a neural text-based speaker change detector and atext-based role recognizer are employed. By employing bothlinguistic and acoustic information, DER was significantly im-proved compared to the acoustic only system.Lexical information from ASR output was also utilized forspeaker segmentation [178] by employing a sequence to se-quence model that outputs speaker turn tokens. Based on theestimated speaker turn, the input utterance is segmented accord-ingly. The experimental results in [178] show that using both14 udio input word1 spk1 word2 word3 spk2 word4 spk1End-to-End ASRand Diarization
Fig. 16: Joint ASR and diarization by inserting a speaker tag in the transcrip-tion. acoustic and lexical information can get an extra advantage ow-ing to the word boundaries we get from the ASR output.[179] presented follow-up research within the above thread.Unlike the system in [178], lexical information from the ASRmodule was integrated with the speech segment clustering pro-cess by employing an integrated adjacency matrix. The adja-cency matrix is obtained from max operation between acousticinformation created from a ffi nities among audio segments andlexical information matrix created by segmenting the word se-quence into word chunks that are likely to be spoken by thesame speaker. Fig. 15 shows a diagram that explains how lex-ical information is integrated in an a ffi nity matrix with acous-tic information. The integrated adjacency matrix leads to animproved speaker diarization performance for CALLHOMEAmerican English dataset. Motivated by the recent success of deep learning and end-to-end modeling, several models have been proposed to jointlyperform ASR and speaker diarization. As with the previoussection, ASR results contain a strong cue to improve speakerdiarization. On the other hand, speaker diarization results canbe used to improve the ASR accuracy, for example, by adaptingthe ASR model towards each estimated speaker. Joint modelingcan leverage such inter-dependency to improve both ASR andspeaker diarization. In the evaluation, a word error rate (WER)metric that is a ff ected by both ASR errors and speaker attri-bution errors, such as speaker-attributed WER [180] or cpWER[89], is often used. ASR-specific metrics (e.g., speaker-agnosticWER) or diarization-specific metrics (e.g., DER) is also usedcomplementary.A first line of approaches is introducing a speaker tag in thetranscription of end-to-end ASR models (Fig. 16). Shafey etal. [77] proposed to insert a speaker role tag (e.g., (cid:104) doctor (cid:105) and (cid:104) patient (cid:105) ) in the output of a recurrent neural network-transducer (RNN-T)-based ASR system. Similarly, Mao et al.[78] proposed to insert a speaker identity tag in the output of anattention-based encoder-decoder ASR system. These methodhave been shown to be able to perform both ASR and speaker audio input speakerembeddingsword hypothesis (with silence boundary) Estimation (*) colored block represents non-silence hypothesis while white block represents silence hypothesis
Fig. 17: Joint decoding framework for ASR and speaker diarization. diarization in their experiments. On the other hand, the speakerroles or speaker identity tags needs to be determined and fixedduring training, so it is di ffi cult to cope with an arbitrary num-ber of speakers with this approach.A second approach is a MAP-based joint decoding frame-work. Kanda et al. [79] formulated the joint decoding ofASR and speaker diarization as followings (see also Fig. 17).Assume that a sequence of observations is represented by X = { X , . . . , X U } , where U is the number of segments (e.g.,generated by applying VAD on a long audio) and X u is theacoustic feature sequence of the u -th segment. Further as-sume that word hypotheses with time boundary information isrepresented by W = { W , . . . , W U } where W u is the speechrecognition hypotheses corresponding to the segment u . Here, W u = ( W , u , ..., W K , u ) contains all speakers’ hypotheses in thesegment u where K is the number of speakers, and W k , u rep-resents the speech recognition hypothesis of the k -th speakerof the segment u . Finally, a tuple of speaker embeddings E = ( e , . . . , e K ), where e j ∈ R d is d -dimensional speaker em-beddings of k -th speaker, is also assumed. With all these nota-tions, the joint decoding framework of multi-speaker ASR anddiarization can be formulated as a problem to find most likelyˆ W as, ˆ W = argmax W P ( W|X ) (32) = argmax W { (cid:88) E P ( W , E|X ) } (33) ≈ argmax W { max E P ( W , E|X ) } , (34)where we use the Viterbi approximation to obtain the final equa-tion. This maximization problem is further decomposed intotwo iterative problems as,ˆ W ( i ) = argmax W P ( W| ˆ E ( i − , X ) , (35)ˆ E ( i ) = argmax E P ( E| ˆ W ( i ) , X ) , (36)where i is the iteration index of the procedure. In [79], Eq. (35)is modeled by the target speaker ASR [181, 182, 183, 71] andEq. (36) is modeled by the overlap-aware speaker embeddingestimation. This method shows a similar speaker-attributedWER compared to that of the target speaker ASR with oraclespeaker embeddings. On the other hand, it requires an iterative15 srEncoder SpeakerEncoderDecoderOut SpeakerQueryRNNInventoryAttentionDecoderRNNAttention audio input speaker profileslabel prediction speaker prediction speaker query Fig. 18: End-to-end speaker-attributed ASR application of the target-speaker ASR and speaker embeddingextraction, which makes it challenging to apply the method inonline mode.As a third line of approaches, End-to-End (E2E) Speaker-Attributed ASR (SA-ASR) model was recently proposed tojointly perform speaker counting, multi-speaker ASR, andspeaker identification [184, 185]. Di ff erent from the first twoapproaches, the E2E SA-ASR model takes the additional in-put of speaker profiles and identifies the index of speaker pro-files based on the attention mechanism (Fig. 18). Thanks tothe attention mechanism for speaker identification and multi-talker ASR capability based on serialized output training [186],there is no limitation of a maximum number of speakers thatthe model can cope with. In case relevant speaker profiles aresupplied in the inference, the E2E SA-ASR model can automat-ically transcribe the utterance while identifying the speaker ofeach utterance based on the supplied speaker profiles. On theother hand, in case the relevant speaker profiles cannot be usedprior to the inference, the E2E SA-ASR model can still be ap-plied with example profiles, and speaker clustering on the inter-nal speaker embeddings of the E2E SA-ASR model (“speakerquery” in Fig. 18) is used to diarize the speaker [80].
5. Evaluation of Speaker Diarization
This section describes the evaluation scheme for speaker di-arization. The dataset that is widely used for the evaluationof speaker diarization is first introduced in Section 5.1. Then,the evaluation metric for speaker diarization is introduced inSection 5.2. Finally, international e ff orts to evaluate diariza-tion systems are introduced in Section 5.3. The summary of thedataset is shown in Table 2. NIST SRE 2000 (Disk-8), often referred to as CALLHOMEdataset, has been the most widely used dataset for speaker di- arization in the recent papers. CALLHOME dataset contains500 sessions of multilingual telephonic speech. Each sessionhas 2 to 7 speakers while there are two dominant speakers ineach conversation.
The AMI database [187] includes 100 hours of meetingrecordings from multiple sites in 171 meeting sessions. AMIdatabase provides audio source recorded with lapel micro-phones which are separately recoreded and amplified for eachspeaker. Another audio source is recorded with omnidirectionalmicrophone arrays that are mounted on the table while meet-ing. AMI database is a suitable dataset for evaluating speakerdiarization system integrated with ASR module since AMI pro-vides forced alignment data which contains word and phonemelevel timings along with the transcript and speaker label. Eachmeeting session contains 3 to 5 speakers.
The ICSI meeting corpus [188] contains 75 meting corpuswith 4 meeting types. ICSI meeting corpus provides word leveltiming along with the transcript and speaker label. The au-dio source is recorded with close-talking individual microphoneand six tabletop microphones to provide speaker-specific chan-nel and multi-channel recording. Each meeting has 3 to 10 par-ticipants.
DIHARD challenge dataset is created for DIHARD chal-lenge 1, 2 and 3 [189, 85, 190] while focusing on very challeng-ing domains. DIHARD challenge development set and eval-uation set include clinical interviews, web videos, speech inthe wild (e.g., recordings in restaurants). DIHARD challengedataset also includes relatively less challenging datasets suchas conversational telephonic speech (CTS) and audio books todiversify the domains in development set and evaluation set.Contrary to other speaker diarization datasets, domains suchas restaurant conversation and web videos contain significantlylower signal to noise ratio (SNR) that makes DER way higher.The first DIHARD challenge, DIHARD 1, started with track1 for diarization beginning from oracle SAD and track 2 di-arization from scratch using system SAD. Unlike DIHARD 1,DIHARD 2 included multichannel speaker diarization task intrack 3 (oracle SAD) and track 4 (system SAD) adding therecordings drawn from CHIME-5 corpus [191]. In the latestDIHARD challenge, DIHARD 3, CTS dataset was added toDIHARD 3 dev set and eval set and DIHARD 3 removed track3 and track 4 while keeping only track 1 (oracle SAD) and track2 (system SAD). / The CHiME-5 corpus [191] includes 50 hours of multi-partyreal conversations in the every-day home environment. It con-tains speaker labels, segmentation, and corresponding tran-scriptions. All of them are manually annotated. The audiosource is recorded by multiple 4-channel microphone arrays lo-cated in the kitchen and dining / living rooms in a house, and also16 able 2: Diarization Evaluation Datasets Size (hr) Style / The VoxConverse dataset [192] contains 74 hours of humanconversation extracted from YouTube video. The dataset is di-vided into development set (20.3 hours, 216 recordings), andtest set (53.5 hours, 310 recordings). The number of speakersin each recording has a wide range of variety from 1 speaker to21 speakers. The audio includes various types of noises such asbackground music, laughter etc. It also contains noticeable por-tion of overlapping speech from 0% to 30.1% dependent on therecording. While the dataset contains the visual information aswell as audio, as of January 2021, only the audio of the devel-opment set is released under a Creative Commons Attribution4.0 International License for research purpose. The audio of theevaluation set was used at the track 4 of the VoxCeleb SpeakerRecognition Challenge 2020 (Section 5.3) as a blind test set.
The LibriCSS corpus [87] is 10 hours of multi-channelrecordings designed for the research of speech separation,speech recognition, and speaker diarization. It was made byplaying back the audio in the LibriSpeech corpus [193] in a realmeeting room, and recorded by a 7-ch microphone array. It con-sists of 10 sessions, each of which is further decomposed to six10-min mini-sessions. Each mini-session was made by audioof 8 speakers and designed to have di ff erent overlap ratio from0% to 40%. To facilitate the research, the baseline system forspeech separation and ASR [87] and the baseline system thatintegrates speech separation, speaker diarization and ASR [88]has been developed and released. The accuracy of speaker diarization system is measured byDiarization Error Rate (DER) [194] where DER is sum of three di ff erent error types: False alarm (FA) of speech, missed detec-tion of speech and confusion between speaker labels. DER = FA + Missed + Speaker-Confusion
Total Duration of Time (37)To establish a one-to-one mapping between the hypothesis out-puts and the reference transcript, Hungarian algorithm [195] isemployed. In Rich Transcription 2006 evaluation [194], 0.25second of “no score” collar is set around every boundary ofreference segment to mitigate the e ff ect of inconsistent annota-tion and human errors in reference transcript and this evaluationscheme has been most widely used in speaker diarization stud-ies. Jaccard Error Rate (JER) was first introduced in DIHARD IIevaluation. The goal of JER is to evaluate each speaker withequal weight. Unlike DER, JER does not use speaker error toobtain the error value.
JER = N N ref (cid:88) i FA i + MISS i TOTAL i (38)In Eq. (38), TOTAL is union of i -th speaker’s speaking time inreference transcript and i -th speaker’s speaking time in the hy-potheses. The sum of FA and MISS divided by TOTAL value isthen averaged over N re f -speakers in the reference script. SinceJER is using union operation between reference and the hy-potheses, JER never exceeds 100% while DER can sometimesreach way over 100%. DER and JER are highly correlated but ifa subset of speakers are dominant in the given audio recording,JER tends to get higher than ordinary case. While DER is based on the duration of speaking time of eachspeaker, Word-level DER (WDER) is designed to measure theerror that is caused in the lexical(output transcription) side. Themotivation of WDER is the discrepency between DER and theaccuracy of final transcript output since DER relies on the du-ration of speaking time that is not always aligned with the wordboundaries. The concept of word-breakage was proposed inSilovsky et al. [175] where WB shares the similar idea with17DER. Unlike WDER, WB measures the number of speakerchange point occurs inside a word boundary. The work in Parkand Georgiou [196] suggested the term WDER, evaluating thediarization output with ground-truth transcription. More re-cently, the joint ASR and speaker diarization system was evalu-ated in WDER format in Shafey et al. [77]. Although the way ofcalculating WDER would di ff er over the studies but the under-lying idea is that the diarization error is calculated by countingthe correctly or incorrectly labeled words. The Rich Transcription (RT) Evaluation [20] is the pio-neering evaluation series of initiating deeper investigation onspeaker diarization in relation with ASR. The main purpose ofthis e ff ort was to create ASR technologies that would producetranscriptions with descriptive metadata, like who said when,where speaker diarization plays in. Thus the main tasks in theevaluation were naturally ASR and speaker diarization. Thedomains of the data of interest were broadcast news, CTS andmeeting recordings with multiple participants. Throughout theperiod 2002 to 2009, the RT evaluation series promoted andgauged advances in speaker diarization as well as ASR tech-nology.DIHARD challenge [189, 85] is the most recent evaluationthat focuses on challenging diarization tasks. DIHARD chal-lenge data contains many di ff erent challenging and diverse do-mains including the recordings from restaurants, meetings, in-terview videos and court room. DIHARD evaluation focuses onthe performance gap of state-of-the-art diarization systems onchallenging domains (e.g. recordings from outdoors) and rel-atively clean speech (e.g. telephonic speech). DIHARD chal-lenge employs a stricter evaluation scheme where the scoringrule does not have “no score” collar and also evaluates over-lapped regions. In addition, DIHARD challenge also employedJER.The CHiME-6 challenge [89] track 2 revisits the previousCHiME-5 challenge [191] and further considers the problem ofdistant multi-microphone conversational speech diarization andrecognition in everyday home environments. Although the finalevaluation criterion is ranked with the WER, the challenge par-ticipants in this track also need to submit the diarization result.The evaluation metrics of the diarization follow the DIHARDchallenge, i.e.,“no score” collar and it also evaluates overlappedregions when computing the DER and JER.The VoxCeleb Speaker Recognition Challenge (VoxSRC) isthe recent evaluation series for speaker recognition systems[197, 105]. The goal of VoxSRC is to probe how well the cur-rent technology can cope with the speech “in the wild”. Theevaluation data is obtained from YouTube videos of various do-mains, such as celebrity interviews, news shows, talk shows,and debates. The audio includes various types of backgroundnoises, laughter as well as noticeable portion of overlappingspeech, all of which make the task very challenging. This eval-uation series initially started with a pure speaker verificationtask [197], and the diarization task was added as the track 4at the latest evaluation at the VoxCeleb Speaker RecognitionChallenge 2020 (VoxSRC-20) [105]. The VoxConverse dataset [192] was used for evaluation with DER as a primary metricto determine the ranking of submitted systems. JER was alsomeasured as a secondary metric.
6. Applications
The goal of meeting transcription is to automatically generatespeaker-attributed transcripts during real-life meetings based ontheir audio and optionally video recordings. Accurate meetingtranscriptions are the one of the processing steps in a pipelinefor several tasks like summarization, topic extraction, etc. Sim-ilarly, the same transcription system can be used in other do-mains such as healthcare [198]. Although this task was in-troduced by NIST in the Rich Transcription Evaluation seriesback in 2003 [180, 188, 199], the initial systems had very poorperformance, and consequently commercialization of the tech-nology was not possible. However, recent advances in the ar-eas of Speech Recognition [200, 201], far-field speech process-ing [202, 203, 204], Speaker ID and diarization [205, 206, 113],have greatly improved the speaker-attributed transcription ac-curacy, enabling such commercialization. Bi-modal process-ing combining cameras with microphone arrays has further im-proved the overall performance [207, 208]. As such, these latesttrends motivated us to include an end-to-end audio-visual meet-ing transcription system overview in this paper.Reflecting the variety of application scenarios, customerneeds, and business scope, di ff erent constraints may be imposedon meeting transcription systems. For example, it is most of-ten required to provide the resulting transcriptions in low la-tency, making the diarization and recognition even more chal-lenging. On the other hand, the architecture of the transcriptionsystem can substantially improve the overall performance, e.g.,employing microphone arrays of known geometry as the inputdevice. Also, in the case where the expected meeting attendeesare known beforehand, the transcription system can further im-prove speaker attribution, all while providing the exact name ofthe speaker, instead of a randomly generated discrete speakerlabels.Two di ff erent scenarios in this space are presented: first, afix-geometry microphone array combined with a fish-eye cam-era system, and second, an ad-hoc geometry microphone arraysystem without a camera. In both scenarios, a “non-binding”list of participants and their corresponding speaker profiles areconsidered known. In more detail, the transcription system hasaccess to the invitees’ names and profiles, however the actual at-tendees may not accurately match those invited. As such, thereis an option to either include “unannounced” participants. Also,some of the invitees may not have profiles. In both scenarios,there is a constraint of low-latency transcriptions, where initialresults need to be shown with low latency. The finalized resultscan be updated later in an o ffl ine fashion.Some of the technical challenges to overcome are [209]:1. Although ASR on overlapping speech is one of the mainchallenges in meeting transcription, limited progress hasbeen made over the years. Numerous multi-channel18peech separation methods have been proposed based onIndependent Component Analysis(ICA) or Spatial Clus-tering [210, 211, 212, 213, 214, 215], but applying themto a meeting setup had limited success. In addition, neu-ral network-based separation methods like Permutation In-variant Training (PIT) [59] or deep clustering (DC) [58]cannot adequately address reverberation and backgroundnoise [216].2. Flexible framework: It is desirable that the transcriptionsystem can process all the available information, such asthe multi-channel audio and the visual cues. The systemneeds to process a dynamically changing number of au-dio channels without loss of performance. As such, thearchitecture needs to be modular enough to encompass thedi ff erent settings.3. Speaker-Attributed ASR of natural meetings requires on-line / streaming ASR, audio pre-processing such as dere-verberation, and accurate diarization and speaker identi-fication. These multiple processing steps are usually op-timized separately and thus, the overall pipeline is mostfrequently ine ffi cient.4. Using multiple, not-synchronized audio streams, e.g., au-dio capturing with mobile devices, adds complexity to themeeting setup and processing. In return, we gain poten-tially better spatial coverage since the devices are usuallydistributed around the room and near the speakers. Aspart of the application scenario, the meeting participantsbring their personal devices, which can be re-purposedto improve the overall meeting transcription quality. Onthe other hand, while there are several pioneering stud-ies [217], it is unclear what the best strategies are forconsolidating multiple asynchronous audio streams and towhat extent they work for natural meetings in online ando ffl ine setups.Based on these considerations, an architecture of meetingtranscription system with asynchronous distant microphoneshave been proposed in [161]. In this work, various fusionstrategies have been investigating: from early fusion beam-forming the audio signals, to mid-fusion combining senonesper channel, to late fusion combining the diarization and ASRresults [147]. The resulting system performance was bench-marked on real-world meeting recordings against fix-geometrysystems. As mentioned above, the requirement of speaker-attributed transcriptions with low latency was adhered, as well.In addition to the end-to-end system analysis, the paper [161]proposed the idea of “leave-one-out beamforming” in the asyn-chronous multi-microphone setup, enriching the “diversity” ofthe resulting signals, as proposed in [218]. Finally, it is de-scribed how an online, incremental version of ROVER can pro-cess both the ASR and diarization outputs, enhancing the over-all speaker-attributed ASR performance. Speech and spoken language are central to conversational in-teractions and carry crucial information about a speaker’s in-tent, emotions, identity, age and other individual and interper-sonal trait and state variables including health state, and compu-tational advances are increasingly allowing for accessing suchrich information [219, 220]. For example, knowing how much,and how, a child speaks in an interaction reveals critical infor-mation about the developmental state, and o ff ers clues to clini-cians in diagnosing disorders such as Autism [221]. Such anal-yses are made possible by capturing and processing the audiorecordings of the interactions, often involving two or more peo-ple. An important foundational step is identifying and associ-ating the speech portions belonging to specific individuals in-volved in the conversation. The technologies that provide thiscapability are speech activity detection (SAD) and speaker di-arization. Speech portions segmented with speaker-specific in-formation provided by speaker diarization, by itself without anyexplicit lexical transcription, can o ff er important information todomain experts who can take advantage of speaker diarizationresults for quantitative turn-taking analysis.A domain that is the most relevant such analyses of spokenconversational interactions relates to behavioral signal process-ing (BSP) [222, 219] which refers to the technology and algo-rithms for modeling and understanding human communicative,a ff ective and social behavior. For example, these may includeanalyzing how positive or negative a person is, how empathican individual toward another, what does the behavior patternsreveal about the relationship status, and health condition of anindividual [220]. BSP involves addressing all the complexi-ties of spontaneous interactions in conversations with additionalchallenges involved in handling and understanding emotional,social and interpersonal behavioral dynamics revealed throughvocal verbal and nonverbal cues of the interaction participants.Therefore, the knowledge of speaker specific vocal informa-tion plays a significant role in BSP, requiring highly accuratespeaker diarization performance. For example, speaker diariza-tion module is employed as a pre-processing module for analyz-ing psychotherapy mechanisms and quality [223], and suiciderisk assessment [224].Another popular application of speaker diarization for con-versation interaction analysis is the medical doctor-patient in-teractions. In the system described in [225], the nature of mem-ory problem of a patient is detected from the conversations be-tween neurologists and patients. Speech and language featuresextracted from ASR transcripts combined with speaker diariza-tion results are used to predict the type of disorder. An au-tomated assistant system for medical domain transcription isproposed in [226] which includes speaker diarization module,ASR module and natural language generation (NLG) module.The automated assistant module accepts the audio clip and out-puts grammatically correct sentences that describe the topic ofthe conversation, subject and subject’s symptom. Content-based audio indexing is a well known applicationdomain for speaker diarization. It can provide meta informationsuch as the content or data type of a given audio data to make19nformation retrieval e ffi cient since search query by machineswould be limited by such metadata. The more diverse infor-mation were available, the better e ffi ciency we could achieve inretrieving audio contents from a database.One useful piece of information for the audio indexing wouldbe ASR transcripts to understand the content of speech por-tions in the audio data. Speaker diarization can augment thosetranscripts in terms of “who spoke when”, which was the mainpurpose of the Rich Transcription evaluation series [20] as wediscussed in Sections 4.1 and 5.3. The aggregated spoken ut-terances from speakers by a speaker diarization system alsoenable per-speaker summary or keyword list-up, which can beused for another query values to retrieve relevant contents fromthe database. In [227], we can peek a view of how speakerdiarization outputs can be linked for information searching inconsumer facing applications. Thanks to the advance of ASR technology, the applicationsof ASR are evolved from simple voice command recognitionsystems to conversational AI systems. Conversational AI sys-tems, as opposed to voice command recognition systems, havefeatures that voice command recognition systems are lack of.The fundamental idea of conversational AI is making a ma-chine that humans can talk to and interact with the system. Inthis sense, focusing on an interested speaker in multi-party set-ting is one of the most important feature of conversational AIand speaker diarization becomes essential feature for conver-sational AI. For example, conversational AI equipped in a carcan pay attention to a specific speaker that is demanding a pieceof information from the navigation system by applying speakerdiarization along with ASR.Smart speakers and voice assistants are the most popularproducts where speaker diarization plays a significant role forconversational AI. Since response time and online processingare the crucial factors in real-life settings, the demand for end-to-end speaker diarization system integrated into ASR pipelineis growing. The performance of incremental (online) ASR andspeaker diarization of the commercial ASR services are eval-uated and compared in [228]. It is expected that the real-timeand low latency aspect of speaker diarization will be more em-phasized in the speaker diarization systems in the future sincethe performance of online diarization and online ASR still havemuch room for improvement.
7. Challenges and the Future of Speaker Diarization
This paper has provided a comprehensive overview ofspeaker diarization techniques, highlighting the recent devel-opment of deep learning-based diarization approaches. In theearly days, a speaker diarization system was developed as apipeline of sub-modules including front-end processing, speechactivity detection, segmentation, speaker embedding extraction,clustering, and post-processing, leading to a standalone sys-tem without much connection to other components in a givenspeech application. As the rise of the deep learning technology, more and more advancements have been made for speaker di-arization, from a method that replaces a single module into adeep-learning-based one, to a fully end-to-end neural diariza-tion. Furthermore, as the speech recognition technology be-comes more accessible, a trend to tightly integrate speaker di-arization and ASR systems has emerged, such as benefitingfrom the ASR output to improve speaker diarization accuracy.As of late, joint modeling for speaker diarization and speechrecognition is investigated in an attempt to enhance the over-all performance. Thanks to these great achievement, speakerdiarization systems have already been deployed in many appli-cations, including meeting transcription, conversational inter-action analysis, audio indexing, and conversational AI systems.As we have seen, tremendous progress has been made forspeaker diarization systems. Nevertheless, there are still muchroom for improvement. As the final remark, we conclude thispaper by listing up the remaining challenges for speaker diariza-tion towards future research and development.
Online processing of speaker diarization.
Most speaker di-arization methods assume that an entire recording can be ob-served to execute speaker diarization. However, many applica-tions such as meeting transcription systems or smart agents re-quire only short latency for assigning the speaker. While therehave been several attempts to make online speaker diarizationsystem both for clustering-based systems (e.g., [205]) and neu-ral network-based diarization systems (e.g., [55, 172, 173]), it’sstill remaining as a challenging problem.
Domain mismatch.
A model that is trained on a data in a spe-cific domain often works poorly on a data in another domain.For example, it is experimentally known that the EEND modeltends to overfit to the distribution of the speaker overlaps of thetraining data [56]. Such domain mismatch issue is universalfor any training-based method. Given the growing interest fortrainable speaker diarization systems, it will become more im-portant to assess the ability for handling the variety of inputs.The international evaluation e ff orts for speaker diarization suchas the DIHARD challenge [189, 85, 190] or VoxSRC [197, 105]will also have great importance for that direction. Speaker overlap.
Overlap of multi-talker speech is inevitablenature of conversation. For example, average 12% to 15%of speaker overlap was observed for meeting recordings [229,102], and it can become higher for daily conversations [230,191, 89]. Nevertheless, many conventional speaker diarizationsystems, especially clustering-based systems, treated only non-overlapped region of recordings sometimes even for the evalua-tion metric. While the topic has been studied for long years (e.g.early works [231, 232]), there is a growing interest for handlingthe speaker overlaps towards better speaker diarization, includ-ing the application of speech separation [104], post-processing[233, 162], and joint modeling of speech separation and speakerdiarization [76, 184].
Integration with ASR.
Not all but many applications requireASR results along with speaker diarization results. In the line of20he modular combination of speaker diarization and ASR, somesystems put a speaker diarization system before ASR [91] whilesome systems put a diarization system after ASR [209]. Bothtypes of systems showed a strong performance for a specifictask, and it is still an open problem that what kind of system ar-chitecture is the best for the speaker diarization and ASR tasks[88]. Furthermore, there is another line of research to jointlyperform speaker diarization and ASR [77, 78, 79, 184] as intro-duced in Section 4. The joint modeling approach could leveragethe inter-dependency between speaker diarization and ASR tobetter perform both tasks. However, it has not yet fully inves-tigated whether such joint frameworks perform better than thewell-tuned modular systems. Overall, the integration of speakerdiarization and ASR is one of the hottest topics that has stillbeen pursued.
Audio visual modeling.
Visual information contains a strongclue to identify speakers. For example, the video captured bya fisheye camera was used to improve the speaker diarizationaccuracy in a meeting transcription task [209]. The visual in-formation was also used to significantly improve the speakerdiarization accuracy for speaker diarization on YouTube video[192]. While these studies showed the e ff ectiveness of visualinformation, the audio-visual speaker diarization has yet beenrarely investigated compared with audio-only speaker diariza-tion, and there will be many rooms for the improvement. References [1] S. E. Tranter, K. Yu, D. A. Reynolds, G. Evermann, D. Y. Kim, P. C.Woodland, An investigation into the the interactions between speakerdiarisation systems and automatic speech transcription, CUED / F-INFENG / TR-464 (2003).[2] S. E. Tranter, D. A. Reynolds, An overview of automatic speaker di-arization systems, IEEE Transactions on Audio, Speech, and LanguageProcessing 14 (2006) 1557–1565.[3] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland,O. Vinyals, Speaker diarization: A review of recent research, IEEETransactions on Audio, Speech, and Language Processing 20 (2012)356–370.[4] H. Gish, M. . Siu, R. Rohlicek, Segregation of speakers for speechrecognition and speaker identification, in: Proceedings of IEEE Inter-national Conference on Acoustics, Speech and Signal Processing, 1991,pp. 873–876.[5] M.-H. Siu, Y. George, H. Gish, An unsupervised, sequential learning al-gorithm for segmentation for speech waveforms with multiple speakers,in: Proceedings of IEEE International Conference on Acoustics, Speechand Signal Processing, 1992, pp. 189–192.[6] J. R. Rohlicek, D. Ayuso, M. Bates, R. Bobrow, A. Boulanger, H. Gish,P. Jeanrenaud, M. Meteer, M. Siu, Gisting conversational speech, in:Proceedings of IEEE International Conference on Acoustics, Speech andSignal Processing, 1992, pp. 113–116.[7] M. Sugiyama, J. Murakami, H. Watanabe, Speech segmentation andclustering based on speaker features, in: Proceedings of IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing, 1993,pp. 395–398.[8] U. Jain, M. A. Siegler, S.-J. Doh, E. Gouvea, J. Huerta, P. J. Moreno,B. Raj, R. M. Stern, Recognition of continuous broadcast news withmultiple unknown speakers and environments, in: Proceedings of ARPASpoken Language Technology Workshop, 1996, pp. 61–66.[9] M. Padmanabhan, L. R. Bahl, D. Nahamoo, M. A. Picheny, Speakerclustering and transformation for speaker adaptation in large-vocabularyspeech recognition systems, in: Proceedings of IEEE International Con-ference on Acoustics, Speech and Signal Processing, 1996, pp. 701–704. [10] M. A. Siegler, U. Jain, B. Raj, R. M. Stern, Automatic segmentation,classification and clustering of broadcast news audio, in: Proceedings ofDARPA Speech Recognition Workshop, 1997, pp. 97–99.[11] H. Jin, F. Kubala, R. Schwartz, Automatic speaker clustering, in: Pro-ceedings of Speech Recognition Workshop, 1997.[12] H. S. Beigi, S. H. Maes, Speaker, channel and environment changedetection, in: Proceedings of World Congress of Automation, 1998.[13] S. S. Chen, P. S. Gopalakrishnan, Speaker, environment and channelchange detection and clustering via the Bayesian Information Criterion,in: Tech. Rep., IBM T. J. Watson Research Center, 1998, pp. 127–132.[14] A. Solomono ff , A. Mielke, M. Schmidt, H. Gish, Clustering speakersby their voices, in: Proceedings of IEEE International Conference onAcoustics, Speech and Signal Processing, 1998, pp. 757–760.[15] J.-L. Gauvain, G. Adda, L. Lamel, M. Adda-Decker, Transcription ofbroadcast news: The LIMSI Nov 96 Hub4 system, in: Proceedings ofARPA Speech Recognition Workshop, 1997, pp. 56–63.[16] J.-L. Gauvain, L. Lamel, G. Adda, The LIMSI 1997 Hub-4E transcrip-tion system, in: Proceedings of DARPA News Transcription and Under-standing Workshop, 1998, pp. 75–79.[17] J.-L. Gauvain, L. Lamel, G. Adda, Partitioning and transcription ofbroadcast news data, in: Proceedings of the International Conferenceon Spoken Language Processing, 1998, pp. 1335–1338.[18] D. Liu, F. Kubala, Fast speaker change detection for broadcast newstranscription and indexing, in: Proceedings of the International Confer-ence on Spoken Language Processing, 1999, pp. 1031–1034.[19] AMI Consortium. .[20] NIST, Rich Transcription Evaluation. .[21] J. Ajmera, C. Wooters, A robust speaker clustering algorithm, in: Pro-ceedings of IEEE Workshop on Automatic Speech Recognition and Un-derstanding, 2003, pp. 411–416.[22] S. E. Tranter, D. A. Reynolds, Speaker diarisation for broadcast news,in: Proceedings of Odyssey Speaker and Language Recognition Work-shop, 2004, pp. 337–344.[23] C. Wooters, J. Fung, B. Peskin, X. Anguera, Toward robust speaker seg-mentation: The ICSI-SRI Fall 2004 diarization system, in: Proceedingsof Fall 2004 Rich Transcription Workshop, 2004, pp. 402–414.[24] D. A. Reynolds, P. Torres-Carrasquillo, The MIT Lincoln LaboratoryRT-04F diarization systems: Applications to broadcast audio and tele-phone conversations, in: Proceedings of Fall 2004 Rich TranscriptionWorkshop, 2004.[25] D. A. Reynolds, P. Torres-Carrasquillo, Approaches and applications ofaudio diarization, in: Proceedings of IEEE International Conference onAcoustics, Speech and Signal Processing, 2005, pp. 953–956.[26] X. Zhu, C. Barras, S. Meignier, J.-L. Gauvain, Combining speaker iden-tification and BIC for speaker diarization, in: Proceedings of the An-nual Conference of the International Speech Communication Associa-tion, 2005, pp. 2441–2444.[27] C. Barras, Xuan Zhu, S. Meignier, J.-L. Gauvain, Multistage speakerdiarization of broadcast news, IEEE Transactions on Audio, Speech,and Language Processing 14 (2006) 1505–1512.[28] N. Mirghafori, C. Wooters, Nuts and flakes: A study of data character-istics in speaker diarization, in: Proceedings of IEEE International Con-ference on Acoustics, Speech and Signal Processing, 2006, pp. 1017–1020.[29] S. Meignier, D. Moraru, C. Fredouille, J.-F. Bonastre, L. Besacier, Step-by-step and integrated approaches in broadcast news speaker diarization,Computer, Speech & Language 20 (2006) 303–330.[30] A. E. Rosenberg, A. Gorin, Z. Liu, P. Parthasarathy, Unsupervisedspeaker segmentation of telephone conversations, in: Proceedings ofthe International Conference on Spoken Language Processing, 2002, pp.565–568.[31] D. Liu, F. Kubala, A cross-channel modeling approach for automaticsegmentation of conversational telephone speech, in: Proceedings ofIEEE Workshop on Automatic Speech Recognition and Understanding,2003, pp. 333–338.[32] S. E. Tranter, K. Yu, G. Evermann, P. C. Woodland, Generating andevaluating for automatic speech recognition of conversational telephonespeech, in: Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing, 2004, pp. 753–756.[33] D. A. Reynolds, P. Kenny, F. Castaldo, A study of new approaches o speaker diarization, in: Proceedings of the Annual Conference ofthe International Speech Communication Association, 2009, pp. 1047–1050.[34] P. Kenny, D. Reynolds, F. Castaldo, Diarization of telephone conversa-tions using factor analysis, IEEE Journal of Selected Topics in SignalProcessing 4 (2010) 1059–1070.[35] T. Pfau, D. Ellis, A. Stolcke, Multispeaker speech activity detectionfor the ICSI meeting recorder, in: Proceedings of IEEE Workshop onAutomatic Speech Recognition and Understanding, 2001, pp. 107–110.[36] J. Ajmera, G. Lathoud, L. McCowan, Clustering and segmenting speak-ers and their locations in meetings, in: Proceedings of IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing, 2004,pp. 605–608.[37] Q. Jin, K. Laskowski, T. Schultz, A. Waibel, Speaker segmentation andclustering in meetings, in: Proceedings of the International Conferenceon Spoken Language Processing, 2004, pp. 597–600.[38] X. Anguera, C. Wooters, B. Peskin, M. Aguilo, Robust speaker seg-mentation for meetings: The ICSI-SRI Spring 2005 diarization system,in: Proceedings of Machine Learning for Multimodal Interaction Work-shop, 2005, pp. 402–414.[39] X. Anguera, C. Wooters, J. Hernando, Purity algorithms for speaker di-arization of meetings data, in: Proceedings of IEEE International Con-ference on Acoustics, Speech and Signal Processing, volume I, 2006,pp. 1025–1028.[40] D. Istrate, C. Fredouille, S. Meignier, L. Besacier, J.-F. Bonastre, NISTRT05S evaluation: Pre-processing techniques and speaker diarization onmultiple microphone meetings, in: Proceedings of Machine Learning forMultimodal Interaction Workshop, 2006.[41] D. A. V. Leeuwen, M. Konecny, Progress in the AMIDA speaker diariza-tion system for meeting data, in: Proceedings of International EvaluationWorkshops CLEAR 2007 and RT 2007, 2007, pp. 475–483.[42] X. Anguera, C. Wooters, J. Hernando, Acoustic beamforming forspeaker diarization of meetings, IEEE Transactions on Audio, Speech,and Language Processing 15 (2007) 2011–2023.[43] X. Zhu, C. Barras, L. Lamel, J.-L. Gauvain, Multi-stage speaker di-arization for conference and lecture meetings, in: Proceedings of Inter-national Evaluation Workshops CLEAR 2007 and RT 2007, 2007, pp.533–542.[44] D. Vijayasenan, F. Valente, H. Bourlard, An information theoretic ap-proach to speaker diarization of meeting data, IEEE Transactions onAudio, Speech, and Language Processing 17 (2009) 1382–1393.[45] F. Valente, P. Motlicek, D. Vijayasenan, Variational Bayesian speakerdiarization of meeting recordings, in: Proceedings of IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing, 2010,pp. 4954–4957.[46] P. Kenny, G. Boulianne, P. Ouellet, P. Dumouchel, Joint factor analy-sis versus eigenchannels in speaker recognition, IEEE Transactions onAudio, Speech, and Language Processing 15 (2007) 1435–1447.[47] E. Variani, X. Lei, E. McDermott, I. L. Moreno, J. G-Dominguez, Deepneural networks for small footprint text-dependent speaker verification,in: Proceedings of IEEE International Conference on Acoustics, Speechand Signal Processing, 2014, pp. 4052–4056.[48] G. Heigold, I. Moreno, S. Bengio, N. Shazeer, End-to-end text-dependent speaker verification, in: Proceedings of IEEE InternationalConference on Acoustics, Speech and Signal Processing, 2016, pp.5115–5119.[49] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, I. L. Moreno, Speaker di-arization with LSTM, in: Proceedings of IEEE International Conferenceon Acoustics, Speech and Signal Processing, 2018, pp. 5239–5243.[50] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-vectors: Robust DNN embeddings for speaker recognition, in: Proceed-ings of IEEE International Conference on Acoustics, Speech and SignalProcessing, 2018, pp. 5329–5333.[51] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-endfactor analysis for speaker verification, IEEE Transactions on Audio,Speech, and Language Processing 19 (2011).[52] S. Shum, N. Dehak, J. Glass, On the use of spectral and iterative methodsfor speaker diarization, in: Proceedings of the Annual Conference of theInternational Speech Communication Association, 2012, pp. 482–485.[53] G. Dupuy, M. Rouvier, S. Meignier, Y. Esteve, i-Vectors and ILP clus-tering adapted to cross-show speaker diarization, in: Proceedings of the Annual Conference of the International Speech Communication Associ-ation, 2012, pp. 2174–2177.[54] S. H. Shum, N. Dehak, R. Dehak, J. R. Glass, Unsupervised methods forspeaker diarization: An integrated and iterative approach, IEEE Trans-actions on Audio, Speech, and Language Processing 21 (2013).[55] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, C. Wang, Fully supervisedspeaker diarization, in: Proceedings of IEEE International Conferenceon Acoustics, Speech and Signal Processing, 2019, pp. 6301–6305.[56] Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, S. Watanabe, End-to-end neural speaker diarization with permutation-free objectives, Pro-ceedings of the Annual Conference of the International Speech Commu-nication Association (2019) 4300–4304.[57] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, S. Watanabe,End-to-end neural speaker diarization with self-attention, in: Proceed-ings of IEEE Workshop on Automatic Speech Recognition and Under-standing, IEEE, 2019, pp. 296–303.[58] J. R. Hershey, Z. Chen, J. Le Roux, S. Watanabe, Deep clustering: Dis-criminative embeddings for segmentation and separation, in: Proceed-ings of IEEE International Conference on Acoustics, Speech and SignalProcessing, IEEE, 2016, pp. 31–35.[59] M. Kolbæk, D. Yu, Z.-H. Tan, J. Jensen, Multitalker speech separa-tion with utterance-level permutation invariant training of deep recur-rent neural networks, IEEE / ACM Transactions on Audio, Speech, andLanguage Processing 25 (2017) 1901–1913.[60] Y. Luo, N. Mesgarani, Conv-tasnet: Surpassing ideal time–frequencymagnitude masking for speech separation, IEEE / ACM Transactions onAudio, Speech, and Language Processing 27 (2019) 1256–1266.[61] E. Variani, X. Lei, E. McDermott, I. L. Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependentspeaker verification, in: Proceedings of IEEE International Conferenceon Acoustics, Speech and Signal Processing, IEEE, 2014, pp. 4052–4056.[62] D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neuralnetwork embeddings for text-independent speaker verification., in: Pro-ceedings of the Annual Conference of the International Speech Commu-nication Association, 2017, pp. 999–1003.[63] T. Drugman, Y. Stylianou, Y. Kida, M. Akamine, Voice activity detec-tion: Merging source and filter-based information, IEEE Signal Process-ing Letters 23 (2015) 252–256.[64] X. Guo, L. Gao, X. Liu, J. Yin, Improved deep embedded clusteringwith local structure preservation, in: Proceedings of International JointConference on Artificial Intelligence, 2017, pp. 1753–1759.[65] J. Wang, X. Xiao, J. Wu, R. Ramamurthy, F. Rudzicz, M. Brudno,Speaker diarization with session-level speaker embedding refinementusing graph neural networks, in: Proceedings of IEEE InternationalConference on Acoustics, Speech and Signal Processing, IEEE, 2020,pp. 7109–7113.[66] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Ko-renevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko,I. Podluzhny, A. Laptev, A. Romanenko, Target-speaker voice activ-ity detection: a novel approach for multi-speaker diarization in a dinnerparty scenario, in: Proceedings of the Annual Conference of the Inter-national Speech Communication Association, 2020, pp. 274–278.[67] D. Yu, X. Chang, Y. Qian, Recognizing multi-talker speech with permu-tation invariant training, Proceedings of the Annual Conference of theInternational Speech Communication Association (2017) 2456–2460.[68] H. Seki, T. Hori, S. Watanabe, J. Le Roux, J. R. Hershey, A purely end-to-end system for multi-speaker speech recognition, 2018, pp. 2620–2630.[69] X. Chang, Y. Qian, K. Yu, S. Watanabe, End-to-end monaural multi-speaker ASR system without pretraining, in: Proceedings of IEEE Inter-national Conference on Acoustics, Speech and Signal Processing, 2019,pp. 6256–6260.[70] N. Kanda, Y. Fujita, S. Horiguchi, R. Ikeshita, K. Nagamatsu, S. Watan-abe, Acoustic modeling for distant multi-talker speech recognition withsingle-and multi-channel branches, in: Proceedings of IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing, 2019,pp. 6630–6634.[71] N. Kanda, S. Horiguchi, R. Takashima, Y. Fujita, K. Nagamatsu,S. Watanabe, Auxiliary interference speaker loss for target-speakerspeech recognition, in: Proceedings of the Annual Conference of the nternational Speech Communication Association, 2019, pp. 236–240.[72] X. Wang, N. Kanda, Y. Gaur, Z. Chen, Z. Meng, T. Yoshioka, Ex-ploring end-to-end multi-channel asr with bias information for meetingtranscription, in: Proceedings of IEEE Workshop on Automatic SpeechRecognition and Understanding, 2021.[73] P. Wang, Z. Chen, X. Xiao, Z. Meng, T. Yoshioka, T. Zhou, L. Lu, J. Li,Speech separation using speaker inventory, in: Proceedings of IEEEWorkshop on Automatic Speech Recognition and Understanding, 2019,pp. 230–236.[74] C. Han, Y. Luo, C. Li, T. Zhou, K. Kinoshita, S. Watanabe, M. Delcroix,H. Erdogan, J. R. Hershey, N. Mesgarani, et al., Continuous speechseparation using speaker inventory for long multi-talker recording, arXivpreprint arXiv:2012.09727 (2020).[75] Z. Huang, S. Watanabe, Y. Fujita, P. Garc´ıa, Y. Shao, D. Povey, S. Khu-danpur, Speaker diarization with region proposal network, in: Proceed-ings of IEEE International Conference on Acoustics, Speech and SignalProcessing, IEEE, 2020, pp. 6514–6518.[76] T. von Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani,R. Haeb-Umbach, All-neural online source separation, counting, anddiarization for meeting analysis, in: Proceedings of IEEE InternationalConference on Acoustics, Speech and Signal Processing, IEEE, 2019,pp. 91–95.[77] L. E. Shafey, H. Soltau, I. Shafran, Joint Speech Recognition andSpeaker Diarization via Sequence Transduction, in: Proceedings of theAnnual Conference of the International Speech Communication Associ-ation, ISCA, 2019, pp. 396–400.[78] H. H. Mao, S. Li, J. McAuley, G. Cottrell, Speech recognition andmulti-speaker diarization of long conversations, in: Proceedings of theAnnual Conference of the International Speech Communication Associ-ation, 2020, pp. 691–695.[79] N. Kanda, S. Horiguchi, Y. Fujita, Y. Xue, K. Nagamatsu, S. Watanabe,Simultaneous speech recognition and speaker diarization for monauraldialogue recordings with target-speaker acoustic models, in: Proceed-ings of IEEE Workshop on Automatic Speech Recognition and Under-standing, 2019, pp. 31–38.[80] N. Kanda, X. Chang, Y. Gaur, X. Wang, Z. Meng, Z. Chen, T. Yosh-ioka, Investigation of end-to-end speaker-attributed ASR for continu-ous multi-talker recordings, in: Proceedings of IEEE Spoken LanguageTechnology Workshop, 2021.[81] R. Haeb-Umbach, S. Watanabe, T. Nakatani, M. Bacchiani,B. Ho ff meister, M. L. Seltzer, H. Zen, M. Souden, Speech processing fordigital home assistants: Combining signal processing with deep-learningtechniques, IEEE Signal Processing Magazine 36 (2019) 111–124.[82] E. Vincent, T. Virtanen, S. Gannot, Audio source separation and speechenhancement, John Wiley & Sons, 2018.[83] D. Wang, J. Chen, Supervised speech separation based on deep learning:An overview, IEEE / ACM Transactions on Audio, Speech, and LanguageProcessing 26 (2018) 1702–1726.[84] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, M. Ma-ciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe, et al., Diariza-tion is hard: Some experiences and lessons learned for the JHU team inthe inaugural DIHARD challenge., in: Proceedings of the Annual Con-ference of the International Speech Communication Association, 2018,pp. 2808–2812.[85] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, M. Liber-man, The second DIHARD diarization challenge: Dataset, task, andbaselines, Proceedings of the Annual Conference of the InternationalSpeech Communication Association (2019) 978–982.[86] M. Diez, F. Landini, L. Burget, J. Rohdin, A. Silnova, K. Zmol´ıkov´a,O. Novotn`y, K. Vesel`y, O. Glembek, O. Plchot, et al., BUT system forDIHARD speech diarization challenge 2018., in: Proceedings of theAnnual Conference of the International Speech Communication Associ-ation, 2018, pp. 2798–2802.[87] Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu, X. Xiao,J. Li, Continuous speech separation: Dataset and analysis, in: Proceed-ings of IEEE International Conference on Acoustics, Speech and SignalProcessing, IEEE, 2020, pp. 7284–7288.[88] D. Raj, P. Denisov, Z. Chen, H. Erdogan, Z. Huang, M. He, S. Watanabe,J. Du, T. Yoshioka, Y. Luo, N. Kanda, J. Li, S. Wisdom, J. R. Hershey,Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis, in: Proceedings of IEEE Spoken Language Technology Workshop, 2021.[89] S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang,S. Khudanpur, V. Manohar, D. Povey, D. Raj, et al., CHiME-6 challenge:Tackling multispeaker speech recognition for unsegmented recordings,in: 6th International Workshop on Speech Processing in Everyday Envi-ronments (CHiME 2020), 2020.[90] A. Arora, D. Raj, A. S. Subramanian, K. Li, B. Ben-Yair, M. Maciejew-ski, P. ˙Zelasko, P. Garcia, S. Watanabe, S. Khudanpur, The JHU multi-microphone multi-speaker asr system for the CHiME-6 challenge, arXivpreprint arXiv:2006.07898 (2020).[91] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Ko-renevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko,I. Podluzhny, et al., The STC system for the CHiME-6 challenge, in:CHiME 2020 Workshop on Speech Processing in Everyday Environ-ments, 2020.[92] X. Lu, Y. Tsao, S. Matsuda, C. Hori, Speech enhancement based ondeep denoising autoencoder., in: Proceedings of the Annual Conferenceof the International Speech Communication Association, 2013, pp. 436–440.[93] Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, A regression approach to speechenhancement based on deep neural networks, IEEE / ACM Transactionson Audio, Speech, and Language Processing 23 (2014) 7–19.[94] H. Erdogan, J. R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive andrecognition-boosted speech separation using deep recurrent neural net-works, in: Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing, IEEE, 2015, pp. 708–712.[95] P. C. Loizou, Speech enhancement: theory and practice, CRC press,2013.[96] T. Gao, J. Du, L.-R. Dai, C.-H. Lee, Densely connected progressivelearning for lstm-based speech enhancement, in: Proceedings of IEEEInternational Conference on Acoustics, Speech and Signal Processing,IEEE, 2018, pp. 5054–5058.[97] J. Heymann, L. Drude, R. Haeb-Umbach, Neural network based spectralmask estimation for acoustic beamforming, in: Proceedings of IEEEInternational Conference on Acoustics, Speech and Signal Processing,IEEE, 2016, pp. 196–200.[98] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, J. Le Roux,Improved MVDR beamforming using single-channel mask predictionnetworks, Proceedings of the Annual Conference of the InternationalSpeech Communication Association (2016) 1981–1985.[99] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, B.-H. Juang,Speech dereverberation based on variance-normalized delayed linearprediction, IEEE Transactions on Audio, Speech, and Language Pro-cessing 18 (2010) 1717–1731.[100] T. Yoshioka, T. Nakatani, Generalization of multi-channel linear predic-tion methods for blind mimo impulse response shortening, IEEE Trans-actions on Audio, Speech, and Language Processing 20 (2012) 2707–2720.[101] L. Drude, J. Heymann, C. Boeddeker, R. Haeb-Umbach, NARA-WPE: A python package for weighted prediction error dereverberationin numpy and tensorflow for online and o ffl ine processing, in: SpeechCommunication; 13th ITG-Symposium, VDE, 2018, pp. 1–5.[102] T. Yoshioka, H. Erdogan, Z. Chen, X. Xiao, F. Alleva, Recognizing over-lapped speech in meetings: A multichannel separation approach usingneural networks, in: Proceedings of the Annual Conference of the Inter-national Speech Communication Association, 2018, pp. 3038–3042.[103] C. Boeddecker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Hey-mann, R. Haeb-Umbach, Front-end processing for the CHiME-5 dinnerparty scenario, in: Proceedings of CHiME 2018 Workshop on SpeechProcessing in Everyday Environments, 2018, pp. 35–40.[104] X. Xiao, N. Kanda, Z. Chen, T. Zhou, T. Yoshioka, Y. Zhao, G. Liu,J. Wu, J. Li, Y. Gong, Microsoft speaker diarization system forthe voxceleb speaker recognition challenge 2020, arXiv preprintarXiv:2010.11458 (2020).[105] A. Nagrani, J. S. Chung, J. Huh, A. Brown, E. Coto, W. Xie,M. McLaren, D. A. Reynolds, A. Zisserman, VoxSRC 2020: Thesecond VoxCeleb speaker recognition challenge, arXiv preprintarXiv:2012.06867 (2020).[106] T. Ng, B. Zhang, L. Nguyen, S. Matsoukas, X. Zhou, N. Mesgarani,K. Vesel`y, P. Matˇejka, Developing a speech activity detection systemfor the darpa rats program, in: Proceedings of the Annual Conference of he International Speech Communication Association, 2012, pp. 1969–1972.[107] R. Sarikaya, J. H. Hansen, Robust detection of speech activity in thepresence of noise, in: Proceedings of the International Conference onSpoken Language Processing, volume 4, Citeseer, 1998, pp. 1455–8.[108] D. Haws, D. Dimitriadis, G. Saon, S. Thomas, M. Picheny, On the im-portance of event detection for asr, in: Proceedings of IEEE InternationalConference on Acoustics, Speech and Signal Processing, 2016.[109] S. Meignier, D. Moraru, C. Fredouille, J.-F. Bonastre, L. Besacier, Step-by-step and integrated approaches in broadcast news speaker diarization,Computer Speech and Language 20 (2006) 303–330.[110] S. Chen, P. Gopalakrishnan, et al., Speaker, environment and channelchange detection and clustering via the bayesian information criterion,in: Proceedings DARPA broadcast news transcription and understandingworkshop, volume 8, Virginia, USA, 1998, pp. 127–132.[111] P. Delacourt, C. J. Wellekens, Distbic: A speaker-based segmentationfor audio data indexing, Speech Communication 32 (2000) 111–126.[112] M. Senoussaoui, P. Kenny, T. Stafylakis, P. Dumouchel, A study ofthe cosine distance-based mean shift for telephone speech diarization,IEEE / ACM Transactions on Audio, Speech, and Language Processing22 (2013) 217–227.[113] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, M. Ma-ciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe, S. Khudan-pur, Diarization is hard: some experiences and lessons learned for theJHU team in the inaugural DIHARD challenge, in: Proceedings of theAnnual Conference of the International Speech Communication Associ-ation, 2018, pp. 2808–2812.[114] W.-H. Tsai, S.-S. Cheng, H.-M. Wang, Speaker clustering of speechutterances using a voice characteristic reference space, in: Proceedingsof the International Conference on Spoken Language Processing, 2004.[115] J. E. Rougui, M. Rziza, D. Aboutajdine, M. Gelgon, J. Martinez, Fast in-cremental clustering of gaussian mixture speaker models for scaling upretrieval in on-line broadcast, in: Proceedings of IEEE InternationalConference on Acoustics, Speech and Signal Processing, volume 5,IEEE, 2006, pp. V–V.[116] D. A. Reynolds, T. F. Quatieri, R. B. Dunn, Speaker verification usingadapted gaussian mixture models, Digital signal processing 10 (2000)19–41.[117] P. Kenny, G. Boulianne, P. Ouellet, P. Dumouchel, Speaker and sessionvariability in gmm-based speaker verification, IEEE Transactions onAudio, Speech, and Language Processing 15 (2007) 1448–1460.[118] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, P. Dumouchel, A study ofinterspeaker variability in speaker verification, IEEE Transactions onAudio, Speech, and Language Processing 16 (2008) 980–988.[119] P. Kenny, G. Boulianne, P. Dumouchel, Eigenvoice modeling withsparse training data, IEEE Transactions on Speech and Audio Process-ing 13 (2005) 345–354.[120] G. Sell, D. Garcia-Romero, Speaker diarization with plda i-vector scor-ing and unsupervised calibration, in: Proceedings of IEEE Spoken Lan-guage Technology Workshop, IEEE, 2014, pp. 413–417.[121] W. Zhu, J. Pelecanos, Online speaker diarization using adapted i-vector transforms, in: Proceedings of IEEE International Conference onAcoustics, Speech and Signal Processing, IEEE, 2016, pp. 5045–5049.[122] Y. Sun, X. Wang, X. Tang, Deep learning face representation from pre-dicting 10,000 classes, in: Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, 2014, pp. 1891–1898.[123] Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: Closing the gapto human-level performance in face verification, in: Proceedings of theIEEE conference on computer vision and pattern recognition, 2014, pp.1701–1708.[124] J. Villalba, N. Chen, D. Snyder, D. Garcia-Romero, A. McCree, G. Sell,J. Borgstrom, F. Richardson, S. Shon, F. Grondin, et al., State-of-the-art speaker recognition for telephone and video speech: The JHU-MITsubmission for NIST SRE18., in: Proceedings of the Annual Confer-ence of the International Speech Communication Association, 2019, pp.1488–1492.[125] D. Comaniciu, P. Meer, Mean shift: A robust approach toward featurespace analysis, IEEE Transactions on pattern analysis and machine in-telligence 24 (2002) 603–619.[126] T. Stafylakis, V. Katsouros, G. Carayannis, Speaker clustering via themean shift algorithm, Recall 2 (2010) 7. [127] M. Senoussaoui, P. Kenny, P. Dumouchel, T. Stafylakis, E ffi cient iter-ative mean shift based cosine dissimilarity for multi-recording speakerclustering, in: Proceedings of IEEE International Conference on Acous-tics, Speech and Signal Processing, IEEE, 2013, pp. 7712–7715.[128] I. Salmun, I. Shapiro, I. Opher, I. Lapidot, Plda-based mean shift speak-ers’ short segments clustering, Computer Speech and Language 45(2017) 411–436.[129] K. J. Han, S. S. Narayanan, A robust stopping criterion for agglomera-tive hierarchical clustering in a speaker diarization system, in: Proceed-ings of the Annual Conference of the International Speech Communica-tion Association, 2007.[130] S. Novoselov, A. Gusev, A. Ivanov, T. Pekhovsky, A. Shulipa,A. Avdeeva, A. Gorlanov, A. Kozlov, Speaker diarization with deepspeaker embeddings for dihard challenge ii., in: Proceedings of theAnnual Conference of the International Speech Communication Associ-ation, 2019, pp. 1003–1007.[131] U. Von Luxburg, A tutorial on spectral clustering, Statist. and Comput.17 (2007) 395–416.[132] A. Ng, M. Jordan, Y. Weiss, On spectral clustering: Analysis and an al-gorithm, Advances in neural information processing systems 14 (2001)849–856.[133] H. Ning, M. Liu, H. Tang, T. S. Huang, A spectral clustering approachto speaker diarization, in: Proceedings of the International Conferenceon Spoken Language Processing, 2006, pp. 2178–2181.[134] J. Luque, J. Hernando, On the use of agglomerative and spectral cluster-ing in speaker diarization of meetings, in: Proceedings of Odyssey: TheSpeaker and Language Recognition Workshop, 2012, pp. 130–137.[135] Q. Lin, R. Yin, M. Li, H. Bredin, C. Barras, LSTM based similaritymeasurement with spectral clustering for speaker diarization, in: Pro-ceedings of the Annual Conference of the International Speech Commu-nication Association, 2019, pp. 366–370.[136] T. J. Park, K. J. Han, M. Kumar, S. Narayanan, Auto-tuning spectralclustering for speaker diarization using normalized maximum eigengap,IEEE Signal Processing Letters 27 (2019) 381–385.[137] P. Kenny, D. Reynolds, F. Castaldo, Diarization of telephone conversa-tions using factor analysis, IEEE Journal of Selected Topics in SignalProcessing 4 (2010) 1059–1070.[138] M. Diez, L. Burget, P. Matejka, Speaker diarization based on bayesianhmm with eigenvoice priors., in: Proceedings of Odyssey: The Speakerand Language Recognition Workshop, 2018, pp. 147–154.[139] M. Diez, L. Burget, F. Landini, J. ˇCernock`y, Analysis of speaker diariza-tion based on bayesian hmm with eigenvoice priors, IEEE / ACM Trans-actions on Audio, Speech, and Language Processing 28 (2019) 355–368.[140] M. Diez, L. Burget, S. Wang, J. Rohdin, J. Cernock`y, Bayesian hmmbased x-vector clustering for speaker diarization., in: Proceedings of theAnnual Conference of the International Speech Communication Associ-ation, 2019, pp. 346–350.[141] F. Landini, J. Profant, M. Diez, L. Burget, Bayesian hmm clustering ofx-vector sequences (vbx) in speaker diarization: theory, implementationand analysis on standard tasks, arXiv preprint arXiv:2006.07898 (2020).[142] G. Sell, D. Garcia-Romero, Diarization resegmentation in the factoranalysis subspace, in: Proceedings of IEEE International Conference onAcoustics, Speech and Signal Processing, IEEE, 2015, pp. 4794–4798.[143] J. G. Fiscus, A post-processing system to yield reduced word error rates:Recognizer output voting error reduction (ROVER), in: Proceedings ofIEEE Workshop on Automatic Speech Recognition and Understanding,IEEE, 1997, pp. 347–354.[144] N. Brummer, L. Burget, J. Cernocky, O. Glembek, F. Grezl, M. Karafiat,D. A. van Leeuwen, P. Matejka, P. Schwarz, A. Strasheim, Fusion ofheterogeneous speaker recognition systems in the STBU submission forthe NIST speaker recognition evaluation 2006, IEEE Transactions onAudio, Speech, and Language Processing 15 (2007) 2072–2084.[145] M. Huijbregts, D. van Leeuwen, F. Jong, The majority wins: a methodfor combining speaker diarization systems, in: Proceedings of the An-nual Conference of the International Speech Communication Associa-tion, ISCA, 2009, pp. 924–927.[146] S. Bozonnet, N. Evans, X. Anguera, O. Vinyals, G. Friedland, C. Fre-douille, System output combination for improved speaker diarization,in: Proceedings of the Annual Conference of the International SpeechCommunication Association, ISCA, 2010, pp. 2642–2645.[147] A. Stolcke, T. Yoshioka, DOVER: A method for combining diariza- ion outputs, in: Proceedings of IEEE Workshop on Automatic SpeechRecognition and Understanding, IEEE, 2019, pp. 757–763.[148] D. Raj, L. P. Garcia-Perera, Z. Huang, S. Watanabe, D. Povey, A. Stol-cke, S. Khudanpur, DOVER-Lap: A method for combining overlap-aware diarization outputs, in: Proceedings of IEEE Spoken LanguageTechnology Workshop, 2021.[149] D. Dimitriadis, Enhancements for Audio-only Diarization Systems,arXiv preprint arXiv:1909.00082 (2019).[150] J. Xie, R. Girshick, A. Farhadi, Unsupervised deep embedding for clus-tering analysis, in: Proceedings ofInternational Conference on MachineLearning, 2016, pp. 478–487.[151] E. Ustinova, V. Lempitsky, Learning deep embeddings with histogramloss, Proceedings of Advances in Neural Information Processing Sys-tems 29 (2016) 4170–4178.[152] Q. Lin, Y. Hou, M. Li, Self-attentive similarity measurement strategiesin speaker diarization, Proceedings of the Annual Conference of theInternational Speech Communication Association (2020) 284–288.[153] T. J. Park, M. Kumar, S. Narayanan, Multi-scale speaker diarization withneural a ffi nity score fusion, arXiv preprint arXiv:2011.10527 (2020).[154] Y. LeCun, Y. Bengio, G. Hinton, Deep Learning, Nature 521 (2015)436.[155] A. Santoro, R. Faulkner, D. Raposo, J. Rae, M. Chrzanowski, T. Weber,D. Wierstra, O. Vinyals, R. Pascanu, T. Lillicrap, Relational RecurrentNeural Networks, in: Proceedings of Advances in Neural InformationProcessing Systems, 2018, pp. 7299–7310.[156] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, T. Lillicrap, Meta-learning with Memory-Augmented Neural Networks, in: ProceedingsofInternational Conference on Machine Learning, 2016, pp. 1842—-1850.[157] S. Sukhbaatar, J. Weston, R. Fergus, et al., End-to-End Memory Net-works, in: Proceedings of Advances in Neural Information ProcessingSystems, 2015, pp. 2440–2448.[158] D. Garcia-Romero, C. Y. Espy-Wilson, Analysis of i-vector Length Nor-malization in Speaker Recognition Systems, in: Proceedings of the An-nual Conference of the International Speech Communication Associa-tion, 2011, pp. 249–252.[159] N. Flemotomos, D. Dimitriadis, A Memory Augmented Architec-ture for Continuous Speaker Identification in Meetings, arXiv preprintarXiv:2001.05118 (2020).[160] Z. Zaj´ıc, M. Kuneˇsov´a, V. Radov´a, Investigation of Segmentation in i-vector Based Speaker Diarization of Telephone Speech, in: InternationalConference on Speech and Computer, 2016, pp. 411–418.[161] T. Yoshioka, D. Dimitriadis, A. Stolcke, W. Hinthorn, Z. Chen, M. Zeng,H. Xuedong, Meeting Transcription Using Asynchronous Distant Mi-crophones, in: Proceedings of the Annual Conference of the Interna-tional Speech Communication Association, 2019, pp. 2968–2972.[162] S. Horiguchi, P. Garcia, Y. Fujita, S. Watanabe, K. Nagamatsu,End-to-end speaker diarization as post-processing, arXiv preprintarXiv:2012.10055 (2020).[163] D. M. Blei, P. I. Frazier, Distance dependent chinese restaurant pro-cesses., Journal of Machine Learning Research 12 (2011).[164] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards real-timeobject detection with region proposal networks, IEEE Transactions onPattern Analysis and Machine Intelligence 39 (2016) 1137–1149.[165] D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, R. Ho-raud, An EM algorithm for joint source separation and diarisation ofmultichannel convolutive speech mixtures, in: Proceedings of IEEEInternational Conference on Acoustics, Speech and Signal Processing,IEEE, 2017, pp. 16–20.[166] D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, R. Horaud, S. Gan-not, Exploiting the intermittency of speech for joint separation and di-arization, in: Proceedings of IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics, IEEE, 2017, pp. 41–45.[167] K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, Tackling real noisyreverberant meetings with all-neural source separation, counting, anddiarization system, in: Proceedings of IEEE International Conferenceon Acoustics, Speech and Signal Processing, IEEE, 2020, pp. 381–385.[168] K. Maekawa, Corpus of spontaneous japanese: Its design and evalua-tion, in: ISCA & IEEE Workshop on Spontaneous Speech Processingand Recognition, 2003, pp. 7–12.[169] S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, K. Nagamatsu, End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors, in: Proceedings of the Annual Conference ofthe International Speech Communication Association, 2020, pp. 269–273.[170] Y. Fujita, S. Watanabe, S. Horiguchi, Y. Xue, J. Shi, K. Nagamatsu,Neural speaker diarization with speaker-wise chain rule, arXiv preprintarXiv:2006.01796 (2020).[171] K. Kinoshita, M. Delcroix, N. Tawara, Integrating end-to-end neuraland clustering-based diarization: Getting the best of both worlds, arXivpreprint arXiv:2010.13366 (2020).[172] Y. Xue, S. Horiguchi, Y. Fujita, S. Watanabe, K. Nagamatsu, Onlineend-to-end neural diarization with speaker-tracing bu ff er, arXiv preprintarXiv:2006.02616 (2020).[173] E. Han, C. Lee, A. Stolcke, BW-EDA-EEND: Streaming end-to-endneural speaker diarization for a variable number of speakers, arXivpreprint arXiv:2011.02678 (2020).[174] J. Huang, E. Marcheret, K. Visweswariah, G. Potamianos, The ibmrt07 evaluation systems for speaker diarization on lecture meetings, in:Multimodal Technologies for Perception of Humans, Springer, 2007, pp.497–508.[175] J. Silovsky, J. Zdansky, J. Nouza, P. Cerva, J. Prazak, Incorporation ofthe asr output in speaker segmentation and clustering within the task ofspeaker diarization of broadcast streams, in: International Workshop onMultimedia Signal Processing, IEEE, 2012, pp. 118–123.[176] L. Canseco-Rodriguez, L. Lamel, J.-L. Gauvain, Speaker diarizationfrom speech transcripts, in: Proceedings of the International Conferenceon Spoken Language Processing, volume 4, 2004, pp. 3–7.[177] N. Flemotomos, P. Georgiou, S. Narayanan, Linguistically aided speakerdiarization using speaker role information, arXiv (2019) arXiv–1911.[178] T. J. Park, P. Georgiou, Multimodal speaker segmentation and diariza-tion using lexical and acoustic cues via sequence to sequence neuralnetworks, Proceedings of the Annual Conference of the InternationalSpeech Communication Association (2018) 1373–1377.[179] T. J. Park, K. J. Han, J. Huang, X. He, B. Zhou, P. Georgiou,S. Narayanan, Speaker diarization with lexical information, Proceed-ings of the Annual Conference of the International Speech Communica-tion Association (2019) 391–395.[180] J. Fiscus, J. Ajot, J. Garofolo, The Rich Transcription 2007 meetingrecognition evaluation, 2007, pp. 373–389.[181] K. Zmolikova, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa,T. Nakatani, Speaker-aware neural network based beamformer forspeaker extraction in speech mixtures., in: Proceedings of the An-nual Conference of the International Speech Communication Associa-tion, 2017, pp. 2655–2659.[182] M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, T. Nakatani, Sin-gle channel target speaker extraction and recognition with speaker beam,in: Proceedings of IEEE International Conference on Acoustics, Speechand Signal Processing, IEEE, 2018, pp. 5554–5558.[183] M. Delcroix, S. Watanabe, T. Ochiai, K. Kinoshita, S. Karita, A. Ogawa,T. Nakatani, End-to-end SpeakerBeam for single channel target speechrecognition., in: Proceedings of the Annual Conference of the Interna-tional Speech Communication Association, 2019, pp. 451–455.[184] N. Kanda, Y. Gaur, X. Wang, Z. Meng, Z. Chen, T. Zhou, T. Yoshioka,Joint speaker counting, speech recognition, and speaker identificationfor overlapped speech of any number of speakers, in: Proceedings of theAnnual Conference of the International Speech Communication Associ-ation, 2020, pp. 36–40.[185] N. Kanda, Z. Meng, L. Lu, Y. Gaur, X. Wang, Z. Chen, T. Yosh-ioka, Minimum bayes risk training for end-to-end speaker-attributedasr, arXiv preprint arXiv:2011.02921 (2020).[186] N. Kanda, Y. Gaur, X. Wang, Z. Meng, T. Yoshioka, Serialized outputtraining for end-to-end overlapped speech recognition, in: Proceedingsof the Annual Conference of the International Speech CommunicationAssociation, 2020, pp. 2797–2801.[187] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain,J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, et al., The amimeeting corpus: A pre-announcement, in: International workshop onmachine learning for multimodal interaction, Springer, 2005, pp. 28–39.[188] A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Pe-skin, T. Pfau, E. Shriberg, A. Stolcke, C. Wooters, The ICSI meetingcorpus, in: Proceedings of IEEE International Conference on Acoustics, peech and Signal Processing, 2003, pp. I–364–I–367.[189] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, M. Liber-man, The first dihard speech diarization challenge, in: Proceedings ofthe Annual Conference of the International Speech Communication As-sociation, 2018.[190] N. Ryant, K. Church, C. Cieri, J. Du, S. Ganapathy, M. Liberman, Thirddihard challenge evaluation plan, arXiv preprint arXiv:2006.05815(2020).[191] J. Barker, S. Watanabe, E. Vincent, J. Trmal, The fifth ’chime’ speechseparation and recognition challenge: Dataset, task and baselines, Pro-ceedings of the Annual Conference of the International Speech Commu-nication Association (2018) 1561–1565.[192] J. S. Chung, J. Huh, A. Nagrani, T. Afouras, A. Zisserman, Spot theconversation: Speaker diarisation in the wild, in: Proceedings of theAnnual Conference of the International Speech Communication Associ-ation, 2020, pp. 299–303.[193] V. Panayotov, G. Chen, D. Povey, S. Khudanpur, LibriSpeech: an ASRcorpus based on public domain audio books, in: Proceedings of IEEEInternational Conference on Acoustics, Speech and Signal Processing,IEEE, 2015, pp. 5206–5210.[194] J. G. Fiscus, J. Ajot, M. Michel, J. S. Garofolo, The rich transcription2006 spring meeting recognition evaluation, in: Proceedings of Interna-tional Workshop on Machine Learning and Multimodal Interaction, May2006, pp. 309–322.[195] P. E. Black, Hungarian algorithm, 2019.Https: // xlinux.nist.gov / dads / HTML / HungarianAlgorithm.html.[196] T. J. Park, P. Georgiou, Multimodal speaker segmentation and di-arization using lexical and acoustic cues via sequence to sequence neu-ral networks, in: Proceedings of the Annual Conference of the In-ternational Speech Communication Association, 2018, pp. 1373–1377.URL: http://dx.doi.org/10.21437/Interspeech.2018-1364 .doi: .[197] J. S. Chung, A. Nagrani, E. Coto, W. Xie, M. McLaren, D. A. Reynolds,A. Zisserman, VoxSRC 2019: The first VoxCeleb speaker recognitionchallenge, arXiv preprint arXiv:1912.02522 (2019).[198] C. Chiu, A. Tripathi, K. Chou, C. Co, N. Jaitly, D. Jaunzeikare,A. Kannan, P. Nguyen, H. Sak, A. Sankar, J. Tansuwan, N. Wan,Y. Wu, X. Zhang, Speech recognition for medical conversations,CoRR abs / http://arxiv.org/abs/1711.07274 . arXiv:1711.07274 .[199] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain,J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, G. Lathoud, M. Lin-coln, A. Lisowska, I. McCowan, W. P. andD. Reidsma, P. Wellner, TheAMI meeting corpus: a pre-announcement, in: Proceedings of Int.Worksh. Machine Learning for Multimodal Interaction, 2006, pp. 28–39.[200] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu,G. Zweig, Achieving human parity in conversational speech recognition,CoRR abs / http://arxiv.org/abs/1610.05256 . arXiv:1610.05256 .[201] G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dimitriadis,X. Cui, B. Ramabhadran, M. Picheny, L. Lim, B. Roomi, P. Hall, Englishconversational telephone speech recognition by humans and machines,CoRR abs / http://arxiv.org/abs/1703.02136 . arXiv:1703.02136 .[202] T. Yoshioka, N. Ito, M. Delcroix, A. Ogawa, K. Kinoshita, M. Fujimoto,C. Yu, W. Fabian, M. Espi, T. Higuchi, S. Araki, T. Nakatani, The NTTCHiME-3 system: advances in speech enhancement and recognition formobile multi-microphone devices, in: Proceedings of IEEE Workshopon Automatic Speech Recognition and Understanding, 2015, pp. 436–443.[203] J. Du, Y. Tu, L. Sun, F. Ma, H. Wang, J. Pan, C. Liu, J. Chen, C. Lee,The USTC-iFlytek system for CHiME-4 challenge, in: Proceedings ofCHiME-4 Workshop, 2016, pp. 36–38.[204] B. Li, T. N. Sainath, A. Narayanan, J. Caroselli, M. Bacchiani, A. Misra,I. Shafran, H. Sak, G. Punduk, K. Chin, K. C. Sim, R. J. Weiss, K. W.Wilson, E. Variani, C. Kim, O. Siohan, M. Weintrauba, E. McDermott,R. Rose, M. Shannon, Acoustic modeling for Google Home, in: Pro-ceedings of the Annual Conference of the International Speech Commu-nication Association, 2017, pp. 399–403.[205] D. Dimitriadis, P. Fousek, Developing on-line speaker diarization sys- tem, in: Proceedings of the Annual Conference of the InternationalSpeech Communication Association, 2017, pp. 2739–2743.[206] A. Zhang, Q. Wan, Z. Zhu, J. Paisley, C. Wang, Fully supervised speakerdiarization, arXiv preprint arXiv:1810.04719 (2018).[207] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog-nition, in: IEEE Conf. Computer Vision, Pattern Recognition, 2016, pp.770–778. doi: .[208] K. He, G. Gkioxari, P. Doll´ar, R. Girshick, Mask R-CNN,CoRR abs / http://arxiv.org/abs/1703.06870 . arXiv:1703.06870 .[209] T. Yoshioka, I. Abramovski, C. Aksoylar, Z. Chen, M. David, D. Dimi-triadis, Y. Gong, I. Gurvich, X. Huang, Y. Huang, A. Hurvitz, L. Jiang,S. Koubi, E. Krupka, I. Leichter, C. Liu, P. Parthasarathy, A. Vinnikov,L. Wu, X. Xiao, W. Xiong, H. Wang, Z. Wang, J. Zhang, Y. Zhao,T. Zhou, Advances in Online Audio-Visual Meeting Transcription, in:Proceedings of IEEE Workshop on Automatic Speech Recognition andUnderstanding, 2019, pp. 276–283.[210] H. Buchner, R. Aichner, W. Kellermann, A generalization of blindsource separation algorithms for convolutive mixtures based on second-order statistics, IEEE Transactions on Speech and Audio Processing 13(2005) 120–134.[211] H. Sawada, S. Araki, S. Makino, Measuring dependence of bin-wiseseparated signals for permutation alignment in frequency-domain BSS,in: Int. Symp. Circ., Syst., 2007, pp. 3247–3250.[212] F. Nesta, P. Svaizer, M. Omologo, Convolutive bss of short mixturesby ica recursively regularized across frequencies, IEEE Transactions onAudio, Speech, and Language Processing 19 (2011) 624–639.[213] H. Sawada, S. Araki, S. Makino, Underdetermined convolutive blindsource separation via frequency bin-wise clustering and permutationalignment, IEEE Transactions on Audio, Speech, and Language Pro-cessing 19 (2011) 516–527.[214] N. Ito, S. Araki, T. Yoshioka, T. Nakatani, Relaxed disjointness basedclustering for joint blind source separation and dereverberation, in: Pro-ceedings of International Workshop on Acoustic Echo and Noise Con-trol, 2014, pp. 268–272.[215] L. Drude, R. Haeb-Umbach, Tight integration of spatial and spectralfeatures for BSS with deep clustering embeddings, in: Proceedings ofthe Annual Conference of the International Speech Communication As-sociation, 2017, pp. 2650–2654.[216] M. Maciejewski, G. Sell, L. P. Garcia-Perera, S. Watanabe, S. Khudan-pur, Building corpora for single-channel speech separation across mul-tiple domains, CoRR abs / http://arxiv.org/abs/1811.02641 . arXiv:1811.02641 .[217] S. Araki, N. Ono, K. Kinoshita, M. Delcroix, Meeting recognition withasynchronous distributed microphone array using block-wise refinementof mask-based MVDR beamformer, in: Proceedings of IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing, 2018,pp. 5694–5698.[218] A. Stolcke, Making the most from multiple microphones in meetingrecordings, in: Proceedings of IEEE International Conference on Acous-tics, Speech and Signal Processing, 2011, pp. 4992–4995.[219] S. Narayanan, P. G. Georgiou, Behavioral signal processing: Derivinghuman behavioral informatics from speech and language, Proceedingsof the IEEE 101 (2013) 1203–1233.[220] D. Bone, C.-C. Lee, T. Chaspari, J. Gibson, S. Narayanan, Signal pro-cessing and machine learning for mental health research and clinical ap-plications, IEEE Signal Processing Magazine 34 (2017) 189–196.[221] M. Kumar, S. H. Kim, C. Lord, S. Narayanan, Speaker diarization fornaturalistic child-adult conversational interactions using contextual in-formation., Journal of the Acoustical Society of America 147 (2020)EL196–EL200. doi: .[222] P. G. Georgiou, M. P. Black, S. S. Narayanan, Behavioral signal pro-cessing for understanding (distressed) dyadic interactions: some recentdevelopments, in: Proceedings of the joint ACM workshop on Humangesture and behavior understanding, 2011, pp. 7–12.[223] B. Xiao, C. Huang, Z. E. Imel, D. C. Atkins, P. Georgiou, S. S.Narayanan, A technology prototype system for rating therapist empathyfrom audio recordings in addiction counseling, PeerJ Computer Science2 (2016) e59.[224] S. N. Chakravarthula, M. Nasir, S.-Y. Tseng, H. Li, T. J. Park, B. Bau-com, C. J. Bryan, S. Narayanan, P. Georgiou, Automatic prediction f suicidal risk in military couples using multimodal interaction cuesfrom couples conversations, in: Proceedings of IEEE International Con-ference on Acoustics, Speech and Signal Processing, IEEE, 2020, pp.6539–6543.[225] B. Mirheidari, D. Blackburn, K. Harkness, T. Walker, A. Venneri,M. Reuber, H. Christensen, Toward the automation of diagnostic con-versation analysis in patients with memory complaints, Journal ofAlzheimer’s Disease 58 (2017) 373–387.[226] G. P. Finley, E. Edwards, A. Robinson, N. Sadoughi, J. Fone, M. Miller,D. Suendermann-Oeft, M. Brenndoerfer, N. Axtmann, An automatedassistant for medical scribes., in: Proceedings of the Annual Confer-ence of the International Speech Communication Association, 2018, pp.3212–3213.[227] A. Guo, A. Faria, J. Riedhammer, Remeeting – Deep insights to conver-sations, in: Proceedings of the Annual Conference of the InternationalSpeech Communication Association, 2016, pp. 1964–1965.[228] A. Addlesee, Y. Yu, A. Eshghi, A comprehensive evaluation of incre-mental speech recognition and diarization for conversational ai, in: Pro-ceedings of the International Conference on Computational Linguistics,2020, pp. 3492–3503.[229] O. Cetin, E. Shriberg, Speaker overlaps and ASR errors in meetings:E ff ects before, during, and after the overlap, in: Proceedings of IEEEInternational Conference on Acoustics, Speech and Signal Processing,volume 1, IEEE, 2006, pp. 357–360.[230] N. Kanda, C. Boeddeker, J. Heitkaemper, Y. Fujita, S. Horiguchi,K. Nagamatsu, R. Haeb-Umbach, Guided source separation meets astrong ASR backend: Hitachi / Paderborn University joint investigationfor dinner party ASR, Proceedings of the Annual Conference of theInternational Speech Communication Association (2019) 1248–1252.[231] S. Otterson, M. Ostendorf, E ffi cient use of overlap information inspeaker diarization, in: Proceedings of IEEE Workshop on AutomaticSpeech Recognition and Understanding, IEEE, 2007, pp. 683–686.[232] K. Boakye, B. Trueba-Hornero, O. Vinyals, G. Friedland, Overlappedspeech detection for improved speaker diarization in multiparty meet-ings, in: Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing, IEEE, 2008, pp. 4353–4356.[233] L. Bullock, H. Bredin, L. P. Garcia-Perera, Overlap-aware diarization:Resegmentation using neural end-to-end overlapped speech detection,in: Proceedings of IEEE International Conference on Acoustics, Speechand Signal Processing, IEEE, 2020, pp. 7114–7118.cient use of overlap information inspeaker diarization, in: Proceedings of IEEE Workshop on AutomaticSpeech Recognition and Understanding, IEEE, 2007, pp. 683–686.[232] K. Boakye, B. Trueba-Hornero, O. Vinyals, G. Friedland, Overlappedspeech detection for improved speaker diarization in multiparty meet-ings, in: Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing, IEEE, 2008, pp. 4353–4356.[233] L. Bullock, H. Bredin, L. P. Garcia-Perera, Overlap-aware diarization:Resegmentation using neural end-to-end overlapped speech detection,in: Proceedings of IEEE International Conference on Acoustics, Speechand Signal Processing, IEEE, 2020, pp. 7114–7118.