[PDF] Content-Aware Speaker Embeddings for Speaker Diarisation

Abstract

Recent speaker diarisation systems often convert variable length speech segments into fixed-length vector representations for speaker clustering, which are known as speaker embeddings. In this paper, the content-aware speaker embeddings (CASE) approach is proposed, which extends the input of the speaker classifier to include not only acoustic features but also their corresponding speech content, via phone, character, and word embeddings. Compared to alternative methods that leverage similar information, such as multitask or adversarial training, CASE factorises automatic speech recognition (ASR) from speaker recognition to focus on modelling speaker characteristics and correlations with the corresponding content units to derive more expressive representations. CASE is evaluated for speaker re-clustering with a realistic speaker diarisation setup using the AMI meeting transcription dataset, where the content information is obtained by performing ASR based on an automatic segmentation. Experimental results showed that CASE achieved a 17.8% relative speaker error rate reduction over conventional methods.

Full PDF

CCONTENT-AWARE SPEAKER EMBEDDINGS FOR SPEAKER DIARISATION

G. Sun, D. Liu, C. Zhang, P. C. Woodland

Cambridge University Engineering Dept., Trumpington St., Cambridge, CB2 1PZ U.K. { gs534,dl567,cz277,pcw } @eng.cam.ac.uk ABSTRACT

Recent speaker diarisation systems often convert variable lengthspeech segments into ﬁxed-length vector representations for speakerclustering, which are known as speaker embeddings. In this pa-per, the content-aware speaker embeddings (CASE) approach isproposed, which extends the input of the speaker classiﬁer to in-clude not only acoustic features but also their corresponding speechcontent, via phone, character, and word embeddings. Comparedto alternative methods that leverage similar information, such asmultitask or adversarial training, CASE factorises automatic speechrecognition (ASR) from speaker recognition to focus on modellingspeaker characteristics and correlations with the correspondingcontent units to derive more expressive representations. CASE isevaluated for speaker re-clustering with a realistic speaker diarisa-tion setup using the AMI meeting transcription dataset, where thecontent information is obtained by performing ASR based on anautomatic segmentation. Experimental results showed that CASEachieved a 17.8% relative speaker error rate reduction over conven-tional methods.

Index Terms — content-aware speaker embedding, diarisation,d-vector, speech recognition, distributed representation

1. INTRODUCTION

Speaker diarisation, the task to ﬁnd “Who spoke when” in a multi-speaker audio stream, is a critical component in automatic speechrecognition (ASR) systems for transcriptions of conversations, meet-ings, or broadcast shows. A typical diarisation system can oftenbe divided into two stages: segmenting the audio into speaker-homogeneous intervals and clustering them into groups that shouldcorrespond to the same speaker. Nowadays, the speaker clusteringstage often converts variable-length speech segments into ﬁxed-length vectors, referred to as speaker embeddings , to characterisethe speaker identity in a multi-dimensional space in which the clus-tering can be performed. The variations present in a spoken utteranceinclude differences in speaker identity, content and style of the ut-terance, microphone, channel and noise characteristics. TraditionalGaussian mixture model-based speaker dependent ASR systemseither normalise the differences between speakers, or jointly modelthem with the phonetic units [9, 13, 38]. For both speaker recog-nition and diarisation, the i-vector approach models all variationstogether using joint factor analysis [14]. With deep learning, neuralnetwork (NN) models are trained to discriminate the training setspeakers for each frame [16, 17] or segment [3, 4, 15, 19–21]. Out-put vectors from the penultimate layer of NN models are extracted,referred to as speaker embeddings. For simplicity, all kinds of NN-based speaker embeddings [4, 15–17] are referred to as d-vectors without any distinction in this paper.

G. Sun is funded by the Cambridge Trust

Although d-vectors are very effective in encoding speaker de-pendent acoustic features ( e.g. , vocal tract length), they include lessinformation about content-related features ( e.g. common terms ofuse) compared to i-vectors [8], because an NN classiﬁer can effec-tively reduce the variations that are less relevant to the targets in itsﬁrst few layers. One solution to this problem is to train the NN modelto jointly classify both speaker and phonetic units [18] in a multi-task or adversarial learning framework [32–34], which enforces thed-vector to encode phone related information. However, it is notstraightforward to extend this method to models that are trained toextract d-vectors over hundreds of frames in a segment using a sta-tistical pooling [15, 21] or a self-attentive structure [2–4, 35, 36], asclassiﬁcation of speakers and phonetic units often require very dif-ferent windows of input features.In this paper, a content-aware speaker embedding (CASE) gen-eration scheme is proposed, which encodes content-related featuresinto speaker embeddings by simply extending the input of the em-bedding extraction network to include additional information fromthe speech content (at the phone, character or word level). De-pending on the task, the speech content can be obtained from ei-ther the reference or using an ASR system. Phone, character, andword embeddings are used to encapsulate content information, andare appended to the corresponding original acoustic features accord-ing to the content-to-frame alignments. Compared to other methods,CASE does not adopt any change to the training method and themodel structure except for the input layers, and is generally applica-ble to any type of NN used for embedding extraction. Meanwhile,CASE can lead to more expressive speaker embeddings since it ex-plicitly factorises speech recognition from the speaker embeddingextraction model by explicitly conditioning on the speech content.The proposed CASE generation scheme was evaluated on the AMImeeting corpus. Experimental results on speaker clustering for diari-sation showed that the diarisation error rate (DER) was improved bya clear margin when CASE-based d-vectors were used. Moreover,CASE can be applied to end-to-end speaker diarisation [25–27] andalso to text-dependent and -independent speaker identiﬁcation andveriﬁcation.The remainder of this paper is organised as follows. Section 2reviews the diarisation pipeline used in this paper. Section 3 presentsthe detailed use of CASE in d-vector extraction. The experimentalsetup and results are given in Sec. 4 and 5, followed by the conclu-sion in Sec. 6.

2. SPEAKER DIARISATION PIPELINE

Our speaker diarisation pipeline includes a neural voice activity de-tection (VAD) module, a neural change point detection (CPD) mod-ule, a speaker embedding extraction model and the clustering algo-rithm [37]. The VAD distinguishes between speech and non-speech,and selects segments corresponding to speech in the audio stream.To appear in

Proc. ICASSP2021, June 06-11, 2021, Toronto, Ontario, Canada © IEEE 2021 a r X i v : . [ c s . S D ] F e b eural VAD Neural CPD Embedding Extraction ClusteringSpeech Segments Speaker-homogeneous Speech Intervals Speaker Embeddings “welcome everybody” Audio Stream Speech Content Diarisation Output

Fig. 1 . Our speaker diarisation pipeline. The dashed lines indicate where the content information can be added for the CASE scheme.The CPD stage splits speech segments into speaker homogeneous in-tervals. Then, the embedding extraction module generates d-vectorsfor those intervals. Finally, the spectral clustering algorithm is usedto group similar intervals and one speaker label is assigned to all ofthem in the same group. The pipeline is illustrated in Fig 1.

Both the VAD and CPD models are built as NN-based frame-level bi-nary classiﬁers. The VAD model is a DNN of seven fully-connectedlayers with ReLU activation functions. The key strength of the DNNstructure is the use of a large input window covering 55 consecutiveframes (27 on each side), which provides sufﬁcient information forhigh performance speech and non-speech classiﬁcation [1].The CPD model adopts a ReLU recurrent neural network (RNN)model to encode past and future inputs (covering 50 frames on eachside) into two vectors respectively which are then fused into onevector using the Hadamard product followed by a softmax fully-connected layer classifying the frame as a speaker change point ornot. Treating the RNN output vectors as speaker representationsfor the past and future audio segments, the Hadamard product andthe output layer can be seen as making decisions on the change ofspeaker identity by comparing the speaker representations beforeand after the current time. To encourage the RNN to encode bet-ter speaker representations, frame-level d-vectors from a time-delayneural network (TDNN) model trained by classifying training setspeakers for each frame are used as the input to the RNN. The CPDmodel including the TDNN, the RNN, and the output layer are thenjointly trained to perform speaker change/non-change classiﬁcation.

Speaker embeddings in this paper are the penultimate layer outputsof a TDNN-based model trained to perform classiﬁcation amongtraining set speakers. In both training and test, each variable lengthspeech interval is ﬁrst split into multiple ﬁxed-length windows with acertain amount of overlap, and a speaker embedding is then extractedfor each window. Window-level speaker embeddings, or window-level d-vectors, are extracted by aggregating frame-level d-vectorsin a speciﬁc window using a multi-head self-attentive layer [2, 3]. Inorder to encode more diverse information by increasing the dissimi-larities between different attention heads, a modiﬁed penalty term asdescribed in [4] is also adopted.A spectral clustering algorithm [5] based on the cosine simi-larity with post-processing is used to cluster the speaker embed-dings. Spectral clustering is ﬁrst performed on window-level d-vectors where the number of clusters is determined by the maximumeigenvalue drop-off [6]. Then, variable-length speech intervals fromthe CPD are considered as speaker-homogeneous, and each segmentis assigned to a cluster whose centroid has the smallest cosine dis-tance to the average of the window-level d-vectors of this segment.

3. CONTENT-AWARE REPRESENTATIONS3.1. Extended inputs using content

The CASE generation scheme is a method to improve the speakerembeddings by extending the input of any NN model for speakerembedding extraction to incorporate the content information corre-sponding to each frame. Speciﬁcally, for a model taking inputs frommultiple frames, there is P ( s | x , . . . , x T )= (cid:88) W (cid:48) P ( s | x , . . . , x T , W (cid:48) ) P ( W (cid:48) | x , . . . , x T ) , (1)where s is a speaker, W (cid:48) is a string corresponding to the contentof the T frames, and x t represents acoustic features. The sum overall possible hypotheses often requires a lattice. To avoid intractablecomputation, P ( W (cid:48) | x , . . . , x T ) is approximated by the Kroneckerdelta function. Let W be the reference or 1-best hypothesis stringof content, P ( W (cid:48) | x , . . . , x T ) equals to one if and only if when W (cid:48) = W and zero otherwise. Hence Eqn. (1) can be rewritten as P ( s | x , . . . , x T , W ) = P ( s | x , . . . , x T , w , . . . , w T ) , (2)where w , . . . , w T are vector representations of the time alignmentof W . Eqn. (2) gives a simple form of CASE used throughout thispaper, which extends the model input at time t to be the concatena-tion of x t and w t . Phone, character and word levels content infor-mation with their corresponding vector representations are exploredin this paper. Phones and characters are represented using k encoding which has the element at the index corresponding to theunit set to 1 and the others set to 0. The word embeddings are de-rived from GloVe [7], which represents words with compact vectorsof continuous-values.Compared to traditional methods, the CASE scheme generatesmore expressive embeddings by using higher-level features. Al-though segment-level models with joint-training on each frame canimplicitly learn some of those high-level features, such learning ca-pability depends on the length of segments whose maximum is re-stricted by the power of the temporal pooling method as well as thecomputation and storage limits ( e.g. , the GPU memory size). TheCASE scheme, however, is able to incorporate more high-level fea-tures beyond the scope of a segment into the speaker embeddings, asthe content sequence obtained using ASR often requires searchingthrough a complete utterance.Furthermore, CASE also reduces the difﬁculty in learning bothlow-level and high-level features using the same model, which canhelp generate more accurate speaker embeddings. It has been shownthat an NN acoustic model tends to eliminate the speaker related in-formation (such as accent or channel) using its input layers, as suchinformation is detrimental for classifying speaker independent pho-netic or linguistic units [9]. Vice versa, a model trained for speaker2lassiﬁcation does not preserve much information related to the con-tent [8]. This phenomenon causes a conﬂict in encoding both low-level and high-level (including linguistic and phonetic) features intoa speaker embedding using the same set of parameters. CASE, how-ever, addresses this issue by decoupling the linguistic and phoneticinformation from the speaker embedding extraction model. The CASE generation scheme is applied to the extraction ofwindow-level d-vectors for speaker clustering. When using CASE,it is required to obtain the content-to-frame alignments used as w t .• At training time, reference transcripts are ﬁrst aligned with apre-trained ASR system to derive the content (word, phone orcharacter) vectors which are used for CASE;• At test time, if a manual segmentation is available, 1-best hy-potheses from an ASR system can be obtained and alignedusing manual segments. The aligned 1-best hypotheses willthen be used as the content information in CASE.• If manual segments are not available, an automatic segmenta-tion produced by an existing speaker diarisation system willbe used to get the 1-best hypotheses using ASR, which arethen used to obtain the content-to-frame alignment to produceCASE-based d-vectors.There are also situations when reference transcriptions are availablefor both training and test. For example, CASE can be used to im-prove clustering a speech corpus without speaker labels, such asin [10]. The reference transcriptions can be available also in text-dependent speaker identiﬁcation and veriﬁcation tasks.

4. EXPERIMENTAL SETUP4.1. Data Preparation

All of the data preparation and model training were done using anextended version of HTK version 3.5.1 and PyHTK [11, 12]. All sys-tems were trained on the augmented multi-party interaction (AMI)meeting corpus. The training set contains 135 recorded meetingswith 155 speakers, of which, 10% of the data for each speaker wasused for held-out validation during training. The development (

Dev )and evaluation (

Eval ) sets from the speech recognition partition wereused to evaluate the performance of proposed systems.

The input acoustic features were 40-dimensional (-d) log-Mel ﬁl-ter banks (FBK) with a 25 ms frame size and a 10 ms frame in-crement, and were extracted from the multiple distance microphone(MDM) audio data pre-processed by BeamformIt [31]. A ReLUTDNN model with 3 256-d hidden layers covering a context of [-7,+7] and a penultimate layer projecting to 128-d was used to extractframe-level d-vectors. For CPD, an RNN with 128-d hidden stateswas used . To extract window level d-vectors, a 2-second slidingwindow was applied with a 1-second overlap between adjacent win-dows [4]. Phone and character embeddings were 48-d and 27-d re-spectively, and 300-d word embeddings were projected to 100-d be-fore concatenation. The phone and character embedding layers andthe word embedding projection layer were jointly trained with thespeaker embedding model. Moreover, instead of using the standard Note that increasing the layer width in our experimental setup did notlead to improvements in DER. softmax output activation with the cross-entropy loss, the angularsoftmax training loss [22–24] was adopted with the m factor set to , to ensure that the derived embeddings are trained to discriminatespeakers based on the cosine distance.Model performance was evaluated in terms of the diarisation er-ror rate (DER) which is the sum of speaker error rate (SER), missedspeech (MS) and false alarm (FA). As many manual segments inAMI were found to have very long non-speech parts, which resultsin many unnecessary overlapping regions and also affects the clus-tering performance, a modiﬁed version of the manual segmenta-tion was created by comparing each original segment with frameto speech and non-speech alignments generated by forced alignmentusing a pre-trained speech recognition system [11]. The originalreference was also modiﬁed accordingly to form the modiﬁed ref-erence to match the modiﬁed segmentation. Details of the originaland modiﬁed segments and corresponding references, together withthe speech recognition partition of the data used in this paper areavailable to download .The CASE-based d-vector systems were compared to the base-line system where d-vectors were extracted without using the con-tent information. For completeness, SERs of two other methods thatuse content information, namely multi-task training [32] and adver-sarial training [33], are reported. In the multi-task training method,d-vectors systems were trained with speaker and phone classiﬁcationsimultaneously by adding a phone classiﬁcation layer. In adversar-ial training, a gradient reversal operation is added to the d-vectorextraction layer on top of the multi-task training setup. The ASR system used in this paper was trained using the Kaldi[28] speech recognition toolkit based on the same HTK 40-d FBKfeatures extracted only from the beamformed AMI MDM data. Thespeaker-independent acoustic model has six convolutional neuralnetwork layers followed by ﬁfteen factorised TDNN blocks withresidual connections, and the ﬁnal layer has 2,312 output units witheach of them representing a context-dependent phone. Lattice-freeMMI training [29] with the SpecAug data augmentation [30] wasused to obtain the acoustic model. A 4-gram language model esti-mated using the transcriptions from both AMI and Fisher corporawas used for all the decoding throughout the paper.

5. EXPERIMENTAL RESULTS5.1. Results with manual segmentation

The ﬁrst set of diarisation experiments were performed using man-ual segments to directly show the effect of CASE on speaker clus-tering. The MS and FA rates in this case are zero, and the SER foreach system is shown in Table 1. The alignments obtained using themanual segmentation and the reference transcription are denoted as reference , while those obtained using the manual segmentation andrecognition results from the ASR system is denoted as hypothesis .In general, using CASE-based d-vectors consistently improvesthe SER on the Dev set, and the results are more varied on the Evalset, as the threshold value for spectral clustering pre-processing stageis determined based on the Dev set performance. Using the ASR hy-pothesis in the CASE scheme slightly degrades the SERs. AlthoughCASE-based system with word, phone and character informationachieves the lowest SER on Dev, its SER on Eval is worse than the https://github.com/BriansIDP/AMI_diar_references.git CASE dvec. (w) 13.8 16.2 13.9 16.3CASE dvec. (w + p + c)

Table 1 . %SERs with the alignment, ASR decoding, and speakerclustering all based on manual segmentation. “w”, “p” and “c” rep-resent the use of word, phone and character respectively.baseline result, due to the over-ﬁtting caused by the use of the wordembeddings. The system with only phone and character informationachieved consistently better SERs compared to the baseline on bothDev and Eval, which achieves relative SER reductions of 23% and11% on Dev and Eval respectively.

In this section, all speaker clustering were performed based on theautomatic segments produced by the full diarisation pipeline. Sincethe use of CPD does not cause any change to MS and FA, all SERsreported in this section can be converted to DERs by adding the MSand FA rates shown in Table 2.Dev EvalMS FA MS FA1.2% 4.0% 1.3% 3.6%

Table 2 . VAD MS and FA on Dev and Eval.Speaker clustering results with the automatic segmentation arelisted in Table 3, where the hypotheses are decoded with a real ASRbased on the manual segmentation. With CASE-based d-vectors,similar improvements to those with manual segmentation were ob-served, and the performance degradation caused by using hypothe-ses remains in a reasonable range. Our best CASE-based diarisationsystem achieved 10% relative SER reductions on both Dev and Eval.Systems Reference (%) Hypothesis (%)Dev Eval Dev EvalBaseline dvec. 13.0 14.6 13.0 14.6Multi-task dvec. 14.3 13.6 14.3 13.6Adversarial dvec. 14.0 16.6 14.0 16.6CASE dvec. (p) 12.6 14.3 12.7 14.1CASE dvec. (c) 10.8 14.1 12.1 14.3CASE dvec. (p + c) 10.7

CASE dvec. (w) 12.5 14.8 12.3 14.4CASE dvec. (w + p + c)

Table 3 . %SERs with the alignment and ASR decoding based onthe manual segmentation, and the speaker clustering based on theautomatic segmentation. %DERs can be calculated by adding %FAand %MS from Table 2 to the %SERs.Finally, the CASE scheme is applied to the situation where nei- ther the manual segmentation nor the reference transcripts are avail-able, which is the same as in a real diarisation system. The hypothe-ses were obtained by decoding with ASR based on the automaticsegmentation, which were aligned and used for the CASE-basedd-vector extraction. Speciﬁcally, at test-time, the audio stream ofa meeting is ﬁrst passed through VAD, CPD, and the baseline d-vector extraction and clustering, to obtain the automatic segmenta-tion. Then ASR is used to decode such automatic segments to derivethe hypotheses, which are then aligned and used for CASE-basedd-vector extraction and re-clustering. The word error rates (WERs)with both manual and automatic segmentation are listed in Table 4.Dev %WER Eval %WERmanual automatic manual automatic35.7 39.7 38.7 42.9

Table 4 . %WERs on Dev and Eval generated by our ASR using theMDM data based on the manual and automatic segmentation.Since our current diarisation pipeline is not able to detect over-lapping speech, the overlapping regions are decoded as the non-overlapping regions which generates more deletion errors and de-grades the overall WERs. Then, the CASE-based d-vectors wereextracted based on the hypotheses, which is followed by the re-clustering, and the SERs are shown in Table 5.Systems Dev %SER Eval %SERBaseline dvec. 13.0 14.6Multi-task dvec. 14.3 13.6Adversarial dvec. 14.0 16.6CASE dvec. (p) 13.2 14.3CASE dvec. (c) 11.8 15.1CASE dvec. (p + c) 11.0

CASE dvec. (w) 13.1 14.7CASE dvec. (w + p + c)

Table 5 . %SERs with the alignment, ASR decoding, and speakerclustering all based on the automatic segmentation. %DERs can becalculated by adding %FA and %MS from Table 2 to the %SERs.From Table 5, the CASE-based systems in general outperformthe baseline and the other methods tested. The best performing sys-tem is the CASE-based d-vector system with both phone and char-acter embeddings, which achieves 15.4% and 18.6% relative SERreductions compared to the baseline on Dev and Eval respectively.

6. CONCLUSIONS

This paper proposes the CASE scheme which incorporates speechcontent information into speaker embedding extraction for speakerdiarisation. The CASE scheme is applied to speaker embedding ex-traction stage by appending each input acoustic feature with its cor-responding phone, character or word representations. Experimentson the AMI corpus used both an oracle setup with manual segmen-tation and reference transcriptions, and a realistic setup with auto-matic segmentation and hypothesis transcriptions. Our best perform-ing CASE-based system with both phone and character embeddingsconsistently outperforms all baseline methods under all conditions.4 eferences [1] L. Wang, C. Zhang, P.C. Woodland, M.J.F. Gales, P. Karana-sou, P. Lanchantin, X. Liu & Y. Qian, “Improved DNN-basedsegmentation for multi-genre broadcast audio”,

Proc. ICASSP ,Shanghai, 2016.[2] Z. Lin, M. Feng, C.N. dos Santos, M. Yu, B. Xiang, B. Zhou &Y. Bengio, “A structured self-attentive sentence embedding”,

Proc. ICLR , Toulon, 2017.[3] Y. Zhu, T. Ko, D. Snyder, B. Mak & D. Povey, “Self-attentivespeaker embeddings for text-Independent speaker veriﬁcation”,

Proc. ICASSP , Calgary, 2018.[4] G. Sun, C. Zhang & P.C. Woodland, “Speaker diarisation using2D self-attentive combination of embeddings”,

Proc. ICASSP ,Brighton, 2019.[5] Q. Wang, C. Downey, L. Wan, P. Mansﬁeld & I.L. Moreno,“Speaker diarization with LSTM”,

Proc. ICASSP , 2018.[6] U. von Luxburg. “A tutorial on spectral clustering”,

Statisticsand Computing , 17, pp.395-416, 2018.[7] J. Pennington, R. Socher & C.D. Manning, “GloVe: Globalvectors for word representation”,

Proc. EMNLP , Doha, 2014.[8] S. Wang, Y. Qian & K. Yu, “What does the speaker embeddingencode”,

Proc. Interspeech , Stockholm, 2017.[9] M. Ferras, C.C. Leung, C. Barras & J. Gauvain, “ConstrainedMLLR for speaker recognition”,

Proc. ICASSP , 2007.[10] P. Bell, M.J.F. Gales, T. Hain, J. Kilgour, P. Lanchantin, X. Liu,A. McParland, S. Renals, O. Saz, M. Wester, P.C. Woodland,“The MGB challenge: Evaluating multi-genre broadcast mediarecognition”,

Proc. ASRU , Scottsdale, 2015.[11] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw,X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey and A. Ragni,V. Valtchev, P. Woodland & C. Zhang, “The HTK Book (HTK3.5)”,

Cambridge University Engineering Department , 2015.[12] C. Zhang, F. Kreyssig, Q. Li & P.C. Woodland, “PyHTK:Python library and ASR pipelines for HTK”,

Proc. ICASSP ,Brighton, 2019.[13] P. Kenny, G. Boulianne & P. Dumouchel, “Eigenvoice mod-eling with sparse training data”,

IEEE Transactions on Speechand Audio Processing , 13(3), pp.345-354, 2005.[14] S.H. Shum, N. Dehak, E. Chuangsuwanich, D. Reynolds &J. Glass, “Exploiting intra-conversation variability for speakerdiarization”,

Proc. Interspeech , 2011.[15] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey & S. Khudan-pur, “X-vectors: Robust DNN embeddings for speaker recog-nition”,

Proc. ICASSP , Calgary, 2018.[16] E. Variani, X. Lei, E. McDermott, I.L. Moreno & J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker veriﬁcation”,

Proc. ICASSP , Florence, 2014.[17] Y.Z. Is¸ik, E. Hakan & S. Ruhi, “S-vector: A discriminativerepresentation derived from i-vector for speaker veriﬁcation”,

Proc. EUSIPCO , Nice, 2015.[18] Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang & K. Yu, “Deepfeature for text-dependent speaker veriﬁcation”,

Speech Com-munication , 73, pp.1-13, 2015.[19] F.A.R.R. Chowdhury, Q. Wang, I.L. Moreno & L. Wan,“Attention-based models for text-dependent speaker veriﬁcatio”,

Proc. ICASSP , Calgary, 2018. [20] G. Heigold, I.L. Moreno, S. Bengio & N. Shazeer, “End-to-end text-dependent speaker veriﬁcation”,

Proc. ICASSP , 2016.[21] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey & A. McCree,“Speaker diarization using deep neural network embeddings”,

Proc. ICASSP , New Orleans, 2017.[22] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj & L. Song, “Deep hy-persphere embedding for face recognition”,

Proc. CVPR , 2017.[23] Z. Huang, S. Wang & K. Yu, “Angular softmax for short-duration text-independent speaker veriﬁcation”,

Proc. Inter-speech , Hyderabad, 2018.[24] Y. Fathullah, C. Zhang & P.C. Woodland, “Improved large-margin softmax loss for speaker diarisation”,

Proc. ICASSP ,Barcelona, 2020.[25] A. Zhang, Q. Wang, Z. Zhu, J. Paisley & C. Wang, “Fullysupervised speaker diarization”,

Proc. ICASSP , Brighton, 2019.[26] Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu & S. Watan-abe, “End-to-end neural speaker diarization with permutation-free objectives”, arXiv 1909.05952 , 2019.[27] Q. Li, F.L. Kreyssig, C. Zhang & P.C. Woodland, “Dis-criminative neural clustering for speaker diarisation”, arXiv1910.09703 , 2019.[28] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motl ´ i ˇ c ek, Y. Qian, P. Schwarz, J.Silovsk ´ y , G. Stemmer & K. Vesel ´ y , “The Kaldi speech recog-nition toolkit”, Proc. ASRU , Hawaii, 2011.[29] D. Povey, V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar,X. Na, Y. Wang & S. Khudanpur, “Purely sequence-trainedneural networks for ASR based on lattice-free MMI”,

Proc.Interspeech , San Francisco, 2016.[30] D.S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E.D.Cubuk & Q.V. Le, “SpecAugment: A simple data augmentationmethod for automatic speech recognition”,

Proc. Interspeech ,Graz, 2019.[31] X. Anguera, C. Wooters & J. Hernando “Acoustic beamform-ing for speaker diarization of meetings”,

IEEE Trans. on Audio,Speech and Language Processing , 15(7), pp. 2011–2022, 2007.[32] Y. Liu, L. He, J. Liu & M.T. Johnson “Introducing pho-netic information to speaker embedding for speaker veriﬁca-tion”,

EURASIP Journal on Audio, Speech, and Music Process-ing , 19, 2019.[33] Z. Meng, Y. Zhao, J. Li & Y. Gong “Adversarial speaker veri-ﬁcation”,

Proc. ICASSP , Brighton, 2019.[34] Z. Chen, S. Wang, Y. Qian & K. Yu “Channel invariant speakerembedding learning with joint multi-task and adversarial train-ing”,

Proc. ICASSP , Barcelona, 2020.[35] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville & Y. Bengio “Generative adver-sarial networks”,

Proc. NIPS , Montreal, 2014.[36] M. Pal, M. Kumar, R. Peri, T.J. Park, S. H. Kim, C. Lord,S. Bishop & S. Narayanan “Meta-learning with latent spaceclustering in generative adversarial network for speaker diariza-tion”, arXiv:2007.09635 , 2020.[37] G. Sun, C. Zhang & P.C. Woodland “Combination of deepspeaker embeddings for diarisation”, arXiv:2010.12025 , 2020.[38] Y. Lei, N. Scheffer, L. Ferrer & M. McLaren “A novel schemefor speaker recognition using a phonetically-aware deep neuralnetwork”,