Speaker attribution with voice profiles by graph-based semi-supervised learning
Jixuan Wang, Xiong Xiao, Jian Wu, Ranjani Ramamurthy, Frank Rudzicz, Michael Brudno
SSpeaker attribution with voice profiles by graph-basedsemi-supervised learning
Jixuan Wang , ∗ , Xiong Xiao , Jian Wu , Ranjani Ramamurthy ,Frank Rudzicz , , Michael Brudno , University of Toronto, Canada Vector Institute, Canada Microsoft, USA , { jixuan, frank, brudno } @cs.toronto.edu, { xioxiao, jianwu, ranjanir } @microsoft.com Abstract
Speaker attribution is required in many real-world applications,such as meeting transcription, where speaker identity is as-signed to each utterance according to speaker voice profiles. Inthis paper, we propose to solve the speaker attribution prob-lem by using graph-based semi-supervised learning methods.A graph of speech segments is built for each session, on whichsegments from voice profiles are represented by labeled nodeswhile segments from test utterances are unlabeled nodes. Theweight of edges between nodes is evaluated by the similari-ties between the pretrained speaker embeddings of speech seg-ments. Speaker attribution then becomes a semi-supervisedlearning problem on graphs, on which two graph-based methodsare applied: label propagation (LP) and graph neural networks(GNNs). The proposed approaches are able to utilize the struc-tural information of the graph to improve speaker attributionperformance. Experimental results on real meeting data showthat the graph based approaches reduce speaker attribution er-ror by up to 68% compared to a baseline speaker identificationapproach that processes each utterance independently.
Index Terms : speaker attribution, speaker identification, graphneural networks, label propagation.
1. Introduction
Speaker diarization is the problem of “who spoke when”, i.e .,grouping the segments of a long audio recording into speaker-homogeneous clusters. The conventional speaker diarizationtask assumes no prior knowledge of speakers’ identities, so itis basically a clustering problem without speaker identification.However, there are scenarios, such as meeting transcription,where voice profiles of speakers are available and the identi-fication of speakers is required. This task is called ‘speakerattribution’ in this paper. A straightforward approach is to builda multi-class classifier from the speaker profiles, and then clas-sify the test segments one-by-one. A drawback of this approachis treating test segments independently without considering thecontext when making predictions. For example, those test seg-ments that are similar to each other should be assigned to thesame speaker.Instead of predicting the speaker label for each speechsegment independently, we propose to use graph-based semi-supervised learning methods that use the structural informationamong speech segments within a session. Each speech seg-ment, either from profile audio or test audio, is representedas a node on a graph for each session. The feature of each *Work done by the first author during internship at Microsoft. (a) Extract d-vectors (b) Build a graph(c) Add profile segments (d) Label predictionFigure 1:
Overview of the proposed method: (a) extract d-vectors of audio segments with a pre-trained speaker embed-ding model; (b, c) build a graph of audio segments basedon pair-wise similarities of the corresponding d-vectors, usingboth profile and test audio segments; (d) predict labels for testaudio segments by graph-based semi-supervised learning meth-ods. node is represented by a fixed-dimensional speaker embedding, e.g ., d-vectors [1, 2, 3, 4, 5, 6, 7], extracted from the corre-sponding speech segment. Segments from the profile audio aretreated as labeled nodes while those from the test audio are un-labeled nodes. The speaker attribution task can then be solvedas a graph-based semi-supervised learning problem, which cannow utilize the structural information of the graph in order toimprove the accuracy of classifying the test nodes. The intu-ition is that if two nodes are similar to each other and sharecommon neighbors on the graph, they are likely to have thesame speaker label. Recently, graph-based methods have alsobeen successfully applied on the conventional speaker diariza-tion problem [8].An overview of the proposed method is shown in Figure 1.First, we apply a pre-trained speaker embedding model to ex-tract d-vectors for speech segments, which are obtained by uni-formly segmenting the audio of one session after applying voiceactivity detection (VAD). Then a graph is built with both speechsegments from profile audio and test audio as shown in Fig-ure 1(c), on which each node is a speech segment whose averaged-vector is used as the feature vector of the node. The weightof each edge represents the similarity between the correspond- a r X i v : . [ ee ss . A S ] F e b ng segment pair. The profile segments are labeled nodes on thegraph, while test segments are unlabeled. To classify the unla-beled nodes, we apply two graph-based semi-supervised learn-ing methods: a graph Laplacian regularization-based approach(label propagation) and a graph embedding-based approach byGNNs. Experiments show that both of these two methods sig-nificantly outperform the classification-based methods on realmulti-party meetings and present great potential for real-worldapplications. Our contributions can be summarized as:• we propose the first solution to speaker attribution withspeaker profiles through graph-based semi-supervisedlearning methods;• we study two graph-based methods – label propagationand GNN-based – and their applications on a speakerattribution pipeline; and• we evaluate the proposed methods on real meeting data.Results show that the graph-based methods significantlyoutperform the baseline method and present great poten-tial for real-world applications.
2. Related work
While the conventional speaker diarization problem is usuallysolved by clustering or end-to-end approaches [9, 10, 11, 12,13, 14], speaker diarization with profiles can be handled as aspeaker identification or classification task. Speech embeddingmodels that map raw speech features into a low-dimensionalspace are widely used for speaker identification or classifica-tion. Traditional methods apply probabilistic linear discrim-inant analysis (PLDA) on top of i-vectors to classify speak-ers [15]. Recently, neural network-based speaker embeddingmodels have become popular in which speaker embedding mod-els are usually trained via classification loss [1, 2], tripletloss [3, 4], generalized end-to-end loss [5], or prototypical net-work loss [6]. The similarities between resulting speaker em-beddings can be measured by simple metrics, such as cosine orEuclidean distances, and a speech segment is typically classi-fied as the speaker whose profile embedding is closest to it inthe embedding space.
Graph-based semi-supervised classification methods performclassification on graphs, where a small fraction of the nodes arelabeled. The label information is propagated from the labelednodes to the unlabeled nodes along the structure of a graph.This can be formulated as adding a graph-based regularizationterm to the supervised loss [16, 17, 18, 19]. Instead of using ex-plicit graph-based regularization, other types of methods utilizegraph neural networks (GNNs) to encode the graph structureand learn new representations for both labeled and unlabelednodes [20, 21, 22, 23]. The supervised loss is defined on the la-beled nodes but the gradient can be distributed to the unlabelednodes to learn new representations for all nodes, upon which aclassifier can be trained for inference.
3. Speaker classification on graphs
In this section, we discuss graph-based methods for speaker at-tribution with speaker profiles. The approach is illustrated inFigure 1.
We build a graph for the audio segments of each meeting. Eachnode represents an audio segment, which could be word-level orutterance-level, or be extracted by a sliding window with a fixedwindow shift. We use the average d-vector of each segmentas the node features. The weight of edges between nodes isrepresented by the cosine similarity of node features, which isnormalized to [0 , linearly.There are several methods to construct graphs from pair-wise similarities [24]: (1) simply connect all nodes, and weighall edges by the similarities between their nodes; (2) only con-nect two nodes if at least one node or both are among the k -nearest neighbors of the other; (3) only keep the edges on whichthe weight is larger than a threshold. In this paper, we applymethod (3) and treat the threshold as a hyperparameter.A meeting session, including related speaker profiles, canbe represented as a graph G ( V , E , A ) , where V is the set ofnodes (speech segments), E is the set of edges, and A ∈ R N × N ≥ is the affinity matrix with A ij > if edge e ij = ( v i , v j ) ∈ E and A ij = 0 otherwise. N is the total number of nodes. X = [ x , ..., x N ] ∈ R N × D is the node feature matrix, where x i is the average d-vector of the i th node and D is the dimen-sion of the embedding space. Without loss of generality, weassume the first M nodes are from speaker profiles and hencelabeled, and < M < N . For a meeting session, let
F ∈ R N × C be a set of matrices,where C is the number of speaker classes. Given a F ∈ F , y i = arg max ≤ j ≤ C F ij is the predicted speaker ID for node i . The label propagation algorithm is summarized as follows:1. Construct the affinity matrix A with A ij = x i , x j )2 and A ii = 0 , where cos( a , b ) denotes the cosine simi-larity of a and b .2. Initialize F as F (0) ij = 1 if j = l i , i ≤ M and F (0) ij = 0 otherwise, where l i is the label of node i
3. Iteratively update F using F ( t +1) = αSF ( t ) + (1 − α ) F (0) , where S = D − / AD − / , D is the degreematrix of A , α is a hyperparameter between (0 , and t is the iteration index.4. Stop when F has converged to F ∗ . The speaker ID ofnode i is obtained as y i = arg max ≤ j ≤ C F ∗ ij .The initial matrix F (0) represents our knowledge about the la-bels of the nodes. Through iterative updating of F , the labelinformation is propagated to the whole graph. In each itera-tion, the soft label assignment of a node ( F ( t ) i ) is updated asa weighted sum of the soft label assignment of itself, its neigh-bors, and the initial assignment. In this way, each neighborhoodin the graph will tend to have the same speaker ID. A limita-tion of label propagation is that it does not directly minimizethe expected speaker classification errors through learning onthe graph. It also lacks the free parameters that allows learningfrom labeled nodes. More details on label propagation can befound in [17].In this work, we modified the label propagation by freezingthe columns of F for labeled nodes, as we don’t want to changethe label assignment of the profile segments. We found thatthis improves the robustness of label propagation on the speakerattribution task. We also use a fixed number of iterations insteadof waiting until the algorithm converges, as the converged F does not always give the best performance.igure 2: Graph convolution. The feature of the bold node inthe next layer is the output of a function of the features of itselfand its neighbors in the current layer. The thickness of the edgesdenotes the similarity between pairs of nodes.
To moderate the limitations of label propagation, we apply sev-eral variants of GNNs to the speaker attribution task. Specifi-cally, we apply the variants under the framework of the messagepassing neural networks (MPNNs) [23] and achieve better per-formance. Under the MPNN framework, as shown in Figure 2,the convolutional operator is expressed as a message passingscheme: x (cid:48) i = γ Θ (cid:0) x i , (cid:3) j ∈N ( i ) φ Θ ( x i , x j , e i,j ) (cid:1) , (1)where x i is the feature of node i of the current layer with dimen-sion D , x (cid:48) i is the node feature in the next layer with dimension D (cid:48) , e i,j is the edge feature from node i to node j , γ ( · ) and φ ( · ) are the update function and message function, respectively(parameterized by Θ ), and (cid:3) denotes the aggregation function, e.g ., sum , mean , max , etc . N ( i ) is the set of neighbors ofnode i .Specifically, one variant we apply is the graph convolu-tional network (GCN) described by [22] in which the (cid:3) functioncorresponds to taking a certain weighted average of neighboringnodes: x (cid:48) i = σ (cid:32) W (cid:88) j L ij x j (cid:33) , (2)where L = ˆ D − / ˆ A ˆ D − / , ˆ A = A + I N , and ˆ D denotesthe degree matrix of ˆ A . W ∈ R D (cid:48) × D is a layer specific train-able weight matrix, and σ ( · ) is a nonlinear function which, inour case, is the exponential linear units (ELU) activation func-tion [25].We build and train a model with several GCN layers foreach meeting session separately . The output dimension of thelast layer is equal to the number of speaker classes C in a meet-ing session. Instead of the ELU activation function, we applythe softmax activation function row-wise on the output embed-ding matrix X out ∈ R N × C from the last GCN layer, resultingin a predicted probability matrix Z ∈ R N × C . The model istrained with regard to the cross-entropy loss over labelled seg-ments in a meeting session: L = − M (cid:88) i =1 C (cid:88) j =1 F ij ln Z ij (3)Instead of using explicit graph-based regularization as inlabel propagation, GNNs can directly encode the graph struc-ture into the model. Training signals can be distributed fromthe supervised loss L along the graph structure, and thus the model can learn representations of both labelled and unlabelledspeech segments. This enables GNN models to have more pow-erful classification ability than label propagation. However, forGNN models, we need to separate a subset from the labelledsegments in each meeting session for validation purposes.
4. Experiments
The graph-based methods are evaluated on two sets of in-housereal meeting data, including a dev set containing 9 meetings anda test set containing 19 meetings. The number of speakers in ameeting ranges from 2 to 20, with a mean of 8. The average du-ration of the meeting sessions is about 41 minutes. The profileaudio was recorded by laptop microphones or headphones, andthe meeting audio was recorded by microphone arrays in meet-ing rooms. The duration of the profile audio for a speaker isabout 20-40s long and each profile audio file consists of a fewsentences. There is no signal processing applied on the pro-file audio, but the meeting audio is processed by beamformingto enhance the speech signal. After VAD, the profile audio isuniformly split into 1.2s long speech segments, while the meet-ing audio is split into 0.8s speech segments. For each segment,d-vectors are extracted and averaged to represent the speakercharacteristics of the segment.
The d-vector extraction model is trained on the VoxCeleb2dataset [26] with data augmentation [7]. In our baseline system,each enrolled speaker is represented by a single profile d-vectoraveraged from the d-vectors extracted from the speaker’s pro-file audio. A segment is assigned to the speaker whose profiled-vector has the highest cosine similarity with the segment’s d-vector.We use the dev set to tune the hyperparameters for graph-based methods, including the threshold for graph edge pruning,parameter α for LP, architecture and learning rate of the GNNmodel, etc . Specifically, to build graphs we only connect twonodes if their cosine similarity is larger than 0.6. The GNNmodel includes two GCN [22] layers between which we applythe ELU activation function [25] and a dropout layer [27]. Eachhidden layer contains 64 nodes. The output dimension is equalto the number of speakers of a session and a softmax layer isapplied to output a probability distribution. The GCN layers areimplemented with the PyTorch Geometric library [28].For the GNN-based method, a network is trained for eachmeeting. We split the labeled nodes into two sets equally andtrain two models. In the first model, we use the first half oflabeled nodes as training set and second half as cross validationset. In the second model, we switch the two sets. In this way, wecan effectively use all the labeled nodes for training. For eachmodel, the training stops if there is no improvement on the crossvalidation loss. During inference, the hidden activation vectorsof the two models before the softmax layer are summed. Notethe cross validation set here is used for model selection for asingle meeting session, which is different from the dev set usedfor hyperparameter tuning. For each meeting session the same number of profile d-vectorsare provided for each speaker. To evaluate the performance ofthe methods with different amount of profile data, the numberable 1:
Segment error rate on the validation set. “
Segment error rate on the test set. Same notations areused as in Table 1.
The results on the dev and test sets are shown in Table 1 and2, respectively. Graph-based methods significantly outperformthe baseline method, resulting in much lower error rate andlower standard deviation. The relative error reduction againstthe baseline method is up to 68.2% and 48.7% on the dev and test sets, respectively. The GNN-based method demonstratesbetter performance over LP under all settings in terms of bothaccuracy and stability.Although the baseline model is simple, it achieves reason-able results. Applying more complex classifiers, e.g ., supportvector machines (SVMs) or multilayer perceptron (MLP) mod-els do not perform better than the baseline due to the smallamount of training data (profile data) and the acoustic mismatchbetween profile audio and meeting audio. The graph-basedsemi-supervised learning methods were able to learn from bothlabeled and unlabeled data, alleviating the data sparsity and mis-match problem significantly.As expected, the mean error rate and stand derivationdrops with more profile d-vectors provided. The GNN-basedmethod outperforms label propagation significantly, especiallywith fewer profile d-vectors. However, the performance of thebaseline method improves on the test set while the performanceof graph based methods drop. This might be due to the discrep-ancy of acoustic characteristics between the dev and test sets, to which the baseline approach is less vulnerable since it is param-eter free.
5. Conclusion
In this work, we applied graph-based semi-supervised learn-ing methods for the speaker attribution task with speaker voiceprofiles. We build a graph of speech segments for each meet-ing with both the profile audio and meeting audio. We appliedtwo methods to this task – label propagation and a GNN-basedmethod. Experiments on real multi-party meeting data showedthat the graph-based methods outperformed the classifier-basedmethods significantly due to its use of meeting-wide structureinformation represented as graphs. Moving forward, we willextend the graph-based methods for online speaker identifica-tion and scenarios where some or all the speakers do not havevoice profiles.
6. Acknowledgement
The authors thank Tianyan Zhou and Yong Zhao for providingthe d-vector extraction model, and Zhuo Chen for helpful dis-cussions. Rudzicz is a CIFAR Chair in AI.
7. References [1] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in .IEEE, 2014, pp. 4052–4056.[2] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust dnn embeddings for speaker recognition,”in . IEEE, 2018, pp. 5329–5333.[3] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kan-nan, and Z. Zhu, “Deep speaker: an end-to-end neural speakerembedding system,” arXiv preprint arXiv:1705.02304 , 2017.[4] H. Bredin, “Tristounet: triplet loss for speaker turn embedding,”in . IEEE, 2017, pp. 5430–5434.[5] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalizedend-to-end loss for speaker verification,” in . IEEE, 2018, pp. 4879–4883.[6] J. Wang, K.-C. Wang, M. T. Law, F. Rudzicz, and M. Brudno,“Centroid-based deep metric learning for speaker recognition,” in
ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 3652–3656.[7] T. Zhou, Y. Zhao, J. Li, Y. Gong, and J. Wu, “CNN with phoneticattention for text-independent speaker verification,” in
Proc. IEEEWorkshop on Automatic Speech Recognition and Understanding ,2019.[8] J. Wang, X. Xiao, J. Wu, R. Ramamurthy, F. Rudzicz, andM. Brudno, “Speaker diarization with session-level speaker em-bedding refinement using graph neural networks,” in , 2020, pp. 7109–7113.[9] S. Meignier and T. Merlin, “LIUM SpkDiarization: an opensource toolkit for diarization,” 2010.[10] S. H. Shum, N. Dehak, R. Dehak, and J. R. Glass, “Unsuper-vised methods for speaker diarization: An integrated and iterativeapproach,”
IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 21, no. 10, pp. 2015–2028, 2013.11] G. Sell and D. Garcia-Romero, “Speaker diarization with PLDAi-vector scoring and unsupervised calibration,” in . IEEE, 2014, pp.413–417.[12] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree,“Speaker diarization using deep neural network embeddings,” in . IEEE, 2017, pp. 4930–4934.[13] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L.Moreno, “Speaker diarization with LSTM,” in . IEEE, 2018, pp. 5239–5243.[14] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully su-pervised speaker diarization,” in
ICASSP 2019-2019 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 6301–6305.[15] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front-end factor analysis for speaker verification,”
IEEE Trans-actions on Audio, Speech, and Language Processing , vol. 19,no. 4, pp. 788–798, 2010.[16] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervisedlearning using Gaussian fields and harmonic functions,” in
Pro-ceedings of the 20th International conference on Machine learn-ing (ICML-03) , 2003, pp. 912–919.[17] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch¨olkopf,“Learning with local and global consistency,” in
Advances in neu-ral information processing systems , 2004, pp. 321–328.[18] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization:A geometric framework for learning from labeled and unlabeledexamples,”
Journal of machine learning research , vol. 7, no. Nov,pp. 2399–2434, 2006.[19] J. Weston, F. Ratle, H. Mobahi, and R. Collobert, “Deep learningvia semi-supervised embedding,” in
Neural Networks: Tricks ofthe Trade . Springer, 2012, pp. 639–655.[20] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “Acomprehensive survey on graph neural networks,” arXiv preprintarXiv:1901.00596 , 2019.[21] H. Cai, V. W. Zheng, and K. C.-C. Chang, “A comprehensive sur-vey of graph embedding: Problems, techniques, and applications,”
IEEE Transactions on Knowledge and Data Engineering , vol. 30,no. 9, pp. 1616–1637, 2018.[22] T. N. Kipf and M. Welling, “Semi-supervised classification withgraph convolutional networks,” arXiv preprint arXiv:1609.02907 ,2016.[23] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, andG. E. Dahl, “Neural message passing for quantum chemistry,”in
Proceedings of the 34th International Conference on MachineLearning-Volume 70 . JMLR. org, 2017, pp. 1263–1272.[24] U. Von Luxburg, “A tutorial on spectral clustering,”
Statistics andcomputing , vol. 17, no. 4, pp. 395–416, 2007.[25] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accu-rate deep network learning by exponential linear units (ELUs),” arXiv preprint arXiv:1511.07289 , 2015.[26] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deepspeaker recognition,” arXiv preprint arXiv:1806.05622 , 2018.[27] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: a simple way to prevent neuralnetworks from overfitting,”
The journal of machine learning re-search , vol. 15, no. 1, pp. 1929–1958, 2014.[28] M. Fey and J. E. Lenssen, “Fast graph representation learning withPyTorch Geometric,” arXiv preprint arXiv:1903.02428arXiv preprint arXiv:1903.02428