MIRNet: Learning multiple identities representations in overlapped speech
MMIRNet: Learning multiple identities representations in overlapped speech
Hyewon Han, Soo-Whan Chung and Hong-Goo Kang
Department of Electrical & Electronic Engineering, Yonsei University, Seoul, South Korea [hwhan, jsh6293]@dsp.yonsei.ac.kr, [email protected]
Abstract
Many approaches can derive information about a singlespeaker’s identity from the speech by learning to recognize con-sistent characteristics of acoustic parameters. However, it ischallenging to determine identity information when there aremultiple concurrent speakers in a given signal. In this paper,we propose a novel deep speaker representation strategy thatcan reliably extract multiple speaker identities from an over-lapped speech. We design a network that can extract a high-level embedding that contains information about each speaker’sidentity from a given mixture. Unlike conventional approachesthat need reference acoustic features for training, our proposedalgorithm only requires the speaker identity labels of the over-lapped speech segments. We demonstrate the effectiveness andusefulness of our algorithm in a speaker verification task anda speech separation system conditioned on the target speakerembeddings obtained through the proposed method.
Index Terms : speaker separation, speaker representation,multi-talker background.
1. Introduction
Speech is one of the most widely used media for recognizingpeoples’ identities due to its distinctive characteristics. Typi-cally, an individual’s vocal identity is represented by acousticcharacteristics such as pitch, formant, and speaking styles thatare consistently observed in speech signals [1, 2, 3]. In earlystudies, speaker representations were obtained using statisticalmodels such as Gaussian mixture models [4] with Joint FactorAnalysis or Support Vector Machines [5, 6, 7]. By removingrapidly varying acoustic features caused by linguistic changes,these methods obtained normalized speaker-discriminative fea-tures with clustering or classification methods [8].Thanks to advances in deep learning and the availabil-ity of large-scale datasets [9, 10], deep neural network-basedspeaker modeling strategies have recently shown great successin speaker recognition. For example, in [11, 12], neural networkmodels were trained to perform classification given the refer-ence speaker labels, from which representations for speakerinformation were obtained from the output of the last hiddenlayer. These methods showed better performance than previousstatistical approaches such as i-vectors [7].Speaker identity information obtained from such methodscan be applied to a variety of speech interface related applica-tions. For example, the accuracy of automatic speech recog-nition (ASR) systems can be improved by using speaker iden-tity information to reduce the bias caused by speaker-dependentcharacteristics [13, 14]. The performance of tasks such asspeech enhancement and separation can also be improved whenthis information is available, which can be used to specify orrepresent the desired voice components in input signals dis-torted by noise, reverberation or interfering speech [15, 16, 17].However, the aforementioned approaches are often incon-venient for real applications because target speakers must first enroll their information before post-tasks can be performed.Therefore, they cannot be used with any unenrolled speakers.It is possible to directly extract speaker representations from amixed input signal, but these representations are prone to error;thus, the overall performance of the post-task can degrade as aresult of incorrect guidance on speaker identity.In this paper, we propose a novel deep learning-basedspeaker representation method that extracts multiple identitiesin the case of simultaneous speech. Our proposed method in-cludes two contributions: a speaker model and its training strat-egy. First, we construct a speaker model with three modules: speech analysis , spectral attention , and speaker embedding .The speech analysis module transforms the input spectrum intoa latent domain and the spectral attention module estimates eachspeaker’s spectral information in the latent space using a tem-poral attention layer, after which the speaker embedding mod-ule finds a representation for each speaker’s identity. Usingthis structure, the network is able to directly estimate speakeridentities from mixed speech. Second, we propose an effectivetraining strategy for this network that only uses speaker identityinformation. It utilizes a decision method for speaker permu-tations to alleviate the ambiguity of speaker assignment. Ourapproach is novel in that it models speaker identities from themixed speech with a single network without requiring any cleanspectral information. Using this method, we can acquire eachspeaker’s identity information even in cases in which clean ut-terances are not enrolled.To verify the effectiveness of our proposed approach, weperform several experiments that utilize the extracted identityinformation. We first measure speaker verification performancein an overlapped speech scenario. By measuring the similaritybetween obtained speaker identities from pairs of mixed speech,we show that our network effectively models speaker identityinformation. We also show that our extracted speaker represen-tations from mixed speech contain distinctive speaker informa-tion by applying them to a speaker-conditioned speech separa-tion task.The rest of the paper is organized as follows. Section 2 de-scribes prior research that is related to our work. The proposednetwork structure and training strategy for learning speakeridentity are shown in Section 3. Experimental results and appli-cations for our proposed method are shown in Section 4, withfinal conclusions in Section 5.
2. Related Work
Speaker identity representation is the task of deriving repre-sentative embeddings for this information by mapping acousticfeatures into a latent space. Various deep learning-based ap-proaches have been proposed to improve the performance ofspeaker representations. D-vectors [11] are a high level speakerrepresentation embedding extracted from the last hidden layer a r X i v : . [ ee ss . A S ] A ug a) Overall structure of MIRNet (b) Temporal attention networkFigure 1: Demonstration of the proposed speaker representation method. of a deep neural network architecture that is designed to performa speaker identification task. To capture the sequential informa-tion of speech signals, x-vectors [12] exploit a time-delay neuralnetwork (TDNN) architecture that consists of frame-wise out-puts and statistical pooling at the sentence level. Also, variousmethods are proposed to model effective speaker representationby modifying the aggregation method [18, 19].Various training criteria have been investigated to furtherimprove the uniformity of output speaker representations. Themost well-known loss function is cross-entropy loss, which fa-cilitates the categorization of speaker identity. Recently, metric-based learning criteria such as triplet loss [20, 21] and prototyp-ical loss [22, 23] have been used to enlarge the similarity be-tween same speaker pairs but minimize the similarity betweendifferent speaker pairs. In this work, we apply cross-entropyloss for a learning network to categorize the speaker identities.
Speech separation is the task of estimating individual signalsfor each speaker from a mixed speech signal. One of the mostdifficult issues in this work is how to avoid the speaker per-mutation problem, i.e. how to correctly assign the speaker IDof the separated signal in each processing frame. To alleviatethis problem, the permutation invariant training (PIT) criterionwas proposed [24, 25], which simply considers all of the lossterms by calculating every possible permutation of candidatepairs. From those candidates, the network determines optimalspeaker pairs that minimize the error. Inspired by this approach,our training criterion also assigns a loss function for all possiblepairs and finds the optimal assignment. However, we computethe PIT loss on estimated speaker identity information such asembeddings or the distribution of speaker classes, not the sepa-rated spectral bins as the conventional method did.
3. Multiple identities representationnetwork (MIRNet)
Although speaker information is mingled in an overlappedspeech signal, it is possible to extract a target speaker’s identityif an appropriate target-related condition vector is given [26].Conventionally, a reference speech signal is used as a condi-tion vector, which necessitates a cumbersome pre-enrollmentprocess. In this work, we separate speaker identity informationin an overlapped signal without providing any reference sig-nal. Figure 1 illustrates a block diagram of the proposed mul-tiple identities representation network (MIRNet). Although theMIRNet can be generalized to an arbitrary number of speakerinputs, we fix the number of speakers to two for simplicityin this paper. The proposed network consists of three stages:speech analysis, spectral attention and speaker embedding.
In the speech analysis stage, an input speech signal is trans-formed into a latent domain representation, V . V = E S ( S A + B ) , (1)where E S is a speech encoder network and V is an embeddingwith D -channels. The input to the encoder, S A + B , denotesa linear magnitude spectrum on a logarithm scale. We do notuse the mel-spectrum that is popularly used in speaker embed-ding tasks because of its over-smoothed spectral characteristics,which makes it difficult to distinguish one speaker from another.We construct the architecture of the speech encoder based on 1-D convolution layers. The detailed parameter settings of thespeech encoder are described in Table 1. The speech encoderemits spectral embeddings with D channels, where we assumethat each D channel contains the information for one speaker. In the spectral attention stage, we extract two different sets ofembeddings from the spectral embedding output of the speechencoder using an attention mechanism. The rationale for the useof an attention mechanism is as follows. Some frames that arefully-overlapped with two speakers do not represent speakers’identities well, but other frames clearly represent each speaker’sidentity when they are spoken by only a single speaker. In addi-tion, some frames contain only silence. We compute frame-wiseattention weights from the spectral embeddings to estimate theamount of importance that each frame implies.Figure 1 illustrates a detailed block diagram of the self-attention mechanism [27]. We first obtain two embedding vec-tors, one from the spectral embedding V and the other from V , re-structured by a channel exchange. Since the attentionlayer shares parameters for attention vectors, half of the chan-nels of V are flipped onto the channel axis to create a differentinput embedding V , which is used to extract another attentionvector W . V = Concat ( V ,D +1:2 D , V , D ) . (2)The attention vectors are related to the framewise power of eachspeaker, which refers to the speaker information in each frame.They are each multiplied with the spectral embedding V andused as inputs to the following fully-connected layer with a non-linear activation function. From the spectral attentive stage, thenetwork outputs Z and Z where they contain discriminativeinformation of each speaker.Figure 2 illustrates an example log magnitude spectrogramof mixed speech from two speakers and those for each speakerindividually, as well as power contours and attention weightsestimated by the self-attention mechanism.igure 2: Visualization of analysis on temporal attention foreach speaker in a mixture.
Top : Spectrogram of input speechsignal S A + B . Mid : Spectrogram of speech S A . Bottom : Spec-trogram of speech S B . Yellow line shows attention weight val-ues, and the blue line is the power contour of speech signal. The speaker embedding network extracts speaker vectors fromthe outputs of the spectral attention stage. The network archi-tecture is similar to the one used for speaker verification, wherewe use ResNet-18 [28] as a backbone network while changingthe pooling strategy to be temporal average pooling (TAP).
For training, we randomly mix signals from two different speak-ers. The networks are jointly trained using cross-entropy losswith a classifier for the speaker embeddings I and I . The per-mutation invariant training method is used to solve the permu-tation problem frequently occurring in speech separation tasks.The entire training criterion is as follows: L = min ( L , L ) , (3) L = L CE ( y , y A ) + L CE ( y , y B ) , (4) L = L CE ( y , y A ) + L CE ( y , y B ) , (5)where L CE is cross-entropy loss.Suppose that we know the labels of speaker y A , y B whenwe generate the mixture S A + B , and the classifier estimatesspeaker identities ( y , y ). Then, the losses between the la-bels y A , y B and y , y are computed for every pair to find theoptimal speaker pairs.
4. Experiments
In this section, we describe two experiments on a speaker ver-ification task and a speaker-conditioned speech separation taskto prove the effectiveness of our method.
The speaker representation network was trained with the Lib-riSpeech corpus [29], which contains 2,238 speakers. We ran-domly selected two speech samples from all the different pairs Table 1:
Details of model parameter settings for speech encoderand attention layer
Speech encoderLayer Non-linearity Channels Kernel conv1 LeakyReLU( α = 0 . ) 512 5conv2 512 3conv3 512 3conv4 512 1conv5 1,500 1conv6 514 1 Attention layerLayer Non-linearity Channels fc1 Tanh 64fc2 Sigmoid 1of speakers to make mixed input signals in every epoch. Inthe training stage, 3 second segments chosen at random offsetswere used as inputs to the network. In each epoch, the modelwas trained using 93 hours of mixed signals for training and15 hours of mixed signals for validation. The input log-scaledspectrum was calculated every 10 ms with an analysis framelength of 32 ms. The FFT size was set to 512; thus, the inputdimension was 257. To measure the speaker embedding per-formance of the network in speaker verification, 200 evaluationpairs were generated for each acceptance and rejection scenariousing the dev-clean , dev-others , test-clean and test-others sub-sets which contain a total of 146 speakers.The detailed network architecture is summarized in Table 1.We use a 257-dimensional log-spectrum as the input, and theoutput channel dimension of the last layer in the speech encoderis 514, two times that of the input spectrum dimension. For theattention layer, the sigmoid activation function is used to limitthe range of the attention weights from 0 to 1. The obtainedattention vector is mapped into a 257-dimensional vector usinga fully-connected layer. To prove whether the proposed model represents discrimina-tive characteristics of speaker identities well, we measure theequal error rate (EER) for the speaker verification task usingthe extracted embeddings from the input mixed speech. Whileconventional speaker verification methods use positive and neg-ative embedding pairs for evaluation, our method needs a newprotocol setup since the output embeddings are permuted andnot assigned to speaker labels.Our evaluation method considers two scenarios to measureEER: acceptance and rejection scenarios. We prepare three mix-tures S A + B , S A (cid:48) + C and S C (cid:48) + D , and each speech segment pro-vides speaker identities ( I A , I B ), ( I A (cid:48) , I C ) and ( I C (cid:48) , I D ), re-spectively. While ( I A , I B ) is set as the anchor embedding pair,we use the positive and negative samples from ( I A (cid:48) , I C ) and( I C (cid:48) , I D ). To compute EER, d p and d n are the distances for theacceptance and the rejection, with equations as shown below: d p = min i,j ( d ( I i | A + B , I j | A (cid:48) + C )) (6) d n = min i,j ( d ( I i | A + B , I j | C (cid:48) + D )) (7)here i, j ∈ { , } . The distances are computed from the clos-est identity pairs between the anchor and positive and betweenthe anchor and negative samples using Euclidean distance, de-noted as d ( · ) . I i | A + B denotes the identity from the i -th outputfrom mixture S A + B . We obtained an EER of 5.00 % and 6.00 % with the aforemen-tioned evaluation method for the seen and unseen speaker cases,respectively. From this result, we can see that our trained net-work successfully represents speaker identities from a mixedinput. This result also demonstrates the efficiency of the deci-sion method for avoiding the permutation problem.Figure 3 depicts a t-SNE visualization of speaker embed-dings extracted by the proposed network architecture. Thespeech samples are mixed with combinations generated from 20different speakers, and speaker assignment is decided with ourproposed permutation decision method. We can see that embed-dings from the same speaker are clustered closely together. Wealso checked the variances of the channels containing speakeridentity information to confirm that they effectively model theinformation without biased channel values. Therefore, we alsovisualized embeddings extracted from each channel Speaker-conditioned speech separation is the task of verify-ing whether speaker information embeddings are informativeenough for speaker separation in overlapped speech signals.In [16], VoiceFilter was proposed to improve speech separa-tion performance using speaker identities extracted from pre-enrolled reference speech. It dramatically improved the sepa-ration performance, but it was not applicable to speech signalsspoken by speakers that had never been seen before.Here, we extract multiple identities from overlapped speechfirst and separate the target speakers’ voices from the signal. Weuse VoiceFilter as our baseline for comparison, which utilizes anindependent speaker model. We also compare with a separationsystem without speaker conditioning using uPIT [25].
Dataset.
The training dataset for our speaker model is iden-tical to the one described in Section 4.1. The speaker modelfor VoiceFilter is pre-trained using the VoxCeleb2 dataset [10].We use the WSJ0-2mix dataset [30] to train and evaluatespeech separation models for the baseline and our method.Speech signals in WSJ0-2mix are sliced every 10ms with 32msframe length, from which they are transformed into the time-frequency domain using the Fourier transform.
Model setup.
We use a model pre-trained on VoxCeleb2 whoseperformance is 2.23% EER [31] as the baseline speaker model,which consists of 34 convolution layers with Inception. Modelparameters for the speech separation model are the same asthose used in [16]. For the model without speaker conditioning,we use a bidirectional LSTM (BLSTM) structure identical toVoiceFilter while doubling the dimension of the fully-connectedlayer and output for separate speech.
Evaluation metric.
For objective evaluation, we measuresignal-to-distortion ratio improvement (SDRi), which is typi-cally used to represent speech separation performance. We alsocalculate the perceptual evaluation of speech quality improve- Table 2:
Performance on the speech separation task
Speaker encoder SDRi (dB) PESQi
Chung et al. [31] 6.147 0.480
Proposed t-SNE visualization of extracted speaker embeddingfrom 20 random speakers. (a) represents speaker embeddingslabeled with our permutation decision method. (b) representsthe speaker embeddings from the each channel. ment (PESQi), which quantifies the perceptual scoring for sep-arated speech signals.
Table 2 shows results comparing our method with the base-line. Our separation method results in slightly lower scoresthan the baseline model. Nevertheless, it is noteworthy that ourmethod is able to achieve reliable performance on this task evenwith unseen speakers. It should be noted that the baseline usesspeaker identities extracted from enrolled reference speech; thisnot only gives it a significant advantage in the speech separa-tion task, but also means that it cannot be used at all withoutpre-enrolling speakers. In addition, since MIRNet was trainedusing a much smaller dataset compared to the baseline model,there should still be margins for it to improve.
5. Conclusion
In this work, we proposed a novel method to estimate la-tent embeddings from overlapped speech that reliably representspeaker identity information. The proposed network consistsof speech analysis, spectral attention, and speaker embeddingstages which extract information on multiple speaker identities.To make the network learn this information while finding an op-timal assignment, we proposed a speaker identity decision pro-cedure based on permutation invariant training. Experimentalresults showed that the proposed network can derive individualspeakers’ identity information from the mixtures without usingacoustic information extracted from reference speech signals.In addition, the resulting embeddings showed reliable perfor-mance on a speaker-conditioned speech separation task, whereit has an advantage in that it can be applied even to speakers forwhich clean reference speech is unavailable.
Acknowledgements.
This research was sponsored by NaverCorporation. . References [1] B. S. Atal, “Automatic speaker recognition based on pitch con-tours,”
The Journal of the Acoustical Society of America , vol. 52,no. 6B, pp. 1687–1697, 1972.[2] J. J. Wolf, “Efficient acoustic parameters for speaker recognition,”
The Journal of the Acoustical Society of America , vol. 51, no. 6B,pp. 2044–2056, 1972.[3] C.-S. Liu, W.-J. Wang, M.-T. Lin, and H.-C. Wang, “Study ofline spectrum pair frequencies for speaker recognition,” in
Inter-national Conference on Acoustics, Speech, and Signal Processing .IEEE, 1990, pp. 277–280.[4] D. A. Reynolds and R. C. Rose, “Robust text-independent speakeridentification using gaussian mixture speaker models,”
IEEEtransactions on speech and audio processing , vol. 3, no. 1, pp.72–83, 1995.[5] V. Wan and W. M. Campbell, “Support vector machines forspeaker verification and identification,” in
Neural Networks forSignal Processing X. Proceedings of the 2000 IEEE Signal Pro-cessing Society Workshop (Cat. No. 00TH8501) , vol. 2. IEEE,2000, pp. 775–784.[6] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support vec-tor machines using gmm supervectors for speaker verification,”
IEEE signal processing letters , vol. 13, no. 5, pp. 308–311, 2006.[7] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front-end factor analysis for speaker verification,”
IEEE Trans-actions on Audio, Speech, and Language Processing , vol. 19,no. 4, pp. 788–798, 2010.[8] F. K. Soong, A. E. Rosenberg, B.-H. Juang, and L. R. Rabiner,“Report: A vector quantization approach to speaker recognition,”
AT&T technical journal , vol. 66, no. 2, pp. 14–26, 1987.[9] M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speakersin the wild (sitw) speaker recognition database.” in
Interspeech ,2016, pp. 818–822.[10] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deepspeaker recognition,” arXiv preprint arXiv:1806.05622 , 2018.[11] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in .IEEE, 2014, pp. 4052–4056.[12] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust dnn embeddings for speaker recognition,”in . IEEE, 2018, pp. 5329–5333.[13] T. Tan, Y. Qian, D. Yu, S. Kundu, L. Lu, K. C. Sim, X. Xiao,and Y. Zhang, “Speaker-aware training of lstm-rnns for acousticmodelling,” in . IEEE, 2016, pp. 5280–5284.[14] G. Pironkov, S. Dupont, and T. Dutoit, “Speaker-aware longshort-term memory multi-task learning for speech recognition,” in .IEEE, 2016, pp. 1911–1915.[15] M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, andT. Nakatani, “Single channel target speaker extraction and recog-nition with speaker beam,” in .IEEE, 2018, pp. 5554–5558.[16] Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Zelin, J. R. Her-shey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “Voice-Filter: Targeted Voice Separation by Speaker-Conditioned Spec-trogram Masking,” in
Proc. Interspeech , 2019, pp. 2728–2732.[17] C. Xu, W. Rao, E. S. Chng, and H. Li, “Time-domain speakerextraction network,” in . IEEE, 2019, pp. 327–334. [18] Y. Tang, G. Ding, J. Huang, X. He, and B. Zhou, “Deepspeaker embedding learning with multi-level pooling for text-independent speaker verification,” in
ICASSP 2019-2019 IEEEInternational Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) . IEEE, 2019, pp. 6116–6120.[19] Y. Jung, S. M. Kye, Y. Choi, M. Jung, and H. Kim, “Improv-ing multi-scale aggregation using feature pyramid module for ro-bust speaker verification of variable-duration utterances,” arXivpreprint arXiv:2004.03194 , 2020.[20] C. Zhang and K. Koishida, “End-to-end text-independent speakerverification with triplet loss on short utterances.” in
Interspeech ,2017, pp. 1487–1491.[21] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kan-nan, and Z. Zhu, “Deep speaker: an end-to-end neural speakerembedding system,” arXiv preprint arXiv:1705.02304 , 2017.[22] J. Wang, K.-C. Wang, M. T. Law, F. Rudzicz, and M. Brudno,“Centroid-based deep metric learning for speaker recognition,” in
ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 3652–3656.[23] S. M. Kye, Y. Jung, H. B. Lee, S. J. Hwang, and H. Kim, “Meta-learning for short utterance speaker recognition with imbalancelength pairs,” arXiv preprint arXiv:2004.02863 , 2020.[24] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invari-ant training of deep models for speaker-independent multi-talkerspeech separation,” in . IEEE, 2017,pp. 241–245.[25] M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speechseparation with utterance-level permutation invariant training ofdeep recurrent neural networks,”
IEEE/ACM Transactions on Au-dio, Speech, and Language Processing , vol. 25, no. 10, pp. 1901–1913, 2017.[26] Y. Shi and T. Hain, “Supervised speaker embedding de-mixingin two-speaker environment,” arXiv preprint arXiv:2001.06397 ,2020.[27] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou,and Y. Bengio, “A structured self-attentive sentence embedding,” arXiv preprint arXiv:1703.03130 , 2017.[28] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections onlearning,” in
Thirty-first AAAI conference on artificial intelli-gence , 2017.[29] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an asr corpus based on public domain audio books,”in . IEEE, 2015, pp. 5206–5210.[30] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deepclustering: Discriminative embeddings for segmentation and sep-aration,” in . IEEE, 2016, pp. 31–35.[31] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham,S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning forspeaker recognition,” in