[PDF] Disentangled speaker and nuisance attribute embedding for robust speaker verification

Abstract

Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). In this paper, we propose a novel fully supervised training method for extracting a speaker embedding vector disentangled from the variability caused by the nuisance attributes. The proposed framework was compared with the conventional deep learning-based embedding methods using the RSR2015 and VoxCeleb1 dataset. Experimental results show that the proposed approach can extract speaker embeddings robust to channel and emotional variability.

Full PDF

DDate of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identiﬁer 10.1109/ACCESS.2019.DOI

Disentangled speaker and nuisanceattribute embedding for robust speakerveriﬁcation

WOO HYUN KANG , (Student Member, IEEE), SUNG HWAN MUN , (Student Member, IEEE),MIN HYUN HAN , (Student Member, IEEE), and NAM SOO KIM , (Senior Member, IEEE) Department of Electrical and Computer Engineering and INMC, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Korea (e-mail:[email protected]) Department of Electrical and Computer Engineering and INMC, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Korea (e-mail:[email protected]) Department of Electrical and Computer Engineering and INMC, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Korea (e-mail:[email protected]) Department of Electrical and Computer Engineering and INMC, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Korea (e-mail:[email protected])

Corresponding author: Nam Soo Kim (e-mail: [email protected]).This research was supported by Projects for Research and Development of Police science and Technology under Center for Research andDevelopment of Police science and Technology and Korean National Police Agency funded by the Ministry of Science, ICT and FuturePlanning (PA-J000001-2017-101).

ABSTRACT

Over the recent years, various deep learning-based embedding methods have been proposedand have shown impressive performance in speaker veriﬁcation. However, as in most of the classical embed-ding techniques, the deep learning-based methods are known to suffer from severe performance degradationwhen dealing with speech samples with different conditions (e.g., recording devices, emotional states). Inthis paper, we propose a novel fully supervised training method for extracting a speaker embedding vectordisentangled from the variability caused by the nuisance attributes. The proposed framework was comparedwith the conventional deep learning-based embedding methods using the RSR2015 and VoxCeleb1 dataset.Experimental results show that the proposed approach can extract speaker embeddings robust to channeland emotional variability.

INDEX TERMS speech embedding, speaker veriﬁcation, domain disentanglement, deep learning.

I. INTRODUCTION S PEAKER veriﬁcation is the task of verifying the claimedspeaker identity based on the given speech samples andhas become a key technology for personal authentication inmany commercial applications, forensics and law enforce-ment [1]. Commonly, an utterance-level ﬁxed-dimensionalvectors (i.e. embedding vectors) are extracted from the en-rollment and test speech samples and then fed into a scoringalgorithm (e.g., cosine distance, probabilistic linear discrim-inant analysis) to measure their similarity or likelihood ofbeing spoken by the same speaker. Over the past years,the i-vector framework has been one of the most dominantapproaches for speech embedding [2], [3]. The widespreadpopularity of the i-vector framework in the speaker veriﬁca-tion community can be attributed to its ability to summarizethe distributive pattern of the speech with a relatively smallamount of training data in an unsupervised manner. In recent years, various methods have been proposed uti-lizing deep learning architectures for extracting embeddingvectors and have shown better performance than the i-vectorframework when a large amount of training data is available[4]. In [5], a deep neural network (DNN) for frame-levelspeaker identiﬁcation was trained and the averaged activationfrom the last hidden layer, namely, the d-vector, was taken asthe embedding vector for text-dependent speaker veriﬁcation.In [4], [6], a speaker identiﬁcation model consisting of aframe-level network and a segment-level network was trainedand the hidden layer activation of the segment-level network(i.e. x-vector) was extracted as the embedding vector. In[7], long short-term memory (LSTM) layers were adoptedto capture the contextual information within the d-vector,and the embedding network was trained to directly optimizethe veriﬁcation score (e.g., cosine similarity) in an end-to-end fashion. The end-to-end d-vector framework was further

VOLUME 4, 2016 a r X i v : . [ ee ss . A S ] A ug ang et al. : Disentangled speaker and nuisance attribute embedding for robust speaker veriﬁcation enhanced in [8] by applying different weight (i.e. attention)to each frame-level activation while obtaining the d-vector,which enables the embedding network to attend more on theframes with relatively higher amount of speaker-dependentinformation. In [9], a generalized end-to-end loss function,which optimizes the embedding vector to move towards thecentroid of the true speaker while departing away from thecentroid of the most confusing speaker, was introduced totrain the end-to-end d-vector system more efﬁciently. In [10]and [11], a variational autoencoder (VAE)-based architecturewas trained in an unsupervised manner to extract an embed-ding vector for short-duration speaker veriﬁcation. Despitetheir success in well-matched conditions, the deep learning-based embedding methods are vulnerable to the performancedegradation caused by mismatched conditions (e.g., channel,noise) [12].In real life applications, numerous factors can contributeto the mismatches in speaker veriﬁcation [1]. Especiallyin forensic situations, channel mismatch often occurs sincepolice ofﬁcers usually acquire voice recordings using variousrecording devices (e.g., hidden microphones, mobile phones)[13]. Such variation in recording devices is known to causevariability to the speech distribution, which leads to lowspeaker identiﬁcation or veriﬁcation performance.Recently, many attempts have been made to extract anembedding vector robust to mismatched conditions. Conven-tionally, various researches focused on adapting the back-end scoring model (e.g., PLDA) [14] or training the embed-ding network with an augmented dataset containing variousnuisance variability [15]. These methods are proven to beeffective when the dataset for the target condition (e.g., noisyevaluation domain) is scarce, but since these methods do notintervene during the embedding extraction, their performancemay be bottlenecked by the speaker discriminative capabilityof the embedding network. Unlike the aforementioned do-main adaptation techniques, there have been several methodswhich aim to directly disentangle the undesired variabilitywhile extracting the speaker embeddings. In [12], [16], in-spired by the usage of gradient reversal strategy in imageclassiﬁcation [17], [18] and robust speech recognition [19],[20], the embedding networks were trained to minimize thespeaker classiﬁcation error while maximizing the error of thesubtask (e.g., noise or channel type classiﬁcation) with theuse of gradient reversal layer. Although the gradient reversalstrategy has shown meaningful improvement in performance,domain adversarial training using gradient reversal layer isknown to be very unstable and sensitive to hyper-parametersetting [21]. In [22], the embedding network was trained tomaximize the error of a subtask (i.e. noise type classiﬁca-tion) by using an adversarial training strategy similarly tothe generative adversarial network (GAN) [23]. The speakerembedding network and the noise classiﬁcation network aretrained competitively; the noise classiﬁcation network istrained to discriminate the noise type correctly, and at thesame time the embedding network is trained to discriminatethe speaker while having high uncertainty on the noise type. When training the speaker embedding network, bit-invertedone-hot labels (i.e. anti-labels) were used for noise classiﬁca-tion, which would force the embedding network to output awrong noise label equally. Though the anti-label strategy hasproven its strength in noise-robust speaker embedding [22],adversarial training is known to be extremely unstable anddifﬁcult [24].In this paper, we propose a novel approach to disentanglethe nuisance attribute information from the speaker embed-ding vector without the use of gradient reversal or adversar-ial training. The proposed method employs an embeddingnetwork similar to the conventional methods (e.g., d-vectorand x-vector). However, unlike the conventional embeddingnetworks, which produce a single embedding vector perutterance, the proposed embedding network simultaneouslyextracts a speaker- and nuisance attribute-dependent (e.g.,recording device-, emotion-dependent) embedding vectors,hence we call the proposed technique joint factor embed-ding (JFE). In the JFE technique, the embedding networkis trained in a fully supervised manner simultaneously withthe speaker and nuisance attribute (e.g., channel, emotion)discriminator networks where each discriminator is trainedto take the embedding vector as input and identify theirrespective targets. Analogous to the conventional speakerembedding systems, the proposed embedding network istrained to produce a speaker embedding vector with highspeaker discriminability. On the other hand, to disentanglethe non-speaker information from the speaker embeddingvector, we propose two different ways to increase the nui-sance attribute uncertainty inherent in the speaker embeddingvector. One way is to train the embedding network to extracta speaker embedding vector to maximize the entropy innuisance attribute identiﬁcation, and the other is to decreasethe relevancy between the speaker and nuisance embeddingvectors by minimizing the mean absolute Pearson’s correla-tion (MAPC) [25].In order to evaluate the performance of the proposed sys-tem in a realistic scenario, we conducted a set of experimentsusing two datasets: • RSR2015 Part 3 dataset: a random digits stringsspeaker veriﬁcation corpus consisting of speech samplesrecorded from 6 different hand-held devices [26], [27]. • VoxCeleb1 dataset: a text-independent speaker veriﬁca-tion corpus consisting of speech samples with 8 differ-ent emotional states [28].The experimental results show that the proposed methodoutperforms the conventional disentanglement methods (i.e.gradient reversal, anti-label) in terms of equal error rate(EER). Moreover, the proposed system performed better thanthe conventional x-vector on short duration speech samples,which is likely to lack signiﬁcant phonetic information.The contributions of this paper are as follows: • We propose a new method to train a speaker embeddingnetwork robust to nuisance attributes, which can be done VOLUME 4, 2016 ang et al. : Disentangled speaker and nuisance attribute embedding for robust speaker veriﬁcation (a)(b)

FIGURE 1: (a) LSTM-based d-vector system trained withsoftmax loss. (b) LSTM-based d-vector system trained withend-to-end loss.easily without the use of adversarial training or gradientreversal learning. • We compared the proposed speaker embedding tech-nique with conventional methods for multi-device andemotional speaker veriﬁcation. • We experimented the proposed speaker embedding tech-nique on speech utterances with various durations.The rest of this paper is organized as follows: We ﬁrstbrieﬂy describe the conventional embedding network ar-chitecture and disentanglement methods based on gradientreversal and anti-label in Section II. In Section III, the newlyproposed JFE scheme is presented. The experiments andresults are shown in Section IV. Finally, Section V concludesthe paper.

II. DEEP LEARNING-BASED SPEAKER EMBEDDING

A. DEEP EMBEDDING NETWORK

Two of the most widely used speaker embedding techniquesare the LSTM-based d-vector [9] and the TDNN (time-delayDNN)-based x-vector system [4]. In both frameworks, givena speech utterance X with T frames, a sequence of frame-level acoustic features { x , ..., x T } extracted from X is fedinto the frame-level network. In the d-vector system, oneof most widely used technique for text-dependent speakerrecognition, the frame-level network is composed of LSTMlayers, which helps capture the temporal correlation. On theother hand, the frame-level network of the x-vector system consists of TDNN layers, which is often used for text-independent speaker recognition. Once the frame-level out-puts { h , ..., h T } are obtained, they are aggregated to obtainan utterance-level representation. One way of aggregating theframe-level outputs is to compute the weighted average as ω = T (cid:88) t =1 α t h t (1)where α t ∈ [0 , is a normalized weight, which is computedby α t = exp( e t ) (cid:80) Tt =1 exp( e t ) . (2)In (2), the frame-level score (i.e. attention) e t is computed asfollows: e t = v (cid:124) t tanh( W t h t + b t ) (3)where v t , W t , and b t are trainable parameters and super-script (cid:124) indicates transpose operation. By using differentweight for each frame, speech frames with relatively higherspeaker-relevancy can contribute more to the embeddingvector.The embedding network is trained by either minimizingthe speaker identiﬁcation loss [5] or directly optimizing theveriﬁcation performance (i.e. end-to-end speaker veriﬁca-tion) [9]. In the ﬁrst case (i.e. embedding network trained foridentiﬁcation), as shown in Fig. 1a, a feed-forward neural net-work for classifying the speakers in the training set is trainedjointly with the embedding network. The speaker classiﬁ-cation network takes the utterance-level representation ω asinput and has an N -dimensional softmax output ˜ y ( ω ) where N corresponds to the number of training speakers. Giventhe one-hot speaker label y , the embedding and classiﬁcationnetworks are trained to minimize the following cross-entropyloss function: L spkr = − N (cid:88) n =1 y n log˜ y n ( ω ) (4)where y n and ˜ y n ( ω ) are the n th components of y and ˜ y ( ω ) ,respectively.For training the end-to-end speaker veriﬁcation system(i.e. embedding network trained for veriﬁcation), a mini-batch of J × K utterances is fed into the embedding networkwhere the mini-batch is composed of J speakers, and eachspeaker has K utterances. As depicted in Fig. 1b, the scaledcosine similarity between each embedding vector and thecentroid of the embedding vectors from each speaker arecomputed by S jk,i = a · cos( ω jk , c i ) + d (5)where a and d are trainable parameters, and cos( ω jk , c i ) isthe cosine similarity between the utterance-level represen-tation extracted from the k th utterance of the j th speaker ω jk and the centroid of the i th speaker’s utterance-levelrepresentations c i ( ≤ j, i ≤ J and ≤ k ≤ K ). Foreach utterance-level representation ω jk in the mini-batch, the VOLUME 4, 2016 et al. : Disentangled speaker and nuisance attribute embedding for robust speaker veriﬁcation (a)(b)

FIGURE 2: (a) Standard multi-task learning (MTL) archi-tecture. (b) Domain adversarial training via gradient reversallayer (GRL).embedding network is trained to maximize the following end-to-end loss function: L e e = S jk,j − log J (cid:88) i =1 ,i (cid:54) = j exp( S jk,i ) . (6)The end-to-end system is known to outperform the softmaxmethod when a large amount of dataset is used for training[6], [7].Once the embedding network is trained, the utterance-levelrepresentation ω [9], or the hidden layer activation of thespeaker classiﬁcation network [4] can be used as the speakerembedding vector. B. CONVENTIONAL DISENTANGLEMENT METHODS

Recently, disentangling various non-speaker factors (e.g.,channel type, noise type, noise-level) from the embeddingvector has become an important issue in speaker veriﬁca-tion [12], [16], [22]. Most of the techniques developed toaddress this issue are based on the multi-task learning (MTL)approaches [29] where the embedding network is trained tooptimize in two tasks: main task (i.e. speaker classiﬁcation)and subtask (e.g., channel classiﬁcation) as shown in Fig. 2a.The objective of the MTL-based disentanglement techniqueis to achieve the best performance in the main task whiledegrading the performance in the subtask.

1) Gradient reversal strategy

One way to achieve this is the gradient reversal strategy,which has shown meaningful performance in channel-robust[16] and noise-robust [12] speaker veriﬁcation. As shown inFig. 2b, the gradient reversal strategy adds a gradient reversallayer (GRL) [17] between the subtask network and the em-bedding network. Let θ emb , θ main , θ sub denote the parame-ters for the embedding, main task, and subtask networks. TheGRL performs identity transformation on the input during forward propagation and reverses the gradient by multiplyinga negative scalar − λ during backpropagation. When jointlytraining the networks, the parameters are updated as θ emb ← θ emb − l · ( ∂ L main ∂θ emb − λ ∂ L sub ∂θ emb ) , (7) θ main ← θ main − l · ( ∂ L main ∂θ main ) , (8) θ sub ← θ sub − l · ( ∂ L sub ∂θ sub ) (9)where l , L main , and L sub are the learning rate, loss functionsfor the main task and subtask, respectively. For extractinga channel-robust embedding for speaker veriﬁcation, L main would be the speaker cross-entropy L spkr deﬁned in (4),and L sub would be the channel cross-entropy which can becomputed as follows: L chan = − M (cid:88) m =1 r m log˜ r m ( ω ) (10)where M is the number of different channels (e.g., recordingdevices) in the training set, r m and ˜ r m ( ω ) are the m th com-ponent of the one-hot channel label r and channel classiﬁer’ssoftmax output ˜ r ( ω ) , respectively.

2) Anti-loss strategy

Another way to achieve disentanglement is by training theembedding network and the subtask network in a competitivemanner via adversarial training [22]. The subtask network istrained to classify the channel identity correctly given theembedding vector as in (10). On the other hand, the maintask and embedding networks are trained to discriminate thespeaker by minimizing (4) but not to perform well on thesubtask. In order to ensure high uncertainty on the subtask,[22] introduces anti-label when computing the cross-entropyfor the subtask. The anti-label is obtained by ﬂipping eachbit in the one-hot label vector. This indicates that for channeldisentanglement, the anti-loss can be computed as follows: L anti − dev = − M (cid:88) m =1 (1 − r m )log˜ r m ( ω ) . (11)By minimizing L anti − dev and L speaker simultaneously, theembedding network would be trained to produce a speakerdiscriminative embedding vector which is robust to channelvariability. III. JOINT FACTOR EMBEDDING

A. JOINT FACTOR EMBEDDING NETWORKARCHITECTURE

Analogous to the conventional disentanglement techniques[12], [16], [22], the proposed method is based on the MTLframework. However, as depicted in Fig. 3, unlike the stan-dard MTL embedding system, the embedding network of theproposed framework extracts two different embedding vec-tors simultaneously: speaker embedding ω spkr and nuisance VOLUME 4, 2016 ang et al. : Disentangled speaker and nuisance attribute embedding for robust speaker veriﬁcation

FIGURE 3: The architecture of the proposed joint factor embedding system.embedding ω nuis . The speaker embedding vector ω spkr istrained to be dependent solely on the speaker variabilitywhile the nuisance embedding vector ω nuis is trained to bedependent on the nuisance (e.g., channel, emotion) variabilityonly. When obtaining ω spkr and ω nuis , different weights areused for aggregating the frame-level outputs as ω spkr = T (cid:88) t =1 α spkr,t h t , (12) ω nuis = T (cid:88) t =1 α nuis,t h t (13)where α spkr,t and α nuis,t are the speaker and nuisanceweights for attention, respectively, which are obtained as in(2). The reason why we use separate attention weights forobtaining ω spkr and ω nuis is that we assume that frameswith high speaker-dependent information are not alwaysguaranteed to have high nuisance attribute-dependent infor-mation. For instance, speaker-dependent information will behigh on speech frames, while channel-dependent informationwill be rather consistent across all frames since even non-speech frames are affected by the recording channel. Oncethe embedding vectors are extracted, both ω spkr and ω nuis are fed into the speaker and nuisance classiﬁcation networks. B. TRAINING FOR JOINT FACTOR EMBEDDING

TABLE 1: Main tasks and subtasks for the embedding vectorsof the joint factor embedding scheme.

Main task Subtask ω spkr Speaker classiﬁcation Nuisance classiﬁcation ω nuis Nuisance classiﬁcation Speaker classiﬁcation

1) Discriminative training

As described in Table 1, the embedding vectors ω spkr and ω nuis are trained with different main task and subtask spec-iﬁcations. In order to maximize the discriminability on theirmain tasks, the following cross-entropy loss functions areminimized: L s − s,CE = − N (cid:88) n =1 y n log˜ y n ( ω spkr ) , (14) L c − c,CE = − M (cid:88) m =1 r m log˜ r m ( ω nuis ) . (15)By minimizing (14) and (15) simultaneously, the em-bedding network is trained to produce ω spkr with highspeaker-dependent information and ω nuis with high nui-sance attribute-dependent information. Moreover, the atten-tion weights α spkr,t and α nuis,t will be trained to focus onthe frames with more meaningful information on their maintasks.

2) Disentanglement training

In this paper, we propose two types of loss functions toperform disentanglement in the subtasks of the embeddingvectors ω spkr and ω nuis . One way for disentanglement isto directly maximize the entropy (or uncertainty) on theirsubtasks while training. For ω spkr and ω nuis , the entropies[30] on their subtasks can be computed as L s − c,E = − N (cid:88) n =1 ˜ y n ( ω nuis )log˜ y n ( ω nuis ) , (16) L c − s,E = − M (cid:88) m =1 ˜ r m ( ω spkr )log˜ r m ( ω spkr ) . (17) VOLUME 4, 2016 et al. : Disentangled speaker and nuisance attribute embedding for robust speaker veriﬁcation

By maximizing (16) and (17), the uncertainty of the outputsin the subtasks will be maximized, leading the conditionaldistribution of the subtask classes to approach uniform.Another way to perform disentanglement is to regularizethe embedding vectors ω spkr and ω nuis so as to have lowcorrelation instead of directly maximizing the uncertaintyon their subtasks. This can be achieved by maximizing thenegative MAPC [25], which can be computed across themini-batch by L nMAP C = − F F (cid:88) f =1 | cov ( ω spkr,f , ω nuis,f ) | std ( ω spkr,f ) std ( ω nuis,f ) (18)where cov is the covariance, std is the standard deviation,and F , ω spkr,f , ω nuis,f are the dimensionality of the embed-ding vectors, f th element of ω spkr and ω nuis , respectively.Since zero correlation indicates that the two variables arenot related, by minimizing the MAPC between ω spkr and ω nuis , the relevancy between the two embedding vectors canbe reduced.The proposed JFE system is trained by simultaneouslyminimizing the discriminative losses (i.e. cross-entropy) de-picted in (14) and (15), while maximizing the disentangle-ment loss in (16), (17), (18). In short, the embedding networkis trained to minimize the following loss function: L JF E = L s − s,CE + L c − c,CE − L s − c,E − L c − s,E − L nMAP C . (19)By optimizing the JFE network, the speaker embeddingvector ω spkr is trained to be speaker discriminative whilehaving high uncertainty on the nuisance attribute, and the nui-sance embedding vector ω nuis aims to be nuisance attributediscriminative while having high uncertainty on the speaker. IV. EXPERIMENTS

A. CHANNEL DISENTANGLEMENT EXPERIMENTS

1) Database

In order to evaluate the performance of the proposed tech-nique for a real-life application of speaker veriﬁcation wheremultiple recording devices are involved for enrollment andtesting, a set of experiments were conducted based on theRSR2015 dataset [26], [27], which is a speaker veriﬁcationdataset recorded using 6 different hand-held devices (i.e. 1Samsung Nexus, 2 Samsung Galaxy S, 1 HTC Desire, 1Samsung Tab, 1 HTC Legend). For training the embeddingnetworks, we used the background and development subsetsof the RSR2015 dataset Part 3, consisting of utterances(recorded from all six devices) spoken by 194 speakers (100male and 94 female speakers).The evaluation was performed according to the RSR2015Part 3 (random digits string) protocol [31] where 106 speak-ers (57 male and 49 female speakers) are involved. From theRSR2015 Part 3 evaluation dataset, the 10-digits strings ofsessions 1, 4, 7 were used for enrollment and the 5-digitsstrings of sessions 2, 3, 5, 6, 8, 9 were used for testing.

2) Experimental Setup

To investigate the effects of the proposed JFE strategy ondifferent embedding architecture, two types of frameworkswere used for embedding extraction: d-vector and x-vector.For the d-vector-based systems, a single 512-dimensionalunidirectional LSTM layer with a projection layer [32](projected to 256-dimension) was used. By aggregating theLSTM outputs via a weighted average as described in (1),256-dimensional embedding vectors were obtained. Eachclassiﬁcation networks (i.e. speaker and channel identiﬁer)consisted of a single 256-dimensional rectiﬁed linear unit(ReLU) hidden layer and a softmax output layer where theoutput size corresponds to the number of speakers or deviceswithin the training set (e.g., 194-dimensional softmax outputfor speaker classiﬁer and 6-dimensional softmax output forchannel classiﬁer). The acoustic features used in the d-vector-based systems were 19-dimensional Mel-frequency cepstralcoefﬁcients (MFCCs) and the log-energy extracted at every10 ms, using a 20 ms Hamming window. Together with thedelta and delta-delta of the 19-dimensional MFCCs and thelog-energy, the frame-level feature used in our experimentswas a 60-dimensional vector.For the x-vector-based systems, 5 TDNN layers were usedas the frame-level network as in the Kaldi x-vector recipe [4].The frame-level output of the last TDNN layer were aggre-gated via attention pooling (1) and followed by a ReLU layer,resulting in a 512-dimensional embedding vector. The classi-ﬁcation networks in the x-vector-based systems consisted ofa single 512-dimensional rectiﬁed linear unit (ReLU) hiddenlayer and a softmax output layer. The acoustic features usedin the x-vector-based systems were 30-dimensional MFCCsextracted at every 10 ms, using a 20 ms Hamming window.The implementation of the embedding systems was donevia Tensorﬂow [33] and trained using the ADAM optimiza-tion technique [34] with β = 0 . and β = 0 . . All theexperimented networks were trained with learning rate 0.001and batch size 32 for 12,000 iterations. Cosine similarity wasused for computing the veriﬁcation scores in the experiments.In our experiments, EER was evaluated as the performancemeasure. The EER indicates the error when the false alarmrate (FAR) and the false reject rate (FRR) are the same.TABLE 2: EER (%) comparison between the speaker em-bedding vectors extracted from the joint factor embeddingnetworks trained with various disentanglement losses. Loss EER [%]

Only discriminative

Entropy nMAPC

Entropy + nMAPC VOLUME 4, 2016 ang et al. : Disentangled speaker and nuisance attribute embedding for robust speaker veriﬁcation

FAR [%] F RR [ % ] Only discriminativeEntropynMAPCEntropy + nMAPC

FIGURE 4: DET curves of the JFE systems trained withvarious disentanglement losses.

3) Comparison between different disentanglement lossterms

In this experiment, we compare the performance of thespeaker embeddings obtained from the d-vector-based JFEsystem trained with different disentanglement loss termsdiscussed in Section III. The experimented methods are asfollows: • Only discriminative : speaker embedding vector ex-tracted from the JFE network trained only with thediscriminative loss functions in (14) and (15) (whichis essentially a multi-task learning for the embeddingnetwork to encode speaker and nuisance discriminativeinformation), • Entropy : speaker embedding vector extracted from theJFE network trained with the discriminative loss func-tions in (14), (15) and the entropy-based disentangle-ment losses in (16) and (17), • nMAPC : speaker embedding vector extracted from theJFE network trained with the discriminative loss func-tions in (14), (15) and the negative MAPC-based disen-tanglement losses in (18), • Entropy + nMAPC : speaker embedding vector extractedfrom the JFE network trained with the discriminativeloss functions in (14), (15) and both the entropy-basedand the negative MAPC-based disentanglement lossesin (16), (17) and (18).Table 2 gives the EER results obtained by using these em-beddings. As shown in the results, the embedding extractedfrom the JFE networks trained with either

Entropy or nMAPC for disentanglement greatly improved the performance com-pared to Only discriminative , which is essentially a standardMTL embedding technique. This implies that both nMAPC and

Entropy are capable of training the embedding network toproduce speaker embedding vectors disentangled from non-speaker factors. Especially the nMAPC showed relative im-provement of 17.99% compared to

Only discriminative . The

Iterations T r a i n i ng s e t c o s t L s-s,CE L c - c ,CE L c -s,E L s- c ,E FIGURE 5: The joint factor embedding training loss valueson each iteration.

80 60 40 20 0 20 40 60 8080604020020406080 (a)

80 60 40 20 0 20 40 60 8080604020020406080 (b)

80 60 40 20 0 20 40 60 8080604020020406080 (c)

80 60 40 20 0 20 40 60 8080604020020406080 (d)

FIGURE 6: t-SNE plot of the speaker and channel embeddingvectors extracted from 10 speakers and 3 devices. The x andy axis indicates the st and nd dimension of the 2D T-SNEprojection, respectively. (a) and (c) are the t-SNE plots of thespeaker embedding vectors, and (b) and (d) are the t-SNEplots of the channel embedding vectors. Different colors in(a) and (b) indicate different speakers, and different colors in(c) and (d) indicates different devices.best veriﬁcation performance was achieved by using bothdisentanglement loss terms (i.e. Entropy + nMAPC ), yieldinga relative improvement of 25.27% in terms of EER. Fromthis, we could assume that nMAPC and

Entropy are usefulfor disentangling the channel variability from the speakerembedding. The DET curves are depicted in Figure 4.

4) Training Analysis

In order to check if the training scheme of the proposed JFEsystem achieves our objective (i.e. maximizing the speaker

VOLUME 4, 2016 et al. : Disentangled speaker and nuisance attribute embedding for robust speaker veriﬁcation (a) (b)

FIGURE 7: Attention weights of d-vector (JFE) for utter-ances speaking the sentence “only lawyers love millionaires".(a) Attention weights for the speaker embedding vector. (b)Attention weights for the channel embedding vector.discriminability and channel uncertainty in ω spkr ), we ana-lyzed the training loss described in (14)-(17) of the d-vector-based JFE system. As shown in Fig. 5, due to the largedifference in the unique number of speakers and devices (i.e.194 speakers and 6 devices), the initial values for L s − s,CE and L s − c,E were higher than L c − c,CE and L c − s,E . Thecross-entropy losses (i.e. L s − s,CE and L c − c,CE ) decreasedquickly toward 0 when the training iteration increases. Onthe other hand, the entropy losses (i.e. L s − c,E and L c − s,E )stayed near at their initial values throughout the training.This indicates that the proposed training scheme increasesthe discriminability of the speaker and channel embeddingson their main tasks while keeping their uncertainty on thesubtasks high as expected.In Fig. 6, the t-SNE plots [35] of the speaker and channelembedding vectors of 10 speakers and 3 devices are shown.As can be seen in Figs. 6a and 6c, the speaker embedding vec-tors ω spkr were well separated between different speakers butwere highly overlapped when it comes to different devices.Meanwhile, as shown in Figs. 6b and 6d, the channel embed-ding vectors ω chan were separately distributed in terms ofthe device, while they were inseparable in terms of speakers.This conﬁrms that the embedding vectors extracted from theproposed JFE system are discriminative on their main tasks,but are invariant with respect to their subtasks.Moreover, in Fig. 7, the attention weights for the utterancespeaking the sentence “only lawyers love millionaires" (i.e. st sentence of the RSR2015 Part1 dataset) are shown. It isinteresting to see that the difference between speaker atten-tion weights α spkr across the frames were quite dramatic,which indicates that α spkr are likely to attend to certainframes. On the other hand, the channel attention weights α chan were relatively consistent across all frames. Theseresults strongly support our assumption that the frames withhigh speaker-dependent information are concentrated on spe- ciﬁc frames while channel-dependent information is similaracross the speech segment.

5) Comparison between the joint factor embedding schemeand conventional disentanglement methods

In this experiment, we compared the embedding vectorsobtained from the proposed joint factor embedding scheme,with those obtained from the conventional disentanglementtechniques discussed in Section II. The experimented trainingstrategies are as follows: • Softmax : embedding extracted from an embedding net-work trained with softmax objective in (4), • Gradient reversal : embedding extracted from an embed-ding network trained with gradient reversal strategy asdescribed in (7) where λ was set to be 0 in the beginningand linearly increased every iteration, reaching 1 at theend of the training as in [19], • Anti-loss : embedding extracted from an embedding net-work trained with anti-loss as described in (11) usingthe same adversarial training strategy described in [22], • JFE (proposed) : speaker embedding extracted from theproposed JFE system trained with the discriminativeloss functions in (14) and (15) and both the entropy-based as shown in (16) and (17) and the negativeMAPC-based disentanglement losses in (18).Table 3 show the performance of the d-vector and x-vector-based systems trained with the methods describedabove. Generally, the

Anti-loss disentanglement strategy hasshown performance enhancement, achieving a relative im-provement of 35.39% in terms of EER in the d-vector-basedexperiment. On the other hand,

Gradient reversal method,showed only slightly improved or worse performance over softmax . Meanwhile, the speaker embedding extracted fromthe proposed JFE scheme yielded the best performance inall architectures (i.e., d-vector and x-vector), achieving arelative improvement of 18.39% in EER compared to thatof d-vector (softmax) . This indicates that the proposed JFEsystem is capable of disentangling complicated corruptions(i.e. corruption via channel) introduced by different recordingdevices.In addition, Table 4 show the performance comparisonTABLE 3: EER (%) comparison between the speaker em-bedding vectors extracted from the proposed joint factorembedding and the other embedding techniques.

Objective EER [%]d-vector

Softmax

Gradient reversal

Anti-loss

JFE (proposed) x-vector

Softmax

Gradient reversal

Anti-loss

JFE (proposed) VOLUME 4, 2016 ang et al. : Disentangled speaker and nuisance attribute embedding for robust speaker veriﬁcation between the state-of-the-art embedding techniques for ran-dom digit strings speaker veriﬁcation (i.e.,

DNN i-vectors and

Uncertainty normalized HMM/i-vector ) [36] and the x-vector-based embedding network trained with the proposedJFE scheme. As shown in the results,

Uncertainty normalizedHMM/i-vector performs better than the x-vector (softmax) bya large margin. This is mainly attributed to the fact that the

Uncertainty normalized HMM/i-vector is trained to modelthe within-digit variability and scored with prior knowledgeon the set of digits being uttered within the test set. There-fore it is not surprising that the x-vector (softmax) performsworse than the HMM/i-vector system, since it is trainedand evaluated with no information on the context. However,despite the innate disadvantage of the x-vector frameworkin random digits strings speaker veriﬁcation, the proposed x-vector (JFE) outperformed the

Uncertainty normalizedHMM/i-vector with an relative improvement of 46.05% interms of male trial EER.

6) Device disentanglement in domain-mismatch scenario

In this experiment, we compared the performance of the con-ventional x-vector and the proposed JFE system in a cross-domain text-independent speaker veriﬁcation scenario. Morespeciﬁcally, both embedding systems were trained using theentire RSR2015 dataset and evaluated on the VoxCeleb1evaluation subset, which is a dataset collected from Youtubevideos recorded from a wide variety of channel and environ-mental conditions (e.g., videos shot on hand-held devices,interviews from red carpets).As depicted in Table 5, the embeddings extracted fromsystems trained with RSR2015 showed severe performancedegradation. Such degradation was likely caused by the vastvariety of channel and environmental conditions within theVoxCeleb1, which are known to cause high within-speakervariability of the extracted speaker embedding vectors. Al-though the RSR2015 dataset is recorded from multiple differ-ent devices, the number of recording devices is limited (i.e.6 devices) and the speech samples are relatively noise-freesince they were recorded in an ofﬁce environment [26], [27],[31]. Therefore training the embedding system using onlythe RSR2015 dataset may be insufﬁcient to tackle the chal-TABLE 4: Gender-dependent EER (%) comparison betweenthe speaker embedding vectors extracted from the x-vector-based embedding systems and the state-of-the-art i-vector-based systems.

Methods EER [%]Male Female x-vector (Softmax)

DNN i-vectors [36] 1.70 2.69

Uncertainty normalized HMM/i-vector [36] 1.52 1.77 x-vector (GRL) x-vector (Anti-loss) x-vector (JFE) lenging condition of the VoxCeleb1 evaluation set. Hencethe x-vector system trained only for speaker discriminationusing RSR2015 showed a relative decrement of 94.83% interms of EER compared to the network trained with theVoxCeleb1 training set. On the other hand, the degredationof the

JFE system trained to disentangle the device factorfrom the speaker embedding was 71.55%, which outper-formed the x-vector trained with the same dataset with arelative improvement of 11.95%. This indicates that even ina domain-mismatch scenario, the proposed

JFE is able toalleviate the performance degradation caused by recordingdevice variability.

B. EMOTION DISENTANGLEMENT

Emotion variability can cause severe performance degrada-tion in speaker recognition [37], but emotion disentangle-ment has not been investigated as much as other nuisanceattributes, such as noise or channel distortion. This may bedue to the challenging nature of emotion disentanglementsince unlike noise or channel, emotional variability is causedby the speaker’s vocal tract, which also creates speakervariability. In this subsection, we apply the proposed JFEframework for disentangling the variability induced by thespeaker’s emotional state.

1) Dataset

In order to evaluate the performance of the proposed tech-nique for emotion disentanglement, a set of experimentswere conducted based on the VoxCeleb1 dataset [28] and theemotion labels provided by the EmoVoxCeleb teacher system[38] . For training the embedding networks, we used the development subset of the VoxCeleb1 dataset, consisting of148,642 utterances collected from 1,211 speakers. Accordingto the emotion labels in EmoVoxCeleb, total 8 emotionsare observed in the VoxCeleb1 dataset (i.e., neutral, happy,surprise, sad, angry, disgust, fear, contempt).The evaluation was performed according to the originalVoxCeleb1 trial list, which consists of 4,874 utterances spo-ken by 40 speakers. The duration of the trial utterances wasbetween 3.97 seconds and 69.05 seconds. ∼ vgg/research/cross-modal-emotions/. TABLE 5: EER (%) comparison between the speaker embed-ding vectors extracted from the proposed joint factor embed-ding and the conventional x-vector framework evaluated onthe VoxCeleb1 evaluation set.

Objective Training data EER [%] x-vector (softmax)

VoxCeleb1 11.6RSR2015 22.6 x-vector (JFE)

RSR2015 19.9

VOLUME 4, 2016 et al. : Disentangled speaker and nuisance attribute embedding for robust speaker veriﬁcation

2) Experimental Setup

The acoustic features used in the experiments were 30-dimensional MFCCs extracted at every 10 ms, using a 20 msHamming window. The embedding networks were trainedwith segments consisting of 250 frames, using the ADAMoptimization technique.For the baseline x-vector framework and joint factor em-bedding system, 5 TDNN layers were used as the frame-level network according to the Kaldi x-vector recipe [4]. TheTDNN outputs are aggregated as described in (1), and fedinto the utterance-level classiﬁcation network (i.e. speakerand emotion identiﬁer). Each utterance-level classiﬁcationnetwork consisted of two 512-dimensional LeakyReLU hid-den layers and a softmax output layer where the output sizecorresponds to the number of speakers or emotions within thetraining set. All the experimented networks were trained withlearning rate 0.001 and batch size 256 for 74,321 iterations.Cosine similarity was used for computing the veriﬁcationscores in the experiments.

3) Comparison between the joint factor embedding schemeand conventional embedding techniques

In this experiment, we compare the embedding vectors ob-tained from the proposed joint factor embedding scheme andthe conventional x-vector framework along with techniquesreported in recent studies including VGG-M, ResNet-34 andend-to-end veriﬁcation systems [39], [40]. The experimentedmethods are as follows: • i-vector [39]: the i-vector performance reported in [39], • VGG [39]: the performance of the embedding extractedfrom VGG-M, which is a CNN architecture knownto perform well on image and speaker classiﬁcation,reported in [39], • Generalized end-to-end [40]: the performance of theResNet-34-based end-to-end speaker veriﬁcation sys-tem trained with the generalized end-to-end loss (6)reported in [39], • All-speaker hard negative mining end-to-end [40]:the performance of the ResNet-34-based end-to-endspeaker veriﬁcation system trained with the all-speakerhard negative mining loss, which is a modiﬁed versionof the softmax loss for robust veriﬁcation, reported in[39], • x-vector (softmax) [39]: the x-vector performance re-ported in [39], • x-vector (our implementation) : the performance of ourimplementation of x-vector (softmax) , • CNN-embedding [39]: the performance of the embed-ding extracted from a CNN-based architecture reportedin [39], • x-vector (JFE) : the performance of the speaker embed-ding extracted from the proposed JFE system trainedto disentangle the emotional factor using loss functions(14)–(18).As shown in Table 6, the proposed JFE outperformed the conventional methods with both cosine similarity andPLDA backends. Especially when using PLDA as backend,the

JFE achieved a relative improvement of 8.16% com-pared to the x-vector (our implementation) in terms of EER.Moreover, training the

JFE with augmented training datadescribed in [39] (i.e., noise and reverberation augmentation)further improved the performance. The results demonstratethat although the proposed

JFE is composed of a simple x-vector-like network, it can provide embedding with higherspeaker discriminative information than the systems withmore complicated architecture.In addition, we evaluated the conventional x-vector frame-work and the proposed joint factor embedding scheme onshort duration speech samples. Each evaluation was doneusing randomly truncated trial utterances and the averageEERs computed over three evaluations for each durationgroup are depicted in Fig. 8. As shown in the results, boththe performance of the joint factor embedding frameworkand the conventional x-vector were degraded as the durationdecreased. This may be due to the lack of phoneticallyinformative frames since a critical amount of speaker rele-vant information is contained in the phonetic characteristics[41]. However, the emotion disentangled speaker embeddingobtained by the proposed

JFE outperformed the conventional x-vector even with short duration speech segments.

V. CONCLUSION

In this paper, a novel approach for extracting an embeddingvector robust to variability caused by nuisance attributesfor speaker veriﬁcation is proposed. In order to disentanglethe nuisance variability from the speaker embedding vector,we introduce a JFE scheme where two types of embeddingvectors are extracted, each dependent solely on the speaker ornuisance attribute, respectively. The proposed JFE networkis trained simultaneously with the speaker and nuisanceattribute classiﬁcation networks where the speaker and nui-sance embedding vectors are optimized to have good discrim-inability for their main task while having high uncertainty ontheir subtask.To evaluate the performance of the embedding vectorextracted from the proposed system in a realistic scenario, weconducted a set of speaker veriﬁcation experiments using theRSR2015 dataset, which is composed of utterances recordedusing multiple different hand-held devices, and VoxCeleb1dataset, which is composed of various emotional speechutterances. From the results, it is shown that the proposed JFEscheme is capable of obtaining speaker embedding vectorswith high speaker discriminability while showing robustnessto channel and emotional variability. Moreover, we observedthat the proposed embedding vector performs better thanthe conventional embedding technique with short durationspeech segments.Although the proposed technique showed great improve-ment over the conventional methods, since the proposed JFEis trained in a fully supervised manner, it requires labels fornot only the speakers but also the nuisance attributes. Thus VOLUME 4, 2016 ang et al. : Disentangled speaker and nuisance attribute embedding for robust speaker veriﬁcation EE R [ % ] Durations [sec.] x-vector + cosine similarity JFE + cosine similarity

FIGURE 8: EER performance of the proposed joint factor embedding scheme and conventional x-vector on different durationutterances.in our future study, we will expand the JFE technique to dis-entangle the non-speaker variability without the supervisionof nuisance attribute labels. Moreover, we will improve thedisentanglement performance by using more sophisticatedmethods for reducing the mutual information between thespeaker and nuisance embedding vectors, rather than usinga simple MAPC regularization.

ACKNOWLEDGMENT

This work was supported by the BK21 Plus program ofthe Creative Research Engineer Development for IT, SeoulNational University in 2020. This research was supportedand funded by the Korean National Police Agency. [ProjectName: Real-time speaker recognition via voiceprint analysis/ Project Number: PA-J000001-2017-101]

REFERENCES [1] J. Hansen and T. Hasan, “Speaker recognition by machines and humans,"

IEEE Signal Process. Mag. , vol. 32, no. 6, pp. 74–99, Oct. 2015. [2] N. Dehak, et al., “Front-end factor analysis for speaker veriﬁcation,"

IEEETrans. Audio, Speech, and Lang. Process. , vol. 19, no. 4, pp. 788–798,May 2011.[3] P. Kenny, “A small footprint i-vector extractor," in

Proc. Odyssey , 2012.[4] D. Snyder, et al., “X-vectors: robust DNN embeddings for speaker recogni-tion," in

Proc. IEEE Int. Conf. Acoust., Speech, and Signal Process. , 2018,pp. 5329-5333.[5] E. Variani, et al., “Deep neural networks for small footprint text-dependentspeaker veriﬁcation," in

Proc. IEEE Int. Conf. Acoust., Speech, and SignalProcess. , 2014, pp. 4080–4084.[6] D. Snyder, et al., “Deep neural network embeddings for text-independentspeaker veriﬁcation," in

Proc. INTERSPEECH , 2017, pp. 999-1003.[7] G. Heigold, et al., “End-to-end text-dependent speaker veriﬁcation," in

Proc. IEEE Int. Conf. Acoust., Speech, and Signal Process. , 2016, pp.5115–5119.[8] L. Wan, et al., “Attention-based models for text-dependent speaker veri-ﬁcation," in

Proc. IEEE Int. Conf. Acoust., Speech, and Signal Process. ,2018, pp. 5359–5363.[9] L. Wan, et al., “Generalized end-to-end loss for speaker veriﬁcation," in

Proc. IEEE Int. Conf. Acoust., Speech, and Signal Process. , 2018, pp.4879–4883.[10] W. H. Kang and N. S. Kim, “Unsupervised learning of total variabilityembedding for speaker veriﬁcation with random digit strings,"

AppliedSciences , vol. 9, no. 8, Apr. 2019.[11] W. H. Kang and N. S. Kim, “Adversarially learned total variability embed-

TABLE 6: EER (%) comparison between the speaker embedding vectors extracted from the proposed joint factor embeddingand the conventional methods. In the Data Augmentation column, X indicates embedding network trained with no augmentationand O indicates network trained with augmented training set.

Methods Scoring Data augmentation EER [%] i-vector [39] PLDA X 8.8

VGG [39] Cosine similarity X 7.8

Generalized end-to-end [40] Cosine similarity X 10.7

All-speaker hard negative mining end-to-end [40] Cosine similarity X 5.6 x-vector (softmax) [39] Cosine similarity X 11.3PLDA X 7.1O 6.0 x-vector (our implementation)

PLDA O 4.9

CNN-embedding [39] Cosine similarity X 7.3PLDA X 5.9O 5.3 x-vector (JFE)

Cosine similarity X

PLDA X O VOLUME 4, 2016 et al. : Disentangled speaker and nuisance attribute embedding for robust speaker veriﬁcation ding for speaker recognition with random digit strings,"

Sensors , vol. 19,no. 21, Oct. 2019.[12] Z. Meng, et al., “Adversarial speaker veriﬁcation," in

Proc. IEEE Int. Conf.Acoust., Speech, and Signal Process. , 2019, pp. 6216-6220.[13] D. Ramos, et al., “Addressing database mismatch in forensic speakerrecognition with Ahumada III: a public real-casework database in Span-ish," in

Proc. INTERSPEECH , 2008, pp. 1493-1496.[14] X. Wang, et al., “VAE-based domain adaptation for speaker veriﬁcation,"in

Proc. APSIPA , 2019.[15] P. S. Nidadavolu, et al., “Low-resource domain adaptation for speakerrecognition using cycle-GANs," in

Proc. ASRU , 2019.[16] X. Fang, et al., “Channel adversarial training for cross-channel text-independent speaker recognition," in

Proc. IEEE Int. Conf. Acoust.,Speech, and Signal Process. , 2019, pp. 6221-6225.[17] Y. Ganin, et al., “Domain-adversarial training of neural networks,"

JMLR ,vol. 17, no. 59, pp. 1–35, 2016.[18] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by back-propagation," in

Proc. ICML , 2015.[19] Y. Shinohara, “Adversarial multi-task learning of deep neural networksfor robust speech recognition," in

Proc. INTERSPEECH , 2016, pp. 2369–2372.[20] A. Tripathi, et al., “Adversarial learning of raw speech features for domaininvariant speech recognition," in

Proc. IEEE Int. Conf. Acoust., Speech,and Signal Process. , 2018, pp. 5959–5963.[21] K. Wei and C. Hsu, “Deep neural network approaches to speaker andlanguage recognition," in

Proc. British Machine Vision Conference , Sep.2015.[22] J. Zhou, et al., “Training multi-task adversarial network for extract-ing noise-robust speaker embedding," in

Proc. IEEE Int. Conf. Acoust.,Speech, and Signal Process. , 2019, pp. 6196–6200.[23] W. Shang, et al., “Generative adversarial nets," in

Proc. NIPS , 2014.[24] M. Arjovsky and L. Bottou, “Towards principled methods for traininggenerative adversarial networks," in

Proc. ICLR , 2017.[25] O. Morgen, “Representation learning for natural language" Ph.D. disser-tation, Department of Computer Science and Engineering, University ofGothenburg, Gothenburg, Sweden, 2018.[26] A. Larcher, et al., “The RSR2015: database for text-dependent speakerveriﬁcation using multiple pass-phrases" in

Proc. INTERSPEECH , 2012,pp. 2–5.[27] A. Larcher, et al., “Text-dependent speaker veriﬁcation: classiﬁers,databases and RSR2015"

Speech Communication , vo. 60, pp. 56–77, May2014.[28] A. Nagrani, et al., “VoxCeleb: a large-scale speaker identiﬁcation dataset,"arXiv:1706.08612, 2017.[29] R. Caruana, “Multitask learning,"

Machine Learning , vol. 28, pp. 41–75,1997.[30] J. T. Springenberg, “Unsupervised and semi-supervised learning withcategorical generative adversarial networks" in

Proc. ICLR , 2016.[31] A. Larcher, et al., “Text-dependent speaker veriﬁcation: classiﬁers,databases and RSR2015,"

Speech Communication , vol. 60, pp. 56–77,May 2014.[32] H. Sak, et al., “Long short-term memory recurrent neural network architec-tures for large scale acoustic modeling," in

Proc. INTERSPEECH , 2014,pp. 338–342.[33] M. Abadi, et al.,

Tensorﬂow: large-scale machine learning heterogenoussystems , Software available at tensorﬂow.org.[34] D. P. Kingma and J. L. Ba, “ADAM, a method for stochastic optimization,"in

Proc. ICLR , 2015.[35] L. Maaten and G. Hinton, “Visualizing data using t-SNE,"

Journal ofMachine Learning Research , vol. 9, pp. 2579–2605, 2008.[36] N. Maghsoodi, et al., "Speaker recognition with random digit strings usinguncertainty normalized HMM-based i-vectors,"

IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 27, no. 11, 2019.[37] L. Chen and Y. Yang, “Emotional speaker recognition based on modelspace migration through translated learning,"

Biometric Recognition , pp.394–401, 2013.[38] S. Albanie, et al., “Emotion recognition in speech using cross-modaltransfer in the wild,"

ACM Multimedia , 2018.[39] S. Shon, et al., “Frame-level speaker embeddings for text-independentrecognition and analysis of end-to-end model," in

Proc. SLT , 2018.[40] H. Heo, et al., “End-to-end losses based on speaker basis vectors and all-speaker hard negative mining for speaker veriﬁcation," in

Proc. INTER-SPEECH , 2019. [41] T. Hasan, R. Saeidi, J. H. L. Hansen, and D. A. van Leeuwen, “Durationmismatch compensation for i-vector based speaker recognition systems,"in

Proc. IEEE Int. Conf. Acoust., Speech, and Signal Process. , 2013, pp.7663–7667.

PLACEPHOTOHERE

WOO HYUN KANG was born in Seoul, Korea,in 1990. He received the B.S. degree in electron-ics engineering from Kookmin University, Seoul,Korea, in 2014. He is currently pursuing the Ph.D.degree in electrical engineering and computer sci-ence at Seoul National University (SNU), Seoul,Korea.His research interests include speaker recogni-tion, machine learning, and signal processing.PLACEPHOTOHERE

SUNG HWAN MUN was born in Incheon, Korea,in 1993. He received the B.S. degree in electronicsengineering from Inha University, Incheon, Korea,in 2019. He is currently pursuing the Ph.D. degreein electrical engineering and computer science atSeoul National University (SNU), Seoul, Korea.His research interests include speaker recogni-tion, machine learning, and signal processing.PLACEPHOTOHERE

MIN HYUN HAN was born in Seoul, Korea, in1992. He received the B.S. degree in electrical& electronic engineering from Yonsei University,Seoul, Korea, in 2018. He is currently pursuing thePh.D. degree in electrical engineering and com-puter science at Seoul National University (SNU),Seoul, Korea.His research interests include speaker recogni-tion, machine learning, and signal processing.PLACEPHOTOHERE

NAM SOO KIM received the B.S. degree in elec-tronics engineering from Seoul National Univer-sity (SNU), Seoul, Korea, in 1988 and the M.S.and Ph.D. degrees in electrical engineering fromKorea Advanced Institute of Science and Technol-ogy in 1990 and 1994, respectively.From 1994 to 1998, he was with Samsung Ad-vanced Institute of Technology as a Senior Mem-ber of Technical Staff. Since 1998, he has beenwith the School of Electrical Engineering, SNU,where he is currently a Professor. His research area includes speech sig-nal processing, speech recognition, speech/audio coding, speech synthesis,adaptive signal processing, machine learning, and mobile communication.12