Open-set Short Utterance Forensic Speaker Verification using Teacher-Student Network with Explicit Inductive Bias
OOpen-set Short Utterance Forensic Speaker Verificationusing Teacher-Student Network with Explicit Inductive Bias
Mufan Sang, Wei Xia, John H.L. Hansen
Center for Robust Speech Systems, University of Texas at Dallas, TX 75080 { mufan.sang, wei.xia, john.hansen } @utdallas.edu Abstract
In forensic applications, it is very common that only small nat-uralistic datasets consisting of short utterances in complex orunknown acoustic environments are available. In this study, wepropose a pipeline solution to improve speaker verification ona small actual forensic field dataset. By leveraging large-scaleout-of-domain datasets, a knowledge distillation based objec-tive function is proposed for teacher-student learning, which isapplied for short utterance forensic speaker verification. Theobjective function collectively considers speaker classificationloss, Kullback-Leibler divergence, and similarity of embed-dings. In order to advance the trained deep speaker embeddingnetwork to be robust for a small target dataset, we introducea novel strategy to fine-tune the pre-trained student model to-wards a forensic target domain by utilizing the model as a fine-tuning start point and a reference in regularization. The pro-posed approaches are evaluated on the st Index Terms : Text-independent speaker verification, short ut-terance, teacher-student learning, transfer learning
1. Introduction
Speaker verification (SV) is defined as the process of iden-tifying the true characteristics of a speaker and to accept ordiscard the identity claimed by the speaker. In recent years,speaker verification has seen significant improvement with fastdevelopment of deep learning and great success of various ad-vanced neural networks applied to deep speaker embeddingsystems [1, 2, 3, 4, 5, 6]. Generally, speaker embedding sys-tems consist of a front-end frame-level feature extractor, anutterance-level encoding layer, one or more fully-connectedlayers, and an output classifier. Like the i-Vector system [7],speaker embedding networks can encode variable length utter-ances into a fix-length speaker representation. Although stud-ies [2, 4] have shown that deep speaker embedding systems out-performed the traditional i-Vector solution especially on shortutterances, short utterance speaker verification is still a chal-lenging task because of insufficient phonetic information pro-vided [8].Recently, deep neural network (DNN) and convolutionalneural network (CNN) based speaker embedding systems havebeen applied to solve this problem and obtain effective perfor-mance improvements [9, 10, 11, 12]. In [10], a raw waveformCNN-LSTM architecture was proposed to extract phonetic-level features which can help compensate for missing phonetic information. In [9, 11], some new aggregation methods suchas NetVLAD layer and Time-Distributed Voting (TDV) weredesigned to improve the efficiency of the aggregation process.In order to maintain strong speaker discrimination, most stud-ies use large amounts of in-domain data to train the speakerembedding networks for short utterances and evaluate them inthe same domain. However, it usually happens that only avery small size dataset is available for the specific target do-main, and directly training the model from scratch on it willgreatly degrade the performance. As one application of speakerverification, forensic related problems are complex and chal-lenging because there are actual constraints on availability ofspeaker data and length of utterances in both test and enroll-ment [13, 14]. Also, inconsistencies usually exist in forensicrelevant datasets, such as unknown recording locations, noise,reverberation, very limited duration of useful human speech,etc. [15, 16]. These variants can lead to significant performancedegradation for speaker verification systems.In this work, we collect a new challenging naturalisticdataset for forensic analysis of audio from various cities acrossthe USA where detectives were investigating actual homicides.We focus on the problem of short utterance speaker verificationin forensic domains with only a small naturalistic target datasetavailable. In order to alleviate this problem, first, we proposea knowledge distillation [17] based objective function whichcollectively considers Kullback-Leibler divergence (KLD) ofposteriors, similarity of embeddings, and speaker classificationloss to improve performance of short utterance speaker verifi-cation with teacher-student (T-S) learning applied. Second, toavoid losing initial speaker discriminative knowledge learnedfrom a large number of speakers from out-of-domain datasets,we introduce a new fine-tuning strategy that helps encode anexplicit inductive bias towards the pre-trained model by usingthe pre-trained network as a fine-tuning start point as well as areference penalized in the regularization. In this way, the fine-tuned model takes advantage of the strong speaker discrimina-tive power on short utterances and learns new speaker charac-teristics from the small target dataset.We evaluate the performance of our fine-tuning strategybased on the L regularization (known as weight decay), whichoften results in a suboptimal solution for the target domain withan implicit inductive bias towards the pre-trained model. Inthis paper, we show that the proposed knowledge distillationloss function is able to transfer knowledge from teacher to stu-dent more effectively, and the introduced fine-tuning strategyoutperforms the L regularization by adequately preserving thelearned features from large-scale source datasets.In Sec. 2, we describe the proposed pipeline, includingthe teacher-student learning framework used for short utterancespeaker verification and the proposed fine-tuning strategy. Datadescription, experiment setting, and analysis results are reportedin Sec. 3 and 4. Finally, we conclude this paper in Sec. 5. a r X i v : . [ ee ss . A S ] S e p esnet34-LDE Long utterance(Teacher)
Embedding layerFC layer Resnet34-LDE
Short utterance (Student)
Embedding layerFC layer
Cosine Dist
KLD Loss
Target Data
Embedding layer 𝐿 𝑚 = 𝛽 𝑾 𝑚 22 Pre-trained student model
Embedding layer
Classification Loss
Speaker posterior 𝐿 𝑠 = 𝛼 𝑾 𝑠 − 𝑾 𝑠0 22 Joint optimized by
𝐿 = 𝐿 𝑐𝑙𝑎𝑠𝑠 + 𝐿 𝑠 + 𝐿 𝑚 (a) (b) Resnet34-LDE Resnet34-LDEFC layer FC layer
Figure 1: (a) Flow-diagram of our teacher-student learning framework. The KL-divergence loss between posteriors, cosine distancebetween embeddings, and speaker classification loss are shown in the figure. (b) Proposed fine-tuning strategy to transfer the pre-trained student model to the small target dataset
2. Methodology
In this section, we introduce our proposed approaches forshort utterance speaker verification on the small forensic tar-get dataset, where utterances in the target domain have muchshorter length as well as several variation factors compared withthe source domain. First, the speaker embedding model shouldlearn a strong speaker discriminative power on short utterancesfrom large out-of-domain datasets. Next, the model is refinedto perform well on short utterances in the target domain withonly a much smaller in-domain dataset available. We utilizeT-S learning with the proposed objective function to improvesystem performance on short utterances. Then, we use our fine-tuning strategy to adapt the trained student model from the largesource dataset to the small target dataset with an explicit induc-tive bias. The architecture of the entire pipeline is illustrated inFigure 1.
In this study, the teacher model is trained independently, and thestudent model is learned to mimic the probability distribution ofthe teachers output and account for speaker classification loss.Our teacher-student learning framework is described in the fol-lowing sections.
For both teacher and student models, softmax cross-entropy lossis used as the speaker classification loss. We also explore the an-gular softmax (A-softmax) [18] loss which has a stronger dis-criminative ability by maximizing the angular margin betweenspeaker embeddings.
In conventional T-S learning, the student is learned to mimicthe teacher by minimizing the Kullback-Leibler divergence be-tween the teacher and student output distributions given paral-lel data [19]. When adopting T-S learning for a short utteranceproblem, the KLD function can be presented as, L KLD = − I (cid:88) i =1 N (cid:88) n =1 P ( y n | x i,T ) log ( P ( y n | x i,S )) (1)where y n refers to speaker n and x i,T and x i,S refer to the i -th input sample of the teacher and student models. In thiswork, the teacher model is fed with long utterance samples, and the student is fed with the shorter crop of the same utterances. P ( y n | x i,T ) and P ( y n | x i,S ) are the posteriors of the i -th sam-ple predicted by teacher and student models. In this way, thestudent model is enforced to generate similar posteriors as theteacher model.As the representation of speakers, embedding is the key toa speaker verification system. The problem will be effectivelyaddressed if a model can produce more similar or even identicalembeddings for short utterances vs. long utterances. Accord-ing to this idea, directly constricting the similarity of embed-dings between short and long utterances would be more effi-cient [10, 20]. It can be achieved by minimizing the distancebetween embeddings of short and long utterances. We use co-sine distance (COS) as the distance metric, with the correspond-ing loss function formulated as, L EMD = − (cid:88) i =1 (cid:15) iT · (cid:15) iS (cid:107) (cid:15) iT (cid:107) (cid:107) (cid:15) iS (cid:107) (2)where (cid:15) iT and (cid:15) iS represent the embeddings predicted by theteacher and student models for the i -th sample.In optimization, we use a multi-task objective functionto combine the losses introduced above. L class denotes thespeaker classification loss. Three different combinations of ob-jective functions are described below:(1) L class + L KLD : On the basis of speaker classificationloss, adding soft labels enables the student model to enhance itsdiscriminative power toward that of the teacher model trainedon long utterances(2) L class + L EMD : Replacing the KLD loss with theembedding-based loss enables the student model to generatemore similar embeddings to that of the teacher.(3) L class + L KLD + L EMD : Collectively combining all threecan guarantee both the speaker discriminative power and thesimilarity of embeddings between short and long utterances.We compare the efficacy of trained student networks withthe baseline models trained without T-S learning. We also com-pare the performances of student models optimized by differentobjective functions mentioned above.
Considering the limited source, naturalistic contexts, and sev-eral inter-speaker and intra-speaker variations contained in theshort utterance forensic corpus [21] , directly training the deepspeaker embedding model from scratch will not ensure discrim-inative power of the speakers, and will also contribute to severeverfitting problems. Fine-tuning with the commonly used L regularization would help alleviate overfitting to some extent,but it can only provide an implicit inductive bias towards thepre-trained model. Therefore, we introduce a novel fine-tuningstrategy that is able to produce the explicit inductive bias to-wards the pre-trained model by setting the model as the startpoint of the fine-tuning as well as a reference for the regu-larization. With the restriction of the reference, the capacityof the fine-tuned model will not be adapted blindly. Accord-ingly, we investigate its efficiency on different regularizers dur-ing fine-tuning and add the suffix -SP to the regularizers. As-sume W ∈ R n represents the parameter matrix containing alladapted parameters of the fine-tuned model. Investigated regu-larizers are described as below: L - norm . This is the most common penalty term used forregularization, also called weight decay. The penalty term isshown as: Θ ( W ) = α (cid:107) W (cid:107) , (3)where α is the weight of the penalty. L -SP . Using W denotes the parameter matrix of the pre-trained student model which is trained on the out-of-domaindatasets. This pre-trained model is set as the start point and thereference. Thus, we penalize the L distance between adaptedparameter matrix W and W to obtain: Θ ( W ) = α (cid:13)(cid:13) W − W (cid:13)(cid:13) (4)Considering the network architecture is usually changedwhen adapting the model from source datasets to a new targetdataset, the penalty can be separated as two parts: one penalizesthe part of the unchanged architecture W s between the pre-trained and fine-tuned networks, and the second penalizes themodified architecture W m . As shown in Figure 1-(b), they arerepresented as L s and L m . We obtain: ˆΘ ( W ) = α (cid:13)(cid:13) W s − W s (cid:13)(cid:13) + β (cid:107) W m (cid:107) (5) L -SP . Changing the L -norm to L -norm, we have: ˆΘ ( W ) = α (cid:13)(cid:13) W s − W s (cid:13)(cid:13) + β (cid:107) W m (cid:107) (6)The L penalty encourages some parameters to be equalto corresponding parts of the pre-trained model. Consequently, L -SP can be considered a trade-off between L -SP and thescheme by freezing some parts of the pre-trained network.In our experiments, we compare the investigated regulariz-ers based on their performance on a small target dataset. Wealso explore the fine-tuning performance with different networklayer selections.
3. Experiments
In this study, we collect a new challenging naturalistic foren-sic relevant dataset, called the st st For all experiments in Sec. 2.1, we use Voxceleb1&2 datasetswhich are collected in the wild [22, 23]. The former contains352 hours of audio from 1251 speakers, and the latter consistof 2442 hours audio from 5994 speakers. We train teacher andstudent speaker embedding networks on the entire Voxceleb2dataset. A similar data augmentation method in [5] is adoptedin the experiments.
For both teacher and student models, 30-dimensional log-Mel filter-banks are extracted with a frame-length of 25 ms at a 10 ms shift. Mean-normalization is ap-plied over a sliding window of up to 3 secs, and Kaldi energy-based VAD is used to remove silence frames. We use theResNet34 [24] as the encoder and the learnable dictionary en-coding (LDE) layer [25] with 64 components to aggregateframe-level features as utterance-level representations. The stu-dent model is initialized as identical to the teacher model at thebeginning. The weights of the teacher model are fixed duringstudent model training. With a mini-batch of 64, we use theAdam optimizer [26] and the learning rate is decayed using theNoam method in [27]. For the teacher model, we apply a simi-lar setting from [25] for network architecture and input samples.For the student, randomly cropped utterances at a fixed-lengthof 200 frames are utilized.
Fine-tuning.
With the last layer replaced, the most commonway is fine-tuning only the last two fully-connected (FC) lay-ers. We also explore other two layer selections: (1) the lasttwo FC layers plus the LDE layer and the last Residual block(Res4) of ResNet34; (2) all layers of the embedding network.For all fine-tuning experiments, we use Adam optimizer with alearning rate decreasing by 10 in every 15 epochs. Two learn-ing rates are used for different layers: 1e-3 used for the replacedlast layer and 1e-5 used for rest of the fine-tuned layers. For thebest results achieved, the hyper-parameters α and β for L -SP and L -SP are set as 0.1 and 0.01, and the weight decay is set as0.001. The back-end part comprise of LDA with dimension re-duction to 200, plus centering, whitening, length normalization,and PLDA applied. . Results and Analysis We first investigate performance of T-S learning on short ut-terances with different training objective functions. Table 1presents the results of different models evaluated on our chal-lenging st Performance of baseline systems, teacher and studentmodels on the st System Distillation EER(%)(a) L class = SoftmaxResNet34-LDE-L - 14.97ResNet34-LDE-S - 13.35ResNet34-LDE L class + L KLD L class + L EMD L class + L KLD + L EMD (b) L class = A-SoftmaxResNet34-LDE-L - 13.83ResNet34-LDE-S - 12.79ResNet34-LDE L class + L KLD L class + L EMD L class + L KLD + L EMD
With the same embedding network utilized, the studentmodel can still boost the performance with the advantage of theproposed objective function for T-S learning. As shown in Ta-ble 1, different student objective functions contribute to differ-ent levels of performance improvement, and all student modelsoutperform their teacher and baseline models. In Table 1-(a),when using the softmax, three student models reduce the EERof the baseline system to 12.31%, 12.03%, and 11.65%, respec-tively. While using the A-softmax loss can still further boostthe performance to 11.85%, 11.34%, and 11.12%. Based on theresults, we find that the models optimizted by L class + L EMD consistently outperform that using L class + L KLD . It indicatesthat the speaker embedding quality is more relevant to systemperformance. Compared with baselines, it is consistent to ob-serve that models optimized by L class + L KLD + L EMD canachieve the best performances for both softmax and A-softmaxwith the highest reductions of EER by 12.7% and 13.1% rel-atively. This makes sense since the student model can learnadditional speaker embedding knowledge from long utterancesbesides the speaker discriminative power, and it is more effec-tive to compensate for short utterances.Considering two student models optimized by the proposedobjective function L class + L KLD + L EMD and L class + L EMD with A-softmax used, they produce the best two performanceson 2-second segments of Voxceleb1 evaluation set with EERs of6.38% and 6.51%, respectively. Meanwhile, they also producethe best two performances on the target dataset in Table 1. Thus, Table 2:
Performance comparison of different regularizationstrategies and layer selections on the best two student models.The first column represents proposed regularizers, the first rowrepresents the different layer selections.
Last 2 FC Last 2 FC+LDE+Res4 All Layers(a) A-Softmax & L class + L KLD + L EMD w/o regularizer 11.03 10.88 11.36 L -norm L -SP L -SP (b) A-Softmax & L class + L EMD w/o regularizer 11.28 11.20 11.52 L -norm L -SP L -SP we pick them as the start points for the follow-up fine-tuningstep.Table 2 shows fine-tuning results on the st L -norm , L -SP , and L -SP achieve their best performances with 10.52%, 10.14%,and 9.85% in EER. In the lower Table 2-(b), they achieve theirbest performances with EERs of 11.03%, 10.25%, and 10.09%,respectively. We observe the trend that with the restriction ofreference, L -SP and L -SP consistently outperform the L -norm for all three layer selection methods. Thus, it is beneficialto fine-tune the pre-trained model with explicit inductive bias,especially when the target dataset is small. For L -norm , theproblem of overfitting to a small amount of data will be moresevere when lower layers are adapted, and it is shown that per-formance of fine-tuning with it begins to degrade from adaptingpart of the network to adapting all layers. Comparing to thebest results achieved by L -SP and L -norm in Table 2, L -SP outperforms them and reduces EERs by relative 2.9% and6.4%, respectively. Considering the small size, short durationand other variations mentioned in Sec. 3.1.1 for the st L -SP and L -SP are def-initely more efficient than weight decay especially when lowerlayers are adapted, and L -SP can provide more strict regular-ization compared to L -SP .
5. Conclusion
In this study, we focused on approaches for short utterancespeaker verification on a small challenging naturalistic foren-sic dataset. Our results indicated that speaker discriminativepower and embedding similarity are two significant points forshort utterance speaker verification. The proposed objectivefunction for teacher-student learning was shown to transfer crit-ical knowledge of these two points from long utterances toshort utterances more effectively. Using our novel fine-tuningstrategy, speaker embedding networks were adapted to a newsmall target dataset while preserving the speaker discriminativepower learned from large number of speakers. Experiment re-sults show that our approaches can significantly improve perfor-mance of speaker verification systems on small domain-specificshort utterance datasets. . References [1] D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero,Y. Carmiel, and S. Khudanpur, “Deep neural network-basedspeaker embeddings for end-to-end speaker verification,” in . IEEE,2016, pp. 165–170.[2] C. Zhang and K. Koishida, “End-to-end text-independent speakerverification with triplet loss on short utterances.” in
Interspeech ,2017, pp. 1487–1491.[3] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kan-nan, and Z. Zhu, “Deep speaker: an end-to-end neural speakerembedding system,” arXiv preprint arXiv:1705.02304 , 2017.[4] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalizedend-to-end loss for speaker verification,” in . IEEE, 2018, pp. 4879–4883.[5] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust dnn embeddings for speaker recognition,”in . IEEE, 2018, pp. 5329–5333.[6] W. Xia, J. Huang, and J. H. Hansen, “Cross-lingual text-independent speaker verification using unsupervised adversarialdiscriminative domain adaptation,” in
IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 5816–5820.[7] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front-end factor analysis for speaker verification,”
IEEE Trans-actions on Audio, Speech, and Language Processing , vol. 19,no. 4, pp. 788–798, 2010.[8] A. Kanagasundaram, R. Vogt, D. B. Dean, S. Sridharan, andM. W. Mason, “I-vector based speaker recognition on short ut-terances,” in
Proceedings of the 12th Annual Conference of theInternational Speech Communication Association . InternationalSpeech Communication Association (ISCA), 2011, pp. 2341–2344.[9] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level aggregation for speaker recognition in the wild,” in
ICASSP2019-2019 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2019, pp. 5791–5795.[10] J.-w. Jung, H.-S. Heo, H.-j. Shim, and H.-J. Yu, “Short utterancecompensation in speaker verification via cosine-based teacher-student learning of speaker embeddings,” in .IEEE, 2019, pp. 335–341.[11] A. Hajavi and A. Etemad, “A deep neural network for short-segment speaker recognition,” arXiv preprint arXiv:1907.10420 ,2019.[12] A. Gusev, V. Volokhov, T. Andzhukaev, S. Novoselov, G. Lavren-tyeva, M. Volkova, A. Gazizullina, A. Shulipa, A. Gor-lanov, A. Avdeeva et al. , “Deep speaker embeddings for far-field speaker recognition on short utterances,” arXiv preprintarXiv:2002.06033 , 2020.[13] M. I. Mandasari, M. McLaren, and D. A. van Leeuwen, “Evalua-tion of i-vector speaker recognition systems for forensic applica-tion,” 2011.[14] A. Poddar, M. Sahidullah, and G. Saha, “Speaker verification withshort utterances: a review of challenges, trends and opportuni-ties,”
IET Biometrics , vol. 7, no. 2, pp. 91–101, 2017.[15] A. K. H. Al-Ali, D. Dean, B. Senadji, V. Chandran, and G. R.Naik, “Enhanced forensic speaker verification using a combina-tion of dwt and mfcc feature warping in the presence of noiseand reverberation conditions,”
IEEE Access , vol. 5, pp. 15 400–15 413, 2017.[16] T. J. Machado, J. Vieira Filho, and M. A. de Oliveira, “Foren-sic speaker verification using ordinary least squares,”
Sensors ,vol. 19, no. 20, p. 4385, 2019. [17] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge ina neural network,” arXiv preprint arXiv:1503.02531 , 2015.[18] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface:Deep hypersphere embedding for face recognition,” in
Proceed-ings of the IEEE conference on computer vision and patternrecognition , 2017, pp. 212–220.[19] L. Lu, M. Guo, and S. Renals, “Knowledge distillation for small-footprint highway networks,” in .IEEE, 2017, pp. 4820–4824.[20] S. Wang, Y. Yang, T. Wang, Y. Qian, and K. Yu, “Knowledge dis-tillation for small foot-print deep speaker embedding,” in
ICASSP2019-2019 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2019, pp. 6021–6025.[21] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, T. Merlin, J. Ortega-Garc´ıa,D. Petrovska-Delacr´etaz, and D. A. Reynolds, “A tutorial ontext-independent speaker verification,”
EURASIP Journal on Ad-vances in Signal Processing , vol. 2004, no. 4, p. 101962, 2004.[22] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb:a large-scale speaker identification dataset,” arXiv preprintarXiv:1706.08612 , 2017.[23] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deepspeaker recognition,” arXiv preprint arXiv:1806.05622 , 2018.[24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” in
Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2016, pp. 770–778.[25] W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and lossfunction in end-to-end speaker and language recognition system,” arXiv preprint arXiv:1804.05160 , 2018.[26] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”in