Speaker-Utterance Dual Attention for Speaker and Utterance Verification
Tianchi Liu, Rohan Kumar Das, Maulik Madhavi, Shengmei Shen, Haizhou Li
SSpeaker-Utterance Dual Attention for Speaker and Utterance Verification
Tianchi Liu , Rohan Kumar Das , Maulik Madhavi , Shengmei Shen , Haizhou Li Pensees Pte Ltd, Singapore Department of Electrical and Computer Engineering, National University of Singapore, Singapore { liutianchi, jane.shen } @pensees.ai, { rohankd, maulik.madhavi, haizhou.li } @nus.edu.sg Abstract
In this paper, we study a novel technique that exploits theinteraction between speaker traits and linguistic content to im-prove both speaker verification and utterance verification per-formance. We implement an idea of speaker-utterance dual at-tention (SUDA) in a unified neural network. The dual attentionrefers to an attention mechanism for the two tasks of speakerand utterance verification. The proposed SUDA features an at-tention mask mechanism to learn the interaction between thespeaker and utterance information streams. This helps to focusonly on the required information for respective task by maskingthe irrelevant counterparts. The studies conducted on RSR2015corpus confirm that the proposed SUDA outperforms the frame-work without attention mask as well as several competitive sys-tems for both speaker and utterance verification.
Index Terms : text-dependent speaker verification, utteranceverification, attention, masking, RSR2015
1. Introduction
Speaker verification (SV) aims to verify the claimed identity ofa person using given speech [1]. Its implementation is broadlycategorized into text-dependent and text-independent based onthe spoken contents used for enrollment and testing [1]. The for-mer deals with use of fixed short phases, while the latter doesn’tput any constraints on the speech content. A text-independentsystem generally requires more training and test data [2, 3] thana text-dependent one [4, 5] to maintain the same level of accu-racy. Therefore, text-dependent system is preferable in manyreal-world applications where user’s cooperation is possible.The research on text-dependent SV has evolved a lot fromtraditional dynamic time warping based template matchingmethod to deep learning recently. For benchmarking of technol-ogy process, standard speech corpora like RSR2015 and Red-Dots are designed [6,7]. It is found that the modeling techniquessuch as hierarchical multi-layer acoustic model (HiLAM) [6],unsupervised hidden Markov model (HMM)-universal back-ground model (UBM) [8] i-vector/HMM [9] and j-vector [10]benefit from temporal information in speech. Further, deeplearning techniques have greatly improved the ability of speakercharacterization [11–16].In human SV, we would like to have test samples as parallelto the unknown as possible to reduce the variability to a min-imum due to language content. Text-dependent SV allows usto do just like that. Various studies show text-dependent SV isconsidered for performing two tasks, an SV as the main task,and an utterance verification as the subtask, where the two tasksare optimized separately or jointly in order to improve the SVobjective. For example, the phonetic posteriorgrams derivedusing Gaussian mixture model (GMM) and deep neural net-work (DNN) frameworks are utilized to capture lexical infor-mation for text-dependent SV [17, 18]. One shows that DNN based speaker embedding benefits from lexical content infor-mation [19], others suggest that lexical information can be usedin different ways to compensate the SV scores for performancegains [20–22].Prior studies have underscored the importance of contentmodeling in text-dependent SV. While utterance verification hasbeen well studied as part of speech recognition [23, 24], it hasnot been given sufficient attention in the context of SV. Someconsider text-dependent SV as a combination of two indepen-dent systems, namely SV and utterance verification [25, 26]. Inour previous work [27], text-dependent SV is formulated in aunified speaker-utterance verification (SUV) system as a multi-task learning implementation. This is inspired by human cog-nitive process where we interpret and decode speaker traits andlinguistic content in a corroborative manner [28, 29]. For ex-ample, by paying special attention to particular sounds whileknowing the linguistic content information, we verify the voiceof a speaker; on the other hand, if we are familiar with a speaker,we tend to recognize his/her voice in a better way.In the unified SUV system, we used a shared long shortterm memory (LSTM) network and two independent LSTMoutput layers, one for speaker identity and another for utter-ance identity [27]. While the previously proposed unified SUVframework is effective, the interaction between the two outputlayers is not explored. We believe that both speaker and utter-ance verification can benefit from each other by exploring thetemporal interaction between them. This is also motivated bysuccessful explorations in text-dependent SV that suggest thebenefit of compensating lexical information [20–22]. Further,various attention models [30–32] project their possibility to fo-cus on specific compensation or masking related to each task.In this work, we propose an speaker-utterance dual atten-tion, SUDA in short for performing both speaker and utteranceverification. The attention mechanism for compensating irrel-evant information for both tasks is derived by using maskingoperation. The masking for attention is estimated frame-by-frame, the attention mechanism establishes the temporal asso-ciation between the speaker trait stream and the utterance con-tent stream of LSTM output. In addition, we note that, as theattention mechanism is applied to both the branches (speakerand utterance) in the framework, it is referred to as dual at-tention. The studies in this work are conducted on RSR2015corpus [6]. The contribution of this work lies in the use of aspeaker-utterance dual attention in a single framework for per-forming both speaker and utterance verification.The rest of the work is organized as follows. Section 2describes the proposed speaker-utterance dual attention mech-anism for speaker and utterance verification. The experimentsare detailed in Section 3, followed by reporting of their resultsand analysis in Section 4. The paper is finally concluded inSection 5. a r X i v : . [ ee ss . A S ] A ug ttention MaskPReLU MaskGeneratorBN FCBNCONV PReLU BN CONVBN CONV FC GAPGAP FCFCMultiply BatchNormalization1D Convolution 1D FullConnectionFeatures MaskGenerator Global AveragePooling 1DAudio Data BNBN SoftmaxSoftmax LTutt LspkLutt
Utterance Verification BranchSpeaker Verification Branch
NF1 256NF1 60 ParametricRectifiedLinear Unit SoftmaxLSTMLSTMLSTM
LTspk
Figure 1:
Block diagram of proposed SUDA for speaker and utterance verification. It consists of a shared LSTM, two LSTM output layer,which interact each other through the attention mask network. The NF and NF’ are number of frames before and after convolution.
2. Speaker-Utterance Dual Attention
This section describes the proposed SUDA for speaker and ut-terance verification. As shown in Figure 1, the features ex-tracted from raw audio data are fed into the LSTM based re-current network to characterize the temporal dynamics. Thissystem is an extension to our earlier work of unified SUV frame-work [27].As presented in our earlier study [27], the first shared layeris common for speaker and utterance verification branches.Then, the hidden representation from the first shared layer ispassed to the next two LSTM networks that focus on extractingvalid information for each of the two sub-tasks, namely, speakerand utterance verification. We now discuss the improvisationsintroduced in this work using dual attention by masking.
Masking has been found to be effective to separate multiplesound sources from an audio mixture [33]. A more recent ap-proach regards speech separation problem as a supervised learn-ing that aims to discriminate different patterns such as, speech,speakers, and background noise, which are learned from train-ing data [33]. Masking operation can be viewed as an attentionmechanism, where the every single features from one branchare masked to attend the specific features from another branch.Attention based models have been applied successfully to var-ious tasks. In computer vision research, there are three mainstrategies for attention: spatial attention, channel attention andmixed attention [34–36]. The spatial attention is used to em-phasis the area that one is interested, while the channel atten-tion is mainly for features recalibration in convoluational neu-ral networks, such as the ‘Squeeze-and-Excitation’ block pro-posed in [35]. Again, the authors of [30] proposed an end-to-end framework with an attention model to combine the frame-level features, which acts as an alignment operation for SV stud-ies. The attention model designed as a part of the speaker em-bedding network, is used to calculate the weighted mean of theframe-level feature maps to derive speaker embedding with abetter discriminative speaker characteristics [31, 32].
In the proposed SUDA, the convolution layers are used to ex-tract the feature maps from the hidden representation obtainedafter LSTM as shown in Figure 1. It has the objective to map thehidden representation to higher dimensional space to compen-sate the mutual information by attention masks. The dynamicmasks can learn the information from feature maps obtainedfrom the convolutional layers. We use sigmoid non-linearityto limit the activation from the mask. The dynamic maskingis performed by learning the parameters from feature maps andthe activating sigmoid function [37]. The masking is formulatedas: mask s = 1 − Sigmoid ( fm u ) (1)mask u = 1 − Sigmoid ( fm s ) (2)where the fm u and fm s represent the feature maps from utter-ance verification and SV branch, respectively, while the mask u and mask s indicate the dynamic masks for the correspondingbranch. These masks are then multiplied by the correspondingstream of feature maps to suppress irrelevant information in therespective stream of feature maps. The parameters in the masksare not fixed during the training or inference phase. They arederived according to the input audio data.The operations performed so far produce the framewise rep-resentations. In order to pool the information across the utter-ance, global average pooling (GAP) is performed. Next, thefully connected (FC) layers are used to perform both speakerand utterance verification tasks.In this work, the attention mask is applied on the everyfeature map through sigmoid operation than the conventionalsoftmax operation. Further, the weights of our attention masksare obtained from the branch of another task. We note that theweights of the attention mask are tuned for different utterancesand speaker-specific information.
3. Experiments
We now discuss the database and experimental setup for thestudies in the following subsections.able 1:
Performance in EER (%) for proposed SUDA in comparison to existing systems on RSR2015 Part I corpus.
System Male FemaleDevelopment Evaluation Development EvaluationTW IC IW TW IC IW TW IC IW TW IC IWi-vector [6] 2.870 5.950 0.740 1.950 4.030 0.320 3.050 7.870 0.940 1.910 6.610 0.750HiLAM [6] 1.660 3.690 0.490 0.820 2.470 0.190 1.770 3.240 0.450 0.610 2.960 0.140Joint-spk-utt [27] 5.565 1.981 1.792 5.125 2.079 0.888 5.179 1.699 0.831 3.110 1.453 0.499Unified SUV [27] 0.470 1.590 0.101 0.293 1.757 0.039 1.176 4.323 0.178 0.375 2.009 0.068Utt-comp [22] - 1.460 - - 0.960 - - 1.640 - - 0.730 -Utt-comp-uf [22] - 1.460 - - 0.960 - - 1.460 - - -mod-SUV
Proposed: SUDA 0.202 0.728 0.022 0.068 0.722 0.010 0.297 1.449 0.024 0.125
The RSR2015 corpus is used for the studies in this work [6].It contains 300 speakers data from 143 female and 157 malespeakers. Further, the corpus is divided into three different partsbased on the nature of the fixed phrases. The Part I includes 30fixed phrase utterances of 3-4 seconds duration, whereas thePart II has 30 fixed short commands of 1-2 seconds duration.Similarly, the Part III contains the random digit based five orten digit sequences. Again, there are 9 sessions for each phrasefrom all the speakers. Out of those, the first, fourth and seventhsessions are used for speaker enrollment and the remaining fortesting as per the RSR2015 evaluation protocol [6].The RSR2015 corpus has three subsets, which are back-ground, development and evaluation set to evaluate the systemperformance [6].The test trials are grouped under four categories based onthe test speaker and phrase labels, which are
Target Correct (TC),
Impostor Correct (IC),
Target Wrong (TW) and
Impos-tor Wrong (IW). They further constitute three test conditions,where each of those conditions consider TC as target trials andthe remaining three categories as non-target trials, respectively.The performance is reported in terms of equal error rate (EER).In this work, we consider Part I and Part II of RSR2015 as theyare suitable for both speaker and utterance verification.
The speech utterances are processed with 20 ms frame size and10 ms shift to extract 60-dimensional (20-base + 20- ∆ + 20- ∆∆ ) mel frequency cepstral coefficient (MFCC) features us-ing KALDI toolkit. The extracted features are normalized bycepstral mean and variance normalization using utterance levelmean and variance statistics.In addition, we apply the feature level triplet loss to furtherincrease the intra-class distance and reduce inter-class distance.It is applied on the 512-dimensional feature vector at the stepafter 1D global average pooling as shown in Figure 1. We cal-culate triplet loss in both the branches, i.e., SV and utteranceverification. In addition, the negative and positive samples inthe training batch for each iteration is searched. The batch sizeof each iteration is set to be 128 for all the experiments and thesesamples are chosen randomly. The loss of proposed frameworkis calculated by following formula: L total = L Tspk + L Tutt + L spk + L utt (3) http://kaldi-asr.org/ https://pytorch.org/ where the L Tspk and L Tutt are triplet loss, while L spk and L utt are negative log likelihood loss. The subscripts spk and utt represent the SV and utterance verification branches inSUDA, respectively.The learning rate, optimizer, LSTM’s hidden layer andscoring follow the same configurations as that in our previouswork of unified SUV [27]. We adopt PyTorch toolkit for theimplementation. We fixed seeds empirically to 2020 in our stud-ies. In contrast to the previous unified SUV, we add 1D convolu-tion layer to the network, where kernel size of convolution layeris 5, while the padding and the stride are 0 and 1, respectively.The activation function between convolutional layers is para-metric rectified linear unit (PReLU) [38]. Further, we note thatin order to observe the impact of attention mask, we also con-duct the experiments without attention based on masking oper-ation block. We refer to the system with this setup as mod-SUVfor comparing to previous unified SUV [27] in this work.
4. Results and Discussions
We consider HiLAM and i-vectors as two basic common refer-ence systems for the studies [6]. Further, as the work advocateson compensation network, we consider joint speaker utterance(joint-spk-utt) [41] and utterance compensation (utt-comp) [22]frameworks for comparison. We note that joint speaker utter-ance models speaker and utterance information jointly [41],whereas the utterance compensation framework compensatesthe utterance information after jointly modeling speaker and ut-terance characteristics [22]. Further, the utterance compensa-tion framework has another variant with utterance factor (utt-comp-uf). We note that all these works used for comparisontargets for only SV studies. The unified SUV proposed in ourprevious work for performing both speaker and utterance veri-fication is also used as reference system [27].Table 1 shows the performance comparison of our proposedSUDA to the systems discussed above on Part I of RSR2015corpus. The performance of systems compared are quoted fromthe previously published results. We find that HiLAM systemperforms better than the i-vectors systems due to use of tem-poral knowledge. Further, the joint speaker utterance modeloutperforms HiLAM system for IC test trial condition, while itperforms poorly in TW and IW trial conditions related to utter-ance verification. Compared to these two systems, our previouswork, unified SUV takes advantage of LSTM to capture tem-poral dynamics, greatly reduces the error rate of TW and IWtrial conditions. In the utterance compensation system, we ob-serve compensating utterance information leads to a good per-formance in IC trial condition that further improves in additionable 2:
Performance in EER (%) for proposed SUDA in comparison to existing systems on RSR2015 Part II corpus.
System Male FemaleDevelopment Evaluation Development EvaluationTW IC IW TW IC IW TW IC IW TW IC IWi-vector [6] 5.410 13.750 2.500 4.390 11.260 1.810 6.940 12.730 2.860 5.160 15.270 3.050HiLAM [6] 6.140 10.580 3.030 4.420 8.380 1.710 4.620 6.660 1.290 3.710 7.950 1.450Joint-spk-utt 10.804 4.096 2.715 9.929 4.190 2.286 10.220 3.482 2.179 7.797 3.382 1.816Utt-comp [22] - 4.160 - - 3.610 - - 4.030 - - 2.850 -Utt-comp-uf [22] - 4.160 - - 3.610 - - 3.860 - - 2.790 -mod-SUV 1.394 3.757 0.279 1.015 3.591 0.176 1.833 4.862
Proposed: SUDA 1.382 2.698 0.245 0.878 2.400 0.127 1.360 3.359
Table 3:
Performance in EER (%) of different systems on theevaluation set of RSR2015 Part I. Here, j-vector: j-vector withcosine similarity, Joint Bayesian: j-vector system with jointBayesian model, J2: joint training of j-vector extractor andjoint Bayesian and the Siamese network for j-vector extractorand the joint Bayesian as a back-end, J3: joint training of j-vector extractor and joint Bayesian, and use the Siamese net-work output for verification, RACNN-LSTM: raw audio convo-lutional neural network with LSTM, i-vector + s-vector: the sys-tem concatenating i-vector and s-vector directly, i-s-vector: thesystem concatenating the last-step hidden output of s-vector andcorresponding i-vector (s-vector extracted either from LSTM orBidirectional LSTM (BLSTM)).
System TW IC IWj-vector [12] 3.14 7.86 0.95Joint Bayesian [15] 0.03 3.61 0.02J2 [15]
Proposed: SUDA of the utterance factor [22]. However, it did not explore on com-pensating speaker information for utterance verification unlikethis paper, hence, the results of TW and IW trials are not inves-tigated in [22].In this work, as mentioned in Section 3.2, mod-SUV is amodified framework of unified SUV. We can observe from Table1 that mod-SUV has significantly improved SV performancein all three test trial conditions over the existing unified SUVframework. Further, our proposed SUDA, which focuses on therequired speaker and utterance information by imposing dualattention, outperforms most of the other systems for all threetest trials conditions except, IC test trial condition of evaluationfemale set.Table 2 reports the performance of various systems on PartII of RSR2015 database. The performance trend of varioussystems remains similar to that observed in case of Part I ofRSR2015 database. The proposed SUDA again outperformsall other systems, showing effectiveness of attention based dualcompensation in various trial conditions.We now compare our proposed SUDA framework withother deep learning systems. We combine both male and fe- Table 4:
Performance in EER (%) for SV and utterance verifi-cation (UV) on Part I of RSR2015 evaluation set.
System Male FemaleSV UV SV UVUnified SUV [27] 1.796 0.021 1.918 mod-SUV 1.132 0.010 1.362 male data of RSR2015 Part I to match with the evaluation pro-tocol followed in [12, 15, 39, 40] for comparison with other re-search studies. We observe from Table 3 that the mod-SUV hasa comparable performance to other systems in the IC trial con-dition. Further, with the use of attention masks, the proposedSUDA achieves a significant improvement on the IC trial con-dition, and at the same time improves performance for IW andTW trial conditions.Finally, as discussed in our earlier work [27], we can ad-just and tune the scores from speaker and utterance verificationbranch to show a security trade-off during scoring. The studiesrelated to this reported in Table 4 show that the proposed SUDAoutperforms previous unified SUV and the current mod-SUV,that highlights the gain provided by dual attention mechanism.
5. Conclusions
This work proposes a novel speaker-utterance dual attention(SUDA) mechanism for speaker and utterance verification. Weused LSTM based models with two branches in a unified frame-work that consider attention masks for temporal interaction be-tween speaker trait stream and utterance content stream thathelps to suppress the irrelevant information for both tasks. Thestudies conducted on RSR2015 corpus reveal the importance ofthe proposed SUDA in comparison to existing approaches towork effectively for both speaker and utterance verification si-multaneously. The framework also leverages a user for usingit according to the security need of the intended application.The future work will focus on extending attention masking forprompted digit based SV.
6. Acknowledgements
This research is supported by Programmatic Grant No.A1687b0033 from the Singapore Government’s Research, In-novation and Enterprise 2020 plan (Advanced Manufacturingand Engineering domain), and Human-Robot Interaction Phase1 (Grant No. 192 25 00054) by the National Research Foun-dation, Prime Minister’s Office, Singapore under the NationalRobotics Programme. . References [1] J. P. Campbell, “Speaker recognition: a tutorial,”
Proceedings ofthe IEEE , vol. 85, no. 9, pp. 1437–1462, Sep. 1997.[2] R. K. Das,
Speaker verification using sufficient train and limitedtest data . PhD thesis, Septemebr 2017.[3] R. K. Das, S. Jelil, and S. R. M. Prasanna, “Significance of con-straining text in limited data text-independent speaker verifica-tion,” in
SPCOM , Bangalore, India, 2016, pp. 1–5.[4] A. Poddar, M. Sahidullah, and G. Saha, “Speaker verification withshort utterances: a review of challenges, trends and opportuni-ties,”
IET Biometrics , vol. 7, no. 2, pp. 91–101, 2018.[5] R. K. Das and S. R. M. Prasanna, “Speaker verification from shortutterance perspective: A review,”
IETE Technical Review , vol. 35,no. 6, pp. 599–617, 2018.[6] A. Larcher, K. A. Lee, B. Ma, and H. Li, “Text-dependent speakerverification: Classifiers, databases and RSR2015,”
Speech Com-munication , vol. 60, pp. 56 – 77, 2014.[7] K. A. Lee, A. Larcher, W. Guangsen, K. Patrick, N. Brummer,D. van Leeuwen, H. Aronowitz, M. Kockmann, C. Vaquero,B. Ma, H. Li, T. Stafylakis, J. Alam, A. Swart, and J. Perez, “TheRedDots data collection for speaker recognition,” in
Interspeech ,Dresden, Germany, 2015, pp. 2996–3000.[8] A. K. Sarkar and Z.-H. Tan, “Text dependent speaker verificationusing un-supervised HMM-UBM and temporal GMM-UBM,” in
Interspeech , San Francisco, USA, 2016, pp. 425–429.[9] H. Zeinali, H. Sameti, L. Burget, J. ˇCernock, N. Maghsoodi, andP. Matjka, “i-vector/HMM based text-dependent speaker verifi-cation system for RedDots challenge,” in
Interspeech , San Fran-cisco, USA, 2016, pp. 440–444.[10] N. Chen, Y. Qian, and K. Yu, “Multi-task learning for text-dependent speaker verification,” in
Interspeech , Dresden, Ger-many, 2015, pp. 185–189.[11] Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, and K. Yu, “Deepfeature for text-dependent speaker verification,”
Speech Commu-nication , vol. 73, pp. 1–13, 2015.[12] Z. Shi, L. Liu, M. Wang, and R. Liu, “Multi-view (joint) probabil-ity linear discrimination analysis for j-vector based text dependentspeaker verification,” in
ASRU , Okinawa, Japan, 2017, pp. 614–620.[13] H. Heo, J. Jung, I. Yang, S. Yoon, and H. Yu, “Joint training of ex-panded end-to-end DNN for text-dependent speaker verification,”in
Interspeech , Stockholm, Sweden, 2017, pp. 1532–1536.[14] S. Dey, S. Madikeri, and P. Motlicek, “End-to-end text-dependentspeaker verification using novel distance measures,” in
Inter-speech , Hyderabad, India, 2018, pp. 3598–3602.[15] Z. Shi, L. Liu, H. Lin, and R. Liu, “Joint learning of j-vector ex-tractor and joint bayesian model for text dependent speaker veri-fication,” in
Interspeech , Hyderabad, India, 2018, pp. 1076–1080.[16] Z. Shi, M. Wang, L. Liu, H. Lin, and R. Liu, “A double jointbayesian approach for j-vector based text-dependent speaker ver-ification,” in
Odyssey , Les Sables d’Olonne, France, 2018, pp.365–371.[17] S. Jelil, R. K. Das, R. Sinha, and S. R. M. Prasanna, “Speakerverification using Gaussian posteriorgrams on fixed phrase shortutterances,” in
Interspeech , Dresden, Germany, 2015, pp. 1042–1046.[18] S. Dey, S. Madikeri, M. Ferras, and P. Motlicek, “Deep neuralnetwork based posteriors for text-dependent speaker verification,”in
ICASSP , Shanghai, China, 2016, pp. 5050–5054.[19] S. Dey, T. Koshinaka, P. Motlicek, and S. Madikeri, “DNN basedspeaker embedding using content information for text-dependentspeaker verification,” in
ICASSP , Calgary, Alberta, Canada, 2018,pp. 5344–5348.[20] N. Scheffer and Y. Lei, “Content matching for short durationspeaker recognition,” in
Interspeech , Singapore, 2014, pp. 1317–1321.[21] S. Dey, S. Madikeri, P. Motlicek, and M. Ferras, “Content normal-ization for text-dependent speaker verification,” in
Interspeech ,Stockholm, Sweden, 2017, pp. 1482–1486. [22] R. K. Das, M. Madhavi, and H. Li, “Compensating utterance in-formation in fixed phrase speaker verification,” in
APSIPA ASC ,Hawaii, USA, 2018, pp. 1708–1712.[23] M. G. Rahim, C.-H. Lee, and B.-H. Juang, “A study on robust ut-terance verification for connected digits recognition,”
The Journalof the Acoustical Society of America , vol. 101, no. 5, pp. 2892–2902, 1997.[24] E. Lleida and R. C. Rose, “Utterance verification in continu-ous speech recognition: decoding and training procedures,”
IEEETrans. on Acoust., Speech & Audio Process. , vol. 8, no. 2, pp.126–139, March 2000.[25] T. Kinnunen, M. Sahidullah, I. Kukanov, H. Delgado, M. Todisco,A. K. Sarkar, N. B. Thomsen, V. Hautamki, N. Evans, and Z.-H.Tan, “Utterance verification for text-dependent speaker recogni-tion: A comparative assessment using the RedDots corpus,” in
Interspeech , San Francisco, USA, 2016, pp. 430–434.[26] H. Zeinali, L. Burget, H. Sameti, and H. Cernocky, “Spoken pass-phrase verification in the i-vector space,” in
Odyssey , Les Sablesd’Olonne, France, 2018, pp. 372–377.[27] T. Liu, M. Madhavi, R. K. Das, and H. Li, “A unified frameworkfor speaker and utterance verification,” in
Interspeech , Graz, Aus-tria, 2019, pp. 4320–4324.[28] Z. Tang, L. Li, D. Wang, and R. Vipperla, “Collaborative jointtraining with multitask recurrent model for speech and speakerrecognition,”
IEEE/ACM Trans. Audio, Speech & Language Pro-cessing , vol. 25, no. 3, pp. 493–504, 2017.[29] R. Kumar, V. Yeruva, and S. Ganapathy, “On convolutionalLSTM modeling for joint wake-word detection and text dependentspeaker verification,” in
Interspeech , Hyderabad, India, 2018, pp.1121–1125.[30] S.-X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “End-to-endattention based text-dependent speaker verification,” in
SLT Work-shop , San Juan, Puerto Rico, 2016, pp. 171–178.[31] F. R. rahman Chowdhury, Q. Wang, I. L. Moreno, and L. Wan,“Attention-based models for text-dependent speaker verification,”in
ICASSP , Calgary, Alberta, Canada, 2018, pp. 5359–5363.[32] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentivespeaker embeddings for text-independent speaker verification.” in
Interspeech , Hyderabad, India, 2018, pp. 3573–3577.[33] D. Wang and J. Chen, “Supervised speech separation based ondeep learning: An overview,”
IEEE ACM Trans. Audio SpeechLang. Process. , vol. 26, no. 10, pp. 1702–1726, 2018.[34] M. Jaderberg, K. Simonyan, A. Zisserman et al. , “Spatial trans-former networks,” in
NIPS , Montreal, Canada, 2015, pp. 2017–2025.[35] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,”in
CVPR , Salt Lake City, Utah, USA, 2018, pp. 7132–7141.[36] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang,and X. Tang, “Residual attention network for image classifica-tion,” in
CVPR , Honolulu, Hawaii, USA, 2017, pp. 3156–3164.[37] C. Shan, J. Zhang, Y. Wang, and L. Xie, “Attention-based end-to-end speech recognition on voice search,” in
ICASSP , Calgary,Alberta, Canada, 2018, pp. 4764–4768.[38] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on imagenet classification,”in
Proceedings of the IEEE international conference on computervision , 2015, pp. 1026–1034.[39] J.-W. Jung, H.-S. Heo, I.-H. Yang, H.-J. Shim, and H.-J. Yu, “Acomplete end-to-end speaker verification system using deep neu-ral networks: From raw signals to verification result,” in
ICASSP ,Calgary, Alberta, Canada, 2018, pp. 5349–5353.[40] S. Wang, Y. Qian, and K. Yu, “What does the speaker embeddingencode?” in
Interspeech , Stockholm, Sweden, 2017, pp. 1497–1501.[41] G. Wang, K. A. Lee, T. H. Nguyen, H. Sun, and B. Ma, “Jointspeaker and lexical modeling for short-term characterization ofspeaker,” in