[PDF] Speaker-Utterance Dual Attention for Speaker and Utterance Verification

Abstract

In this paper, we study a novel technique that exploits the interaction between speaker traits and linguistic content to improve both speaker verification and utterance verification performance. We implement an idea of speaker-utterance dual attention (SUDA) in a unified neural network. The dual attention refers to an attention mechanism for the two tasks of speaker and utterance verification. The proposed SUDA features an attention mask mechanism to learn the interaction between the speaker and utterance information streams. This helps to focus only on the required information for respective task by masking the irrelevant counterparts. The studies conducted on RSR2015 corpus confirm that the proposed SUDA outperforms the framework without attention mask as well as several competitive systems for both speaker and utterance verification.

Full PDF

SSpeaker-Utterance Dual Attention for Speaker and Utterance Veriﬁcation

Tianchi Liu , Rohan Kumar Das , Maulik Madhavi , Shengmei Shen , Haizhou Li Pensees Pte Ltd, Singapore Department of Electrical and Computer Engineering, National University of Singapore, Singapore { liutianchi, jane.shen } @pensees.ai, { rohankd, maulik.madhavi, haizhou.li } @nus.edu.sg Abstract

In this paper, we study a novel technique that exploits theinteraction between speaker traits and linguistic content to im-prove both speaker veriﬁcation and utterance veriﬁcation per-formance. We implement an idea of speaker-utterance dual at-tention (SUDA) in a uniﬁed neural network. The dual attentionrefers to an attention mechanism for the two tasks of speakerand utterance veriﬁcation. The proposed SUDA features an at-tention mask mechanism to learn the interaction between thespeaker and utterance information streams. This helps to focusonly on the required information for respective task by maskingthe irrelevant counterparts. The studies conducted on RSR2015corpus conﬁrm that the proposed SUDA outperforms the frame-work without attention mask as well as several competitive sys-tems for both speaker and utterance veriﬁcation.

Index Terms : text-dependent speaker veriﬁcation, utteranceveriﬁcation, attention, masking, RSR2015

1. Introduction

Speaker veriﬁcation (SV) aims to verify the claimed identity ofa person using given speech [1]. Its implementation is broadlycategorized into text-dependent and text-independent based onthe spoken contents used for enrollment and testing [1]. The for-mer deals with use of ﬁxed short phases, while the latter doesn’tput any constraints on the speech content. A text-independentsystem generally requires more training and test data [2, 3] thana text-dependent one [4, 5] to maintain the same level of accu-racy. Therefore, text-dependent system is preferable in manyreal-world applications where user’s cooperation is possible.The research on text-dependent SV has evolved a lot fromtraditional dynamic time warping based template matchingmethod to deep learning recently. For benchmarking of technol-ogy process, standard speech corpora like RSR2015 and Red-Dots are designed [6,7]. It is found that the modeling techniquessuch as hierarchical multi-layer acoustic model (HiLAM) [6],unsupervised hidden Markov model (HMM)-universal back-ground model (UBM) [8] i-vector/HMM [9] and j-vector [10]beneﬁt from temporal information in speech. Further, deeplearning techniques have greatly improved the ability of speakercharacterization [11–16].In human SV, we would like to have test samples as parallelto the unknown as possible to reduce the variability to a min-imum due to language content. Text-dependent SV allows usto do just like that. Various studies show text-dependent SV isconsidered for performing two tasks, an SV as the main task,and an utterance veriﬁcation as the subtask, where the two tasksare optimized separately or jointly in order to improve the SVobjective. For example, the phonetic posteriorgrams derivedusing Gaussian mixture model (GMM) and deep neural net-work (DNN) frameworks are utilized to capture lexical infor-mation for text-dependent SV [17, 18]. One shows that DNN based speaker embedding beneﬁts from lexical content infor-mation [19], others suggest that lexical information can be usedin different ways to compensate the SV scores for performancegains [20–22].Prior studies have underscored the importance of contentmodeling in text-dependent SV. While utterance veriﬁcation hasbeen well studied as part of speech recognition [23, 24], it hasnot been given sufﬁcient attention in the context of SV. Someconsider text-dependent SV as a combination of two indepen-dent systems, namely SV and utterance veriﬁcation [25, 26]. Inour previous work [27], text-dependent SV is formulated in auniﬁed speaker-utterance veriﬁcation (SUV) system as a multi-task learning implementation. This is inspired by human cog-nitive process where we interpret and decode speaker traits andlinguistic content in a corroborative manner [28, 29]. For ex-ample, by paying special attention to particular sounds whileknowing the linguistic content information, we verify the voiceof a speaker; on the other hand, if we are familiar with a speaker,we tend to recognize his/her voice in a better way.In the uniﬁed SUV system, we used a shared long shortterm memory (LSTM) network and two independent LSTMoutput layers, one for speaker identity and another for utter-ance identity [27]. While the previously proposed uniﬁed SUVframework is effective, the interaction between the two outputlayers is not explored. We believe that both speaker and utter-ance veriﬁcation can beneﬁt from each other by exploring thetemporal interaction between them. This is also motivated bysuccessful explorations in text-dependent SV that suggest thebeneﬁt of compensating lexical information [20–22]. Further,various attention models [30–32] project their possibility to fo-cus on speciﬁc compensation or masking related to each task.In this work, we propose an speaker-utterance dual atten-tion, SUDA in short for performing both speaker and utteranceveriﬁcation. The attention mechanism for compensating irrel-evant information for both tasks is derived by using maskingoperation. The masking for attention is estimated frame-by-frame, the attention mechanism establishes the temporal asso-ciation between the speaker trait stream and the utterance con-tent stream of LSTM output. In addition, we note that, as theattention mechanism is applied to both the branches (speakerand utterance) in the framework, it is referred to as dual at-tention. The studies in this work are conducted on RSR2015corpus [6]. The contribution of this work lies in the use of aspeaker-utterance dual attention in a single framework for per-forming both speaker and utterance veriﬁcation.The rest of the work is organized as follows. Section 2describes the proposed speaker-utterance dual attention mech-anism for speaker and utterance veriﬁcation. The experimentsare detailed in Section 3, followed by reporting of their resultsand analysis in Section 4. The paper is ﬁnally concluded inSection 5. a r X i v : . [ ee ss . A S ] A ug ttention MaskPReLU MaskGeneratorBN FCBNCONV PReLU BN CONVBN CONV FC GAPGAP FCFCMultiply BatchNormalization1D Convolution 1D FullConnectionFeatures MaskGenerator Global AveragePooling 1DAudio Data BNBN SoftmaxSoftmax LTutt LspkLutt

Utterance Veriﬁcation BranchSpeaker Veriﬁcation Branch

NF1 256NF1 60 ParametricRectiﬁedLinear Unit SoftmaxLSTMLSTMLSTM

LTspk

Figure 1:

Block diagram of proposed SUDA for speaker and utterance veriﬁcation. It consists of a shared LSTM, two LSTM output layer,which interact each other through the attention mask network. The NF and NF’ are number of frames before and after convolution.

2. Speaker-Utterance Dual Attention

This section describes the proposed SUDA for speaker and ut-terance veriﬁcation. As shown in Figure 1, the features ex-tracted from raw audio data are fed into the LSTM based re-current network to characterize the temporal dynamics. Thissystem is an extension to our earlier work of uniﬁed SUV frame-work [27].As presented in our earlier study [27], the ﬁrst shared layeris common for speaker and utterance veriﬁcation branches.Then, the hidden representation from the ﬁrst shared layer ispassed to the next two LSTM networks that focus on extractingvalid information for each of the two sub-tasks, namely, speakerand utterance veriﬁcation. We now discuss the improvisationsintroduced in this work using dual attention by masking.

Masking has been found to be effective to separate multiplesound sources from an audio mixture [33]. A more recent ap-proach regards speech separation problem as a supervised learn-ing that aims to discriminate different patterns such as, speech,speakers, and background noise, which are learned from train-ing data [33]. Masking operation can be viewed as an attentionmechanism, where the every single features from one branchare masked to attend the speciﬁc features from another branch.Attention based models have been applied successfully to var-ious tasks. In computer vision research, there are three mainstrategies for attention: spatial attention, channel attention andmixed attention [34–36]. The spatial attention is used to em-phasis the area that one is interested, while the channel atten-tion is mainly for features recalibration in convoluational neu-ral networks, such as the ‘Squeeze-and-Excitation’ block pro-posed in [35]. Again, the authors of [30] proposed an end-to-end framework with an attention model to combine the frame-level features, which acts as an alignment operation for SV stud-ies. The attention model designed as a part of the speaker em-bedding network, is used to calculate the weighted mean of theframe-level feature maps to derive speaker embedding with abetter discriminative speaker characteristics [31, 32].

In the proposed SUDA, the convolution layers are used to ex-tract the feature maps from the hidden representation obtainedafter LSTM as shown in Figure 1. It has the objective to map thehidden representation to higher dimensional space to compen-sate the mutual information by attention masks. The dynamicmasks can learn the information from feature maps obtainedfrom the convolutional layers. We use sigmoid non-linearityto limit the activation from the mask. The dynamic maskingis performed by learning the parameters from feature maps andthe activating sigmoid function [37]. The masking is formulatedas: mask s = 1 − Sigmoid ( fm u ) (1)mask u = 1 − Sigmoid ( fm s ) (2)where the fm u and fm s represent the feature maps from utter-ance veriﬁcation and SV branch, respectively, while the mask u and mask s indicate the dynamic masks for the correspondingbranch. These masks are then multiplied by the correspondingstream of feature maps to suppress irrelevant information in therespective stream of feature maps. The parameters in the masksare not ﬁxed during the training or inference phase. They arederived according to the input audio data.The operations performed so far produce the framewise rep-resentations. In order to pool the information across the utter-ance, global average pooling (GAP) is performed. Next, thefully connected (FC) layers are used to perform both speakerand utterance veriﬁcation tasks.In this work, the attention mask is applied on the everyfeature map through sigmoid operation than the conventionalsoftmax operation. Further, the weights of our attention masksare obtained from the branch of another task. We note that theweights of the attention mask are tuned for different utterancesand speaker-speciﬁc information.

3. Experiments

We now discuss the database and experimental setup for thestudies in the following subsections.able 1:

Performance in EER (%) for proposed SUDA in comparison to existing systems on RSR2015 Part I corpus.

System Male FemaleDevelopment Evaluation Development EvaluationTW IC IW TW IC IW TW IC IW TW IC IWi-vector [6] 2.870 5.950 0.740 1.950 4.030 0.320 3.050 7.870 0.940 1.910 6.610 0.750HiLAM [6] 1.660 3.690 0.490 0.820 2.470 0.190 1.770 3.240 0.450 0.610 2.960 0.140Joint-spk-utt [27] 5.565 1.981 1.792 5.125 2.079 0.888 5.179 1.699 0.831 3.110 1.453 0.499Uniﬁed SUV [27] 0.470 1.590 0.101 0.293 1.757 0.039 1.176 4.323 0.178 0.375 2.009 0.068Utt-comp [22] - 1.460 - - 0.960 - - 1.640 - - 0.730 -Utt-comp-uf [22] - 1.460 - - 0.960 - - 1.460 - - -mod-SUV

Proposed: SUDA 0.202 0.728 0.022 0.068 0.722 0.010 0.297 1.449 0.024 0.125

The RSR2015 corpus is used for the studies in this work [6].It contains 300 speakers data from 143 female and 157 malespeakers. Further, the corpus is divided into three different partsbased on the nature of the ﬁxed phrases. The Part I includes 30ﬁxed phrase utterances of 3-4 seconds duration, whereas thePart II has 30 ﬁxed short commands of 1-2 seconds duration.Similarly, the Part III contains the random digit based ﬁve orten digit sequences. Again, there are 9 sessions for each phrasefrom all the speakers. Out of those, the ﬁrst, fourth and seventhsessions are used for speaker enrollment and the remaining fortesting as per the RSR2015 evaluation protocol [6].The RSR2015 corpus has three subsets, which are back-ground, development and evaluation set to evaluate the systemperformance [6].The test trials are grouped under four categories based onthe test speaker and phrase labels, which are

Target Correct (TC),

Impostor Correct (IC),

Target Wrong (TW) and

Impos-tor Wrong (IW). They further constitute three test conditions,where each of those conditions consider TC as target trials andthe remaining three categories as non-target trials, respectively.The performance is reported in terms of equal error rate (EER).In this work, we consider Part I and Part II of RSR2015 as theyare suitable for both speaker and utterance veriﬁcation.

The speech utterances are processed with 20 ms frame size and10 ms shift to extract 60-dimensional (20-base + 20- ∆ + 20- ∆∆ ) mel frequency cepstral coefﬁcient (MFCC) features us-ing KALDI toolkit. The extracted features are normalized bycepstral mean and variance normalization using utterance levelmean and variance statistics.In addition, we apply the feature level triplet loss to furtherincrease the intra-class distance and reduce inter-class distance.It is applied on the 512-dimensional feature vector at the stepafter 1D global average pooling as shown in Figure 1. We cal-culate triplet loss in both the branches, i.e., SV and utteranceveriﬁcation. In addition, the negative and positive samples inthe training batch for each iteration is searched. The batch sizeof each iteration is set to be 128 for all the experiments and thesesamples are chosen randomly. The loss of proposed frameworkis calculated by following formula: L total = L Tspk + L Tutt + L spk + L utt (3) http://kaldi-asr.org/ https://pytorch.org/ where the L Tspk and L Tutt are triplet loss, while L spk and L utt are negative log likelihood loss. The subscripts spk and utt represent the SV and utterance veriﬁcation branches inSUDA, respectively.The learning rate, optimizer, LSTM’s hidden layer andscoring follow the same conﬁgurations as that in our previouswork of uniﬁed SUV [27]. We adopt PyTorch toolkit for theimplementation. We ﬁxed seeds empirically to 2020 in our stud-ies. In contrast to the previous uniﬁed SUV, we add 1D convolu-tion layer to the network, where kernel size of convolution layeris 5, while the padding and the stride are 0 and 1, respectively.The activation function between convolutional layers is para-metric rectiﬁed linear unit (PReLU) [38]. Further, we note thatin order to observe the impact of attention mask, we also con-duct the experiments without attention based on masking oper-ation block. We refer to the system with this setup as mod-SUVfor comparing to previous uniﬁed SUV [27] in this work.

4. Results and Discussions

We consider HiLAM and i-vectors as two basic common refer-ence systems for the studies [6]. Further, as the work advocateson compensation network, we consider joint speaker utterance(joint-spk-utt) [41] and utterance compensation (utt-comp) [22]frameworks for comparison. We note that joint speaker utter-ance models speaker and utterance information jointly [41],whereas the utterance compensation framework compensatesthe utterance information after jointly modeling speaker and ut-terance characteristics [22]. Further, the utterance compensa-tion framework has another variant with utterance factor (utt-comp-uf). We note that all these works used for comparisontargets for only SV studies. The uniﬁed SUV proposed in ourprevious work for performing both speaker and utterance veri-ﬁcation is also used as reference system [27].Table 1 shows the performance comparison of our proposedSUDA to the systems discussed above on Part I of RSR2015corpus. The performance of systems compared are quoted fromthe previously published results. We ﬁnd that HiLAM systemperforms better than the i-vectors systems due to use of tem-poral knowledge. Further, the joint speaker utterance modeloutperforms HiLAM system for IC test trial condition, while itperforms poorly in TW and IW trial conditions related to utter-ance veriﬁcation. Compared to these two systems, our previouswork, uniﬁed SUV takes advantage of LSTM to capture tem-poral dynamics, greatly reduces the error rate of TW and IWtrial conditions. In the utterance compensation system, we ob-serve compensating utterance information leads to a good per-formance in IC trial condition that further improves in additionable 2:

Performance in EER (%) for proposed SUDA in comparison to existing systems on RSR2015 Part II corpus.

System Male FemaleDevelopment Evaluation Development EvaluationTW IC IW TW IC IW TW IC IW TW IC IWi-vector [6] 5.410 13.750 2.500 4.390 11.260 1.810 6.940 12.730 2.860 5.160 15.270 3.050HiLAM [6] 6.140 10.580 3.030 4.420 8.380 1.710 4.620 6.660 1.290 3.710 7.950 1.450Joint-spk-utt 10.804 4.096 2.715 9.929 4.190 2.286 10.220 3.482 2.179 7.797 3.382 1.816Utt-comp [22] - 4.160 - - 3.610 - - 4.030 - - 2.850 -Utt-comp-uf [22] - 4.160 - - 3.610 - - 3.860 - - 2.790 -mod-SUV 1.394 3.757 0.279 1.015 3.591 0.176 1.833 4.862

Proposed: SUDA 1.382 2.698 0.245 0.878 2.400 0.127 1.360 3.359

Table 3:

Performance in EER (%) of different systems on theevaluation set of RSR2015 Part I. Here, j-vector: j-vector withcosine similarity, Joint Bayesian: j-vector system with jointBayesian model, J2: joint training of j-vector extractor andjoint Bayesian and the Siamese network for j-vector extractorand the joint Bayesian as a back-end, J3: joint training of j-vector extractor and joint Bayesian, and use the Siamese net-work output for veriﬁcation, RACNN-LSTM: raw audio convo-lutional neural network with LSTM, i-vector + s-vector: the sys-tem concatenating i-vector and s-vector directly, i-s-vector: thesystem concatenating the last-step hidden output of s-vector andcorresponding i-vector (s-vector extracted either from LSTM orBidirectional LSTM (BLSTM)).

System TW IC IWj-vector [12] 3.14 7.86 0.95Joint Bayesian [15] 0.03 3.61 0.02J2 [15]

Proposed: SUDA of the utterance factor [22]. However, it did not explore on com-pensating speaker information for utterance veriﬁcation unlikethis paper, hence, the results of TW and IW trials are not inves-tigated in [22].In this work, as mentioned in Section 3.2, mod-SUV is amodiﬁed framework of uniﬁed SUV. We can observe from Table1 that mod-SUV has signiﬁcantly improved SV performancein all three test trial conditions over the existing uniﬁed SUVframework. Further, our proposed SUDA, which focuses on therequired speaker and utterance information by imposing dualattention, outperforms most of the other systems for all threetest trials conditions except, IC test trial condition of evaluationfemale set.Table 2 reports the performance of various systems on PartII of RSR2015 database. The performance trend of varioussystems remains similar to that observed in case of Part I ofRSR2015 database. The proposed SUDA again outperformsall other systems, showing effectiveness of attention based dualcompensation in various trial conditions.We now compare our proposed SUDA framework withother deep learning systems. We combine both male and fe- Table 4:

Performance in EER (%) for SV and utterance veriﬁ-cation (UV) on Part I of RSR2015 evaluation set.

System Male FemaleSV UV SV UVUniﬁed SUV [27] 1.796 0.021 1.918 mod-SUV 1.132 0.010 1.362 male data of RSR2015 Part I to match with the evaluation pro-tocol followed in [12, 15, 39, 40] for comparison with other re-search studies. We observe from Table 3 that the mod-SUV hasa comparable performance to other systems in the IC trial con-dition. Further, with the use of attention masks, the proposedSUDA achieves a signiﬁcant improvement on the IC trial con-dition, and at the same time improves performance for IW andTW trial conditions.Finally, as discussed in our earlier work [27], we can ad-just and tune the scores from speaker and utterance veriﬁcationbranch to show a security trade-off during scoring. The studiesrelated to this reported in Table 4 show that the proposed SUDAoutperforms previous uniﬁed SUV and the current mod-SUV,that highlights the gain provided by dual attention mechanism.

5. Conclusions

This work proposes a novel speaker-utterance dual attention(SUDA) mechanism for speaker and utterance veriﬁcation. Weused LSTM based models with two branches in a uniﬁed frame-work that consider attention masks for temporal interaction be-tween speaker trait stream and utterance content stream thathelps to suppress the irrelevant information for both tasks. Thestudies conducted on RSR2015 corpus reveal the importance ofthe proposed SUDA in comparison to existing approaches towork effectively for both speaker and utterance veriﬁcation si-multaneously. The framework also leverages a user for usingit according to the security need of the intended application.The future work will focus on extending attention masking forprompted digit based SV.

6. Acknowledgements

This research is supported by Programmatic Grant No.A1687b0033 from the Singapore Government’s Research, In-novation and Enterprise 2020 plan (Advanced Manufacturingand Engineering domain), and Human-Robot Interaction Phase1 (Grant No. 192 25 00054) by the National Research Foun-dation, Prime Minister’s Ofﬁce, Singapore under the NationalRobotics Programme. . References [1] J. P. Campbell, “Speaker recognition: a tutorial,”

Proceedings ofthe IEEE , vol. 85, no. 9, pp. 1437–1462, Sep. 1997.[2] R. K. Das,

Speaker veriﬁcation using sufﬁcient train and limitedtest data . PhD thesis, Septemebr 2017.[3] R. K. Das, S. Jelil, and S. R. M. Prasanna, “Signiﬁcance of con-straining text in limited data text-independent speaker veriﬁca-tion,” in

SPCOM , Bangalore, India, 2016, pp. 1–5.[4] A. Poddar, M. Sahidullah, and G. Saha, “Speaker veriﬁcation withshort utterances: a review of challenges, trends and opportuni-ties,”

IET Biometrics , vol. 7, no. 2, pp. 91–101, 2018.[5] R. K. Das and S. R. M. Prasanna, “Speaker veriﬁcation from shortutterance perspective: A review,”

IETE Technical Review , vol. 35,no. 6, pp. 599–617, 2018.[6] A. Larcher, K. A. Lee, B. Ma, and H. Li, “Text-dependent speakerveriﬁcation: Classiﬁers, databases and RSR2015,”

Speech Com-munication , vol. 60, pp. 56 – 77, 2014.[7] K. A. Lee, A. Larcher, W. Guangsen, K. Patrick, N. Brummer,D. van Leeuwen, H. Aronowitz, M. Kockmann, C. Vaquero,B. Ma, H. Li, T. Stafylakis, J. Alam, A. Swart, and J. Perez, “TheRedDots data collection for speaker recognition,” in

Interspeech ,Dresden, Germany, 2015, pp. 2996–3000.[8] A. K. Sarkar and Z.-H. Tan, “Text dependent speaker veriﬁcationusing un-supervised HMM-UBM and temporal GMM-UBM,” in

Interspeech , San Francisco, USA, 2016, pp. 425–429.[9] H. Zeinali, H. Sameti, L. Burget, J. ˇCernock, N. Maghsoodi, andP. Matjka, “i-vector/HMM based text-dependent speaker veriﬁ-cation system for RedDots challenge,” in

Interspeech , San Fran-cisco, USA, 2016, pp. 440–444.[10] N. Chen, Y. Qian, and K. Yu, “Multi-task learning for text-dependent speaker veriﬁcation,” in

Interspeech , Dresden, Ger-many, 2015, pp. 185–189.[11] Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, and K. Yu, “Deepfeature for text-dependent speaker veriﬁcation,”

Speech Commu-nication , vol. 73, pp. 1–13, 2015.[12] Z. Shi, L. Liu, M. Wang, and R. Liu, “Multi-view (joint) probabil-ity linear discrimination analysis for j-vector based text dependentspeaker veriﬁcation,” in

ASRU , Okinawa, Japan, 2017, pp. 614–620.[13] H. Heo, J. Jung, I. Yang, S. Yoon, and H. Yu, “Joint training of ex-panded end-to-end DNN for text-dependent speaker veriﬁcation,”in

Interspeech , Stockholm, Sweden, 2017, pp. 1532–1536.[14] S. Dey, S. Madikeri, and P. Motlicek, “End-to-end text-dependentspeaker veriﬁcation using novel distance measures,” in

Inter-speech , Hyderabad, India, 2018, pp. 3598–3602.[15] Z. Shi, L. Liu, H. Lin, and R. Liu, “Joint learning of j-vector ex-tractor and joint bayesian model for text dependent speaker veri-ﬁcation,” in

Interspeech , Hyderabad, India, 2018, pp. 1076–1080.[16] Z. Shi, M. Wang, L. Liu, H. Lin, and R. Liu, “A double jointbayesian approach for j-vector based text-dependent speaker ver-iﬁcation,” in

Odyssey , Les Sables d’Olonne, France, 2018, pp.365–371.[17] S. Jelil, R. K. Das, R. Sinha, and S. R. M. Prasanna, “Speakerveriﬁcation using Gaussian posteriorgrams on ﬁxed phrase shortutterances,” in

Interspeech , Dresden, Germany, 2015, pp. 1042–1046.[18] S. Dey, S. Madikeri, M. Ferras, and P. Motlicek, “Deep neuralnetwork based posteriors for text-dependent speaker veriﬁcation,”in

ICASSP , Shanghai, China, 2016, pp. 5050–5054.[19] S. Dey, T. Koshinaka, P. Motlicek, and S. Madikeri, “DNN basedspeaker embedding using content information for text-dependentspeaker veriﬁcation,” in

ICASSP , Calgary, Alberta, Canada, 2018,pp. 5344–5348.[20] N. Scheffer and Y. Lei, “Content matching for short durationspeaker recognition,” in

Interspeech , Singapore, 2014, pp. 1317–1321.[21] S. Dey, S. Madikeri, P. Motlicek, and M. Ferras, “Content normal-ization for text-dependent speaker veriﬁcation,” in

Interspeech ,Stockholm, Sweden, 2017, pp. 1482–1486. [22] R. K. Das, M. Madhavi, and H. Li, “Compensating utterance in-formation in ﬁxed phrase speaker veriﬁcation,” in

APSIPA ASC ,Hawaii, USA, 2018, pp. 1708–1712.[23] M. G. Rahim, C.-H. Lee, and B.-H. Juang, “A study on robust ut-terance veriﬁcation for connected digits recognition,”

The Journalof the Acoustical Society of America , vol. 101, no. 5, pp. 2892–2902, 1997.[24] E. Lleida and R. C. Rose, “Utterance veriﬁcation in continu-ous speech recognition: decoding and training procedures,”

IEEETrans. on Acoust., Speech & Audio Process. , vol. 8, no. 2, pp.126–139, March 2000.[25] T. Kinnunen, M. Sahidullah, I. Kukanov, H. Delgado, M. Todisco,A. K. Sarkar, N. B. Thomsen, V. Hautamki, N. Evans, and Z.-H.Tan, “Utterance veriﬁcation for text-dependent speaker recogni-tion: A comparative assessment using the RedDots corpus,” in

Interspeech , San Francisco, USA, 2016, pp. 430–434.[26] H. Zeinali, L. Burget, H. Sameti, and H. Cernocky, “Spoken pass-phrase veriﬁcation in the i-vector space,” in

Odyssey , Les Sablesd’Olonne, France, 2018, pp. 372–377.[27] T. Liu, M. Madhavi, R. K. Das, and H. Li, “A uniﬁed frameworkfor speaker and utterance veriﬁcation,” in

Interspeech , Graz, Aus-tria, 2019, pp. 4320–4324.[28] Z. Tang, L. Li, D. Wang, and R. Vipperla, “Collaborative jointtraining with multitask recurrent model for speech and speakerrecognition,”

IEEE/ACM Trans. Audio, Speech & Language Pro-cessing , vol. 25, no. 3, pp. 493–504, 2017.[29] R. Kumar, V. Yeruva, and S. Ganapathy, “On convolutionalLSTM modeling for joint wake-word detection and text dependentspeaker veriﬁcation,” in

Interspeech , Hyderabad, India, 2018, pp.1121–1125.[30] S.-X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “End-to-endattention based text-dependent speaker veriﬁcation,” in

SLT Work-shop , San Juan, Puerto Rico, 2016, pp. 171–178.[31] F. R. rahman Chowdhury, Q. Wang, I. L. Moreno, and L. Wan,“Attention-based models for text-dependent speaker veriﬁcation,”in

ICASSP , Calgary, Alberta, Canada, 2018, pp. 5359–5363.[32] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentivespeaker embeddings for text-independent speaker veriﬁcation.” in

Interspeech , Hyderabad, India, 2018, pp. 3573–3577.[33] D. Wang and J. Chen, “Supervised speech separation based ondeep learning: An overview,”

IEEE ACM Trans. Audio SpeechLang. Process. , vol. 26, no. 10, pp. 1702–1726, 2018.[34] M. Jaderberg, K. Simonyan, A. Zisserman et al. , “Spatial trans-former networks,” in

NIPS , Montreal, Canada, 2015, pp. 2017–2025.[35] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,”in

CVPR , Salt Lake City, Utah, USA, 2018, pp. 7132–7141.[36] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang,and X. Tang, “Residual attention network for image classiﬁca-tion,” in

CVPR , Honolulu, Hawaii, USA, 2017, pp. 3156–3164.[37] C. Shan, J. Zhang, Y. Wang, and L. Xie, “Attention-based end-to-end speech recognition on voice search,” in

ICASSP , Calgary,Alberta, Canada, 2018, pp. 4764–4768.[38] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers:Surpassing human-level performance on imagenet classiﬁcation,”in

Proceedings of the IEEE international conference on computervision , 2015, pp. 1026–1034.[39] J.-W. Jung, H.-S. Heo, I.-H. Yang, H.-J. Shim, and H.-J. Yu, “Acomplete end-to-end speaker veriﬁcation system using deep neu-ral networks: From raw signals to veriﬁcation result,” in

ICASSP ,Calgary, Alberta, Canada, 2018, pp. 5349–5353.[40] S. Wang, Y. Qian, and K. Yu, “What does the speaker embeddingencode?” in

Interspeech , Stockholm, Sweden, 2017, pp. 1497–1501.[41] G. Wang, K. A. Lee, T. H. Nguyen, H. Sun, and B. Ma, “Jointspeaker and lexical modeling for short-term characterization ofspeaker,” in