[PDF] Exploring the Use of an Unsupervised Autoregressive Model as a Shared Encoder for Text-Dependent Speaker Verification

Abstract

In this paper, we propose a novel way of addressing text-dependent automatic speaker verification (TD-ASV) by using a shared-encoder with task-specific decoders. An autoregressive predictive coding (APC) encoder is pre-trained in an unsupervised manner using both out-of-domain (LibriSpeech, VoxCeleb) and in-domain (DeepMine) unlabeled datasets to learn generic, high-level feature representation that encapsulates speaker and phonetic content. Two task-specific decoders were trained using labeled datasets to classify speakers (SID) and phrases (PID). Speaker embeddings extracted from the SID decoder were scored using a PLDA. SID and PID systems were fused at the score level. There is a 51.9% relative improvement in minDCF for our system compared to the fully supervised x-vector baseline on the cross-lingual DeepMine dataset. However, the i-vector/HMM method outperformed the proposed APC encoder-decoder system. A fusion of the x-vector/PLDA baseline and the SID/PLDA scores prior to PID fusion further improved performance by 15% indicating complementarity of the proposed approach to the x-vector system. We show that the proposed approach can leverage from large, unlabeled, data-rich domains, and learn speech patterns independent of downstream tasks. Such a system can provide competitive performance in domain-mismatched scenarios where test data is from data-scarce domains.

Full PDF

EExploring the Use of an Unsupervised Autoregressive Model as a SharedEncoder for Text-Dependent Speaker Veriﬁcation

Vijay Ravi, Ruchao Fan, Amber Afshan, Huanhua Lu, Abeer Alwan

University of California Los Angeles, USA (vijaysumaravi,fanruchao,amberafshan,huanhua,alwan)@ucla.edu

Abstract

In this paper, we propose a novel way of addressing text-dependent automatic speaker veriﬁcation (TD-ASV) by us-ing a shared-encoder with task-speciﬁc decoders. An autore-gressive predictive coding (APC) encoder is pre-trained in anunsupervised manner using both out-of-domain (LibriSpeech,VoxCeleb) and in-domain (DeepMine) unlabeled datasets tolearn generic, high-level feature representation that encapsu-lates speaker and phonetic content. Two task-speciﬁc decoderswere trained using labeled datasets to classify speakers (SID)and phrases (PID). Speaker embeddings extracted from the SIDdecoder were scored using a PLDA. SID and PID systems werefused at the score level. There is a 51.9% relative improvementin minDCF for our system compared to the fully supervised x-vector baseline on the cross-lingual DeepMine dataset. How-ever, the i-vector/HMM method outperformed the proposedAPC encoder-decoder system. A fusion of the x-vector/PLDAbaseline and the SID/PLDA scores prior to PID fusion furtherimproved performance by 15% indicating complementarity ofthe proposed approach to the x-vector system. We show thatthe proposed approach can leverage from large, unlabeled, data-rich domains, and learn speech patterns independent of down-stream tasks. Such a system can provide competitive perfor-mance in domain-mismatched scenarios where test data is fromdata-scarce domains.

Index Terms : speaker veriﬁcation, unsupervised-learning,feature-representation, shared-encoder, domain-adaptation.

1. Introduction

Text-dependent automatic speaker veriﬁcation (TD-ASV) sys-tems classify pairs of speech utterances as same or differentbased on the speaker’s identity and the lexical content of thephrases spoken. This is analogous to two-factor authentication,in that the phrase identiﬁcation (PID) and the speaker identiﬁca-tion (SID), both, have to match for the user to gain access. Theapplications of TD-ASV include, but are not limited to, bio-metric veriﬁcation in healthcare [1], banking, forensics [2], andprivacy protection in personalized voice-assistants [3].While the same-or-different speaker decision accuracy is ofutmost importance, it is also beneﬁcial if the TD-ASV systemis resilient to domain mismatch between the training and testingdata. This would enable the deployment of TD-ASV systems,originally developed for data-rich domains, to data-scarce do-mains thereby extending TD-ASV to unconventional domainslike children’s speech or zero-resource languages. To facilitateresearch in this direction, the short-duration speaker veriﬁcationchallenge (SDSVC), 2020 [4] provides a standardized evalua-tion platform for researchers to test and benchmark their ASVsystems using a common evaluation dataset. In this study, we

This study was supported in part by the NSF. address the problem of TD-ASV in a novel way by training anencoder in an unsupervised fashion to learn shared feature rep-resentations of both speaker and phrase identity.Figure 1:

Encoder-Decoder model architecture proposed in thispaper. The encoder is an APC model trained in an unsupervisedway to learn a generic, high-level feature representation inde-pendent of downstream tasks. The decoders (PID and SID) aretrained in a supervised manner.

Previously, the i-vector/PLDA (probabilistic linear discrim-inant analysis) method [5, 6] and some of its extensions [7, 8]showed promising results on the TD-ASV task. Zenali et al . in-troduced the HMM based i-vector approach [9, 10], and used aset of phone-speciﬁc HMMs to collect the statistics for i-vectorextraction. In [11], Variani et al . replaced the conventional i-vectors by using deep neural networks (DNNs) to learn speakerdiscriminative features (d-vector). A phonetically-aware TD-ASV system was developed to extract i-vectors using: a) outputposteriors [12] and b) bottleneck features [13], as frame align-ments, which were generated from a DNN trained for automaticspeech recognition (ASR).To tackle the shorter utterance prob-lem, convolutional neural networks [14] and DNNs [15] wereused to map the i-vectors extracted from short utterances to thecorresponding long-utterance i-vectors. Although these systems a r X i v : . [ ee ss . A S ] A ug ere effective, they relied on handcrafted dictionaries to gener-ate alignments for every phrase and large, labeled, in-domaindatasets. On the contrary, the proposed method needs no dic-tionaries or alignments and can take advantage of abundantlyavailable out-of-domain data. More recently, the end-to-end(E2E) approach of training TD-ASV systems has gained sig-niﬁcant momentum. Heigold et al . proposed an E2E systemcombining the training, the evaluation and the veriﬁcation pro-cess into a single compact network and jointly optimized all pa-rameters using a veriﬁcation-based loss [16]. In [17], Zhang etal . suggested an attention based E2E network for jointly learn-ing speaker and phonetic discriminative features. In contrast tothe previous E2E systems that were trained on a tuple-basedloss function, Wan et al . proposed the generalized E2E lossfunction [18]. These E2E systems were, however, computation-ally expensive and optimized to perform well only for a speciﬁcphrase (eg: the wake-word phrase).Inspired by the recent success of unsupervised pre-training [19, 20] and representation learning [21, 22, 23], wepropose to use a shared-encoder with two task-speciﬁc de-coders for TD-ASV. The model architecture is as shown in Fig-ure 1. Speciﬁcally, an autoregressive predictive coding (APC)encoder [21] is trained in an unsupervised way to learn a genericfeature representation. The encoded representation encapsu-lates both speaker and phonetic discriminative features. Wethen use features extracted from the encoder as input to task-speciﬁc decoders to predict phrase identity and extract speakerembeddings. Since the APC encoder is trained using unlabeleddata (in-domain and out-of-domain), it is capable of capturinghigh-level speech representation independent of data domain ordownstream tasks. The proposed shared-encoder architectureobviates the need for two separate encoders for each individualtask and large amounts of labeled in-domain data for training.Results on the domain-mismatched evaluation data demonstratethat the proposed shared-encoder model can also be effective indomain adaptation in TD-ASV.Prior work in feature-learning includes [24, 21, 25]. Liu etal . suggested the use of DNNs for feature extraction [24]. Theirmethod, however, required labeled data for training the feature-extractor in contrast to the unsupervised method employed inthis study. Chung et al . proposed the unsupervised APC en-coder in [21] but used the extracted feature representations withan i-vector/PLDA SID system as opposed to the task-speciﬁcdecoders suggested in this paper. While these methods reportedresults on domain-matched datasets, the proposed model wasevaluated on DeepMine data which consists of Persian and En-glish phrases spoken by non-native English speakers. All eval-uations are in accordance with Task-1 of the short durationspeaker veriﬁcation challenge (SDSVC) [4].The remainder of the paper is organized as follows: in Sec-tion 2, the encoder-decoder structure is presented. The datasetsused and the model architecture proposed are outlined in Sec-tion 3, results are presented and discussed in Section 4 and con-clusion and the future directions are provided in Section 5.

2. Encoder-Decoder TD-ASV

Predictive coding has played an important role in speech pro-cessing, especially in speech coding using linear prediction cod-ing (LPC) [26]. LPC predicts future audio samples whereas,a recently proposed autoregressive predictive coding [21] pre-dicts the features of a future frame. The idea is to utilize the input sequence itself as labels and predict a frame n steps aheadof the current frame to achieve unsupervised speech representa-tion learning. The model architecture is as shown in Figure 1.Suppose the input speech sequence is X = ( x , x , ..., x T ) ,the time shift of prediction is ﬁxed at n , and the ground truth ofthe prediction for each frame is ( x n , x n , ..., x T + n ) . In or-der to prevent the model from learning a trivial solution, weapply a uni-directional neural network structure, as opposed tobi-directional networks, by letting the model be aware of thecontext only from history. By stacking multiple long short-term memory (LSTM) layers and adding residual connections,we obtain a deep LSTM network. Prior to that, a two-layerfeed-forward network is considered as the pre-net network totransform the speech features into a hidden latent space. To-gether with LSTMs, we denote this combined network as DL-STM. The output of the DLSTM is then fed into a linear layerand transferred to the input space, which means that the dimen-sion will be the same as the input features. Mathematically, themodel architecture can be described as follows: Y = W f DLST M ( X , W lstm ) + b f (1)where W lstm represents all the parameters in the DLSTM; W f and b f denote the weight matrix and bias vector in the lastlayer, respectively; and Y = ( y , y , ..., y T ) is the output. Con-sidering the L1 loss as a metric distance for prediction, all theabove parameters are obtained by optimizing the following lossfunction: L = T − n (cid:88) t =1 | x t + n − y t | (2) The PID decoder was designed to distinguish between differ-ent phrases. In order to obtain better generalization and fasterconvergence, we allowed the PID decoder to learn frame-levelphonetic representations through a phoneme classiﬁcation taskusing the connectionist temporal classiﬁcation (CTC) [27]. Theframe-level representations were then averaged using a statisti-cal pooling layer to form a single feature vector for sentence-level phrase classiﬁcation. Speciﬁcally, the speech represen-tation obtained in Section 2.1 was ﬁrst fed into a stacked bi-directional LSTM network (BLSTM) to get the frame-level rep-resentations. Then, the frame-level representations were usedas the inputs for two subsequent networks. In the ﬁrst network,they were transformed into the phoneme space to capture pho-netic information. In the second network, a pooling layer andtwo feed-forward layers were used to transcribe the frame-levelrepresentations to phrase-ID space followed by a softmax layer.The overall PID decoder was optimized by jointly minimizingthe following loss: L total = L CTC + λ L CE , (3)where, L CTC is the CTC loss for phoneme classiﬁcationand L CE is the loss arising from the phrase classiﬁcation. Weuse λ as a regularizing hyperparameter to control the contribu-tion of the CE loss to the total loss.The speaker-ID decoder consists of another BLSTM net-work followed by a statistical pooling layer to extract speakerembeddings. Speech representations obtained from the APCencoder in Section 2.1 are used here as input. The size of the ﬁ-nal transformation layer is dependent on the number of speakersin the dataset. The SID decoder is optimized by minimizing thecross entropy loss arising from the classiﬁcation of speakers. . Experimental Details The speciﬁcations of the datasets used in this paper are providedin Table 1. Utterances from LibriSpeech, VoxCeleb1 and Vox-Celeb2 [28] and DeepMine Part-1 [29, 30] were used for threedifferent tasks: 1) Unsupervised pre-training of the shared en-coder, 2) Phrase ID training, and 3) Speaker ID training. In thissection, we provide details of the subsets of data used for eachtask. Table 1:

Details of the datasets used.

Subset Database train-librispeech

Librispeech 140k 5466 478.5 dev-librispeech

Librispeech 2.7k 97 5.3 train-voxceleb

VoxCeleb 1.2M 7350 2637.8 dev-voxceleb

VoxCeleb 73k 7350 151.2 train-deepmine

DeepMine 101k 963 91.5 dev-deepmine

DeepMine 37k NA 31.6 test-deepmine

DeepMine 69k NA 61.2

The in-domain training data ( train-deepmine ) containsspeech utterances from 963 speakers, some of whom have onlyPersian phrases. The enrollment ( dev-deepmine ) and test utter-ances ( test-deepmine ) are drawn from a ﬁxed set of ten phrasesconsisting of ﬁve Persian and ﬁve English phrases, respectively.More details of the phrases can be found in [29].

The unsupervised pre-training of the shared encoder used theout-of-domain train-librispeech subset, 500k utterance fromVoxCeleb and the in-domain train-depmine subset. Since theAPC encoder can be trained with unvoiced frames as well, nospeech activity detection (SAD) is applied. A uniform samplingrate of 16 KHz is used across datasets. To prevent overﬁtting,a combined development set consisting of dev-librispeech , dev-voxceleb and dev-deepmine were used for hyperparameter se-lection. For training the phrase ID decoder, 100 hours of LibriSpeechand all utterances of train-deepmine were used. dev-librispeech and the dev-deepmine dataset were used for hyperparameter se-lection.The SID decoder was trained using 1.2M utterances (7350speakers) from the VoxCeleb dataset. Similar to the data pro-cessing of the x-vector system in [31], the utterances were cutinto 3 second segments and augmented with noise from the MU-SAN database [32] resulting in a total of 3.2M utterances ( ∼ khours). The Kaldi framework [33] was used for all front-end prepro-cessing and feature extraction for each of the three tasks. Thefeatures are 40 dimensional ﬁlterbanks with a frame-length of25ms and a frame shift of 10ms. Cepstral mean and variancenormalization is applied on the features. The energy SAD (from Kaldi), used in the speaker embedding extraction, ﬁlters outnon-speech frames.

The APC encoder DLSTM is composed of 4 layers of unidirec-tional LSTMs with each layer consisting of 512 hidden units.The input to the shared-encoder is 40 dimensional ﬁlter-bankfeatures. The shared encoder is trained in an auto-regressivemanner by minimizing the L1 loss function as described in Sec-tion 2.The pre-net feature embedding network of the encoder DL-STM is made up of 2 fully-connected layers with ReLU activa-tions. The encoder model is initialized using the Xavier uniforminitialization and a dropout of 0.1 is applied to the ReLu activa-tion function.During evaluation, the shared-encoder is used as a fea-ture extractor to extract learned representations for each ut-terance. These feature representations are the hidden RNNstates of the APC model and form a 4-dimensional tensor ofthe shape (number-layers, batch-size, sequence-length, RNN-hidden-size). In our experiments, 512 dimensional hidden statesof all 4 RNN layers of the APC model were used. Features ex-tracted from the APC model are then fed into the task-speciﬁcdecoder for learning the corresponding speaker and phrase iden-tities.

Two standalone decoders are trained to classify speech utter-ances based on speakers and phrase-IDs. Each decoder istrained and evaluated separately.The phrase ID (PID) decoder is composed of 3 layers ofbidirectional LSTMs made up of 512 hidden units. The out-put of these BLSTM layers is then fed into two different sub-networks to predict phonemes and classify phrases. The map-ping from Persian to English phoneme set is adopted as sug-gested in the data corpus, leading to 39 phonemes in total.Therefore, the phoneme prediction sub-network is a linear layerwith a 40 dimensional (39 phonemes + 1 blank) output. Thephrase classiﬁcation sub-network consists of a pooling layerfollowed by a fully-connected layer (400 hidden units) and aprediction layer of 11 outputs (10 phrases + 1 no match). Sincewe utilize out of domain data which do not have phrase-ID la-bels, we add an extra category for all utterances whose contentsdo not match the given 10 phrases of the evaluation data. Weobserve that the PID decoder converges well when λ (deﬁnedin section 2.2) is heuristically set to 0.2.The speaker ID decoder is made up of 3 layers of bidi-rectional LSTMs each consisting of 512 hidden units. This isfollowed by statistical pooling, a fully-connected (dense) layer,and a prediction layer. The dimension of the prediction layer7350 based on the number of speakers in the training set. Dur-ing evaluation, the bottleneck features (outputs from the denselayer of the SID decoder) are extracted and used as speaker em-beddings. The dimension of the fully-connected dense layer isset at 600 similar to the x-vector system. The shared encoder was trained for 5 epochs with a learning rateof e − . The weights and biases of the shared-encoder networkwere frozen after the training to ensure that the task-speciﬁcoptimization of the decoders did not modify the shared-encoderhe. Both the phrase ID and the speaker ID decoder networkswere trained in parallel to minimize their corresponding lossfunctions. Decoders were trained for 5 epochs with a learningrate of e − and the learning rate was annealed by a factor of0.5 after 3 epochs.During evaluation,the log likelihood of phrase-ID of test ut-terance and the corresponding enrollment utterance being thesame is computed as the PID score. Speaker embeddings areextracted from the dense layer of the SID decoder. A PLDAclassiﬁer is used to compare the extracted speaker embeddings,and predict target/imposter speaker decisions. Speaker embed-dings extracted from the speaker ID decoder were centered andprojected using LDA. The LDA dimension was tuned on theVoxCeleb training set to 200. After dimensionality reduction,the representations were length-normalized and modeled by thePLDA and the PLDA model was then adapted using the Deep-Mine training data. The log-likelihood scores of the PLDAmodel (SID scores) and the PID model were fused to generatethe ﬁnal system prediction.

4. Results and Discussion

Table 2 provides results obtained from the text-dependentspeaker veriﬁcation task of SDSVC on the evaluation data. Sys-tem performance is compared using the normalized minimumdetection cost function (minDCF) [34].Two baselines were provided in the challenge evaluationplan for this task: the x-vector system and i-vector/HMM sys-tem. The state-of-the-art x-vector method, based on the TDNNarchitecture of [31], was trained using VoxCeleb1 and Vox-Celeb2 databases. Evaluation trials, as per the provided base-line, were scored using the PLDA without any score normaliza-tion. The i-vector/HMM method, that also takes into consider-ation phrase information, was selected as the second baseline.Among the published results, the i-vector/HMM method is thebest performing system on DeepMine data.Table 2:

Results for the text-dependent task of the SDSV chal-lenge in terms of minDCF and EER. ∗ indicates baseline and + indicates score-level fusion using linear regression. Speaker IDSystem Phrase IDSystem minDCF EER(%) x-vector ∗ None 0.5611 10.13i-vector ∗ HMM 0.1472 3.47x-vector PID 0.2170 4.80SID PID 0.2697 6.28SID + x-vector PID 0.1830 4.18

The proposed system achieves a minDCF of 0.2697 andan EER of 6.28%. This represents a relative improvement of51.9% in terms of minDCF (0.5611 for the x-vector baselineversus 0.2697 for the proposed method) and 38% in terms ofEER (10.13% to 6.28%). In order to have a fair comparisonbetween the x-vector system and the shared-encoder system,we fused the scores of x-vectors and PID. We observed that,in this case, the performance of the fused x-vectors was betterthan the shared encoder system. The minDCF improved rela-tively by 19.5% (from 0.2697 to 0.2170) and the EER by 23.5%(from 6.28% to 4.8%). Thus, the x-vector system, on its own,is better at capturing speaker discriminatory features, than the SID network of the proposed framework. Nevertheless, on theoverall task of TD-ASV, the proposed system performs betterthan the x-vector baseline. This improvement in performancecan be attributed to the unsupervised pre-training of the shared-encoder using unlabeled in-domain data and the use of phoneticinformation by the proposed system. As a result, our systemis better suited for the text-dependent, cross-lingual task of thischallenge in comparison to the x-vector baseline.To further analyze the performance of the proposed system,fusion of the x-vector/PLDA scores and the SID/PLDA scoreswas performed using linear regression before fusing with PIDscores. Equal coefﬁcients of . were chosen for this linearregression which resulted in a 15% gain in minDCF (0.2170to 0.1830) and a 12% relative gain in EER (4.8% to 4.18%).These results seem to suggest that the SID system offers com-plimentary information to the x-vector system. It is possiblethat the proposed unsupervised method learns useful speaker-discriminative information that was previously discarded whenlearning representations in a supervised fashion. Combining su-pervised and unsupervised feature representations can thereforebe advantageous in developing robust TD-ASV systems.The performance of the i-vector/HMM method, on the otherhand, exceeded that of the proposed method by 45% (minDCFof 0.1472 vs 0.2697). This system used hidden Markov model(HMM) states to model time sequences and extract i-vectors foreach phrase. The i-vector/HMM approach outperforms the pro-posed method mainly because of its capability to reject target-wrong trials, meaning that if two different phrases were spo-ken by the same speaker, the HMM Viterbi decoding producedinvalid statistics for such trials and consequently they were re-jected easily [10]. In contrast, since the PID and the SID sys-tems were fused by a simple score-level fusion, our system mayhave predicted higher log-likelihoods. A comprehensive anal-ysis of the results could not be performed because the groundtruth labels for the evaluation data were not available.

5. Conclusion

In this paper, a novel model architecture comprised of a shared-encoder with task-speciﬁc decoders was proposed for TD-ASV.An auto-regressive predictive coding encoder was trained in anunsupervised fashion to learn generic features independent ofthe downstream task. Task-speciﬁc decoders were then opti-mized for phrase and speaker classiﬁcation. An improvementof 52% was achieved in terms of minDCF compared to the x-vector baseline. The i-vector/HMM method was the best per-forming system.The proposed method has the advantage of learning high-level speech patterns from large, unlabeled, data-rich do-mains. The encoded speech representations successfully cap-tured speaker and phonetic discriminative features. Resultsobtained on the evaluation dataset demonstrated the domain-adaptaion ability of the proposed system. Further, strong evi-dence of the complementarity of the proposed system was foundwhen the x-vector scores were fused with the scores of theencoder-SID decoder.A natural progression of this work is to compare the effec-tiveness of the APC encoder against other unsupervised meth-ods such as the contrastive prediction approach. Further re-search could also be conducted to determine the applicability ofthe shared-encoder on other data-scarce domains, for example,accented speech, zero-resource languages, children’s speech.Additionally, both PID and SID systems could be jointly trainedas a multi-task problem to make the system more robust. . References [1] F. Sigona, “Voice biometrics technologies and applications forhealthcare: an overview.”

JDREAM. Journal of interDisciplinaryREsearch Applied to Medicine , vol. 2, no. 1, pp. 5–16, 2018.[2] N. Singh, R. Khan, and R. Shree, “Applications of speaker recog-nition,”

Procedia engineering , vol. 38, pp. 3122–3126, 2012.[3] Y.-T. Chang and M. J. Dupuis, “My voiceprint is my authenti-cator: A two-layer authentication approach using voiceprint forvoice assistants,” in . IEEE, 2019,pp. 1318–1325.[4] H. Zeinali, K. A. Lee, J. Alam, and L. Burget, “Short-durationspeaker veriﬁcation (sdsv) challenge 2020: the challenge evalua-tion plan.” arXiv preprint arXiv:1912.06311, Tech. Rep., 2020.[5] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front-end factor analysis for speaker veriﬁcation,”

IEEE Trans-actions on Audio, Speech, and Language Processing , vol. 19,no. 4, pp. 788–798, 2010.[6] P. Kenny, T. Stafylakis, P. Ouellet, M. J. Alam, and P. Dumouchel,“Plda for speaker veriﬁcation with utterances of arbitrary du-ration,” in . IEEE, 2013, pp. 7649–7653.[7] A. Larcher, K. A. Lee, B. Ma, and H. Li, “Phonetically-constrained plda modeling for text-dependent speaker veriﬁcationwith multiple short utterances,” in . IEEE, 2013,pp. 7673–7677.[8] T. Stafylakis, P. Kenny, P. Ouellet, J. Perez, M. Kockmann, andP. Dumouchel, “Text-dependent speaker recognition using pldawith uncertainty propagation,” matrix , vol. 500, no. 1, 2013.[9] H. Zeinali, H. Sameti, L. Burget, J. Cernock`y, N. Maghsoodi, andP. Matejka, “i-vector/hmm based text-dependent speaker veriﬁ-cation system for reddots challenge.” in

InterSpeech , 2016, pp.440–444.[10] H. Zeinali, H. Sameti, and L. Burget, “Hmm-based phrase-independent i-vector extractor for text-dependent speaker veriﬁca-tion,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 25, no. 7, pp. 1421–1435, 2017.[11] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker veriﬁcation,” in .IEEE, 2014, pp. 4052–4056.[12] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel schemefor speaker recognition using a phonetically-aware deep neuralnetwork,” in . IEEE, 2014, pp. 1695–1699.[13] H. Zeinali, L. Burget, H. Sameti, O. Glembek, and O. Plchot,“Deep neural networks and hidden markov models in i-vector-based text-dependent speaker veriﬁcation.” in

Odyssey , 2016, pp.24–30.[14] J. Guo, U. A. Nookala, and A. Alwan, “Cnn-based joint mappingof short and long utterance i-vectors for speaker veriﬁcation usingshort utterances.” in

INTERSPEECH , 2017, pp. 3712–3716.[15] J. Guo, N. Xu, K. Qian, Y. Shi, K. Xu, Y. Wu, and A. Alwan,“Deep neural network based i-vector mapping for speaker veriﬁ-cation using short utterances,”

Speech Communication , vol. 105,pp. 92–102, 2018.[16] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker veriﬁcation,” in . IEEE, 2016, pp. 5115–5119. [17] S.-X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “End-to-end at-tention based text-dependent speaker veriﬁcation,” in . IEEE, 2016, pp.171–178.[18] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalizedend-to-end loss for speaker veriﬁcation,” in . IEEE, 2018, pp. 4879–4883.[19] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language under-standing,” arXiv preprint arXiv:1810.04805 , 2018.[20] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec:Unsupervised pre-training for speech recognition,” arXiv preprintarXiv:1904.05862 , 2019.[21] Y.-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An unsupervisedautoregressive model for speech representation learning,” arXivpreprint arXiv:1904.03240 , 2019.[22] Y.-A. Chung and J. Glass, “Generative pre-training for speechwith autoregressive predictive coding,” in

ICASSP , 2020.[23] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning withcontrastive predictive coding,” arXiv preprint arXiv:1807.03748 ,2018.[24] Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, and K. Yu, “Deepfeature for text-dependent speaker veriﬁcation,”

Speech Commu-nication , vol. 73, pp. 1–13, 2015.[25] A. K. Sarkar, Z.-H. Tan, H. Tang, S. Shon, and J. Glass, “Time-contrastive learning based deep bottleneck features for text-dependent speaker veriﬁcation,”

Ieee/acm Transactions on Audio,Speech, and Language Processing , vol. 27, no. 8, pp. 1267–1279,2019.[26] D. O’Shaughnessy, “Linear predictive coding,”

IEEE potentials ,vol. 7, no. 1, pp. 29–32, 1988.[27] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Con-nectionist temporal classiﬁcation: labelling unsegmented se-quence data with recurrent neural networks,” in

Proceedings ofthe 23rd international conference on Machine learning , 2006, pp.369–376.[28] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identiﬁcation dataset,” in

INTERSPEECH , 2017.[29] H. Zeinali, H. Sameti, and T. Stafylakis, “Deepmine speech pro-cessing database: Text-dependent and independent speaker veriﬁ-cation and speech recognition in persian and english.” in

Odyssey ,2018, pp. 386–392.[30] H. Zeinali, L. Burget, J. ˇCernock`y et al. , “A multi purpose andlarge scale speech corpus in persian and english for speakerand speech recognition: the deepmine database,” arXiv preprintarXiv:1912.03627 , 2019.[31] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust dnn embeddings for speaker recognition,”in . IEEE, 2018, pp. 5329–5333.[32] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, andnoise corpus,” arXiv preprint arXiv:1510.08484 , 2015.[33] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The kaldi speech recognition toolkit,” in

IEEE 2011 workshopon automatic speech recognition and understanding , no. CONF.IEEE Signal Processing Society, 2011.[34] A. F. Martin and C. S. Greenberg, “The nist 2010 speaker recog-nition evaluation,” in