Investigation of Speaker-adaptation methods in Transformer based ASR
IInvestigation of Speaker-adaptation methods in Transformer based ASR
Vishwas M. Shetty, Metilda Sagaya Mary N J, S. Umesh
Department of Electrical Engineering, Indian Institute of Technology Madras, Chennai, India [email protected], [email protected], [email protected]
Abstract
End-to-end models are fast replacing conventional hybrid mod-els in automatic speech recognition. A transformer is asequence-to-sequence framework solely based on attention, thatwas initially applied to machine translation task. This end-to-end framework has been shown to give promising results whenused for automatic speech recognition as well. In this paper,we explore different ways of incorporating speaker informationwhile training a transformer-based model to improve its perfor-mance. We present speaker information in the form of speakerembeddings for each of the speakers. Two broad categories ofspeaker embeddings are used: (i) fixed embeddings, and (ii)learned embeddings. We experiment using speaker embeddingslearned along with the model training, as well as one-hot vectorsand x-vectors. Using these different speaker embeddings, weobtain an average relative improvement of 1% to 3% in the to-ken error rate. We report results on the NPTEL lecture database.NPTEL is an open-source e-learning portal providing contentfrom top Indian universities.
Index Terms : Transformer, Speech Recognition, SpeakerAdaptation, x-vector
1. Introduction
The recognition performance of an Automatic Speech Recog-nition (ASR) system is affected by speaker variations. Speakeradaptation in conventional DNN-HMM based systems was ex-plored in [1, 2, 3, 4, 5, 6]. i-vectors appended to input featureshave been shown to improve the model performance. The use ofi-vectors for speaker adaptation was explored in [3, 4, 5]. Alongwith i-vectors, the use of x-vectors, deep CNN embeddings forspeaker adaptive training of DNN was explored in [2]. In [6],d-vectors were used for speaker adaptation.Speaker adaptation in an End-to-End (E2E) framework wasexplored in [7, 8, 9, 10]. In [10], speaker i-vectors extractedfrom the training data were stored in a memory block and ac-cessed through a learned attention mechanism during the test-ing time. The resulting vector was called the memory vectorand was appended to the acoustic features while training theE2E model. This was an unsupervised speaker adaptation ap-proach, in the sense that it did not require i-vector computationduring testing. In [7], two approaches for adaptation; Kullback-Leibler divergence (KLD) adaptation and Multi-task learning(MTL) adaptation of an E2E Connectionist Temporal Classifi-cation (CTC) based model was proposed. Speaker adaptation ina multi-channel E2E framework was proposed in [9].In [11], a method for speaker adaptation by making use of apersonalized speech synthesizer and a neural language genera-tor was proposed. The primary objective in [11] was to addressthe problem of data sparsity in the case of rapid speaker adap-tation. Their approach was similar to back-translation in a ma-chine translation problem. The speech synthesizer was adaptedto the target speaker by using about a minute of speech data,which was then used to synthesize relevant text generated by the neural language generator. The synthesized speech, along withthe original speech data for adaptation, is used for the acousticmodel adaptation.In the transformer-based E2E framework, [12] proposed aspeaker-aware speech-transformer. Here speaker embeddingswere obtained by attention over i-vectors. At each time step, aweighted combination of i-vectors was calculated to generate aspeaker embedding.In this paper, we propose to study the effect of providingspeaker information on ASR performance for systems built on atransformer framework. We explore different ways of providingspeaker information while training the model as well as duringtesting. Speaker information is provided in the form of em-beddings for each of the speakers. We have experimented withdifferent types of embeddings: one-hot , x-vectors , learned em-beddings , and different ways of incorporating them: add , con-catenate .
2. Dataset
National Programme on Technology Enhanced Learning(NPTEL) is an open-source e-learning portal managed and co-ordinated by the Indian Institutes of Technology (IITs) and In-dian Institute of Science (IISc). NPTEL provides free accessto lecture videos and study materials for courses taught at topIndian universities. It covers a plethora of subjects covering en-gineering, basic sciences, humanities, and social science.Figure 1:
Speaker wise distribution of the train set
The lecture video and their corresponding subtitles havebeen made available in the public domain. In this paper, wehave worked with lectures from Computer Science and Electri-cal Engineering domain. The lecture video files are availablein .mp4 format. Audio was extracted from these files in .wavformat by sampling at 16kHz. The transcriptions for audio fileswere obtained from their corresponding SRT files which hadtimes stamps as well as corresponding captions. The speaker,i.e., the instructor for each course is different. Data from tendifferent courses were used in our experiments. All lectures arein Indian English. The data set details are given in Table 1. The train , dev , and eval sets have data from the same set of speak-ers . Speaker wise data distribution in the train set is shown in a r X i v : . [ ee ss . A S ] A ug igure 1. The eval and dev have similar speaker distribution.Further, train and dev sets have 10 speakers each, whereas eval set has 9 speakers. train dev evalDur (hours) 196 5.35 5.20 Table 1:
NPTEL data set Statistics
3. Experimental Details
The sequence to sequence encoder-decoder model is trainedbased on the transformer [13] framework. The encoder has 12layers, whereas the decoder has 6 layers. Each layer in boththe encoder and decoder have a self-attention module followedby 2048 feed-forward units with ReLU non-linearity. The self-attention module had an attention dimension of 256 and fourattention heads. The batch size was set to 32 with accumgrad2. This is equivalent to having a batch size of 64. “noam” op-timizer (section 5.3 of [13]) with the 25000 warmup steps, andan initial transformer learning rate of 10 was used. All the mod-els were trained for 40 epochs. The final model was obtainedby averaging the best five models. The experiments were runon the V100 and GTX1080 GPU cards. Kaldi [14] and ESP-net [15] tool kits were used to train our models. The modeltraining was based on the hybrid CTC/Attention [16] architec-ture. Connectionist Temporal Classification (CTC) alignmentsbeing monotonic guide the training of the attention model. Theobjective function for this hybrid training procedure is: L MOL = λ logp ctc ( C | X ) + (1- λ ) logp ∗ att ( C | X ) Here X is the input acoustic feature frame, and C is its corre-sponding target sequence. In this experiment, we use Byte Pairencoding (BPE) [17] subword units. λ , the multitask learningcoefficient, is a tunable parameter that decides the contributionfrom CTC and attention components. In our experiments, wehave set λ to 0.3 both while training and decoding in most ofthe cases except for x-vectors, where we found λ = 0.1 was op-timal while decoding.
4. Different Speaker EmbeddingImplementations
The
Baseline model in our experiment was the model trainedwithout speaker information provided in any form. Thisspeaker-independent model was trained with 83-dimensionalfeatures, i.e., 80-dimensional filter bank features along with 3pitch related features. These were obtained by using 25ms win-dow frames, with a frameshift of 10ms. These were then passedthrough two conv2d layers with stride 2 each. Hence the inputof “T” frames to the conv2d layers was reduced to “T/4” at theoutput of conv2d. The target set consisted of 159 BPE units.For speaker adaptation, we provide speaker embeddings bothduring training and testing, i.e., speaker-adaptive training. Inthe next few sections, we describe the different ways in whichwe provide the speaker embeddings.
Since in our experiments, we were dealing with ten differentspeakers, ten-dimensional one-hot speaker embeddings wereused. That is [1 0 0 0 0 0 0 0 0 0] for
Spk1 , [0 1 0 0 0 0 00 0 0] for
Spk2 and so on. Given an utterance, all the featureframes belong to one speaker. Hence each of the frames from a given utterance is concatenated with the same one-hot speakerembedding. These ten-dimensional one-hot vectors were con-catenated to the input acoustic feature frames, as shown in Fig-ure 2. The input acoustic feature frames are 83 dimensional,and hence after appending one-hot vectors, the input becomes93 dimensional. This model has been referred to as
One-hot inthis paper.Figure 2:
Appending One-hot vectors as speaker embedding toacoustic feature frames
A 512 dimension x-vector embedding is obtained for eachspeaker using a pretrained model built in Kaldi. The pre-trainedmodel was built using Voxceleb data. We have used x-vectorsextracted per speaker in our experiments, i.e., since we have10 speakers, we extracted 10 x-vectors. We have explored threeways of using these speaker based x-vector embeddings. The 512 dimension x-vector for a given speaker is first downprojected to 256 dimensions and then added to all the outputvectors from the speaker final encoder layer, as shown in Figure3. These speaker embedding added encoder vectors are thenused to calculate attention with each decoder layer output. Thismodel is referred to as xvector out add in this paper. Note thatprojection is learned during model training.
The 512 dimension x-vector for given speaker is concatenatedto all the 256 dimension output vectors from the final encoderlayer. These 256 + 512 = 768 dimension output vectors arethen down projected to 256 dimensions and used to calculateattention with each decoder layer output, as shown in Figure4. In this case, also, projection is learned during training.Thismodel is referred to as xvector out cat in this paper.
Here the 512 dimension x-vector is first down projected to83, i.e., to the dimension of the input acoustic feature vec- http://kaldi-asr.org/models/8/0008 sit v2 1a.tar.gz igure 3: Adding down-projected x-vector to the output of thefinal encoder layer
Figure 4:
Concatenating x-vector to the output of the final en-coder layer tors. The 83 dimension x-vectors are then used as the first andthe last frames for that utterance. Hence “T” acoustic featureframes in a given utterance, after x-vector append, will resultin “T+2” frames. These are then passed through the conv2dlayer, as shown in Figure 5. This model is referred to as x-vector in append in this paper. As usual, projection is learnedduring training.
Here the 512 dimension x-vector is first down projected to 83and then concatenated to the 83 dimensional input acousticframes. This approach is similar to
One-hot approach shownin Figure 2 with one-hot vectors replaced by down projected x-vectors. The input to the conv2d has the dimension 83 + 83 =166. This model is referred to as x-vector in cat in this paper. Figure 5:
Appending down projected x-vector as first and lastframes along with the input acoustic feature frames
So far, in all our experiments, the speaker embeddings for eachof the speakers were decided a priori, i.e., before training. Wethen explored methods to learn the speaker embedding, insteadof providing a fixed vector as an embedding. In the NPTELdata set used in our experiments, a speech segment/utterance be-longs to one speaker. We make use of this information to learnspeaker embeddings for each speaker. The speaker embeddingsare learned similar to the way character embeddings are learnedat the transformer decoder. Like the character embedding ma-trix at the decoder, we learn a speaker embedding matrix at theencoder. This speaker embedding matrix is learned along withthe model training.Figure 6:
Adding learned speaker embedding vectors to theacoustic feature frames
We have experimented with two different ways of incorpo-rating these learned speaker embeddings, i.e., adding and con-catenating them to the acoustic feature frames. The dimen-sion of the learned embedding is decided based on how it isto be incorporated. In case of adding, the embedding dimensionwas matched to the dimension of the input acoustic features, pk1 Spk2 Spk3 Spk4 Spk5 Spk6 Spk7 Spk8 Spk9 Spk10 Avg
Baseline x-vector out add x-vector out cat x-vector in cat
One-hot x-vector in append x-vector out add + 0.1 λ Spk embed add
Spk embed concat x-vector in append + 0.1 λ Table 2:
Speaker wise %TER comparison for dev set
Spk1 Spk2 Spk3 Spk4 Spk5 Spk6 Spk7 Spk8 Spk9 Spk10 Avg
Baseline x-vector out add x-vector out cat
One-hot x-vector in cat x-vector in append x-vector out add + 0.1 λ Spk embed add
Spk embed concat x-vector in append + 0.1 λ Table 3:
Speaker wise %TER comparison for eval set (Spk6 is not present in the eval set) whereas in the case of concatenating, their dimension was setto ten. In this paper, we refer to speaker embedding “add” and“concatenate” models as
Spk embed add and
Spk embed con-cat respectively. Figure 6 shows the
Spk embed add approach.
Spk embed concat is similar to the
One-hot approach shownin Figure 2 with one-hot vector being replaced by the learnedspeaker embeddings generated by the speaker embedding ma-trix.
5. Analysis
Since we are using BPEs as the sub-word units, we report resultsin terms of Token Error Rates. Given in Table 2 and Table 3 are%TERs for the dev and eval sets, respectively. Here λ is theMTL coefficient. Note that during testing, we assume that weknow the speaker’s identity. Hence during testing, we providethe corresponding speaker embedding of the speaker. There is asmall, consistent improvement across most speakers for learnedembeddings. Some of the observations from the tables are:• One-hot gets some improvement on dev set but is worse formost speakers in the eval set.• x-vector in cat performs slightly better than
Baseline on dev set but is almost same in case of eval set.• Both ways of providing x-vector at the output, i.e., add andconcatenate, did not provide any significant improvement. Onobserving the train and validation accuracies, we found thatthough the attention accuracies of x-vector out were betterthan the Baseline model, their CTC loss was inferior to the
Baseline
CTC loss. As a check, we reduced λ the MTLcoefficient while decoding from 0.3 to 0.1, thus reducing the CTC factor. These results have been referred to as x-vector out add + 0.1 λ in Tables 2 and 3. With λ set to 0.1,xvector embeddings at the encoder output improve slightlyover the baseline for both dev and eval sets.• Learned speaker embeddings, both when added and concate-nated give improvement, over Baseline model.• Providing x-vectors at the input of the encoder by append-ing them to feature vectors and decoding with λ = 0.1, i.e., x-vector in append + 0.1 λ gave the best performance amongall the speaker embeddings discussed in this paper. This cor-responds to the implementation in Figure 5 with correspond-ing description in section 4.2.3.
6. Conclusion and Future work
In this paper, we have explored different ways of adapting aTransformer based E2E speech recognizer. Performance ofdifferent types of speaker embeddings; onehot , x-vector , and learned embedding has been reported. We compare differentways of incorporating these embeddings while training: adding,concatenating, at encoder input, and at encoder output. Fromthe results, it was observed that among the different embed-dings discussed in this paper, providing x-vectors at the inputof the encoder gives the best recognition accuracy. We get 1%and 3% relative improvement over the Baseline model on eval and dev sets respectively. Here, we have worked with a smallset (ten) of speakers and closed (seen speaker) test conditions.In future, we would like to explore working with more numberof speakers and also extend to an open (unseen speaker) testcondition. . References [1] H. Seki, K. Yamamoto, T. Akiba, and S. Nakagawa, “Rapidspeaker adaptation of neural network based filterbank layer forautomatic speech recognition,” in , 2018, pp. 574–580.[2] J. Rownicka, P. Bell, and S. Renals, “Embeddings for dnn speakeradaptive training,” in , 2019, pp. 479–486.[3] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adap-tation of neural network acoustic models using i-vectors,” in , 2013, pp. 55–59.[4] V. Gupta, P. Kenny, P. Ouellet, and T. Stafylakis, “I-vector-basedspeaker adaptation of deep neural networks for french broadcastaudio transcription,” 05 2014.[5] A. Senior and I. Lopez-Moreno, “Improving dnn speaker indepen-dence with i-vector inputs,” in
Proc. ICASSP , 2014.[6] Y. Zhao, J. Li, S. Zhang, L. Chen, and Y. Gong, “Domain andspeaker adaptation for cortana speech recognition,” in , 2018, pp. 5984–5988.[7] K. Li, J. Li, Y. Zhao, K. Kumar, and Y. Gong, “Speaker adapta-tion for end-to-end ctc models,” in , 2018, pp. 542–549.[8] S. Mirsamadi and J. H. Hansen, “On multi-domain trainingand adaptation of end-to-end rnn acoustic models for distantspeech recognition,” in
Proc. Interspeech 2017 , 2017, pp. 404–408. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2017-398[9] T. Ochiai, S. Watanabe, S. Katagiri, T. Hori, and J. R.Hershey, “Speaker adaptation for multichannel end-to-end speechrecognition,” in
IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP)
ICASSP 2020 - 2020 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) , 2020, pp.7384–7388.[11] Y. Huang, L. He, W. Wei, W. Gale, J. Li, and Y. Gong, “Usingpersonalized speech synthesis and neural language generator forrapid speaker adaptation,” in
ICASSP , April 2020.[12] Z. Fan, J. Li, S. Zhou, and B. Xu, “Speaker-aware speech-transformer,” in , 2019, pp. 222–229.[13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”in
Advances in neural information processing systems , 2017, pp.5998–6008.[14] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The kaldi speech recognition toolkit,” in
IEEE 2011 workshopon automatic speech recognition and understanding , no. CONF.IEEE Signal Processing Society, 2011.[15] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba,Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner,N. Chen, A. Renduchintala, and T. Ochiai, “Espnet: End-to-end speech processing toolkit,” in
Interspeech , 2018, pp.2207–2211. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1456[16] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hy-brid ctc/attention architecture for end-to-end speech recognition,”
IEEE Journal of Selected Topics in Signal Processing , vol. 11,no. 8, pp. 1240–1253, 2017.[17] P. Gage, “A new algorithm for data compression,”