AISPEECH-SJTU accent identification system for the Accented English Speech Recognition Challenge
AAISPEECH-SJTU ACCENT IDENTIFICATION SYSTEM FOR THE ACCENTED ENGLISHSPEECH RECOGNITION CHALLENGE † Houjun Huang , , † Xu Xiang , Yexin Yang , Rao Ma , (cid:66) Yanmin Qian AISpeech Ltd, Suzhou China MoE Key Lab of Artificial Intelligence, AI Institute SpeechLab, Department of Computer Scienceand Engineering Shanghai Jiao Tong University, Shanghai,China { houjun.huang, xu.xiang } @aispeech.com, { yangyexin, rm1031, yanminqian } @sjtu.edu.cn ABSTRACT
This paper describes the AISpeech-SJTU system for the accent iden-tification track of the Interspeech-2020 Accented English SpeechRecognition Challenge. In this challenge track, only 160-hour ac-cented English data collected from 8 countries and the auxiliaryLibrispeech dataset are provided for training. To build an accurateand robust accent identification system, we explore the whole sys-tem pipeline in detail. First, we introduce the ASR based phoneposteriorgram (PPG) feature to accent identification and verify itsefficacy. Then, a novel TTS based approach is carefully designedto augment the very limited accent training data for the first time.Finally, we propose the test time augmentation and embedding fu-sion schemes to further improve the system performance. Our fi-nal system is ranked first in the challenge and outperforms all theother participants by a large margin. The submitted system achieves83.63% average accuracy on the challenge evaluation data, ahead ofthe others by more than 10% in absolute terms.
Index Terms — Accent identification, phone posteriorgram,PPG, TTS based data augmentation, test time augmentation
1. INTRODUCTION
An accent is a manner of pronunciation peculiar to a particular indi-vidual, location, or nation. It may be influenced by the speaker’s lo-cality, education attainment or first language. For the pervasivenessof accents, accent identification is widely utilized in robust speechrecognition, speaker recognition, language identification and foren-sic applications.Earlier studies in accent identification mainly focus on combin-ing linguistic theory with statistical analysis. Piat et al. , used a sta-tistical approach based on prosodic parameters and found that theduration and energy are promising parameters for correct identifi-cation [1]. Berkling et al. , leveraged the structure of English sylla-ble to improve the accent identification [2]. Chen et al. , proposed aGaussian mixture model (GMM) based method for Mandarin accentidentification [3]. Recently, deep neural network based approacheshave emerged in this field. Weninger et al. , made use of bidirectionalLong Short-Term Memory (bLSTM) networks to model longer-termacoustic context [4]. Jiao et al. , proposed a system utilizing long andshort term features in parallel using DNNs and RNNs [5]. † : These authors contributed equally to this work. Yanmin Qian is thecorresponding author. This work was supported by the National Key R&DProgram of China (No. 2018YFB1004602) and the China NSFC project (No.62071288)
In this work, we describe our accent identification system for theInterspeech-2020 Accented English Speech Recognition Challenge(AESRC) [6] in detail. Since accent training data is rather limited, itis critical to effectively make use of the auxiliary Librispeech dataset.In contrast to the previous studies, we have three distinct contribu-tions. First, we introduce the ASR based PPGs as the discriminativefeatures for accent identification. Second, we propose a novel TTSbased approach to synthesize the accent data, which provides richerspeaker, channel and text variability for training the accent classifier.Third, we develop the test time augmentation and the hierarchicalmulti-embedding joint model for improving system performance.This paper is arranged as follows. Section 2 gives an in-depthdescription of the framework of our system. Section 3 presents ourexperiments with different settings. Section 4 concludes our paper.
2. SYSTEM DESCRIPTION
In this section, we depict our system for the accent identificationchallenge, which is shown in Figure 1.First, we leverage both the accent training data and the Lib-rispeech data to train an ASR model with conventional data aug-mentation. The PPG features are extracted from the ASR model andemployed to train the accent identification(AID) model. Then, wepropose a novel TTS based data augmentation to augment the accentdata for training an accurate and robust accent classifier. Finally, weintroduce the test time augmentation scheme to improve system per-formance on the test data. Moreover, we further boost the systemperformance with the hierarchical multi-embedding joint model.
Fig. 1 . Our system diagram for the challenge. Firstly, an ASR modelis trained on the pooled data. Then, the PPG features derived fromthe ASR model are prepared to train an AID model. Finally, TheAID model makes predictions on the test data using PPG features. a r X i v : . [ c s . S D ] F e b .1. ASR based PPG feature extraction While MFCC and FBANK features are popular in speech relatedtasks, we wouldn’t build our AID system on them directly as thefollowing reasons. First, AID system built on MFCC or FBANKfeatures directly could not take advantage of data-sets without ac-cent labels Second, as these features are low level and not task-oriented, they may contain some nuisance attributes like speaker ortext specific, which makes the learning of accent related representa-tion harder, especially when the amount of the training data is lim-ited.To address these issues, we adopt the phone posteriorgram(PPG) feature that has been successfully applied for cross-lingualvoice conversion [7] and cross-accent voice conversion [8] to trainthe accent classifier. PPG is a time-versus-class vector that repre-sents the posterior probabilities of phonetic classes for a specifictime frame. In this work, we first train a speaker independent (SI)automatic speech recognition (ASR) model with both the accenttraining data and the Librispeech data and then extract the PPGfeatures with the SI-ASR model. With this process, the resultingPPG features have the speaker independent property which helpsimprove the robustness of the system.To train a robust ASR model for PPG feature extraction, we em-ploy two kinds of conventional data augmentation that have beenwidely used in automatic speech recognition tasks. The first oneaugments original data with additive noise and reverberation. For ad-ditive noise, the music, noise and speech part in the MUSAN dataset[9] are used. For reverberation, the room impulse responses (RIRs)and the simulated RIRs described in Kaldi’s [10] VoxCeleb recipeare used. The second one is based on warping the signal in thetime domain. We randomly change the tempo of the audio signalwhile ensuring that the pitch and spectral envelope of the signal un-changed. The tempo effect of the SoX tool was used to achieve suchspeech rate perturbation.
Since only 160 hours of accent data are provided, it is too limited totrain an accurate and robust accent identification model. Therefore,we develop a novel TTS based data augmentation approach that isspecially designed for synthesizing accent data.Using the generated high-quality artificial speech as the aug-mented data has been successfully facilitated in ASR systems [11].Recent advances in speech synthesis (text-to-speech, TTS) allowunsupervised modeling of prosody and speaker variations, whichgive the power to synthesize the same texts with diverse speakingstyles. Moreover, the development of TTS models has made synthe-sized speech indistinguishable from human speech. In this work, wechoose FastSpeech [12] as our synthesizer and LPCNet [13] as thevocoder.FastSpeech is a transformer-based model which generates theentire sequence in a non-autoregressive manner. The implementa-tion of FastSpeech is based on the ESPnet toolkit [14]. We use thedefault settings as in [12]. Instead of applying the knowledge dis-tillation process, we train the FastSpeech model from scratch onlyusing the extracted features. The synthesizer converts input text tospectrogram. The size of the input vocabulary is 41, including En-glish phonemes, a pause break token, and a sentence boundary token.Additionally, we augment the decoder with a five-layer post-net [15],which slightly enhance model performance. LPCNet is a variant ofWaveRNN that combines linear prediction with recurrent neural net-works, greatly promoting the synthesis efficiency [13]. Our imple-mentation is based on [13]. First, we train a TDNN x-vector speaker model [16] with thepooled data consists of the accent training data and the auxiliaryLibrispeech data. The x-vectors and the phoneme representation ex-tracted from the pooled data are then used to train the FastSpeechsynthesizer. In order to better capture the characteristics of each ac-cent, we create 8 accent specified synthesizer by finetuning the Fast-Speech Model on each 20-hour accent data respectively. Finally, onthe clean subset of the overall training data, we train two LPCNetvocoders for the male and female speakers respectively.For each speaker from the accent training data, 30 utterancesare grouped to calculate the speaker specific statistics. We use thesestatistics with randomly selected reference texts to synthesize datawith the previously trained accent specific synthesizers. With thisprocess, the generated speech can preserve the speaker’s speakingstyle while adopting new accents.
In our final submission, we use a hierarchical multi-embedding jointmodel to predict the accent label, which is shown in Figure 2. Thestructure of the TDNN sub-model is the same as the one in [16], butwith 2 × size, as we find it gives higher accuracy on the developmentdata. The RES2SETDNN sub-model is developed in a way similarto [17], by introducing the Res2Net [18] type convolution and thesqueeze-and-excitation [19] module to the original TDNN structure.However, we do not include the residual connection in our model,for there is no improvement in performance.We train the joint model in a progressive way as follows. First,we pretrain each accent identification model independently using anadditive angular margin softmax loss [20, 21]. Then, for each modelthe embedding extraction part is fixed. Finally, we train a linearregression classifier based on the concatenated embeddings extractedfrom each sub-model. Fig. 2 . The hierarchical multi-embedding joint model includes theTDNN and RES2SETDNN sub-models and accepts two set of fea-tures. denotes the 40-dimensional PPG features, denotes the 40-dimensional PPG features and its first and sec-ond order difference and FC denotes the fully connected linear layer. Test time augmentation is a common trick in image classificationtasks to improve the test accuracy [22, 23]. Instead of predictingthe label of test image itself, the model takes multiple augmentedversions of the test image as input, and the predicting results arethen aggregated to give the final result. In our work, similar test timeaugmentation for speech data is adopted. We use the tempo effect able 1 . This table shows the identification accuracy (%) of two model architectures on the development set: TDNN and RES2SETDNN.For each architecture, two kinds of feature FBANK and PPG and the corresponding data augmentation strategies are listed. Here “CONVAug” means conventional data augmentation, “TTS Aug” means TTS based data augmentation, “TTA” means test-time data augmentation,and “Delta” means we use the original PPG feature and its first and second difference for training.
ID Model Configuration Accuracy (%)US UK CHN IND JPN KR PT RU Avg.
3. EXPERIMENTS
A detailed comparison of our systems are presented in this section.Kaldi is used for FBANK feature extraction and PyTorch [24] is uti-lized to train the neural network models and PPG feature extraction.All experiment results on the development set are shown in Table 1,where 8 accented English are denoted by their corresponding coun-try codes respectively: US (the USA), UK (the United Kingdom), CHN (China),
IND (India),
JPN (Japan), KR (Korea), PT (Portu-gal) and RU (Russia). Our baseline systems are trained on the original 160-hour accenttraining data, with 40-dimensional FBANK feature. In Table 1, thebaseline system for TDNN and RES2SETDNN are denoted by ID 1and 9 respectively.
We then apply the conventional data augmentation to the accenttraining data. First, the speech rate of each utterance in the train-ing set is randomly changed to 0.8 × , 0.9 × , 1.1 × or 1.2 × , whichincrease the amount of the data to 320 hours. Then, by augmen-tation with additive noise and reverberation, we extend the trainingdata to 1600 hours. The TDNN and RES2SETDNN models trainedon the extended training data are shown in Table 1 with ID 2 andID 10 respectively. Compared with the baseline systems, on averageaccuracy the conventional data augmentation can give an absoluteimprovement of 8.50% and 11.41% respectively. Besides, the im-provement is consistent across all 8 accents. In addition to the conventional data augmentation, we further applythe TTS based data augmentation approach described in Section 2.2on the 1600-hour augmented data to generate 4800-hour synthesizeddata.To check the correlation between the synthesized speech and itscorresponding accent, we test the synthesized data using system ID2 and 10 in Table 1 which are trained on the conventional augmenteddata. From Table 2, we can find that the accent identification accu-racy on the synthesized data is comparable to that of the developmentdata.
Table 2 . Accuracy (%) on synthesized data
Accent System ID2 10
US 59.64 60.50UK 90.73 86.04CHN 70.93 64.08IND 67.13 70.70JPN 63.42 53.73KR 67.48 66.91PT 86.44 81.77RU 46.27 45.04Avg. 69.08 66.16We train system ID 3 and 11 with the combined 6400-hour data.As shown in Table 1, the average accuracy is largely improved again,also, the improvement across the 8 accents is consistent. This verifiesthe effectiveness of our proposed TTS based data augmentation. able 3 . Final submission results on the accent identification evaluation set of the top 4 teams in the rank list
Ranking Team Accuracy (%)US UK CHN IND JPN KR PT RU Avg.
In this section, we compare the systems trained with PPG featuresor FBANK features. The SI-ASR model for PPG feature extractionis prepared with the same setting in [25]. Similar to the training dataaugmentation schemes for systems trained with FBANK features,we train the TDNN and RES2SETDNN models on the original dataand two sets of augmented data respectively. The system trained onPPG features are denoted by ID 4, 5, 6 and 12, 13, 14 in Figure 1.According to the reported numbers of the average accuracy inTable 1, the system ID 4 (or 12) beats the system ID 1 (or 9) by alarge margin, though they are trained on the original data only. Thissuggests that, with the speaker independent property, PPG featuresmake the discrimination of accent much easier. In addition, compar-ing system ID 5, 6 (or 13, 14) to ID 4 (or 12), we can find the systemperformance can be improved when applying data augmentation.In this work, the SI-ASR system used as extractor for PPG fea-tures is trained over all available data (accented+librispeech). Whenonly the accented data are used for training the SI-ASR system, PPGfeatures achieves about 3% performance improvement compared tothe FBANK features on average accuracy, which is small then usingall data.
Table 1 shows, with the test-time data argumentation described inSection 2.4, the performance has a slight lift from system ID 6 to 7(or ID 9 to 10) .
We train systems with 120-dimensional delta PPG features (the orig-inal feature and its first and second difference) on the combined6400 hours training data. We find the performance has a small dropon the TDNN model (system ID 7 and 8), but rises a little on theRES2SETDNN model (system ID 15 and 16).
In our final system, following the design in Section 2.3, we first ini-tialize the embedding extraction part with four systems (ID 7, 8,15 and 16) in Table 1 and fix the parameters. Then we train thefully connected layer for accent classification. The final predictionis given by the 8-way classification hierarchical multi-embeddingjoint model on the test time augmented wave file. As shown in Ta-ble 1, our final system (ID 18) significantly outperforms the chal-lenge baseline system (ID 17), reaching a remarkable 15.03% accu-racy improvement on the development set.On the challenge evaluation data, Table 3 shows the rankings ofthe leading submissions for this challenge. Our system ranks first inthe challenge with an average accuracy of 83.63% on the evaluation set, ahead of the second team by 11.23%. Moreover, when com-paring the system performance shown in Table 1, our final systemachieves much smaller performance gap between the developmentdata and the evaluation data than the challenge baseline.To better understand our system performance on each accent, wevisualize the accent embeddings of the test data using t-SNE [26].As shown in Figure 3, the cluster of the
IND utterances is compactand far from the other clusters, while the cluster of the US utteranceshas many overlaps with the other clusters. This partially explains thedifferences in identification accuracy. Fig. 3 . Accent embeddings of 8 accents in test set. 200 accent em-beddings for every accent are chosen in the test set.
4. CONCLUSIONS
In this paper, we describe our submitted system for the Interspeech-2020 Accented English Speech Recognition Challenge (AESRC).Several novel approaches are developed to improve the robustnessof our accent identification system. To the best of our knowledge,it is the first time phone posteriorgram feature has been introducedto accent classification, which brings an improvement of more than15% compared to the regular FBANK feature. To train a robust sys-tem from such limited data, we adopt TTS based data augmenta-tion to synthesize additional accented training data, improving thesystem performance by 2.15% ∼ . REFERENCES [1] Marina Piat, Dominique Fohr, and Irina Illina, “Foreign accentidentification based on prosodic parameters,” in Ninth AnnualConference of the International Speech Communication Asso-ciation , 2008.[2] Kay Berkling, Marc A Zissman, Julie Vonwiller, and Christo-pher Cleirigh, “Improving accent identification through knowl-edge of english syllable structure,” in
Fifth International Con-ference on Spoken Language Processing , 1998.[3] Too Chen, Chao Huang, Eric Chang, and Jingehan Wang, “Au-tomatic accent identification using gaussian mixture models,”in
IEEE Workshop on Automatic Speech Recognition and Un-derstanding, 2001. ASRU’01.
IEEE, 2001, pp. 343–346.[4] Felix Weninger, Yang Sun, Junho Park, Daniel Willett, andPuming Zhan, “Deep learning based mandarin accent iden-tification for accent robust asr.,” in
INTERSPEECH , 2019, pp.510–514.[5] Yishan Jiao, Ming Tu, Visar Berisha, and Julie M Liss, “Accentidentification by combining deep neural networks and recur-rent neural networks trained on long and short term features.,”in
Interspeech , 2016, pp. 2388–2392.[6] Xian Shi, Fan Yu, Yizhou Lu, Qiangze Feng, Daliang Wang,Yanmin Qian, and Lei Xie, “The accented english speechrecognition challenge 2020: open datasets, tracks, baselines,results and methods,” 2020.[7] Lifa Sun, Hao Wang, Shiyin Kang, Kun Li, and Helen MMeng, “Personalized, cross-lingual tts using phonetic poste-riorgrams,” in
INTERSPEECH , 2016, pp. 322–326.[8] Guanlong Zhao, Shaojin Ding, and Ricardo Gutierrez-Osuna,“Foreign accent conversion by synthesizing speech from pho-netic posteriorgrams,” in
INTERSPEECH , 2019, pp. 2843–2847.[9] David Snyder, Guoguo Chen, and Daniel Povey, “Mu-san: A music, speech, and noise corpus,” arXiv preprintarXiv:1510.08484 , 2015.[10] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Bur-get, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, PetrMotlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldi speechrecognition toolkit,” in
IEEE 2011 workshop on automaticspeech recognition and understanding . IEEE Signal Process-ing Society, 2011, number CONF.[11] Guangzhi Sun, Yu Zhang, Ron J Weiss, Yuan Cao, HeigaZen, Andrew Rosenberg, Bhuvana Ramabhadran, and YonghuiWu, “Generating diverse and natural text-to-speech samplesusing a quantized fine-grained vae and autoregressive prosodyprior,” in
ICASSP 2020-2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2020, pp. 6699–6703.[12] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, ZhouZhao, and Tie-Yan Liu, “Fastspeech: Fast, robust and con-trollable text to speech,” in
Advances in Neural InformationProcessing Systems , 2019, pp. 3171–3180.[13] Jean-Marc Valin and Jan Skoglund, “Lpcnet: Improving neuralspeech synthesis through linear prediction,” in
ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, 2019, pp. 5891–5895. [14] Shinji Watanabe, Takaaki Hori, Shigeki Karita, TomokiHayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta So-plin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al.,“Espnet: End-to-end speech processing toolkit,” arXiv preprintarXiv:1804.00015 , 2018.[15] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster,Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang,Yuxuan Wang, Rj Skerrv-Ryan, et al., “Natural tts synthesisby conditioning wavenet on mel spectrogram predictions,” in . IEEE, 2018, pp. 4779–4783.[16] David Snyder, Daniel Garcia-Romero, Gregory Sell, DanielPovey, and Sanjeev Khudanpur, “X-vectors: Robust dnn em-beddings for speaker recognition,” in . IEEE, 2018, pp. 5329–5333.[17] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck,“Ecapa-tdnn: Emphasized channel attention, propagation andaggregation in tdnn based speaker verification,” arXiv preprintarXiv:2005.07143 , 2020.[18] Shanghua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang,Ming-Hsuan Yang, and Philip HS Torr, “Res2net: A newmulti-scale backbone architecture,”
IEEE transactions on pat-tern analysis and machine intelligence , 2019.[19] Jie Hu, Li Shen, and Gang Sun, “Squeeze-and-excitation net-works,” in
Proceedings of the IEEE conference on computervision and pattern recognition , 2018, pp. 7132–7141.[20] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou,“Arcface: Additive angular margin loss for deep face recogni-tion,” in
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2019, pp. 4690–4699.[21] Xu Xiang, Shuai Wang, Houjun Huang, Yanmin Qian, and KaiYu, “Margin matters: Towards more discriminative deep neuralnetwork embeddings for speaker recognition,” in . IEEE, 2019, pp. 1652–1656.[22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Ima-genet classification with deep convolutional neural networks,”in
Advances in neural information processing systems , 2012,pp. 1097–1105.[23] Karen Simonyan and Andrew Zisserman, “Very deep convo-lutional networks for large-scale image recognition,” arXivpreprint arXiv:1409.1556 , 2014.[24] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin,Natalia Gimelshein, Luca Antiga, et al., “Pytorch: An imper-ative style, high-performance deep learning library,” in
Ad-vances in neural information processing systems , 2019, pp.8026–8037.[25] Tan Tian, Lu Yizhou, Ma Rao, Zhu Sen, Guo Jiaqi, and Yan-min Qian, “Aispeech-sjtu asr system for the accented englishspeech recognition challenge,” IEEE, 2020.[26] G. E. Hinton, “Visualizing high-dimensional data using t-sne,”