[PDF] AISPEECH-SJTU accent identification system for the Accented English Speech Recognition Challenge

Abstract

This paper describes the AISpeech-SJTU system for the accent identification track of the Interspeech-2020 Accented English Speech Recognition Challenge. In this challenge track, only 160-hour accented English data collected from 8 countries and the auxiliary Librispeech dataset are provided for training. To build an accurate and robust accent identification system, we explore the whole system pipeline in detail. First, we introduce the ASR based phone posteriorgram (PPG) feature to accent identification and verify its efficacy. Then, a novel TTS based approach is carefully designed to augment the very limited accent training data for the first time. Finally, we propose the test time augmentation and embedding fusion schemes to further improve the system performance. Our final system is ranked first in the challenge and outperforms all the other participants by a large margin. The submitted system achieves 83.63\% average accuracy on the challenge evaluation data, ahead of the others by more than 10\% in absolute terms.

Full PDF

AAISPEECH-SJTU ACCENT IDENTIFICATION SYSTEM FOR THE ACCENTED ENGLISHSPEECH RECOGNITION CHALLENGE † Houjun Huang , , † Xu Xiang , Yexin Yang , Rao Ma , (cid:66) Yanmin Qian AISpeech Ltd, Suzhou China MoE Key Lab of Artiﬁcial Intelligence, AI Institute SpeechLab, Department of Computer Scienceand Engineering Shanghai Jiao Tong University, Shanghai,China { houjun.huang, xu.xiang } @aispeech.com, { yangyexin, rm1031, yanminqian } @sjtu.edu.cn ABSTRACT

This paper describes the AISpeech-SJTU system for the accent iden-tiﬁcation track of the Interspeech-2020 Accented English SpeechRecognition Challenge. In this challenge track, only 160-hour ac-cented English data collected from 8 countries and the auxiliaryLibrispeech dataset are provided for training. To build an accurateand robust accent identiﬁcation system, we explore the whole sys-tem pipeline in detail. First, we introduce the ASR based phoneposteriorgram (PPG) feature to accent identiﬁcation and verify itsefﬁcacy. Then, a novel TTS based approach is carefully designedto augment the very limited accent training data for the ﬁrst time.Finally, we propose the test time augmentation and embedding fu-sion schemes to further improve the system performance. Our ﬁ-nal system is ranked ﬁrst in the challenge and outperforms all theother participants by a large margin. The submitted system achieves83.63% average accuracy on the challenge evaluation data, ahead ofthe others by more than 10% in absolute terms.

Index Terms — Accent identiﬁcation, phone posteriorgram,PPG, TTS based data augmentation, test time augmentation

1. INTRODUCTION

An accent is a manner of pronunciation peculiar to a particular indi-vidual, location, or nation. It may be inﬂuenced by the speaker’s lo-cality, education attainment or ﬁrst language. For the pervasivenessof accents, accent identiﬁcation is widely utilized in robust speechrecognition, speaker recognition, language identiﬁcation and foren-sic applications.Earlier studies in accent identiﬁcation mainly focus on combin-ing linguistic theory with statistical analysis. Piat et al. , used a sta-tistical approach based on prosodic parameters and found that theduration and energy are promising parameters for correct identiﬁ-cation [1]. Berkling et al. , leveraged the structure of English sylla-ble to improve the accent identiﬁcation [2]. Chen et al. , proposed aGaussian mixture model (GMM) based method for Mandarin accentidentiﬁcation [3]. Recently, deep neural network based approacheshave emerged in this ﬁeld. Weninger et al. , made use of bidirectionalLong Short-Term Memory (bLSTM) networks to model longer-termacoustic context [4]. Jiao et al. , proposed a system utilizing long andshort term features in parallel using DNNs and RNNs [5]. † : These authors contributed equally to this work. Yanmin Qian is thecorresponding author. This work was supported by the National Key R&DProgram of China (No. 2018YFB1004602) and the China NSFC project (No.62071288)

In this work, we describe our accent identiﬁcation system for theInterspeech-2020 Accented English Speech Recognition Challenge(AESRC) [6] in detail. Since accent training data is rather limited, itis critical to effectively make use of the auxiliary Librispeech dataset.In contrast to the previous studies, we have three distinct contribu-tions. First, we introduce the ASR based PPGs as the discriminativefeatures for accent identiﬁcation. Second, we propose a novel TTSbased approach to synthesize the accent data, which provides richerspeaker, channel and text variability for training the accent classiﬁer.Third, we develop the test time augmentation and the hierarchicalmulti-embedding joint model for improving system performance.This paper is arranged as follows. Section 2 gives an in-depthdescription of the framework of our system. Section 3 presents ourexperiments with different settings. Section 4 concludes our paper.

2. SYSTEM DESCRIPTION

In this section, we depict our system for the accent identiﬁcationchallenge, which is shown in Figure 1.First, we leverage both the accent training data and the Lib-rispeech data to train an ASR model with conventional data aug-mentation. The PPG features are extracted from the ASR model andemployed to train the accent identiﬁcation(AID) model. Then, wepropose a novel TTS based data augmentation to augment the accentdata for training an accurate and robust accent classiﬁer. Finally, weintroduce the test time augmentation scheme to improve system per-formance on the test data. Moreover, we further boost the systemperformance with the hierarchical multi-embedding joint model.

Fig. 1 . Our system diagram for the challenge. Firstly, an ASR modelis trained on the pooled data. Then, the PPG features derived fromthe ASR model are prepared to train an AID model. Finally, TheAID model makes predictions on the test data using PPG features. a r X i v : . [ c s . S D ] F e b .1. ASR based PPG feature extraction While MFCC and FBANK features are popular in speech relatedtasks, we wouldn’t build our AID system on them directly as thefollowing reasons. First, AID system built on MFCC or FBANKfeatures directly could not take advantage of data-sets without ac-cent labels Second, as these features are low level and not task-oriented, they may contain some nuisance attributes like speaker ortext speciﬁc, which makes the learning of accent related representa-tion harder, especially when the amount of the training data is lim-ited.To address these issues, we adopt the phone posteriorgram(PPG) feature that has been successfully applied for cross-lingualvoice conversion [7] and cross-accent voice conversion [8] to trainthe accent classiﬁer. PPG is a time-versus-class vector that repre-sents the posterior probabilities of phonetic classes for a speciﬁctime frame. In this work, we ﬁrst train a speaker independent (SI)automatic speech recognition (ASR) model with both the accenttraining data and the Librispeech data and then extract the PPGfeatures with the SI-ASR model. With this process, the resultingPPG features have the speaker independent property which helpsimprove the robustness of the system.To train a robust ASR model for PPG feature extraction, we em-ploy two kinds of conventional data augmentation that have beenwidely used in automatic speech recognition tasks. The ﬁrst oneaugments original data with additive noise and reverberation. For ad-ditive noise, the music, noise and speech part in the MUSAN dataset[9] are used. For reverberation, the room impulse responses (RIRs)and the simulated RIRs described in Kaldi’s [10] VoxCeleb recipeare used. The second one is based on warping the signal in thetime domain. We randomly change the tempo of the audio signalwhile ensuring that the pitch and spectral envelope of the signal un-changed. The tempo effect of the SoX tool was used to achieve suchspeech rate perturbation.

Since only 160 hours of accent data are provided, it is too limited totrain an accurate and robust accent identiﬁcation model. Therefore,we develop a novel TTS based data augmentation approach that isspecially designed for synthesizing accent data.Using the generated high-quality artiﬁcial speech as the aug-mented data has been successfully facilitated in ASR systems [11].Recent advances in speech synthesis (text-to-speech, TTS) allowunsupervised modeling of prosody and speaker variations, whichgive the power to synthesize the same texts with diverse speakingstyles. Moreover, the development of TTS models has made synthe-sized speech indistinguishable from human speech. In this work, wechoose FastSpeech [12] as our synthesizer and LPCNet [13] as thevocoder.FastSpeech is a transformer-based model which generates theentire sequence in a non-autoregressive manner. The implementa-tion of FastSpeech is based on the ESPnet toolkit [14]. We use thedefault settings as in [12]. Instead of applying the knowledge dis-tillation process, we train the FastSpeech model from scratch onlyusing the extracted features. The synthesizer converts input text tospectrogram. The size of the input vocabulary is 41, including En-glish phonemes, a pause break token, and a sentence boundary token.Additionally, we augment the decoder with a ﬁve-layer post-net [15],which slightly enhance model performance. LPCNet is a variant ofWaveRNN that combines linear prediction with recurrent neural net-works, greatly promoting the synthesis efﬁciency [13]. Our imple-mentation is based on [13]. First, we train a TDNN x-vector speaker model [16] with thepooled data consists of the accent training data and the auxiliaryLibrispeech data. The x-vectors and the phoneme representation ex-tracted from the pooled data are then used to train the FastSpeechsynthesizer. In order to better capture the characteristics of each ac-cent, we create 8 accent speciﬁed synthesizer by ﬁnetuning the Fast-Speech Model on each 20-hour accent data respectively. Finally, onthe clean subset of the overall training data, we train two LPCNetvocoders for the male and female speakers respectively.For each speaker from the accent training data, 30 utterancesare grouped to calculate the speaker speciﬁc statistics. We use thesestatistics with randomly selected reference texts to synthesize datawith the previously trained accent speciﬁc synthesizers. With thisprocess, the generated speech can preserve the speaker’s speakingstyle while adopting new accents.

In our ﬁnal submission, we use a hierarchical multi-embedding jointmodel to predict the accent label, which is shown in Figure 2. Thestructure of the TDNN sub-model is the same as the one in [16], butwith 2 × size, as we ﬁnd it gives higher accuracy on the developmentdata. The RES2SETDNN sub-model is developed in a way similarto [17], by introducing the Res2Net [18] type convolution and thesqueeze-and-excitation [19] module to the original TDNN structure.However, we do not include the residual connection in our model,for there is no improvement in performance.We train the joint model in a progressive way as follows. First,we pretrain each accent identiﬁcation model independently using anadditive angular margin softmax loss [20, 21]. Then, for each modelthe embedding extraction part is ﬁxed. Finally, we train a linearregression classiﬁer based on the concatenated embeddings extractedfrom each sub-model. Fig. 2 . The hierarchical multi-embedding joint model includes theTDNN and RES2SETDNN sub-models and accepts two set of fea-tures. denotes the 40-dimensional PPG features, denotes the 40-dimensional PPG features and its ﬁrst and sec-ond order difference and FC denotes the fully connected linear layer. Test time augmentation is a common trick in image classiﬁcationtasks to improve the test accuracy [22, 23]. Instead of predictingthe label of test image itself, the model takes multiple augmentedversions of the test image as input, and the predicting results arethen aggregated to give the ﬁnal result. In our work, similar test timeaugmentation for speech data is adopted. We use the tempo effect able 1 . This table shows the identiﬁcation accuracy (%) of two model architectures on the development set: TDNN and RES2SETDNN.For each architecture, two kinds of feature FBANK and PPG and the corresponding data augmentation strategies are listed. Here “CONVAug” means conventional data augmentation, “TTS Aug” means TTS based data augmentation, “TTA” means test-time data augmentation,and “Delta” means we use the original PPG feature and its ﬁrst and second difference for training.

ID Model Conﬁguration Accuracy (%)US UK CHN IND JPN KR PT RU Avg.

3. EXPERIMENTS

A detailed comparison of our systems are presented in this section.Kaldi is used for FBANK feature extraction and PyTorch [24] is uti-lized to train the neural network models and PPG feature extraction.All experiment results on the development set are shown in Table 1,where 8 accented English are denoted by their corresponding coun-try codes respectively: US (the USA), UK (the United Kingdom), CHN (China),

IND (India),

JPN (Japan), KR (Korea), PT (Portu-gal) and RU (Russia). Our baseline systems are trained on the original 160-hour accenttraining data, with 40-dimensional FBANK feature. In Table 1, thebaseline system for TDNN and RES2SETDNN are denoted by ID 1and 9 respectively.

We then apply the conventional data augmentation to the accenttraining data. First, the speech rate of each utterance in the train-ing set is randomly changed to 0.8 × , 0.9 × , 1.1 × or 1.2 × , whichincrease the amount of the data to 320 hours. Then, by augmen-tation with additive noise and reverberation, we extend the trainingdata to 1600 hours. The TDNN and RES2SETDNN models trainedon the extended training data are shown in Table 1 with ID 2 andID 10 respectively. Compared with the baseline systems, on averageaccuracy the conventional data augmentation can give an absoluteimprovement of 8.50% and 11.41% respectively. Besides, the im-provement is consistent across all 8 accents. In addition to the conventional data augmentation, we further applythe TTS based data augmentation approach described in Section 2.2on the 1600-hour augmented data to generate 4800-hour synthesizeddata.To check the correlation between the synthesized speech and itscorresponding accent, we test the synthesized data using system ID2 and 10 in Table 1 which are trained on the conventional augmenteddata. From Table 2, we can ﬁnd that the accent identiﬁcation accu-racy on the synthesized data is comparable to that of the developmentdata.

Table 2 . Accuracy (%) on synthesized data

Accent System ID2 10

US 59.64 60.50UK 90.73 86.04CHN 70.93 64.08IND 67.13 70.70JPN 63.42 53.73KR 67.48 66.91PT 86.44 81.77RU 46.27 45.04Avg. 69.08 66.16We train system ID 3 and 11 with the combined 6400-hour data.As shown in Table 1, the average accuracy is largely improved again,also, the improvement across the 8 accents is consistent. This veriﬁesthe effectiveness of our proposed TTS based data augmentation. able 3 . Final submission results on the accent identiﬁcation evaluation set of the top 4 teams in the rank list

Ranking Team Accuracy (%)US UK CHN IND JPN KR PT RU Avg.

In this section, we compare the systems trained with PPG featuresor FBANK features. The SI-ASR model for PPG feature extractionis prepared with the same setting in [25]. Similar to the training dataaugmentation schemes for systems trained with FBANK features,we train the TDNN and RES2SETDNN models on the original dataand two sets of augmented data respectively. The system trained onPPG features are denoted by ID 4, 5, 6 and 12, 13, 14 in Figure 1.According to the reported numbers of the average accuracy inTable 1, the system ID 4 (or 12) beats the system ID 1 (or 9) by alarge margin, though they are trained on the original data only. Thissuggests that, with the speaker independent property, PPG featuresmake the discrimination of accent much easier. In addition, compar-ing system ID 5, 6 (or 13, 14) to ID 4 (or 12), we can ﬁnd the systemperformance can be improved when applying data augmentation.In this work, the SI-ASR system used as extractor for PPG fea-tures is trained over all available data (accented+librispeech). Whenonly the accented data are used for training the SI-ASR system, PPGfeatures achieves about 3% performance improvement compared tothe FBANK features on average accuracy, which is small then usingall data.

Table 1 shows, with the test-time data argumentation described inSection 2.4, the performance has a slight lift from system ID 6 to 7(or ID 9 to 10) .

We train systems with 120-dimensional delta PPG features (the orig-inal feature and its ﬁrst and second difference) on the combined6400 hours training data. We ﬁnd the performance has a small dropon the TDNN model (system ID 7 and 8), but rises a little on theRES2SETDNN model (system ID 15 and 16).

In our ﬁnal system, following the design in Section 2.3, we ﬁrst ini-tialize the embedding extraction part with four systems (ID 7, 8,15 and 16) in Table 1 and ﬁx the parameters. Then we train thefully connected layer for accent classiﬁcation. The ﬁnal predictionis given by the 8-way classiﬁcation hierarchical multi-embeddingjoint model on the test time augmented wave ﬁle. As shown in Ta-ble 1, our ﬁnal system (ID 18) signiﬁcantly outperforms the chal-lenge baseline system (ID 17), reaching a remarkable 15.03% accu-racy improvement on the development set.On the challenge evaluation data, Table 3 shows the rankings ofthe leading submissions for this challenge. Our system ranks ﬁrst inthe challenge with an average accuracy of 83.63% on the evaluation set, ahead of the second team by 11.23%. Moreover, when com-paring the system performance shown in Table 1, our ﬁnal systemachieves much smaller performance gap between the developmentdata and the evaluation data than the challenge baseline.To better understand our system performance on each accent, wevisualize the accent embeddings of the test data using t-SNE [26].As shown in Figure 3, the cluster of the

IND utterances is compactand far from the other clusters, while the cluster of the US utteranceshas many overlaps with the other clusters. This partially explains thedifferences in identiﬁcation accuracy. Fig. 3 . Accent embeddings of 8 accents in test set. 200 accent em-beddings for every accent are chosen in the test set.

4. CONCLUSIONS

In this paper, we describe our submitted system for the Interspeech-2020 Accented English Speech Recognition Challenge (AESRC).Several novel approaches are developed to improve the robustnessof our accent identiﬁcation system. To the best of our knowledge,it is the ﬁrst time phone posteriorgram feature has been introducedto accent classiﬁcation, which brings an improvement of more than15% compared to the regular FBANK feature. To train a robust sys-tem from such limited data, we adopt TTS based data augmenta-tion to synthesize additional accented training data, improving thesystem performance by 2.15% ∼ . REFERENCES [1] Marina Piat, Dominique Fohr, and Irina Illina, “Foreign accentidentiﬁcation based on prosodic parameters,” in Ninth AnnualConference of the International Speech Communication Asso-ciation , 2008.[2] Kay Berkling, Marc A Zissman, Julie Vonwiller, and Christo-pher Cleirigh, “Improving accent identiﬁcation through knowl-edge of english syllable structure,” in

Fifth International Con-ference on Spoken Language Processing , 1998.[3] Too Chen, Chao Huang, Eric Chang, and Jingehan Wang, “Au-tomatic accent identiﬁcation using gaussian mixture models,”in

IEEE Workshop on Automatic Speech Recognition and Un-derstanding, 2001. ASRU’01.

IEEE, 2001, pp. 343–346.[4] Felix Weninger, Yang Sun, Junho Park, Daniel Willett, andPuming Zhan, “Deep learning based mandarin accent iden-tiﬁcation for accent robust asr.,” in

INTERSPEECH , 2019, pp.510–514.[5] Yishan Jiao, Ming Tu, Visar Berisha, and Julie M Liss, “Accentidentiﬁcation by combining deep neural networks and recur-rent neural networks trained on long and short term features.,”in

Interspeech , 2016, pp. 2388–2392.[6] Xian Shi, Fan Yu, Yizhou Lu, Qiangze Feng, Daliang Wang,Yanmin Qian, and Lei Xie, “The accented english speechrecognition challenge 2020: open datasets, tracks, baselines,results and methods,” 2020.[7] Lifa Sun, Hao Wang, Shiyin Kang, Kun Li, and Helen MMeng, “Personalized, cross-lingual tts using phonetic poste-riorgrams,” in

INTERSPEECH , 2016, pp. 322–326.[8] Guanlong Zhao, Shaojin Ding, and Ricardo Gutierrez-Osuna,“Foreign accent conversion by synthesizing speech from pho-netic posteriorgrams,” in

INTERSPEECH , 2019, pp. 2843–2847.[9] David Snyder, Guoguo Chen, and Daniel Povey, “Mu-san: A music, speech, and noise corpus,” arXiv preprintarXiv:1510.08484 , 2015.[10] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Bur-get, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, PetrMotlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldi speechrecognition toolkit,” in

IEEE 2011 workshop on automaticspeech recognition and understanding . IEEE Signal Process-ing Society, 2011, number CONF.[11] Guangzhi Sun, Yu Zhang, Ron J Weiss, Yuan Cao, HeigaZen, Andrew Rosenberg, Bhuvana Ramabhadran, and YonghuiWu, “Generating diverse and natural text-to-speech samplesusing a quantized ﬁne-grained vae and autoregressive prosodyprior,” in

ICASSP 2020-2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2020, pp. 6699–6703.[12] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, ZhouZhao, and Tie-Yan Liu, “Fastspeech: Fast, robust and con-trollable text to speech,” in

Advances in Neural InformationProcessing Systems , 2019, pp. 3171–3180.[13] Jean-Marc Valin and Jan Skoglund, “Lpcnet: Improving neuralspeech synthesis through linear prediction,” in

ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, 2019, pp. 5891–5895. [14] Shinji Watanabe, Takaaki Hori, Shigeki Karita, TomokiHayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta So-plin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al.,“Espnet: End-to-end speech processing toolkit,” arXiv preprintarXiv:1804.00015 , 2018.[15] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster,Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang,Yuxuan Wang, Rj Skerrv-Ryan, et al., “Natural tts synthesisby conditioning wavenet on mel spectrogram predictions,” in . IEEE, 2018, pp. 4779–4783.[16] David Snyder, Daniel Garcia-Romero, Gregory Sell, DanielPovey, and Sanjeev Khudanpur, “X-vectors: Robust dnn em-beddings for speaker recognition,” in . IEEE, 2018, pp. 5329–5333.[17] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck,“Ecapa-tdnn: Emphasized channel attention, propagation andaggregation in tdnn based speaker veriﬁcation,” arXiv preprintarXiv:2005.07143 , 2020.[18] Shanghua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang,Ming-Hsuan Yang, and Philip HS Torr, “Res2net: A newmulti-scale backbone architecture,”

IEEE transactions on pat-tern analysis and machine intelligence , 2019.[19] Jie Hu, Li Shen, and Gang Sun, “Squeeze-and-excitation net-works,” in

Proceedings of the IEEE conference on computervision and pattern recognition , 2018, pp. 7132–7141.[20] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou,“Arcface: Additive angular margin loss for deep face recogni-tion,” in

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2019, pp. 4690–4699.[21] Xu Xiang, Shuai Wang, Houjun Huang, Yanmin Qian, and KaiYu, “Margin matters: Towards more discriminative deep neuralnetwork embeddings for speaker recognition,” in . IEEE, 2019, pp. 1652–1656.[22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Ima-genet classiﬁcation with deep convolutional neural networks,”in

Advances in neural information processing systems , 2012,pp. 1097–1105.[23] Karen Simonyan and Andrew Zisserman, “Very deep convo-lutional networks for large-scale image recognition,” arXivpreprint arXiv:1409.1556 , 2014.[24] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin,Natalia Gimelshein, Luca Antiga, et al., “Pytorch: An imper-ative style, high-performance deep learning library,” in

Ad-vances in neural information processing systems , 2019, pp.8026–8037.[25] Tan Tian, Lu Yizhou, Ma Rao, Zhu Sen, Guo Jiaqi, and Yan-min Qian, “Aispeech-sjtu asr system for the accented englishspeech recognition challenge,” IEEE, 2020.[26] G. E. Hinton, “Visualizing high-dimensional data using t-sne,”