[PDF] The Accented English Speech Recognition Challenge 2020: Open Datasets, Tracks, Baselines, Results and Methods

Abstract

The variety of accents has posed a big challenge to speech recognition. The Accented English Speech Recognition Challenge (AESRC2020) is designed for providing a common testbed and promoting accent-related research. Two tracks are set in the challenge -- English accent recognition (track 1) and accented English speech recognition (track 2). A set of 160 hours of accented English speech collected from 8 countries is released with labels as the training set. Another 20 hours of speech without labels is later released as the test set, including two unseen accents from another two countries used to test the model generalization ability in track 2. We also provide baseline systems for the participants. This paper first reviews the released dataset, track setups, baselines and then summarizes the challenge results and major techniques used in the submissions.

Full PDF

TTHE ACCENTED ENGLISH SPEECH RECOGNITION CHALLENGE 2020: OPENDATASETS, TRACKS, BASELINES, RESULTS AND METHODS

Xian Shi ∗ , Fan Yu ∗ , Yizhou Lu , Yuhao Liang , Qiangze Feng , Daliang Wang , Yanmin Qian , Lei Xie † Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,Northwestern Polytechnical University, Xi’an, China SpeechLab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, China Datatang (Beijing) Technology Co., LTD, Beijing, China

ABSTRACT

The variety of accents has posed a big challenge to speechrecognition. The Accented English Speech Recognition Challenge(AESRC2020) is designed for providing a common testbed and pro-moting accent-related research. Two tracks are set in the challenge– English accent recognition (track 1) and accented English speechrecognition (track 2). A set of 160 hours of accented English speechcollected from 8 countries is released with labels as the training set.Another 20 hours of speech without labels is later released as the testset, including two unseen accents from another two countries usedto test the model generalization ability in track 2. We also providebaseline systems for the participants. This paper ﬁrst reviews thereleased dataset, track setups, baselines and then summarizes thechallenge results and major techniques used in the submissions.

Index Terms — Accented speech recognition, accent recogni-tion, acoustic modeling, end-to-end ASR

1. INTRODUCTION

Accent is one of the major variable factors in human speech, whichposes a great challenge to the robustness of automatic speech recog-nition (ASR) systems. English is one of the most common languagesspeaking around the world. It is inevitable to produce varieties ofEnglish accents in different areas. The difference between accents ismainly reﬂected in three aspects of pronunciation: stress, tone andduration, which brings difﬁculties to ASR models. There has beenmuch interest in accent recognition to distinguish different Englishaccents [1, 2, 3, 4], and it is also valuable to improve the generaliza-tion capability of ASR models on varieties of English accents.The Interspeech2020 Accented English Speech RecognitionChallenge (AESRC) is speciﬁcally designed to provide a commontestbed and a sizable dataset for both English accent recognition (setas track 1) and accented English speech recognition (set as track 2).A 180-hour speech dataset is opened to participants, which contains10 types of English accents – Chinese, American, British, Korean,Japanese, Russian, Indian, Portuguese, Spanish and Canadian. Thetwo tracks run on the dataset to compare the submissions fairly.The rest of this paper is organized as follows. Section 2 is a sum-mary of related works on accent recognition, robust speech recogni-tion on accented speech and current datasets available for related re-search. Section 3 describes the dataset released by the challenge. InSection 4, the baseline experiments are introduced. Section 5 mainlysummarizes the outcome of the challenge, speciﬁcally discussing on ∗ The ﬁrst two authors contributed equally to this work. † Lei Xie is the corresponding author. the major techniques used in the submitted systems. Section 6 con-cludes the challenge with important take-home messages.

2. RELATED WORK

Accent recognition is similar to language identiﬁcation [5, 6, 7] andspeaker identiﬁcation [8, 9, 10, 11]. They all classify variable-lengthspeech sequences to utterance-level posteriors to obtain accent,speaker or language ID. In order to distinguish different accentsin English, Teixeira et al. [12] proposed to use context-dependentHMM units to optimize parallel networks and Deshpande et al. [13]introduced format frequency features into GMM models. Ahmedet al. [14] presented VFNet (Variable Filter Net), a convolutionalneural network (CNN) based architecture which applies ﬁlters withvariable size along the frequency band to capture a hierarchy offeatures, aiming at improving the accuracy of accent recognition indialogues. Winata et al. [15] proposed an accent-agnostic approachthat extends the model-agnostic meta-learning (MAML) algorithmfor fast adaptation to unseen accents. Transfer learning and multi-task learning were also found useful for spoken accent recognitiontasks [16, 17].As for speech recognition on accented speech, adaptation meth-ods and adversarial training related techniques were proved effec-tive. Assuming that the labelled data for speciﬁc accent is lim-ited, the adaptation method is ﬁrst to train a base model with stan-dard speech data that is usually with a large volume, and then adaptthe model to the speciﬁc accent with the respective data [18, 19,20, 21]. Domain adversarial training (DAT) was used by Sun etal. to obtain accent-independent hidden representation in order toachieve a high-performance ASR system for accented Chinese [22].A generative adversarial network (GAN) based pre-training frame-work named as AIPNet was proposed by Chen et al.[23]. Theypre-trained an attention-based encoder-decoder model to disentan-gle accent-invariant and accent-speciﬁc characteristics from acous-tic features by adversarial training. Accent-dependent acoustic mod-eling approaches take accent-related information into network ar-chitecture by accent embedding, accent-speciﬁc bottleneck featuresor i-vectors [24, 25]. In a closed set of known accents, accent-dependent models usually outperform the accent-independent uni-versal models, while the latter ones usually achieve a better averagemodel under the situations where accent labels are unavailable.English accent recognition and accented English speech recog-nition are also hindered by data insufﬁciency. Existing open-sourceaccented English datasets are limited in data volume and accent va-rieties [26, 27]. This motivates us to provide a sizable dataset and acomon testbed to advance the research in the related areas. a r X i v : . [ c s . S D ] F e b able 1 . Results of baseline systems on the separated cv set. Track English

Accent

Recognition

Network

Accuracy (%)

Total RU KR US PT JPN UK CHN

IND

Self-Attention

Classification

Network

Transformer-3L

Transformer-6L

Transformer-12L

ASR init-Transformer-12L

Track Accented

English

Speech

Recognition

Training

Strategy LM and Decoding WER (%)

Ave. RU KR US PT JPN UK CHN

IND

Chain

Model

Accent160 LM Accent160 (Libri960

Base)

Accent160+Libri160

Transformer

Accent160 +0.3

RNNLM

Accent160 (Libri960

Base) +0.3

RNNLM

Accent160+Libri160 - +0.3 RNNLM +0.3

RNNLM + CTC

3. OPEN DATASET

An accented English speech dataset was released to participants inthe challenge. It was collected from both native speakers in UK andUS, and also English speakers in China, Japan, Russia, India, Portu-gal, Korea, Spain and Canada. We suppose that the data collected ineach country is belong to one type of accent of English, and in totalwe have 10 ‘accents’. The speakers, aged between 20 to 60, wereasked to read sentences covering common conversation and human-computer speech interaction commands. All the speech recordingswere collected in relatively quiet indoor environment with Androidphones or iPhones. The training set, named as Accent160, for bothchallenge tracks (introduced in Section 4) has 160 hours of speechincluding 8 accents (20 hours/accent). Spanish and Canadian accentsare not included in the training set. The test set for track 1 includes16 hours of data (2 hours for each accent) and the test set for track2 has 20 hours of data including extra 4 hours of data from Spanishand Canadian accents (as unseen accent data). Speech recordingswere provided in Microsoft wav format (16KHz, 16bit and mono)with manual transcriptions.Training speech data and the corresponding transcriptions wereﬁrst released to participants together with metadata, in which de-tailed information about speakers and recording environment includ-ing speaker gender, age, region, recording device and others are pro-vided. In order to make a fair comparison with the provided baselineexperiments, we also release a speaker list for participants to dividethe CV set from the training set. The test set was released later to theparticipants with only audio recordings.

4. TRACKS AND BASELINES4.1. Track 1: English Accent Recognition

Track 1 aims to study the English accent recognition problem. Therules are as follows. 1) Data used for accent classiﬁcation training islimited to the 160 hours Accent English data and 960 hours of Lib-rispeech [28] data, and data augmentation based on the above data ispermitted. 2) Multi-system fusion techniques including recognizeroutput voting error reduction (ROVER) [29] are prohibited; there isno other restriction to the model and training techniques. 3) The ﬁ-nal ranking is based on the recognition accuracy on the whole testset and the accuracy for each accent is for reference only. For baseline experiments, a self-attention (SA) based classiﬁ-cation network is realized using ESPnet . A mean + std poolinglayer is applied after encoder to pool the output on T dimension. Transformer-3L \ \ are different in the number of encoder lay-ers. All of them are trained under the simple CE loss for 40 epochs.Specaugment is applied to the input feature. As shown in Table 1,with the use of the releasd data, 6 \

12 layers of SA encoder result inover ﬁtting. From the results on the CV set, we found that the accu-racy of some accents varies a lot among different speakers. As thereare only a few speakers in the CV set, the absolute value above is notstatistically signiﬁcant. However, it is worth noting that when weuse an SA encoder trained by an ASR downstream task to initializethe encoder of the accent classiﬁcation network, the accent recogni-tion accuracy is signiﬁcantly improved. Finally the total accuracy ofASR-init-Transformer-12L is up to 76.1% on the CV set. The codeand conﬁguration of the baseline can be found from our github . Track 2 studies the robustness of ASR system on accented Englishspeech where the word error rate on the whole test set is used asthe evaluation metric. Test sets include accents beyond training datain order to evaluate the generalization performance of the model.Again, data usage is only limited to the released data and the lib-rispeech data. All kinds of system combination methods includ-ing ROVER are strictly prohibited. Language model training shouldonly use the transcripts of permitted speech training data. Data aug-mentation should only be applied on the permitted speech data only.We prepare ASR baseline systems for track 2 with both Kaldi and ESPnet toolkits. Several training strategies and decoding relatedparameters are compared, and results are shown in the Table 1.In all experiments, we use 71-dimensional mel-ﬁlterbank fea-ture as the input of the acoustic model and frame length is 25 mswith a 10 ms shift. In our baseline chain-model system, the acousticmodel consists of a single convolutional layer with the kernel of 3to down-sample the input speech feature, 19 hidden layers of a 256-dimensional TDNN and a 1280-dimensional fully connection layerwith ReLU activation function. A 3-gram language model trained https://github.com/espnet/espnet https://github.com/R1ckShi/AESRC2020 able 2 . Results and major techniques used in the top 8 submissions in track 1. Data

Augmentation N o i s e A u g m e n t . Sp ee d p e r t u r b . V o l u m e A u g m e n t . R e v e r b s i m u l a t i o n C u tt i n g & s p li c i n g TT S P i t c h Sh i f t Sp e c A u g m e n t . A S R M o d e l Team

Code

Network

Total

Acc (%) S2 TDNN ✔ ✔ ✔ ✔ ✔ ✔ E2 Transformer ✔ Z2 Jasper + Transformer ✔ ✔ F Transformer ✔ ✔ ✔ ✔ ✔ ✔ ✔ Baseline Self-Attention

Encoder ✔ D2 RESNET + CTC C TDNN-F ✔ ✔ ✔ ✔ V TDNN-LSTM-Attention ✔ ✔ ✔ ✔ H Transformer ✔ ✔ ✔ ✔ with the transcripts of the 160 hours of speech is used in the de-coding graph compiling. As for the transformer baseline, ESPnetjoint CTC/Attention transformer which contains 12-layer encoderand 6-layer decoder is used, and the dimension of attention and feed-forward layer is set to 256 (4 heads) and 2048 respectively. Thewhole network is trained for 50 epochs with warmup[30] for the ﬁrst25,000 iterations. We mainly try three training strategies: 1) onlyusing Accent160; 2) using Accent160 with another 160 hours of se-lected data from librispeech (Libri160); 3) using 960 hours of lib-rispeech data (Libri960) to train a base model and then ﬁne-tuningthe model using Accent160. Furthermore, we optimize the decod-ing with RNN language model and CTC posterior probability. TheRNNLM is a 2-layer LSTM model trained using ESPnet on the tran-scriptions associated with Accent160, and both RNNLM and CTCare fused (weight=0.3 for both) with beam search scores. From thebaseline results, we ﬁnd that the end-to-end models outperform thehybrid chain models given the limited training data and ﬁne-tuningthe librispeech base model with accented English data achieves thebest performance among the three training strategies. The wholerecipe and results of the baselines can be found from our github.

5. CHALLENGE RESULTS AND ANALYSIS5.1. Track 1: English Accent Recognition

Finally 25 teams submitted their results to track 1 and the accuracyon the test set and the major techniques used for the top 8 teams aresummarized in Table 3. The winner goes to team S2 using a TDNNbased classiﬁcation network with phonetic posteriorgram (PPG) fea-ture as input, and they use text-to-speech (TTS) to expand the train-ing data [31]. The major techniques are summarized below.

Since the size of the released training data is relatively small, mostteams have done a lot of work in data augmentation. For exam-ple, noise augmentation and speed perturbation are generally used.Speed augmentation can enhance the robustness of the model mod-estly, while volume augmentation and reverberation simulation helpa little. Simulated room impulse response (RIR) is used to convo-lution with the original speech to generate data with reverberation. Moreover, several teams use random cutting and splicing to expandthe data. In detail, two pieces of audio with the same accent are ran-domly selected from the training dataset, and then each piece is cut totwo splices and the splices from the two pieces are combined as newsamples. Specaugmentation is also very useful reported by manyteams. Pitch shift is also an effective data augmentation method re-ported by team F. In addition to the above tricks, it is also worth tonotice that the winner team S2 used the provided data to train a TTSsystem to synthesize a large number of training audio clips with thecorresponding accents, and the accuracy of accent recognition wasimproved absolutely by 10%.

As revealed in the baseline experiments, the training of accent clas-siﬁcation is easy to be over ﬁtted to the speakers as the biggestacoustic difference lies in the speaker characteristics rather than ac-cent characteristics, which leads to great difference in the accuracyof different speakers with same accent. Therefore, it is beneﬁcialto use speaker-invariant feature input and pre-trained encoder byspeaker-invariant downstream task to initialize the network. TeamS2 used PPG features generated by a Kaldi ASR system as modelinput. Team Z2 adopted a multi-task strategy with both accentrecognition and phoneme classiﬁcation. The second place team E2used the accent label together with transcripts to train a TransformerASR model with accent classiﬁcation ability at the same time. Asreported, putting the accent tag at the beginning of text outper-forms tagging at the end. Besides the mainstream neural networks,team H used an NN and support vector machine (SVM) combinationmethod. An embedding layer was applied before the fully-connectedlayer and SVM was used to classify the embedding vector.

Forty two teams submitted their results to track 2. Team Q2 obtainedthe lowest average WER of 4.06%. This was achieved by a CTCmodel with LAS rescoring while the CTC model was initialized by aWav2Vec encoder trained in an unsupervised manner using Fairseqtoolkit [32]. The superior performance indicates that unsupervisedtraining is promising in improving performance when labeled data islimited. The results and major techniques used in the top 10 systemsare summarzied in Table 3. able 3 . Results and major techniques used in the top 10 submissions in track 2.

Data

Augmentation

LM&Decode

AM&Training N o i s e & R e v e r b . A u g . Sp ee d P e r t u r b a t i o n V o l u m e A u g m e n t a t i o n Sp e c A u g m e n t a t i o n - P a ss D e c o d i n g NN L M M u l t i - l e v e l M o d e li n g A cc e n t M u l t i - t a s k U n s up e r v i s e d T r a i n i n g T r a n s f e r L e a r n i n g Team

Code

Network

Ave.

WER(%) Q2 Wav2vec enc. + CTC + LAS ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ S2 CTC + LAS ✔ ✔ ✔ ✔ ✔ ✔ ✔ E2 Transformer + CTC ✔ ✔ ✔ A2 Conformer + CTC ✔ ✔ ✔ T2 Transformer + CTC ✔ ✔ ✔ ✔ ✔ F Conformer + CTC ✔ ✔ ✔ ✔ U2 Lite

Transformer + CTC ✔ ✔ M3 BLSTM-CNN-TDNNF

Hybrid ✔ ✔ ✔ ✔ ✔ D Conformer + CTC ✔ ✔ M2 CNN-Multi-Stream-TDNNF-Attn ✔ ✔ ✔ ✔ ✔ ✔ Baseline Transformer + CTC ✔ Similar to track 1, various data augmentation tricks were widelyadopted in the submitted system in track 2 and Table 3 shows thetricks used in the top-performing systems. According to the systemdescriptions provided by the teams, the relative WER reduction of5% to 10% can be achieved by methods including volume augmen-tation and speed perturbation. Noise and reverberation augmentationwas tried by several teams but no obvious gain was obtained. Thisis probably because the acoustic and channel conditions between thetest set and the training set is similar.

A variety of different models were found to be used in track 2,mainly including Transformer-based encoder-decoder models [33,34, 35], CTC models with LAS [36] rescoreing and traditional hy-brid models. Attention-based end-to-end models are able to takefull sentence-scale acoustic information into consideration, so theyhave a natural advantage over the traditional hybrid models. Unsu-pervised training has been drawing increasingly attention [37, 38].Team Q2 followed the work of wav2vec2.0 [37] and pre-trained aself-attention encoder using both contrastive loss and diversity loss,in order to obtain an encoder with waveform reconstruction capabil-ity and contextualized representation capability. The score of letter-level model above was combined with a word-piece level Trans-former LM. The second place team S2 adopted frame-level CE losspre-trained encoder [39] (labels are generated by a Kaldi triphonesystem), resulting in 5% WER reduction. Conformer and lite Trans-former were used by several teams which implies the potential roomfor improvement of primitive Transformer, especially in enhancingthe ability of local information modeling. Explicit accent related op-timizations were rarely used in the submitted systems. But team T2used accent recognition multi-task training encoder and yielded 3%relative WER reduction.

As for language modeling, it is obvious that NN language modelrescoring works well, and it brings improvement ranging from 7%to 15% on the CV set, reported from the system descriptions. Two- pass decoding was used by the top 2 teams. It is effective to rescorethe lattice generated by WFST (11.7% WER reduction by team S2)or fusion the primitive posterior probability in the process of beamsearch with LAS (26.4% WER reduction by Q2). Team Q2 speciﬁ-cally compared the performance of statistic language model and NNlanguage model. It turns out that a well-trained Transformer LMcan achieve a slightly lower perplexity, but the 4-gram model out-performs the Transformer LM in WER (3.96% to 4.01%). Team Q2also tried a method of combining two language models using differ-ent granularity modeling units. A word-piece level Transformer LMwas applied in the decoding fusion, leading to 2.4% relative WERreduction on the CV set (3.73% to 3.64%).

6. SUMMARY

In this challenge, participants have used the released 160 hours oftraining data to build accent recognition (track 1) and accented En-glish speech recognition (track 2) systems. According to the resultsof track 1, we found it necessary to ﬁx the over-ﬁtting problem,which means to peel off the speaker-related information from theencoder output. Therefore, a pre-trained encoder with ASR down-stream task works well. The use of phonetic posteriorgram (PPG)features as network input is also effective in accent recognition. Intrack 2, we have seen a variety of networks including end-to-endmodels and traditional hybrid systems. In conclusion, CNN-basedunsupervised training with contrastive loss and diversity loss can en-hance the ability of waveform reconstruction and contextualized rep-resentation of encoder, leading to superior recognition performance.Several works are done on the Transformer family like Conformerand Lite Transformer. The oracle Transformer has disadvantages inlocal information modeling and strengthening such local informationapparently leads to improved performance. As to language model-ing, CTC with LAS 2-pass decoding performs well, which combinesthe time sequence modeling capability with CTC loss and full sen-tence scale modeling ability from self-attention structures. A fewaccent related modeling techniques are used in track 2, but mostparticipants choose to train an average model for all accents. Withlimited training data, data augmentation tricks are essential for bothtracks. . REFERENCES [1] Hamid Behravan, Ville Hautam¨aki, Sabato Marco Siniscalchi,Tomi Kinnunen, and Chin-Hui Lee, “I-vector modeling ofspeech attributes for automatic foreign accent recognition,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 24, no. 1, pp. 29–41, 2015.[2] Yusnita Ma, M. P. Paulraj, Sazali Yaacob, A. B. Shahriman,and Sathees Kumar Nataraj, “Speaker accent recognitionthrough statistical descriptors of mel-bands spectral energy andneural network model,” 2012.[3] Maryam Najaﬁan and Martin Russell, “Automatic accentidentiﬁcation as an analytical tool for accent robust automaticspeech recognition,”

Speech Communication , vol. 122, pp. 44–55, 2020.[4] Fadi Biadsy,

Automatic dialect and accent recognition andits application to speech recognition , Ph.D. thesis, ColumbiaUniversity, 2011.[5] Pradeep Rangan, Sundeep Teki, and Hemant Misra, “Exploit-ing spectral augmentation for code-switched spoken languageidentiﬁcation,” 2020.[6] Nur Endah Saﬁtri, Amalia Zahra, and Mirna Adriani, “Spo-ken language identiﬁcation with phonotactics methods on mi-nangkabau, sundanese, and javanese languages,”

ProcediaComputer Science , vol. 81, pp. 182–187, 2016.[7] Chithra Madhu, Anu George, and Leena Mary, “Automaticlanguage identiﬁcation for seven indian languages using higherlevel features,” in

IEEE International Conference on SignalProcessing , 2017.[8] Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda, “At-tentive statistics pooling for deep speaker embedding,” arXivpreprint arXiv:1803.10963 , 2018.[9] Suwon Shon, Hao Tang, and James Glass, “Frame-levelspeaker embeddings for text-independent speaker recognitionand analysis of end-to-end model,” in . IEEE, 2018, pp. 1007–1013.[10] Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zis-serman, “Utterance-level aggregation for speaker recognitionin the wild,” in

ICASSP 2019 . IEEE, 2019, pp. 5791–5795.[11] Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zis-serman, “Voxceleb: Large-scale speaker veriﬁcation in thewild,”

Computer Speech & Language , vol. 60, pp. 101027,2020.[12] Carlos Teixeira, Isabel Trancoso, and Ant´onio Serralheiro,“Accent identiﬁcation,” in

Proceeding of Fourth InternationalConference on Spoken Language Processing. ICSLP’96 . IEEE,1996, vol. 3, pp. 1784–1787.[13] Shamalee Deshpande, Sharat Chikkerur, and Venu Govin-daraju, “Accent classiﬁcation in speech,” in

Fourth IEEEWorkshop on Automatic Identiﬁcation Advanced Technologies(AutoID’05) . IEEE, 2005, pp. 139–143.[14] Asad Ahmed, Pratham Tangri, Anirban Panda, Dhruv Ramani,and Samarjit Karmakar, “Vfnet: A convolutional architecturefor accent classiﬁcation,” in . IEEE, 2019, pp. 1–4. [15] Genta Indra Winata, Samuel Cahyawijaya, Zihan Liu, Zhao-jiang Lin, Andrea Madotto, Peng Xu, and Pascale Fung,“Learning fast adaptation on cross-accented speech recogni-tion,” arXiv preprint arXiv:2003.01901 , 2020.[16] Zhong Meng, Hu Hu, Jinyu Li, Changliang Liu, Yan Huang,Yifan Gong, and Chin-Hui Lee, “L-vector: Neural label em-bedding for domain adaptation,” in

ICASSP 2020 . IEEE, 2020,pp. 7389–7393.[17] Thibault Viglino, Petr Motlicek, and Milos Cernak, “End-to-end accented speech recognition.,” in

INTERSPEECH , 2019,pp. 2140–2144.[18] Bo Li and Khe Chai Sim, “Comparison of discriminative inputand output transformations for speaker adaptation in the hy-brid nn/hmm systems,” in

Eleventh Annual Conference of theInternational Speech Communication Association , 2010.[19] Han Zhu, Li Wang, Pengyuan Zhang, and Yonghong Yan,“Multi-accent adaptation based on gate mechanism,” arXivpreprint arXiv:2011.02774 , 2020.[20] Hank Liao, “Speaker adaptation of context dependent deepneural networks,” in

ICASSP 2013 . IEEE, 2013, pp. 7947–7951.[21] Tian Tan, Yanmin Qian, Maofan Yin, Yimeng Zhuang, and KaiYu, “Cluster adaptive training for deep neural network,” in

ICASSP 2015 . IEEE, 2015, pp. 4325–4329.[22] Sining Sun, Ching-Feng Yeh, Mei-Yuh Hwang, Mari Osten-dorf, and Lei Xie, “Domain adversarial training for accentedspeech recognition,”

CoRR , vol. abs/1806.02786, 2018.[23] Yi Chen Chen, Zhaojun Yang, Ching Feng Yeh, MahaveerJain, and Michael L. Seltzer, “Aipnet: Generative adversarialpre-training of accent-invariant networks for end-to-end speechrecognition,” in

ICASSP 2020 , 2020.[24] Sanghyun Yoo, Inchul Song, and Yoshua Bengio, “A highlyadaptive acoustic model for accurate multi-dialect speechrecognition,” in

ICASSP 2019 , 2019.[25] M. Chen, Zhanlei Yang, Jizhong Liang, Yanpeng Li, andWenju Liu, “Improving deep neural networks based multi-accent mandarin speech recognition using i-vectors and accent-speciﬁc top layer,” in

INTERSPEECH , 2015.[26] Isin Demirsahin, Oddur Kjartansson, Alexander Gutkin, andClara Rivera, “Open-source Multi-speaker Corpora of the En-glish Accents in the British Isles,” in

Proceedings of The12th Language Resources and Evaluation Conference (LREC) ,Marseille, France, May 2020, pp. 6532–6541, European Lan-guage Resources Association (ELRA).[27] Ltd. Magic Data Technology Co., “Openslr68: Magicdatamandarin chinese read speech corpus,” .[28] Vassil Panayotov, Guoguo Chen, Daniel Povey, and SanjeevKhudanpur, “Librispeech: an asr corpus based on public do-main audio books,” in

ICASSP 2015 . IEEE, 2015, pp. 5206–5210.[29] Jonathan G Fiscus, “A post-processing system to yield re-duced word error rates: Recognizer output voting error reduc-tion (rover),” in . IEEE, 1997, pp.347–354.30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and IlliaPolosukhin, “Attention is all you need,” in

Advances in neuralinformation processing systems , 2017, pp. 5998–6008.[31] Houjun Huang, Xu Xiang, Yexin Yang, Rao Ma, and YanminQian, “Aispeech-sjtu accent identiﬁcation system for the ac-cented english speech recognition challenge,” in

Proc. ICASSP2021 . IEEE, 2021 (to appear).[32] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, SamGross, Nathan Ng, David Grangier, and Michael Auli, “fairseq:A fast, extensible toolkit for sequence modeling,” in

Proceed-ings of NAACL-HLT 2019: Demonstrations , 2019.[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and IlliaPolosukhin, “Attention is all you need,” in

Advances in neuralinformation processing systems , 2017, pp. 5998–6008.[34] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Par-mar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zheng-dong Zhang, Yonghui Wu, and Ruoming Pang, “Conformer:Convolution-augmented transformer for speech recognition,”2020.[35] Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han,“Lite transformer with long-short range attention,” 2020.[36] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals,“Listen, attend and spell: A neural network for large vocab-ulary conversational speech recognition,” in

ICASSP 2016 .IEEE, 2016, pp. 4960–4964.[37] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, andMichael Auli, “wav2vec 2.0: A framework for self-supervisedlearning of speech representations,” 2020.[38] Xingchen Song, Guangsen Wang, Zhiyong Wu, Yiheng Huang,Dan Su, Dong Yu, and Helen Meng, “Speech-xlnet: Unsuper-vised acoustic model pretraining for self-attention networks,”2020.[39] Tian Tan, Yizhou Lu, Rao Ma, Sen Zhu, Jiaqi Guo, and Yan-min Qian, “Aispeech-sjtu asr system for the accented englishspeech recognition challenge,” in