Bi-APC: Bidirectional Autoregressive Predictive Coding for Unsupervised Pre-training and Its Application to Children's ASR
BBI-APC: BIDIRECTIONAL AUTOREGRESSIVE PREDICTIVE CODING FORUNSUPERVISED PRE-TRAINING AND ITS APPLICATION TO CHILDREN’S ASR
Ruchao Fan, Amber Afshan and Abeer Alwan
Department of Electrical and Computer Engineering, University of California Los Angeles, USA
ABSTRACT
We present a bidirectional unsupervised model pre-training(UPT) method and apply it to children’s automatic speechrecognition (ASR). An obstacle to improving child ASR isthe scarcity of child speech databases. A common approachto alleviate this problem is model pre-training using datafrom adult speech. Pre-training can be done using supervised(SPT) or unsupervised methods, depending on the avail-ability of annotations. Typically, SPT performs better. Inthis paper, we focus on UPT to address the situations whenpre-training data are unlabeled. Autoregressive predictivecoding (APC), a UPT method, predicts frames from onlyone direction, limiting its use to uni-directional pre-training.Conventional bidirectional UPT methods, however, predictonly a small portion of frames. To extend the benefits of APCto bi-directional pre-training, Bi-APC is proposed. We thenuse adaptation techniques to transfer knowledge learned fromadult speech (using the Librispeech corpus) to child speech(OGI Kids corpus). LSTM-based hybrid systems are inves-tigated. For the uni-LSTM structure, APC obtains similarWER improvements to SPT over the baseline. When appliedto BLSTM, however, APC is not as competitive as SPT, butour proposed Bi-APC has comparable improvements to SPT.
Index Terms — Child ASR, Unsupervised pre-training,Autoregressive predictive coding, Hybrid BLSTM models
1. INTRODUCTION
One of the challenges faced in developing automated and in-dividualized educational and assessment tools for children isthe performance lag in child ASR compared to adult ASR [1].Challenges arise, in part, from difficulties in acoustic and lan-guage modeling of child speech. Due to different growth pat-terns of children and motor control issues, children’s speechhas a higher degree of intra-speaker and inter-speaker acousticvariability [2]. Additionally, children’s speech is character-ized by significant mispronunciations and disfluencies [3].Another challenge is the lack of publicly-available childspeech databases. Interestingly, with enough training data,the performance of child ASR using CLDNN-based hybrid
This work was supported in part by National Science Foundation (NSF)Grant models was shown to be comparable to adult systems [4]. Toalleviate the data scarcity problems, data-efficient TDNN-Fnetwork for child ASR was proposed in [5].Model pre-training with a data-sufficient task is anothersuccessful approach to address the data scarcity issue. Whencombined with fine-tuning, model pre-training can transferthe knowledge learned from one task to another [6]. Su-pervised pre-training (SPT) has been effectively applied tocross-lingual [7] and child ASR [8–10]. However, obtainingtranscriptions is not always feasible. Recently unsupervisedrepresentation learning was proposed for situations whentranscriptions are not available. This approach could beused for–(i) feature extraction, and (ii) model initialization,referred to as unsupervised pre-training (UPT). Common un-supervised techniques used as feature extractors include au-toregressive predictive coding (APC) [11, 12] and contrastivepredictive coding (CPC) [13]. APC predicts a future framefrom previous ones to learn speech representation while CPCconsiders samples randomly selected from the waveform,referred to as ”negative samples”. Most UPT methods ap-ply BERT-style pre-training mechanisms, which reconstructthe masked frames (frames masked to zero as input) fromunmasked frames using bidirectional information [14–18].However, UPT methods have not been used for child ASR.Unlike APC, most UPT methods mask only partial framesfor prediction limiting the pre-training model from learning amore comprehensive representation. APC, which is mostlyused for feature extraction, is constrained to learning fromonly one direction, limiting its use in bi-directional sequentialmodels. Bi-directional models provide better performance forASR systems in comparison to their uni-directional counter-parts [19]. To fully exploit the potential of APC for bidirec-tional models, we propose a novel bidirectional APC to useas an UPT and we refer to this technique as Bi-APC.We evaluate supervised and unsupervised pre-trainingmethods and investigate their ability of transferring knowl-edge learned from adult speech to child speech in the contextof LSTM-based acoustic models. We also evaluate our pro-posed Bi-APC technique against conventional bidirectionalpre-training methods such as MPC. The remainder of thepaper is organized as follows. Section 2 presents the pro-posed Bi-APC technique along with SPT and APC. Section3 describes the experimental setup, followed by results and a r X i v : . [ ee ss . A S ] F e b iscussion in Section 4. Section 5 concludes the paper.
2. MODEL PRE-TRAINING METHODS
Model pre-training learns common knowledge from a data-sufficient task and then transfers the knowledge learned to alow-resource task. In this paper, we aim to transfer the knowl-edge learned from adult speech to child speech. We use thepre-training methods described in this section for adult modeltraining. Long short-term memory (LSTM) based networksare chosen as acoustic models, which are then used to forma hybrid HMM-LSTM ASR system. Based on the trainingmechanism, we can summarize the pre-training methods intotwo categories–supervised and unsupervised.
Recently, supervised pre-training has been successfully usedin child ASR [8] and is frequently referred to as transfer learn-ing. Specifically, suppose the output of the LSTM is Y = { y , y , . . . , y T } and the corresponding frame-level label ob-tained from forced alignment is ˆ Y = { ˆ y , ˆ y , . . . , ˆ y T } , thesupervised training aims to optimize the cross-entropy lossfunction: L CE = − T (cid:88) t =1 C (cid:88) c =1 ˆ y ct log ( y ct ) (1)where C is the number of output categories (HMM states).The parameters in the LSTM are then utilized as the initial-ization for child acoustic model training except for the lastfeed-forward layer due to the different state space betweenadult and child models. Different from supervised pre-training, unsupervised pre-training does not require speech labels. Most of the unsu-pervised pre-training methods use either prediction or maskand reconstruction, where the supervision is the speech signalitself. In this section, we first review the APC for uni-LSTMpre-training and then show how we can extend the APC tobidirectional LSTM (BLSTM) pre-training.
APC utilizes the shifted input sequence as supervision andtries to predict the frame n steps ahead of the current framewith information from previous frames. As it is a regression-based prediction task, we consider the L distance. Supposethe input feature sequence is X = { x , x , . . . , x T } , then thepre-training model is trained with the following loss function: L APC = T − n (cid:88) t =1 ( | x t + n − y t | ) (2) ...... ............ Forward Label Reverse Label
Layer 1Layer 2Layer P ......
Fig. 1 . Illustration of Bi-APC pre-training for BLSTM. Redparts and blue parts are the forward-related and reversed-related parameters and computations, respectively. f h and rh indicate the hidden states of the forward and reversed cal-culations, respectively, at each layer.where n is a fixed value as a hyper-parameter. A key dif-ference from [11] in the usage of APC in this paper is thatwe utilize the pre-training model for parameter initializationinstead of feature extraction. The reason is that we do notexpect APC training with adult data as a feature extractor toresult in improvements for child ASR, due to the large acous-tic mismatch between adult and child speech. Nevertheless,the mechanism of APC can be used for LSTM pre-trainingfrom only one direction, and thus does not fully exploit infor-mation from both directions. The mechanism of APC is well suited for uni-directionalstructures such as uni-LSTM. However, BLSTMs usuallyprovide better performance than uni-LSTMs as they learnfrom both directions. Therefore, we propose a bidirectionalAPC (Bi-APC), which extends APC to exploit its potentialfor BLSTM pre-training. The idea of Bi-APC is to add a re-versed version of APC prediction, where we predict the frame n steps behind the current frame given all future frames.Figure 1 shows how to use Bi-APC for BLSTM pre-training. To prevent equivalent mapping in the network,the outputs of the BLSTM should not contain informa-tion about the corresponding supervisions. We, therefore,split the BLSTM into forward-related and reverse-relatedparts as shown in red parts and blue parts in Fig. 1, re-spectively, including the parameters (arrows) and outputs(rectangles) at each layer. When computing the outputs Y fwd = { y fwd , y fwd , . . . , y fwdT } in the forward direction,the values of the blue rectangles are set to zero to exclude theinformation that are extracted from the frames on the rightside. The reversed-related parameters are also not updated.The same strategies are used in the computation of outputs Y rev = { y rev , y rev , . . . , y revT } in the reversed direction. Thearameters in black arrows are not trained in the pre-trainingsince they allow for an illegal information exchange from dif-ferent directions. The green arrows are the shared parameterswhich are not used in the fine-tuning. The BLSTM is thenpre-trained by optimizing the APC from both directions as: L Bi-APC = 0 . · T − n (cid:88) t =1 | x t + n − y fwdt | + 0 . · T (cid:88) t = n +1 | x t − n − y revt | (3)where task ratios are set to 0.5 as both directions have thesame importance. Note that we can also train an APC withuni-LSTM and only initialize the parameters of the red partsin Figure 1. We still denote this pre-training as APC in theexperimental results.
3. EXPERIMENTAL SETUP
Experiments were conducted using Kaldi [20] and Pykaldi2[21]. Pykaldi2 is used to train the neural networks for thehybrid system and Kaldi is used for WFST-based decoding.
For the pre-training task, Librispeech [22] was used becauseit is the largest publicly-available adult speech corpus and ismainly read speech. The test set of the Librispeech corpusis split into “clean” and “other” based on the quality of therecorded utterances, where the ”other” refers to noisy data,and are used to evaluate the adult ASR system.For the fine-tuning experiments, the scripted part of theOGI Kids’ Speech Corpus [23] was used. It contains speechfrom approximately 100 speakers per grade saying singlewords, sentences and digit strings. The utterances were ran-domly split into training and test sets without speaker overlap,where utterances from 30% of the speakers were chosen asthe testing data, denoted as ogi-test. As a result, nearly 50hours of child data were used to train the child ASR system.
The initial experiments used GMM model training. The Lib-rispeech recipe in kaldi was used for pre-training and the JHUOGI recipe [5] was applied for fine-tuning. The GMM modelswere then used to obtain the frame-level alignment for DNN-based acoustic model training. The HMM states were 5776and 1360 for adult and child models, respectively.Uni-LSTM and BLSTM were chosen as acoustic modelsto compare pre-training methods. 80-dimensional Mel-filterbank features (which is common for UPT) were extractedfrom each 25ms window with a 10ms frame shift as the in-put. No frame stacking or skipping was applied. Hence, theoutput dimension for the unsupervised pre-training task is 80.The uni-LSTM model consists of 4 uni-LSTM layers with
Table 1 . WERs of baseline systems, including uni-LSTM andBLSTM trained with Librispeech and OGI data, respectively.
WERs(%) Libri-adult Childrentest-clean test-other ogi-testAdult Model - Librispeechuni-LSTM 5.71 15.15 65.90BLSTM 4.90 12.59 59.12Child Model - OGI CorpusTDNN-F [5] - - 10.71uni-LSTM 95.77 97.28 12.58BLSTM 86.82 92.15 9.16
800 hidden units, while the BLSTM model has 4 BLSTM lay-ers with 512 hidden units in each direction. Batch normaliza-tion and dropout layers with a 0.2 dropout rate were appliedafter each LSTM layer. The output of the LSTMs were thentransferred into either the state space for classification or thefeature space for prediction with a single feed-forward layer.All models were trained with a multi-step schedule, wherethe learning rate was held in the first 2 epochs and then wasexponentially decayed to a ratio λ of the initial learning rate inthe remaining epochs. For pre-training tasks, 8 epochs wereused with the initial learning rate of . and λ = 0 . . Forthe fine-tuning tasks, we trained the models with 15 epochs.The learning rate starts from 2e-4 to 2e-6. The last threemodel checkpoints were averaged as the final model for eval-uation. For both APC and Bi-APC training, the time shift n was heuristically set to 2. Sequence discriminative trainingwas not applied in our experiments since our goal is to com-pare different pre-training methods. All experiments use the same lexicon and language modelsfrom the original Librispeech corpus. Specifically, the 14Mtri-gram (tgsmall) language model was used for first pass de-coding, and the 725M tri-gram (tglarge) language model wasused for rescoring. We report the results of rescoring.
4. RESULTS AND DISCUSSION4.1. Baseline
We first show the results of the baseline models in Table 1.Here we compared two models–(a) adult model trained us-ing Librispeech and (b) child model trained using the OGIspeech corpus. We evaluated these models on test-clean andtest-other from Librispeech and also on the OGI test. Wecompared uni-LSTM and BLSTM acoustic model architec-tures for both setups. For the adult model, we obtained per- able 2 . Performance comparison of supervised pre-training(SPT) and unsupervised pre-training (UPT) in terms of WER(%) for both LSTM and BLSTM acoustic model architecture.The results are for ogi-test. We also provide word error ratereduction (WERR) compared to the baseline.
WERs(%) uni-LSTM WERR BLSTM WERRBaseline 12.58 - 9.16 -SPT 11.85 5.8% 8.46 7.6%UPT MPC [14] - - 9.02 1.5%APC 11.76 6.5% 8.85 3.4%Bi-APC - - 8.57 6.5% formances similar to previously published results [22]. Adultmodels were also used to test on ogi-test that has an acousticdomain mismatch resulting in high WERs for LSTM models.For child models, the performance on Librispeech de-grades drastically with both uni-LSTM and BLSTM models.To compare with existing results in the literature, we eval-uated the TDNN-F acoustic model trained with the OGIcorpus [5]. We see that the uni-LSTM performed worse thanTDNN-F but BLSTM outperformed TDNN-F, thus motivat-ing us to explore model pre-training for the BLSTM system.
This paper aims at exploring the performance of supervised(SPT) and unsupervised pre-training (UPT) for children’sASR. As mentioned in Section 3.1, we used Librispeechfor pre-training and OGI for fine-tuning the model. Table 2presents results of fine-tuning on both uni-LSTM and BLSTMarchitectures, evaluated on the OGI test. Note that, differentfrom [8], all layers were updated during fine-tuning since thiswas the best setting for our experiments.Table 2 shows that SPT improved the performance of theuni-LSTM model to 11.85% and was better than the baselinewithout pre-training. Interestingly, unsupervised pre-trainingusing APC also provides improvement (11.76%) similar tothat of SPT with the uni-LSTM model.As mentioned earlier, BLSTM has better performancethan uni-LSTM. SPT resulted in the best performance (WER,8.46%) among the pre-training methods applied to BLSTM.Note that, to perform UPT, we first used APC to pre-trainonly the forward path parameters of BLSTM, resulting in aWER of 8.85%. We then compared it to a widely used bidi-rectional pre-training method, the masked predictive coding(MPC) [14] and showed that MPC (9.02%) performed worsethan APC (8.85%). We assume the reason is that MPC hasfewer frames to be predicted (only 15% of the frames wererandomly masked) although MPC can learn from both direc-tions. The proposed Bi-APC achieved a WER of 8.57% thatis comparable to the SPT. This can be valuable when thereis a large amount of data without transcriptions. Since the
Table 3 . BLSTM-based ASR performance breakdown basedon age groups of kindergarten to grade 2, grade 3-6 and grade7-10. WERs(%) K0-G2 G3-G6 G7-G10Baseline 18.87 7.24 5.51+SPT 17.43 6.66 5.11+APC 18.07 7.03 5.40+Bi-APC 17.23 6.91 5.26pre-training task is a 960-hour dataset, UPT could possiblybenefit from more unlabeled data. Recent works have shownthat self-attention layers are better for acoustic modeling thanthe BLSTM [24,25]. It will be interesting to see how Bi-APCcould be extended to other model topologies, which is animportant issue for future research.
To obtain an insight into the influence of the speaker’sage on the performance of pre-training methods, in Ta-ble 3, we present results based on age groups in the OGIdataset. Similar to [26], three different age groups wereselected–kindergarten to grade 2, grade 3-6, and grade 7-10. We present the results using the BLSTM model. Foryounger children (kindergarten- grade 2), the Bi-APC pro-vided slightly better results compared to SPT. In contrast,we did not observe any such improvement in the older agegroups for children. This trend could mean that UPT may becapturing a representation crucial to the performance of veryyoung child speech, whose speech is more variable and diffi-cult to recognize than older children [27]. Further research isrequired to explore the usage of the approach more effectivelyfor children’s ASR.
5. CONCLUSIONS
In this paper, we proposed a bidirectional pre-training (Bi-APC) method. We also compared supervised and unsuper-vised model pre-training methods for child ASR. We showedthat standard APC could be well applied to uni-LSTM pre-training, achieving about 6.5% relative WER improvementover the uni-LSTM baseline without pre-training. However,APC lost its superiority when applied to the BLSTM struc-ture and had a performance gap with SPT. Our proposed Bi-APC addressed these issues and resulted in comparable per-formance to SPT. We further analyzed the performance ofchild speech for different age groups. Results showed the po-tential of unsupervised pre-training for younger child speech,and we achieved the best-reported ASR result (a WER of8.46% for SPT) for the OGI Kids corpus. The proposed Bi-APC achieved a WER of 8.57%, performing better than otherUPT methods such as APC and MPC. . REFERENCES [1] James Kennedy, S´everin Lemaignan, Caroline Montassier,Pauline Lavalade, Bahar Irfan, Fotios Papadopoulos, Em-manuel Senft, and Tony Belpaeme, “Child speech recogni-tion in human-robot interaction: evaluations and recommen-dations,” in
ACM/IEEE International Conference on Human-Robot Interaction , 2017, pp. 82–90.[2] Sungbok Lee, Alexandros Potamianos, and ShrikanthNarayanan, “Acoustics of children’s speech: Developmentalchanges of temporal and spectral parameters,”
JASA , vol. 105,no. 3, pp. 1455–1468, 1999.[3] Prashanth Gurunath Shivakumar, Alexandros Potamianos,Sungbok Lee, and Shrikanth Narayanan, “Improving speechrecognition for children using acoustic adaptation and pronun-ciation modeling.,” in
WOCCI , 2014, pp. 15–19.[4] Hank Liao, Golan Pundak, Olivier Siohan, Melissa K Carroll,Noah Coccaro, Qi-Ming Jiang, Tara N Sainath, Andrew Se-nior, Franc¸oise Beaufays, and Michiel Bacchiani, “Large vo-cabulary automatic speech recognition for children,” in
Inter-speech , 2015.[5] Fei Wu, Povey D Garc´ıa-PLP, and S Khudanpur, “Advances inautomatic speech recognition for child speech using factoredtime delay neural network,” in
Interspeech , 2019, pp. 1–5.[6] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu,Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon,“Unified language model pre-training for natural language un-derstanding and generation,” in
Advances in Neural Informa-tion Processing Systems , 2019, pp. 13063–13075.[7] Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, and Yifan Gong,“Cross-language knowledge transfer using multilingual deepneural network with shared hidden layers,” in
ICASSP , 2013,pp. 7304–7308.[8] Prashanth Gurunath Shivakumar and Panayiotis Georgiou,“Transfer learning from adult to children for speech recogni-tion: Evaluation, analysis and recommendations,”
ComputerSpeech & Language , vol. 63, pp. 101077, 2020.[9] Robert Gale, Liu Chen, Jill Dolata, Jan Van Santen, andMeysam Asgari, “Improving asr systems for children withautism and language impairment using domain-focused dnntransfer techniques,” in
Interspeech , 2019, pp. 11–15.[10] Rong Tong, Lei Wang, and Bin Ma, “Transfer learning forchildren’s speech recognition,” in
IALP , 2017, pp. 36–39.[11] Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James Glass,“An Unsupervised Autoregressive Model for Speech Repre-sentation Learning,” in
Interspeech , 2019, pp. 146–150.[12] Vijay Ravi, Ruchao Fan, Amber Afshan, Huanhua Lu, andAbeer Alwan, “Exploring the use of an unsupervised autore-gressive model as a shared encoder for text-dependent speakerverification,” arXiv preprint arXiv:2008.03615 , 2020. [13] Aaron van den Oord, Yazhe Li, and Oriol Vinyals, “Repre-sentation learning with contrastive predictive coding,” arXivpreprint arXiv:1807.03748 , 2018.[14] Dongwei Jiang, Xiaoning Lei, Wubo Li, Ne Luo, Yuxuan Hu,Wei Zou, and Xiangang Li, “Improving transformer-basedspeech recognition using unsupervised pre-training,” arXivpreprint arXiv:1910.09932 , 2019.[15] Andy T Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, andHung-yi Lee, “Mockingjay: Unsupervised speech representa-tion learning with deep bidirectional transformer encoders,” in
ICASSP , 2020, pp. 6419–6423.[16] Xingchen Song, Guangsen Wang, Yiheng Huang, ZhiyongWu, Dan Su, and Helen Meng, “Speech-xlnet: Unsupervisedacoustic model pretraining for self-attention networks,”
Proc.Interspeech 2020 , pp. 3765–3769, 2020.[17] Alexei Baevski, Michael Auli, and Abdelrahman Mohamed,“Effectiveness of self-supervised pre-training for speechrecognition,” arXiv preprint arXiv:1911.03912 , 2019.[18] Weiran Wang, Qingming Tang, and Karen Livescu, “Unsuper-vised pre-training of bidirectional speech encoders via maskedreconstruction,” in
ICASSP . IEEE, 2020, pp. 6889–6893.[19] Albert Zeyer, Patrick Doetsch, Paul Voigtlaender, RalfSchl¨uter, and Hermann Ney, “A comprehensive study of deepbidirectional lstm rnns for acoustic modeling in speech recog-nition,” in
ICASSP . IEEE, 2017, pp. 2462–2466.[20] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Bur-get, Ondrej Glembek, Nagendra Goel, Mirko Hannemann,Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldispeech recognition toolkit,” in
ASRU , 2011.[21] Liang Lu, Xiong Xiao, Zhuo Chen, and Yifan Gong,“Pykaldi2: Yet another speech toolkit based on kaldi and py-torch,” arXiv preprint arXiv:1907.05955 , 2019.[22] Vassil Panayotov, Guoguo Chen, Daniel Povey, and SanjeevKhudanpur, “Librispeech: an asr corpus based on public do-main audio books,” in
ICASSP , 2015, pp. 5206–5210.[23] Khaldoun Shobaki, John-Paul Hosom, and Ronald A Cole,“The ogi kids’ speech corpus and recognizers,” in
ICSLP ,2000.[24] Yongqiang Wang, Abdelrahman Mohamed, Due Le, et al.,“Transformer-based acoustic modeling for hybrid speechrecognition,” in
ICASSP . IEEE, 2020, pp. 6874–6878.[25] Liang Lu, Changliang Liu, Jinyu Li, and Yifan Gong, “Ex-ploring transformers for large-scale speech recognition,”
Proc.Interspeech 2020 , pp. 5041–5045, 2020.[26] Saeid Safavi, Maryam Najafian, Abualsoud Hanani, Mar-tin J Russell, Peter Jancovic, and Michael J Carey,“Speaker recognition for children’s speech,” arXiv preprintarXiv:1609.07498 , 2016.[27] Gary Yeung and Abeer Alwan, “On the difficulties of auto-matic speech recognition for kindergarten-aged children.,” in