Senone-aware Adversarial Multi-task Training for Unsupervised Child to Adult Speech Adaptation
SSENONE-AWARE ADVERSARIAL MULTI-TASKTRAINING FOR UNSUPERVISED CHILD TO ADULT SPEECH ADAPTATION
Richeng Duan, Nancy F. Chen
Institute for Infocomm Research, A*STAR, Singapore
ABSTRACT
Acoustic modeling for child speech is challenging due tothe high acoustic variability caused by physiological differ-ences in the vocal tract. The dearth of publicly availabledatasets makes the task more challenging. In this work,we propose a feature adaptation approach by exploiting ad-versarial multi-task training to minimize acoustic mismatchat the senone (tied triphone states) level between adult andchild speech and leverage large amounts of transcribed adultspeech. We validate the proposed method on three tasks:child speech recognition, child pronunciation assessment andchild fluency score prediction. Empirical results indicate thatour proposed approach consistently outperforms competitivebaselines, achieving 7.7% relative error reduction on speechrecognition and up to 25.2% relative gains on the evaluationtasks.Index Terms — child speech recognition, automaticspeech evaluation, unsupervised feature adaptation
1. INTRODUCTION
Despite research efforts for more than two decades, it is stillchallenging to develop speech technology for children (e.g.recognize child speech in voice-activated applications andautomatic child speech evaluation) [1, 2, 3, 4, 5, 6, 7, 8].Acoustic modeling, a key component in speech technology,is challenging when applied to child speech. The challengesstem from physiological differences of the articulatory ap-paratus size, pronunciation variations, and proficiency levelsbetween children and adults [9, 10, 11, 12]. While large-scale linguistic resources and powerful computational modelsenable the development of superior acoustic models, suchprivileged scenarios are often unavailable when processingchild speech [13]. To overcome challenges of acoustic mod-eling for child speech and the scarcity of annotated linguisticresources, our previous work [14] proposed to learn a fea-ture adaptation model to transform child speech to the adultfeature space without using transcribed child speech. Similarto recent work of learning domain invariant feature represen-tations [15, 16, 17, 18], [14] conducted adversarial trainingusing binary domain labels (of child and adult). However,such training strategies implicitly assume the acoustic trans- formations needed for each phoneme is the same, while inpractice the acoustic differences across adults and childrenvary across phonemes; resonant frequencies of vocal tractmodels of voiceless fricatives are more similar across adultsand children while those of high vowels would be more dis-tinct [19], making it more difficult for global transformationsto achieve optimal results. Inspired by such linguistic in-sights, we propose a senone-aware adversarial training (SAT)strategy to adapt child speech to adult speech. The tradi-tional binary child/adult domain labels are further elaboratedand refined with multi-dimensional senone posterior labelsto take advantage of phonemic differences across adults andchildren. Empirical validation on 3 tasks compares favorablywith established baselines.
2. RELATED WORK2.1. Acoustic modeling of children’s speech
Training an acoustic model directly with thousands of hoursof child speech is the most straightforward way to reach goodperformance [2]. However, this scenario is often unavail-able. To improve child speech recognition and automaticspoken language assessment, past approaches include statis-tical machine learning such as vocal tract length normaliza-tion (VTLN) [20] and maximum linear likelihood transform(MLLT) [12] to conduct speaker-independent feature andmodel adaptation to train better acoustic models. Featurespace maximum likelihood linear regression (FMLLR), alsoknown as constrained maximum likelihood linear regression(CMLLR) [21], is a commonly adopted supervised adaptationmethod, where the feature transforms can be estimated at thespeaker or utterance level using labeled data. With limitedamounts of transcribed child speech, [13] improved acousticmodels by freezing lower layers of the pre-trained DNN whileonly the output layer is updated. However, these methods usu-ally require at least some quantity of human transcribed childspeech for supervised training or fine-tuning, while in realitysuch annotated resources are publicly unavailable to mostacademic researchers. In addition, leveraging transcribeddata from new domains to sequentially retrain pre-trainedDNN models usually suffers from catastrophic forgetting[22, 23]. Such challenges motivate us to investigate unsuper- a r X i v : . [ c s . S D ] F e b ig. 1 . Child-to-adult acoustic feature adaptation frameworkvised adaptation approaches. Inspired by generative adversarial networks (GAN), adversar-ial learning has been explored for learning domain-invariantmodels in areas such as image classification [15], robustspeech recognition [16], speaker adaptation [17], and spo-ken keyword spotting [18]. Rather than training a domain-invariant acoustic model from scratch, [14] applied adver-sarial learning to explicitly estimate a global set of trans-formations to adapt child speech features to be used in ahigh performance adult acoustic model. Although it hasachieved favorable performance, the implicit assumption thatall phonemes/senones should be transformed in the samefashion could cast practical constraints in achieving robustperformance. In this work, we strategically consider ex-ploiting phonemic information to help anchor more targetedtransformations at the senone level.
3. UNSUPERVISED FEATURE ADAPTATION VIAADVERSARIAL MULTI-TASK LEARNING
Fig.1 shows the acoustic model architecture for processingchild speech consists three sub-networks: an adult acous-tic model, a child-to-adult adaptation model and a domaindiscriminator. Prior to model training, the parameters ofthe adult acoustic model are copied from a pre-trained adultmodel, which has been trained with large amounts of tran-scribed adult speech. These copied parameters are fixedduring model training. We attach the output layer of the fea-ture adaptation model to the input layer of the adult acousticmodel. The discriminator is connected to the output layer ofthe front-end adaptation network, which is used to map thetransformed features to domain labels. During inference, thedomain discriminator component is removed; we only use adult acoustic model and the child to adult adaptation modelto process child speech.
The idea of domain adversarial learning is minimizing thedomain discriminator prediction accuracy while training thegenerator to confuse the discriminator . In our previous work[14], during model training, the front-end feature adaptationnetwork maximizes the loss of the child-adult binary domainclassifier so the transformed features are mapped to becomecloser to the adult speech feature space. To ensure the gen-erated features are able to perform senone classification, weminimized errors from acoustic modeling at the same timevia multi-task training, where additional senone classificationcross entropy loss is considered.However, such binary domain adversaries suffer from alimitation: the feature adaptor aligns the two domains of childand adult from a global perspective. A global transformationis sub-optimal because the differences in acoustic variabil-ity of child and adult speech are different across phonemes.For example, there is less acoustic differences of child andadult speech for voiceless fricatives such as /f/ compared tothe post-alveolar approximant /r/, as the cavity formed fromthe front of the lower lip to the upper teeth constriction whenproducing /f/ is similar in adults and children leading to sim-ilar resonant frequencies, while the the resonant frequenciescaused by /r/ would differ in adults and children due to thesize of the vibrating vocal folds, tongue position and oral cav-ity size [19]. Therefore, applying the same transformation to/f/ and /r/ is not ideal as /f/ requires minimal or no transforma-tion while /r/ would require significant adaptation. Coarticu-lation effects could further complicate the acoustic variabilityacross adults and children as well. Therefore, it is strategicto consider phonetic information such as senones (clusteredtriphones) when learning the adaptation transformations. In-spired by such linguistic insights, we exploit multi-task adver-sarial training to learn more robust transformations specificto each senone. Fig. 2 using an example of classifying twosenones to illustrate the intuition behind the motivation of ourcurrent work. In Fig. 2(a), while after domain adaptation forchild speech the two domains are closer to each other, the sep-aration across senones is not demarcated, resulting in manysenone samples falsely classified. Such limitations motivatesus to incorporate senone knowledge into adversarial trainingto enable seonone-associated feature space transforms, whichwe elaborate in Section 3.2.
As shown in Fig. 2(b), the binary child/adult domain infor-mation is now re-represented at the senone level. Therefore,the proposed representation not only embodies the broad do-main information (adult and child speaker), but also encodes ig. 2 . (a) Global transformation using binary adversariltraining (BAT): Only domain labels of child or adult are con-sidered as the model is blind to senone information (e.g. s and s for adult speech are both labeled as adult domain). (b)Proposed senone-aware adversarial training (SAT): In addi-tion to domain labels of child and adult, senone posteriors arealso considered.fine-grained senone information, enabling the proposed adap-tation model to learn feature transforms specific to senonesresulting in more distinct separation among senone classes.Assuming that there are N training samples in total and n samples of them belong to adult speech, the multi-task objec-tive function is computed as: E ( Θ adpt , Θ dom ) = 1 n n X i =1 L isenone ( Θ adpt ) − N N X i =1 L idom ( Θ adpt , Θ dom ) (1)where L senone is the cross-entropy loss between the adultsenone labels and the network output posteriors from the adultacoustic model. L dom is the domain classification loss onadult and child speech samples.The domain classification loss L dom in adversarial train-ing using binary domain labels is characterized as: L idom ( Θ adpt , Θ dom ) = − (1 − I dom ) log P ( dom = a | f i ) − I dom log P ( dom = c | f i ) (2) where I dom is the domain indicator function, possessing thevalue of 1 for children samples f i and 0 for adult samples. P ( dom = a | f i ) and P ( dom = c | f i ) are the probabilityoutputs from the domain discriminator.To incorporate senone information, we present the domainloss L dom in Equation (2) as follows: L idom = − (1 − I dom ) K X k =1 α ik log P ( dom = a, sen = k | f i ) − I dom K X k =1 α ik log P ( dom = c, sen = k | f i ) (3)where K is the total number of senones. α ik is the k th entryof senone posteriors, extracted from the adult acoustic modelfor speech sample i . The child/adult domain information isthus represented by a K dimension vector.During model training, the parameters in the front-endfeature adapter ( Θ adpt ) and domain classifier ( Θ dom ) are op-timized such that: Θ adpt = argmin Θ adpt E ( Θ adpt , Θ dom ) (4) Θ dom = argmax Θ dom E ( Θ adpt , Θ dom ) (5)Via joint optimization, the generated child speech features arenot only closer to the adult speech features but are also ableto contribute to sharper senone classification performance.
4. EXPERIMENTAL SETUP4.1. Datasets
We adopt LibriSpeech [24] ”train-clean-100” subset (251speakers) to train the adult acoustic model. The
SingaKids-English corpus [14] consists of 46 hrs from 193 speakersranging from ages 6 to 12. The age and gender range for thedevelopmental and test sets are equally distributed to mini-mize biased performance at test time. There are 5 proficiencylevels used in assessment, which are scored by an Englishschool teacher certified by the Ministry of Education in Sin-gapore. We merge the six grades in primary school into threelevels of G12, G34, and G56.
The training data for the adult acoustic model was parameter-ized into 40-dimensional log Mel-scale filter-bank features,along with their first and second temporal differentials. Theinput to the network are 11 contiguous frames that yield a1,320-dimensional feature vector. The back-end adult acous-tic model has 6 hidden layers with 2,048 nodes per layer.Both batch normalization and dropout were used to prevent able 1 . Phone error rate (%) on test set SingKids-English.Baseline Baseline + adaptationDNN FMLLR BAT[14] SATG12 85.11 83.73 75.89 69.56G34 73.79 73.24 67.26 61.84G56 69.08 65.78 62.31 58.02Overall 74.43 72.55 67.19 62.02over-fitting. Considering a language learner’s pronunciationand fluency performance are often correlated, we designeda multi-task neural network architecture to perform the pro-nunciation and fluency assessment together. The input to thenetwork is a 30-dimensional feature vector, consisting a set offeatures adopted from previous work [3, 25]. The model con-sists of 3 layers with 128 nodes per layer, and 2 softmax out-put layers with 5 nodes in each branch. All hyper-parameterswere optimized on the developmental set of “Dev-clean” ofLibriSpeech and the developmental set SingaKids-English.
5. EXPERIMENTAL RESULTS5.1. Speech recognition
To validate our DNN baseline that trained on adult speech,we employ the standard pruned version of the WSJ-5k tri-gram language model to conduct speech recognition on Lib-riSpeech “test-clean” set. Our baseline model achieves aword error rate of 9.49%, slightly better than Kaldi’s DNNmodel of 9.66%, suggesting this baseline is competitive.Other baselines include a lightly supervised FMLLR and thedomain adaptation using binary adversarial training (BAT)[14]. The FMLLR transformation matrix was estimated withphonetic labels generated from the aforementioned DNNbaseline model. (Note that we focus on investigating unsu-pervised approaches, dedicated methods for post-processingthose generated labels for FMLLR are out of the scope of thispaper.) To minimize the influence from the language model,we adopt a free phone decoding graph instead.As shown in Table 1 the proposed senone-aware approach(SAT) achieves the lowest phone error rate (PER) in all con-ditions. It reduces the overall PER for 12.41% absolute overthe DNN baseline, and 7.7% relative over the BAT baseline.Though the PER around 60% is still relatively high, such re-sults are consistent with prior work [13, 26], and illustrate thechallenges of processing child speech. We use two widely adopted metrics [8, 14]: prediction ac-curacy and mean squared error (MSE) between the predictedscores and the human rated scores. From Table 2 and 3, we https://github.com/kaldi-asr/kaldi/blob/master/egs/librispeech/s5 Table 2 . Pronunciation prediction performance comparison.Baseline BAT[14] SATAccuracy(%)G12 42.1 47.3 50.0G34 37.3 38.9 47.7G56 47.3 52.3 52.7Overall 43.3 47.5 49.8MSEG12 1.30 1.10 1.12G34 1.14 1.14 1.11G56 1.44 1.32 1.19Overall 1.32 1.25 1.17
Table 3 . Fluency prediction performance comparison.Baseline BAT[14] SATAccuracy(%)G12 31.6 42.1 47.3G34 40.3 38.8 44.7G56 50.9 53.6 55.3Overall 44.2 47.0 49.7MSEG12 2.25 1.99 1.50G34 1.96 1.80 1.46G56 2.22 1.96 1.40Overall 2.13 1.90 1.42obverse that the proposed SAT approach improves the predic-tion accuracy on both pronunciation and fluency evaluation,implying SAT generates features that are more easily sepa-rable for the scoring classifier. The prediction accuracy isaround 50%, more than twice the probability at chance level.The overall MSE is reduced from 1.90 to 1.42, 25.2% rela-tive improvement when predicting fluency scores. Note thatfluency scoring is done at the utterance level, so the avail-able training data for the assessment classifier is much fewerthan that for acoustic modeling, limiting the prediction perfor-mance. Approaches to relieve the need of such labor-intensivehuman scoring is a topic of on-going research.
6. CONCLUSION
To tackle challenges arising from the limited linguistic re-sources suitable for modeling child speech, we proposedan unsupervised adversarial multi-task training approach totransform child speech features to the adult feature space byanchoring the adaption at the senone-level to exploit senone-specific acoustic variations across adults and children. Weshowed that our approach outperforms competitive SOTAbaselines by 7.7%-25.2% relative across three speech tasks. . REFERENCES [1] S. Ghai and R. Sinha, “Exploring the role of spectralsmoothing in context of children’s speech recognition,”in
Tenth Annual Conference of the International SpeechCommunication Association , 2009.[2] H. Liao, G. Pundak, O. Siohan, M. K. Carroll, N. Coc-caro, Q.-M. Jiang, T. N. Sainath, and Senior, “Largevocabulary automatic speech recognition for children,”in
INTERSPEECH , 2015, pp. 1611–1615.[3] C. Cucchiarini, H. Strik, and L. Boves, “Automatic eval-uation of dutch pronunciation by using speech recogni-tion technology,” in
ASRU . IEEE, 1997, pp. 622–629.[4] H. Franco, V. Abrash, K. Precoda, H. Bratt, R. Rao, J.Butzberger, R. Rossier, and F. Cesari, “The SRI EduS-peakTM system: Recognition and pronunciation scor-ing for language learning,”
Proceedings of InSTILL , pp.123–128, 2000.[5] J. Cheng and J. Shen, “Towards accurate recognition forchildren’s oral reading fluency,” in , 2010, pp. 103–108.[6] A. Metallinou and J. Cheng, “Using deep neural net-works to improve proficiency assessment for childrenenglish language learners,” in
INTERSPEECH , 2014.[7] Y. Qian, X. Wang, and K. Evanini, “Self-adaptivednn for improving spoken language proficiency assess-ment.,” in
INTERSPEECH , 2016, pp. 3122–3126.[8] K. Kyriakopoulos, M. Gales, and K. Knill, “Automaticcharacterisation of the pronunciation of non-native en-glish speakers using phone distance features,” 2018.[9] A. Potamianos, S. Narayanan, and S. Lee, “Automaticspeech recognition for children,” in
EUSPEECH , 1997.[10] M. Gerosa, D. Giuliani, and F. Brugnara, “Acoustic vari-ability and automatic recognition of children’s speech,”
Speech Communication , vol. 49, pp. 847–860, 2007.[11] M. Russell and S. D’Arcy, “Challenges for computerrecognition of children’s speech,” in
Workshop onSpeech and Language Technology in Education , 2007.[12] P. G. Shivakumar, A. Potamianos, S. Lee, and S.Narayanan, “Improving speech recognition for childrenusing acoustic adaptation and pronunciation modeling.,”in
WOCCI , 2014, pp. 15–19.[13] M. Matassoni, R. Gretter, D. Falavigna, and D. Giuliani,“Non-native children speech recognition through trans-fer learning,” in
ICASSP . IEEE, 2018, pp. 6229–6233. [14] R. Duan and N. F. Chen, “Unsupervised feature adapta-tion using adversarial multi-task training for automaticevaluation of children’s speech,” in
INTERSPEECH ,2020, pp. 3037–3041.[15] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H.Larochelle, F. Laviolette, M. Marchand, and V. Lempit-sky, “Domain-adversarial training of neural networks,”
JMLR , vol. 17, no. 1, pp. 2096–2030, 2016.[16] Y. Shinohara, “Adversarial multi-task learning of deepneural networks for robust speech recognition.,” in
IN-TERSPEECH , 2016, pp. 2369–2372.[17] Z. Meng, J. Li, Z. Chen, Y. Zhao, V. Mazalov, Y. Gang,and B.-H. Juang, “Speaker-invariant training via adver-sarial learning,” in
ICASSP , 2018, pp. 5969–5973.[18] J. Hou, P. Guo, S. Sun, F. K. Soong, W. Hu, and L.Xie, “Domain adversarial training for improving key-word spotting performance of esl speech,” in
ICASSP .IEEE, 2019, pp. 8122–8126.[19] K. N. Stevens,
Acoustic phonetics , vol. 30, MIT press,2000.[20] R. Serizel and D. Giuliani, “Vocal tract length normal-isation approaches to dnn-based children’s and adults’speech recognition,”
IEEE SLT , pp. 135–140, 2014.[21] M. J. Gales and P. C. Woodland, “Mean and vari-ance adaptation within the mllr framework,”
ComputerSpeech & Language , vol. 10, no. 4, pp. 249–264, 1996.[22] R. M. French, “Catastrophic forgetting in connectionistnetworks,”
Trends in cognitive sciences , vol. 3, no. 4,pp. 128–135, 1999.[23] S.-W. Lee, J.-H. Kim, J. Jun, J.-W. Ha, and B.-T.Zhang, “Overcoming catastrophic forgetting by incre-mental moment matching,” in
Advances in Neural Infor-mation Processing Systems 30 , I. Guyon, U. V. Luxburg,S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, andR. Garnett, Eds., pp. 4652–4662. 2017.[24] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur,“Librispeech: an asr corpus based on public domain au-dio books,” in
ICASSP . IEEE, 2015, pp. 5206–5210.[25] W. Hu, Y. Qian, F. K. Soong, and Y. Wang, “Improvedmispronunciation detection with deep neural networktrained acoustic models and transfer learning based lo-gistic regression classifiers,”
Speech Communication ,vol. 67, pp. 154–166, 2015.[26] R. Serizel and D. Giuliani, “Deep-neural networkapproaches for speech recognition with heterogeneousgroups of speakers including children,”