Fundamental Frequency Feature Normalization and Data Augmentation for Child Speech Recognition
FFUNDAMENTAL FREQUENCY FEATURE NORMALIZATION ANDDATA AUGMENTATION FOR CHILD SPEECH RECOGNITION
Gary Yeung, Ruchao Fan, and Abeer Alwan
Dept. of Electrical and Computer Engineering, University of California, Los Angeles, USA
ABSTRACT
Automatic speech recognition (ASR) systems for youngchildren are needed due to the importance of age-appropriateeducational technology. Because of the lack of publiclyavailable young child speech data, feature extraction strate-gies such as feature normalization and data augmentationmust be considered to successfully train child ASR systems.This study proposes a novel technique for child ASR usingboth feature normalization and data augmentation methodsbased on the relationship between formants and fundamentalfrequency ( f o ). Both the f o feature normalization and dataaugmentation techniques are implemented as a frequencyshift in the Mel domain. These techniques are evaluated on achild read speech ASR task. Child ASR systems are trainedby adapting a BLSTM-based acoustic model trained on adultspeech. Using both f o normalization and data augmentationresults in a relative word error rate (WER) improvement of19.3% over the baseline when tested on the OGI Kids’ SpeechCorpus, and the resulting child ASR system achieves the bestWER currently reported on this corpus. Index Terms — child speech recognition, fundamentalfrequency, feature normalization, data augmentation
1. INTRODUCTION
The development of effective child automatic speech recog-nition (ASR) systems has become important in recent years.For example, the advancement of child ASR can facilitatethe development of teaching and assessment tools for chil-dren in educational settings [1, 2, 3] using interactive sys-tems such as social robots [4, 5, 6]. This is especially rele-vant for kindergarten-aged children who are just learning toread, write, or type and rely on speech to interact with tech-nology. Yet, ASR systems for young children still performquite poorly when compared to adult ASR [6, 7].One of the major hurdles facing the development of ef-fective child ASR systems is the lack of publicly availableyoung child speech databases. This is especially a concernin an era where deep learning, which requires many hours of
This work was supported in part by National Science Foundation (NSF)Grant training data, is rapidly becoming the primary method of de-veloping ASR systems, using data intensive acoustic modelssuch as bidirectional long-short term memory (BLSTM) net-works. Hence, many child ASR systems complement youngchild speech with older child speech or even adult speech fortraining data. However, there is a large acoustic mismatch be-tween child and adult speech, further complicated by the factthat children’s speech acoustics change quite dramatically asthey grow [8, 9]. These changes include formants and fun-damental frequency ( f o ) [8, 9], two defining features of thespeech signal, especially for vowels. Furthermore, as the agedifference between training and testing speakers grows, ASRperformance degrades rapidly [7].One common strategy to account for the acoustic mis-match between speakers is frequency normalization. This ap-proach attempts to warp the frequency spectra of an utterancegiven a normalization factor for the utterance and a targetacoustic space. For instance, vocal tract length normaliza-tion (VTLN) warps the frequency spectra using a maximumlikelihood approach and can be implemented in several wayswith varying degrees of success [10, 11, 12]. An alternativeapproach is to use acoustically relevant speech parameters asnormalization factors such as subglottal resonances (SGRs)[13], the third spectral peak or formant frequency [14], and f o [15].Another strategy is to augment the training data by cre-ating additional speech-like features for training data. Thereare a number of ways to implement this augmentation such asmanipulating the frequency scaling or adding noise [16, 17].While data augmentation has not been as readily explored forchild speech compared to adult speech, some techniques thathave been evaluated include adding noise and reverberation[18] and using out-of-domain adult data [19].In our previous study, we proposed an f o -based normal-ization technique for child ASR [15]. In that study, a num-ber of ASR systems were trained using speech from childrenof various ages and tested using kindergarten-aged children.While that study demonstrated that f o normalization was ef-fective for child ASR, the experiments performed assumedthat only older child speech was available to train a youngchild ASR system. In a more practical situation, we may ex-pect that some amount of in-domain child speech would beavailable as training data. Furthermore, we expect to be able a r X i v : . [ ee ss . A S ] F e b o generate additional training data through effective data aug-mentation techniques.In this study, we propose a new data augmentation methodfor training child ASR systems based on the f o normaliza-tion method we proposed in [15]. This method is capableof generating speech-like features that adhere to the physi-cal properties of speech defined by the relationship betweenvowel formants and f o . We show that this data augmentationmethod is capable of improving child ASR systems adaptedfrom BLSTM acoustic models trained on adult speech. Addi-tionally, we demonstrate that this data augmentation methodcan be used alongside f o normalization for further improve-ment.The remainder of the paper is organized as follows. Sec-tion 2 reviews the f o normalization technique proposed in[15] and formulates the data augmentation technique. Section3 describes the databases and experimental setup. Section 4presents the experimental results. Section 5 concludes the pa-per with a summary and considerations for future work.
2. NORMALIZATION TECHNIQUE2.1. Relationship Between f o and Formants Several past studies have revealed meaningful connectionsbetween f o and the first several formant frequencies ( F , F , F , . . . ) of vowels. In Bark scale, the tonotopicdistances between adjacent formants ( F ( x + 1) − F x for x ∈ { , , , . . . } ), along with the tonotopic distance betweenthe first formant and f o ( F − f o ), have proven to be effectiveat modeling human vowel perception and representing thevowel space [20, 21]. Furthermore, these tonotopic distancescan be equivalently represented by the difference between anyformant and f o ( F x − f o for x ∈ { , , , . . . } ). This suggeststhat the vowel space can be modeled as a linear relationship(in Bark scale) between formants and f o . Additionally, stud-ies have found that f o and formant locations depend on oneanother in both vowel production [22] and vowel perception[23]. A more detailed examination of these relationships canbe found in [15]. The frequency normalization technique we proposed forkindergarten ASR in [15] was designed to be applied to anyspectral-based feature such as Mel-frequency cepstral coef-ficients (MFCCs). The linear relationship between f o andvowel formants is exploited to formulate a normalizationtechnique using median f o as the only normalization factor.As such, the technique attempts to map a speaker’s acousticspace, governed by the speaker’s f o , to some target space,defined by a predetermined target f o . Notably, Mel scalewas used in [15] instead of Bark scale as the two scales arehighly correlated, but either scale can be used in practice. Forconsistency, this study will also use the Mel scale. To perform the f o -based normalization, a default f o , de-noted as f o,def , must first be chosen. Before feature compu-tation, f o extraction is performed across the entire utteranceusing a reliable f o detection algorithm. For this study, wewill use the multi-band summary correlogram (MBSC) pitchdetection algorithm [24].The median f o across all voiced frames of the utterance,denoted as f o,utt , is then chosen as the normalization factorfor the utterance. The discrete Fourier transform (DFT) of thefeature extraction procedure is warped as follows: f norm = f orig − ( f o,utt − f o,def ) (1)where all frequencies are in Mel scale, f norm is a normal-ized frequency corresponding to some DFT index, and f orig is the frequency from the original speech spectrum mapped to f norm .An example of this f o normalization is shown in Figure1. The Mel filter bank outputs of an 18 year old male and a 7year old male saying the vowel / i / are displayed both with andwithout f o normalization. When both utterances are normal-ized to f o,def = 100 Hz, the Mel filter bank outputs becomemore similar.
While the f o normalization procedure attempted to reducevariability between speakers by using Eq. 1, fixing f o,def toa default value, and adjusting f o,utt , an alternative procedurecan use Eq. 1 to create variability rather than reduce it. Thiscan be accomplished by extracting features multiple timesfrom the same speech utterance while adjusting f o,def . Byperturbing the feature extraction, we can augment the trainingdata by generating acoustic features that are consistent withthe structure of speech defined by the tonotopic distances.This can be especially useful for training deep neural net-works that require a large amount of speech data. We willrefer to this technique as f o perturbation. When combinedwith the f o normalization procedure, we can simultaneouslyremove larger inter-speaker variabilities while perturbing thefeatures for additional training data.
3. DATABASE AND EXPERIMENTAL SETUP3.1. Database
Two child speech databases were used in this study. The firstone was the OGI Kids’ Speech Corpus [25]. This corpus con-tains approximately 100 speakers per American educationalgrade level, from kindergarten to 10 th grade. Utterances wererecorded with a sampling rate of 16 kHz (8 kHz bandwidth).Both scripted and spontaneous styles of speech were recordedfrom each speaker. In this study, we used the sentence utter-ances from the scripted speech recordings, which consisted ofa total of 10,072 sentence utterances from children in grades ig. 1 : Mel filter bank outputs of an 18 year old male (solid) and 7 year old male (dashed) saying the vowel / i /, computedwith 15 filters and a frequency range of 20 Hz to 6 kHz. The filter outputs are computed both without (left) and with (right) f o normalization. When normalization is applied, default f o is chosen to be f o,def = 100 Hz. The 18 year old male had f o,utt = 106 Hz, and the 7 year old male had f o,utt = 270 Hz. The Mel filter bank outputs computed with f o normalizationare much more well-aligned than the filter outputs computed without f o normalization.kindergarten through 5 th grade. Approximately 70% of theutterances in each grade were used as training data for a totalof 7,051 training utterances. The remaining utterances wereused for testing.The second child speech database was the CMU Kids Cor-pus [26]. This corpus contains 76 speakers between 1 st and3 rd grade with two additional speakers from kindergarten or6 th grade for a total of 78 speakers. Utterances were recordedwith a sampling rate of 16 kHz. A total of 5,180 read sentenceutterances were recorded across these speakers. Exactly 70%of the utterances in this corpus were used as training data fora total of 3,626 training utterances. The remaining utteranceswere used for testing.For adult data, the LibriSpeech ASR Corpus was used[27]. For our experiments, we used all the training data fromthe corpus, which contains 960 hours of adults reading audiobooks in clean and noisy conditions, for adult model training. The baseline features used for the ASR experiments were 13-dimensional MFCCs with a window size of 25 ms and shift of10 ms. These MFCCs were extracted using a 512-point DFT,23 Mel filters, and bandwidth from 20 Hz to 8 kHz.The second set of features was similar to the baselinefeatures except the DFT was normalized by the procedurepresented in Section 2.2. The default f o was chosen to be f o,def = 100 Hz representing an adult male f o , and f o,utt waschosen to be median f o across the utterance, estimated usingMBSC pitch detection. As the value of f o,utt reached as highas 300 Hz, the bandwidth for these features was limited to 6.2 kHz to compensate for the frequency shift upwards. Notably,this limits the maximum frequency shift of the f o normaliza-tion procedure to approximately 250 Mels. Both baseline and f o normalized features were extracted from all utterances.When applying f o perturbation, f o,def was adjusted suchthat the DFT was shifted by ± , ± , and ± Mels. Alongwith the original features, this multiplies the amount of train-ing data by 7. Data augmentation was applied to the CMUKids and OGI Kids’ training datasets.Implementation of the feature extraction is relatively sim-ple when using Eq. 1 to compute the DFT frequency shift.We let f o,def ∈ { } Hz where every value in the set is used if perform-ing data augmentation and only 100 Hz is used if performinga standard feature extraction. Similarly, we let f o,utt = 100 Hz when performing a non-normalized feature extraction andchoose f o,utt as the median f o across the utterance when nor-malizing. For both the baseline and f o normalized features, an adultASR system was first trained using the LibriSpeech train-ing set. The acoustic model was a 3-layer BLSTM networkwith 512 cells in each direction followed by a feed-forwardlayer that mapped the output of the BLSTMs to senone prob-abilities. The input to the BLSTM was 7 frames (3 framesforward and backward in time) to form a 91-dimensional in-put for the acoustic model. The output was approximately5,700 senone probabilities. The acoustic model was based onPyKaldi2 [28], while decoding used the Kaldi Speech Recog- able 1 : Word error rates (WERs) of the child ASR exper-iment using a BLSTM-based acoustic model adapted fromadult speech. The left two columns indicate whether f o nor-malization (“Norm?”) and data augmentation using f o pertur-bation (“Aug?”) were used. WERs for both CMU Kids andOGI Kids’ are reported in the latter columns. Norm? Aug? CMU Kids OGI Kids’
No No 16.88 6.84Yes No 16.93 6.50No Yes 16.63 5.85Yes Yes nition Toolkit [29].The adult ASR systems were adapted to child speech us-ing either the CMU Kids or OGI Kids’ training data. Theadult ASR system trained with f o normalization was adaptedusing child speech features that were extracted with f o nor-malization. Similarly, the adult ASR system trained withoutnormalization was adapted using features that were extractedwithout normalization. Additionally, data augmentation using f o perturbation was applied on the child training datasets.A 4-gram language model (LM) trained on Project Guten-burg books was chosen for decoding. This LM is one ofthe default language models included in Kaldi’s LibriSpeechrecipe [29]. Adapted ASR systems were evaluated using thecorresponding CMU Kids or OGI Kids’ testing datasets. TheASR system trained with f o normalization was tested usingfeatures extracted with f o normalization.
4. RESULTS AND DISCUSSION
The results of the child ASR experiments described in Section3 are shown in Table 1. The top row displays the word errorrate (WER) of the baseline system (i.e., no normalization oraugmentation). The second row displays the WER of the sys-tem using f o normalization. The third row displays the systemadapted with f o perturbed child speech features. Finally, thelast row uses both f o normalization and f o perturbed adap-tation data. Additionally, we performed the experiment oncemore using the acoustic model trained only on LibriSpeech.This system achieved a WER of 37.49% for CMU Kids and59.70% for OGI Kids’, significantly worse than any WER inTable 1, which demonstrates the importance of child speechadaptation data.When using adaptation and only applying f o normaliza-tion, the performance of the OGI Kids’ system saw a smallimprovement over the baseline, which has no normalizationor data augmentation, but not enough to be significant. Ap-plying f o perturbation to the training data, a more substantialimprovement was achieved from 6.84% to 5.85%. However,when using the CMU Kids system, no major improvements were observed.Using both f o normalization and f o perturbation resultedin the best performing ASR system for both testing sets. TheOGI Kids’ testing set saw a relative WER improvement of19.3%, reducing the WER to 5.52%, and this result is sta-tistically significant at p < . . However, the CMU Kidstesting set only saw a relative WER improvement of 2.4%.While the OGI Kids’ testing set saw a larger improvementthan the CMU Kids testing set, this result may be expected.We note that the CMU Kids testing set had a narrower agerange (approximately 6-9 years old excluding the two out-lier children) compared to the OGI Kids’ testing set (approx-imately 5-11 years old). The f o normalization method hasbeen shown to produce larger improvements when the rangeof ages used in training and testing data is wider [15]. A sim-ilar phenomenon may be occurring for f o perturbation. Thatis, since there is less variability in the CMU Kids testing set,adding additional variability to the training set through f o per-turbation was unnecessary to train the BLSTM. Meanwhile,with the larger variability of the OGI Kids’ dataset, both f o normalization and f o perturbation proved helpful. Further-more, these techniques may extend to ASR systems using thefull OGI Kids’ scripted speech dataset, which has been re-ported to have a WER of 10.8% [18]. Preliminary experi-ments suggest that f o perturbation also performs better thanVTLN-based data augmentation.
5. CONCLUSION
This study proposes a new data augmentation method fortraining child ASR systems based on the f o normalizationmethod proposed in [15]. Both normalization and data aug-mentation methods adhere to the physical properties of speechdefined by the relationship between vowel formants and f o .The two methods can be formulated as a simple shift of DFTbin frequencies in the Mel domain and are implemented bymanipulating f o,def and f o,utt in Eq. 1. Child ASR sys-tems were trained using these methods by adapting from aBLSTM acoustic model trained on adult speech. When usingboth f o normalization and f o perturbation, a 19.3% relativeimprovement was observed on the OGI Kids’ Speech Corpus.However, a less substantial improvement was observed on theCMU Kids Corpus. This suggests, that both f o normalizationand f o perturbation are more effective when the age range ofthe speakers is large.There are a number of possible directions for future work.We plan to evaluate these methods on ASR systems for bothchildren and adults for a situation with increased age and f o variability between speakers. We also plan to apply these sys-tems to other child ASR applications such as conversationalspeech, educational applications, and clinical applications. . REFERENCES [1] J. Tepperman, J. Silva, A. Kazemzadeh, H. You, S. Lee, A. Al-wan, and S. Narayanan, “Pronunciation Verification of Chil-dren’s Speech for Automatic Literacy Assessment,” in Proc.INTERSPEECH , 2006, pp. 845–848.[2] H. T. Bunnell, D. M. Yarrington, and J. B. Polikoff, “STAR:Articulation Training for Young Children,” in
Proc. ICSLP ,2000, pp. 85–88.[3] G. Yeung, A. Afshan, K. E. Ozgun, K. Kaewtip, S. M. Lulich,and A. Alwan, “Predicting Clinical Evaluations of Children’sSpeech with Limited Data Using Exemplar Word TemplateReferences,” in
Proc. SLaTE , 2017, pp. 161–166.[4] S. Spaulding, H. Chen, S. Ali, M. Kulinski, and C. Breazeal,“A Social Robot System for Modeling Children’s Word Pro-nunciation,” in
Proc. AAMAS , 2018, pp. 1658–1666.[5] G. Yeung, A. L. Bailey, A. Afshan, M. Q. P´erez, A. Martin,S. Spaulding, H. W. Park, A. Alwan, and C. Breazeal, “To-wards the Development of Personalized Learning CompanionRobots for Early Speech and Language Assessment,” in
Proc.AERA , 2019.[6] J. Kennedy, S. Lemaignan, C. Montassier, P. Lavalade, B. Ir-fan, F. Papadopoulos, E. Senft, and T. Belpaeme, “ChildSpeech Recognition in Human-Robot Interaction: Evaluationsand Recommendations,” in
Proc. ACM/IEEE HRI , 2017, pp.82–90.[7] G. Yeung and A. Alwan, “On the Difficulties of AutomaticSpeech Recognition for Kindergarten-Aged Children,” in
Proc. INTERSPEECH , 2018, pp. 1661–1665.[8] S. Lee, A. Potamianos, and S. Narayanan, “Acoustics ofChildren’s Speech: Developmental Changes of Temporal andSpectral Parameters,”
J. Acoust. Soc. Am. , vol. 105, no. 3, pp.1455–1468, 1999.[9] H. K. Vorperian and R. D. Kent, “Vowel Acoustic Space De-velopment in Children: A Synthesis of Acoustic and AnatomicData,”
J. Speech. Lang. Hear. , vol. 50, no. 6, pp. 1510–1545,2007.[10] G. Stemmer, C. Hacker, S. Steidl, and E. N¨oth, “Acoustic Nor-malization of Children’s Speech,” in
Proc. EUROSPEECH ,2003, pp. 1313–1316.[11] X. Cui and A. Alwan, “MLLR-Like Speaker Adaptation Basedon Linearization of VTLN with MFCC Features,” in
Proc.INTERSPEECH , 2005, pp. 273–276.[12] R. Serizel and D. Giuliani, “Vocal Tract Length NormalizationApproaches to DNN-Based Children’s and Adults’ SpeechRecognition,” in
Proc. SLT , 2014, pp. 135–140.[13] J. Guo, R. Paturi, G. Yeung, S. M. Lulich, H. Arsikere, andA. Alwan, “Age-Dependent Height Estimation and SpeakerNormalization for Children’s Speech Using the First ThreeSubglottal Resonances,” in
Proc. INTERSPEECH , 2015, pp.1665–1669. [14] X. Cui and A. Alwan, “Adaptation of Children’s Speech withLimited Data Based on Formant-Like Peak Alignment,”
Com-put. Speech Lang. , vol. 20, no. 4, pp. 400–419, 2006.[15] G. Yeung and A. Alwan, “A Frequency Normalization Tech-nique for Kindergarten Speech Recognition Inspired by theRole of fo in Vowel Perception,” in
Proc. INTERSPEECH ,2019, pp. 6–10.[16] X. Cui, V. Goel, and B. Kingsbury, “Data augmentation fordeep neural network acoustic modeling,” in
Proc. ICASSP ,2014, pp. 5582–5586.[17] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.Cubuk, and Q. V. Le, “SpecAugment: A simple data aug-mentation method for automatic speech recognition,” in
Proc.INTERSPEECH , 2019, pp. 2613–2617.[18] F. Wu, L. P. Garcia, D. Povey, and S. Khudanpur, “Advances inautomatic speech recognition for child speech using factoredtime delay neural network,” in
Proc. INTERSPEECH , 2019,pp. 1–5.[19] J. Fainberg, P. Bell, M. Lincoln, and S. Renals, “Improvingchildren’s speech recognition through out-of-domain data aug-mentation,” in
Proc. of INTERSPEECH , 2016, pp. 1598–1602.[20] A. K. Syrdal and H. S. Gopal, “A Perceptual Model of VowelRecognition Based on the Auditory Representation of Ameri-can English Vowels,”
J. Acoust. Soc. Am. , vol. 79, no. 4, pp.1086–1100, 1986.[21] H. Traunm¨uller, “Perceptual Dimension of Openness in Vow-els,”
J. Acoust. Soc. Am. , vol. 69, no. 5, pp. 1465–1475, 1981.[22] S. Barreda and T. M. Nearey, “The Perception of Formant-Frequency Range is Affected by Veridical and Judged Fun-damental Frequency,” in
Proc. Meetings on Acoustics , 2013,vol. 19, p. 060197.[23] S. Barreda and T. M. Nearey, “The Direct and Indirect Rolesof Fundamental Frequency in Vowel Perception,”
J. Acoust.Soc. Am. , vol. 131, no. 1, pp. 466–477, 2012.[24] L. N. Tan and A. Alwan, “Multi-Band Summary Correlogram-Based Pitch Detection for Noisy Speech,”
Speech Comm. , vol.55, no. 7-8, pp. 841–856, 2013.[25] K. Shobaki, J.-P. Hosom, and R. A. Cole, “The OGI Kids’Speech Corpus and Recognizers,” in
Proc. ICSLP , 2000, pp.258–261.[26] M. Eskenazi, J. Mostow, and D. Graff, “The CMU KidsSpeech Corpus LDC97S63,” 1997.[27] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: An ASR corpus based on public domain audiobooks,” in
Proc. IEEE ICASSP , 2015, pp. 5206–5210.[28] L. Lu, X. Xiao, Z. Chen, and Y. Gong, “PyKaldi2:Yet another speech toolkit based on Kaldi and PyTorch,” arXiv:1907.05955 , 2019.[29] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motl´ıˇcek, Y. Qian, P. Schwarz,J. Silovsk´y, G. Stemmer, and K. Vesel´y, “The Kaldi SpeechRecognition Toolkit,” in