End-to-End Neural Systems for Automatic Children Speech Recognition: An Empirical Study
EEND-TO-END NEURAL SYSTEMS FOR AUTOMATIC CHILDREN SPEECHRECOGNITION: AN EMPIRICAL STUDY
Prashanth Gurunath Shivakumar, Shrikanth Narayanan
University of Southern California, Los Angeles, California 90089, USA
ABSTRACT
A key desiderata for inclusive and accessible speech recog-nition technology is ensuring its robust performance to chil-dren’s speech. Notably, this includes the rapidly advancingneural network based end-to-end speech recognition systems.Children speech recognition is more challenging due to thelarger intra-inter speaker variability in terms of acoustic andlinguistic characteristics compared to adult speech. Further-more, the lack of adequate and appropriate children speech re-sources adds to the challenge of designing robust end-to-endneural architectures. This study provides a critical assessmentof automatic children speech recognition through an empiri-cal study of contemporary state-of-the-art end-to-end speechrecognition systems. Insights are provided on the aspects oftraining data requirements, adaptation on children data, andthe effect of children age, utterance lengths, different archi-tectures and loss functions for end-to-end systems and role oflanguage models on the speech recognition performance.
Index Terms — Children Speech Recognition, End-to-End Speech Recognition, Residual Network, Time DepthSeparable Convolutional Network, Transformer
1. INTRODUCTION
Creating speech and spoken language technologies (SLT) thatare inclusive and broadly accessible need to ensure that theyoffer robust performance to children speech. Beyond support-ing an important potential segment of users of applicationsinvolving conversational interfaces [1] such as for entertain-ment, interactive gaming, education and learning, such tech-nologies can enable novel child-centric possibilities in sup-port of diagnosis and treatment for a variety of developmentaldisorders and health conditions [2, 3]. However, the inclusionof the children population in SLT research and developmenthas been lagging behind within the exciting realm of rapid de-velopment and deployment of these technologies mainly foradult population, underscoring an unmet need.Automatic speech recognition (ASR) is a core SLT tech-nology, and has witnessed accelerated advances since theadvent of deep learning. Early attempts at incorporatingdeep learning into ASR replaced Gaussian mixture models email: [email protected]; [email protected] (GMM) with DNN [4]. The objective of the DNN is toproduce a distribution over senones given the input acousticfeature frames. Such a system requires the alignments ob-tained from the GMM-based hidden markov models (HMM)for training purposes. [5] introduced connectionist temporalclassification (CTC) for sequence data labeling with recurrentneural networks which eliminates the need of pre-computedalignments and subsequent processing by computing theprobability distribution over all the possible label sequencesgiven the input signal sequence. Alternatively to the CTC,sequence to sequence learning was introduced to computethe mapping between variable length sequences [6]; andattention-based sequence to sequence models proved feasiblefor end-to-end trainable speech recognition systems [7]. Theattention mechanism is able to implicitly calculate the align-ments between the sequence of input speech feature framesand the output text sequence provided with large amounts oftraining data.Several different end-to-end DNN architectures for ASRhave been proposed with CTC and sequence-to-sequencelearning frameworks. RNN based architectures [7, 8, 9] andfully convolutional architectures [10, 11] are popular whilesome studies have successfully adopted combination of RNNand convolution neural networks for end-to-end speech recog-nition [12]. Residual networks [13, 14, 15], and highwayconnections [16] have increased the feasibility of traininglarge, deep neural networks. A few works have exploredjoint CTC and sequence-to-sequence learning architectures[17, 18]. Self-attention and multi-headed attention based neu-ral networks have yielded state-of-the-art ASR performance[19, 20].The success of deep neural networks (DNN) is mostly at-tributed to its ability to utilize vast amounts of data to esti-mate highly non-linear functions which in turn has resultedin improved acoustic and language modeling. The lack ofsuitable available child speech data has limited the modelingcapabilities of the DNN models for children speech recogni-tion. Additionally, the multifaceted signal variability foundin children speech poses a number of modeling challenges.Acoustically, increased speech signal variability in childrenis mainly attributed to the developmental changes of the vo-cal apparatus [21, 22]. The variability manifests as shift-ing of formant frequencies, spectral and temporal character- a r X i v : . [ ee ss . A S ] F e b stics both within subject and across subjects and age groups[21, 22, 23, 24]. Moreover, children speech is characterizedwith increased pronunciation variability, mispronunciations,disfluencies and non-verbal vocalizations [25, 23]. Children’sspeech is known to include repetitions and revisions [26].Also, the use of language, linguistic and grammatical con-structs vary with children. The pronunciation and linguisticvariability in children can be attributed to the developing lin-guistic knowledge and behavior of children.To address the increased speech signal variability in chil-dren, several robust speech signal features, filters and mod-els have been studied and introduced. Several front-end fea-ture transformations, frequency warping, speaker normaliza-tion and filtering techniques have been found to be useful[23, 27, 28] for mitigating feature space acoustic variabil-ity in children. Vocal Tract Length Normalization (VTLN)[29, 30] techniques have been found particularly beneficial forreducing acoustic variability in robust recognition of childrenspeech. Adaptation techniques such as maximum likelihoodlinear regression (MLLR) transforms [29, 31, 32] also aid inadapting to varying acoustic speech patterns found in chil-dren. Speaker adaptive training based on constrained MLLR[33] was also found to provide notable improvements by re-ducing heightened inter-speaker variability in children [29,31, 32]. To handle the pronunciation variability, adopting cus-tomized dictionaries [34] for children and pronunciation mod-eling techniques [29] have been successful. Finally, to capturelinguistic variability and language use of children, languagemodels trained on children’s speech have been effective giv-ing improved WER [29, 32, 35].However, most of the effective modeling adaptations likeVTLN, MLLR, speaker adaptive training etc., for childrenspeech are restricted to GMM-HMM and DNN-HMM sys-tems. This raises questions about the feasibility of newerend-to-end based models for children speech recognition.Although, few works have explored end-to-end speech recog-nition for children (in Mandarin language), the benefits andtheir application to handling various aspects of childrenspeech variability has been unclear [36, 37, 38]. Moreover,these works use fairly limited amount of children speech data(less than 60 hours).In this paper, we conduct a methodological study intochildren speech recognition, particularly investigating themost recent developments in end-to-end speech recognitionwith established state-of-the-art systems. It aims to contributeby answering the following questions:1. Do the benefits established with end-to-end speechrecognition systems for adult’s speech translate to chil-dren speech?2. How do the end-to-end systems compare to the op-timized existing DNN-HMM based children speechrecognition systems? 3. Will an end-to-end system’s ability to exploit largeamounts of speech data impute for the anomalies foundin children speech?4. Which neural network based end-to-end architecturesare most effective for children speech recognition?5. How do the end-to-end systems perform for children ofdifferent age categories?6. What are the merits/demerits of the end-to-end systemscompared to DNN-HMM based systems?The rest of the paper is organized as follows: Section 2presents the different DNN architectures including the DNN-HMM systems and the more recent state-of-the-art end-to-endsystems investigated in this study. Section 3 describes the lan-guage models (LM) and their architectures employed in ourwork. Different decoding techniques investigated as a partof this study are presented in section 4. The children speechdatabases used in this study are listed in section 5 and the ex-perimental setup is described in section 6. Section 7 presentsthe results of children speech recognition on adult acousticmodels (AM) and the results on children acoustic models arepresented in section 8. More insights on the recognition per-formance including children age, amount of data, length ofutterance, error analysis are carried in section 9. Finally thestudy is concluded in section 10.
2. ACOUSTIC MODELING
In this section, we describe the architectures employed foracoustic modeling in this work. We select three recently pro-posed end-to-end architectures that have demonstrated state-of-the-art results in speech recognition on popular bench-marking datasets. Additionally, for reference to previousworks employing DNN-HMM systems, we also consider acompetitive DNN-HMM based speech recognition system.In case of end-to-end architectures, we consider two setsof architectures each, one trained completely supervised onLIBRISPEECH and the other trained on semi-supervisedmodel on LIBRIVOX [20].
Factorized Time Delay Neural Networks are one-dimensionalconvolutional neural networks (CNN) with special semi-orthogonal constraints [39]. The constraints mimic the sin-gular value decomposition in factorizing the weight matricesinto products of 2 smaller factors obtained by dropping smallsingular values. This enables to preserve the descriptivepower of transformations by significantly reducing the num-ber of parameters. The TDNN-F can be conceptually viewedas introducing an additional bottleneck layer to a traditionalconvolutional layer (TDNN). TDNN-F was first introducedor speech recognition giving comparable performance tothat of TDNN-LSTM system with almost half the parame-ters [39]. More recently, TDNN-F models have proven theirefficacy for children speech recognition [40].The architecture made use in our study is comprised of 16TDNN-F blocks with skip-connections. Each block consistsof a TDNN-F layer followed by rectified linear unit (RELU)non-linearity, followed by a batch normalization layer anda dropout layer. Each TDNN-F layer has a 1536 dimen-sional TDNN layer and a 160 dimensional bottleneck layer.Lattice-free maximum mutual information (LF-MMI) crite-rion [41] is adopted for training the TDNN-F acoustic model.L2-regularization is adopted during training. We do not useVTLN since its efficacy in conjunction with TDNN-F wasnot clear from [40].
The ResNet was first proposed for the task of image recogni-tion [42]. Increasing the depth of DNN allows for modelingmore complex functions, however, the optimization, conver-gence of the DNNs gets harder as the depth of the networkincreases and thus limits the number of layers in the network.This is partly attributed to gradients getting too small (vanish-ing gradient) or too high (exploding gradients). The ResNetsmodel the residual functions using skip-connections (short-cut connections skipping a block of layers) rather than theoriginal unreferenced mapping. It has been found that opti-mizing the referenced residual functions are easier and alle-viate the vanishing/exploding gradient problem, thereby al-lowing for deeper networks to estimate complex functions ef-ficiently [42]. ResNets have been adopted successfully forspeech recognition [43, 44, 14]. Both LSTM [13, 15] andconvolution [13, 43, 44, 14] blocks have been proposed withskip connections for ASR.In this work, we employ the architecture proposed in [20].The input signal is processed using a SpecAugment layer [45]and mapped to an embedding space of dimension 1024 us-ing 1-D convolution layer with stride 2. The ResNet encodercomprises 12 blocks of 3 1-D convolution layers with a kernelsize of 3. Each convolution layer is followed by ReLU non-linearity, dropout and LayerNorm [46]. The dropout and hid-den units increase with depth of the network and additionalconvolution layers are inserted between ResNet blocks forincreasing the hidden dimension. Three max pooling layerswith stride 2 are inserted after block 3, 7 and 10. The en-coder architectures are identical for both CTC and sequence-to-sequence loss, except that the encoder for the sequence-to-sequence has lower dropout for deeper layers and the lastbottleneck layer is removed. The decoder for the sequence-to-sequence model has 2 rounds of key-value attention (seeequation 1) as in [47, 48] through 3 (LIBRISPEECH AM) or2 (LIBRIVOX AM) layers of RNN-GRU of dimension 512each followed by a dropout layer. [47] introduced time-depth separable convolutions for speechrecognition with a sequence-to-sequence end-to-end architec-ture. The TDS architecture has been shown to generalize bet-ter than typical deep convolutional architectures with fewerparameters. Significant improvements were achieved on theLIBRISPEECH dataset with TDS layers compared to modelsbased on RNN and convolutional networks [47].The core concept of the TDS block is to separate time ag-gregation from channel mixing and thus increase the receptivefield. The TDS block comprises a 2-D convolutional layerfollowed by ReLU non-linearity and residual connection fol-lowed by LayerNorm [46]. Finally, the output is re-viewedand is followed by two fully-connected layer with ReLU non-linearity in between and layer normalization. Moreover, asub-sampling factor of 8 is applied using 3 sub-sampling lay-ers with stride of 2 each. The sub-sampling layers are fol-lowed with a RELU and layer normalization.The architecture used in this study is similar to [20]. Theinput signal is processed using a SpecAugment layer [45]and mapped to an embedding space using 2D-convolutionlayer with stride of size 2 ×
1. For LIBRISPEECH AM, 3groups of TDS blocks are employed, containing 5, 6 and 10TDS blocks each with 10, 14, and 18 channels respectively.For LIBRIVOX AM, 4 groups of TDS blocks are employed,containing 2, 2, 5 and 6 TDS blocks each with 16, 16, 32,and 48 channels respectively. The number of channels inthe feature maps spanning the two internal fully-connectedlayers are increased by a factor of 3 (LIBRISPEECH AM),or 2 (LIBRIVOX AM) via sub-sampling 2D-convolutionallayers. All the 2D-convolutional layers are followed byReLU non-linearity and LayerNorm [46]. The kernel sizeof both the TDS blocks and 2D-convolutions is set to 21 × × Transformer networks were first introduced for the task ofmachine translation [48] significantly advancing the state-of-the-art. Since then, transformers have dominated the fieldsof natural language processing [49], speech recognition [19],spoken language technologies [50] as well as the computer vi-sion and image processing domains [51]. The transformer isa neural sequence transducer with an encoder-decoder archi-tecture based solely on attention mechanisms. They employ6 stacked multi-headed self-attention layers each followed byully connected layers for both encoder and decoder. The self-attention is described in terms of mapping a query and a setof key-value pairs to an output. The self-attention is definedas:
Attention ( Q, K, V ) = sof tmax ( QK T √ d k ) V (1)It is essentially softmax weighted sum of values, V , wherethe weights are dot-product of two matrices Q (query) and K (keys) each corresponding to collection of sequence of inputvectors which are scaled by the dimension, d k , of the key vec-tors. The term multi-head refers to projecting the key, valueand query vectors into multiple subspaces and running mul-tiple self-attention in parallel on each to derive multiple out-puts and concatenating the outputs. The multi-head attentionis given by:MultiHead ( Q, K, V ) =
Concat ( head , . . . , head n ) W M head i = Attention ( QW Qi , KW Ki , V W Vi ) (2)where W Qi , W Ki , W Vi are projections corresponding to head i , for query, key and value respectively.In this work, we adopt the architecture specified in [20]for training the acoustic model. A front-end of 3 (LIB-RISPEECH AM) or 6 (LIBRIVOX AM) layers of 1-D con-volutions each with kernel size 3, strided by 8 frames (80ms)is used as feature extraction for the subsequent transformerblocks. The (input, output) size of the first layer is (80 , D c ) ,the last layer is ( D c / , D tr × and the intermediate layersis ( D c / , D c ) , with D c =1024, D tr =1024 for self-attention,4096 for feed-forward network (LIBRISPEECH AM) or D c =2048, D tr =768 for self-attention, 3072 for feed-forwardnetwork (LIBRIVOX AM). Each convolution layer is stridedby 2 each (LIBRISPEECH AM) or by 2 every alternate layer(LIBRIVOX AM). Each convolution is followed by gatedlinear unit (GLU), dropout and LayerNorm. Next, with eachsucceeding transformer block 4-head attention is used withskip (residual) connection followed by layer normalization,feed-forward layer and one hidden layer with RELU non-linearity. Additionally, skip (residual) connection is usedacross the entire transformer block. Dropouts are used onthe self-attention. Moreover layer-drop [52] is employed forfeed-forward network to drop the entire layer. The encoderconsists of 24 (LIBRISPEECH AM), or 36 (LIBRIVOX AM)stacked transformer blocks. Identical architecture is used forencoder of both CTC and sequence-to-sequence models. Forsequence-to-sequence models, the decoder is made up of 6stacked Transformers with 4 attention heads and encodingdimension of 256.
3. LANGUAGE MODELING
In this section, the language models used in beam-search de-coding are described. We experiment with 4 types of lan-guage models: (i) word-based 4-gram LM, (ii) word-piece 6-gram LM, (iii) word-based gated convolutional neural net-work (GCNN) LM, and (iv) word-piece based GCNN LM[53]. Since we employ a lexicon during decoding for CTCbased models, word-based 4-gram LM amd GCNN LM arerestricted to CTC models. In case of sequence-to-sequencemodels, we employ lexicon-free decoding [54] along withword-piece based 6-gram LM and GCNN word-piece LM.Gated convolutional neural networks were first proposedfor the task of language modeling [53]. GCNN is amongthe first non-recurrent, highly parallelizable, finite context ap-proach with stacked convolutions to give competitive, low la-tency alternative to strong recurrent language models. Thegating mechanism alleviates the vanishing gradient problemenabling deeper networks for language modeling. The gatingoperation in GCNN is formulated as: h ( X ) = ( X ∗ W + b ) ⊗ σ ( X ∗ V + c ) (3)where h is the hidden layer, X ∈ R N × m is the input for layer h , W and V are the weights of the 1-D convolution layer ∈ R k × m × n , b and c are the biases ∈ R n , σ is the sigmoidfunction, ⊗ denotes element wise product and m, n, k and Nare the input feature map, output feature map, patch size andlength of the input sequence respectively.The architectures of the GCNN LM are borrowed from[53] (GCNN-14B architecture). The GCNN-14B bottleneckarchitecture comprises of an embedding layer which maps theinput words to a fixed dimension of 1024. The embeddinglayer is followed by 14 residual blocks. The first residualblock contains one gated 1-D convolution layer with [kernelsize, output size] of [5, 512]. Residual blocks 2 to 4 are com-prised of 3 gated 1-D convolution layer each with [1,128],[5,128], [1,512]; residual blocks 5 to 7 comprised of 3 gated1-D convolution layer each with [1,512], [5,512], [1,1024];residual blocks 8 to 13 comprised of 3 gated 1-D convolu-tion layer each with [1,1024], [5,1024], [1,2048] and the finalresidual block contains one gated 1-D convolution layer with[1,1024], [5,1024], [1,4096]. The softmax layer outputs theprobability distribution over all the words/word-pieces in thevocabulary.
4. ASR DECODING
Decoding is the process of scoring the hypothesis with theacoustic model and the language model to derive the final out-put. In this study, we assess two specific types of decoding (i)beam-search decoder, and (ii) greedy decoder.
The output of the neural networks can be viewed as a C × T matrix (lattice) with probabilities over each class c ∈ { . . . C } for each time step t ∈ { . . . T } . Each paththrough the lattice represents a possible ASR hypothesiswhich can be scored by a LM to further influence the scoresorpus Train Development TestMyST 88318 Utterances 5000 Utterances 5000 Utterances197.72 hours 12.23 hours 13.28 hours678 speakers 25 speakers 34 speakersOGI Kids 1099 Utterances30.5 hours1099 speakers Table 1 : Statistics: Children speech corpusof acoustic model. A typical beam-search decoder outputs ahypothesis that maximizes: logP AM ( y | x ) + αlogP LM ( y ) + β | y | (4)where y is the output hypothesis, x is the input acousticfeatures, α is the LM weight and β is the word insertionpenalty. Additionally, for sequence-to-sequence models end-of-sentence (EOS) penalty is adopted to control the outputhypothesis lengths. The LM weight, word insertion penaltyand EOS penalty are all tuned using grid search in our ex-periments. In order to keep the memory and computationcomplexity tractable, top few states are considered over eachtime-step. The number of top states considered defines thebeam-size.In our experiments, we consider two types of beam-search decoding, (i) lexicon-based, and (ii) lexicon-free.With lexicon-based decoding, a dictionary mapping is usedto convert the output of the acoustic model to words, and thusthe beam-search space is restricted to words in the dictio-nary. Whereas, with lexicon-free decoding, the beam-searchspace is not restricted to words and operates on word-pieces,thus capable of outputting words with arbitrary spellings.The lexicon-based decoding requires the LM to be on word-level and the lexicon-free decoding requires the LM withinput tokens as word-pieces. As suggested in [20], we adoptlexicon-based decoding for models trained with CTC loss andlexicon-free decoding for sequence-to-sequence models. With greedy decoding, there is no language model involved,and the most probable output of the acoustic model is con-sidered as final. The end-to-end acoustic models are capableof learning language model inherently given enough train-ing data. Studies such as [20], have shown that given largeamounts of data, the greedy decoding without language modelperforms just as good as beam-search decoding with a largelanguage model [20].
5. DATABASES
The choice of the children’s speech corpora in our study ismainly based on (i) amount of children speech available, and(ii) good distribution of age demographics among the childrenfor analysis purposes. We make use of two popular children’sspeech corpora:
The MyST Corpus [55, 56] is one of the publicly availablelarge collection of English children’s speech. It consists of499 hours with 244,069 utterances of conversational speechbetween children and a virtual tutor. The corpus consists of1,372 students from third, fourth and fifth grades having con-versations spanning 9 areas of science. This makes the cor-pora larger than all other available children’s English speechcorpora combined together. However, only 42% of the corpusis annotated for ASR purposes, i.e., 103,429 utterances (233hours). The transcribed subset of the corpora were furthercleaned and 98,318 utterances (223.23 hours) with 737 speak-ers were retained. The database is randomly split into threeparts for training, development and held-out test set withoutspeaker overlap. The details of the split is presented in Ta-ble 1.
To have a broad range of age demographics among children,for investigating age related effects, we additionally make useof the OGI Kids speech corpus [57]. The OGI Kids corpusconsists of 1100 children ranging from kindergarten to 10thgrade. In this study, we only select the spontaneous speechsubset of the data with annotated transcripts since the sponta-neous children speech is believed to be more complex both inacoustic and linguistic constructs compared to the promptedspeech [58]. In the spontaneous speech data portion, the ex-perimenter asks the child a series of questions to elicit a spon-taneous response. In our study, we use this corpus for evalua-tion purposes only. The statistics are presented in Table 1 andthe age distributions are presented in Figure 1. rade S p eake r s KG 1 2 3 4 5 6 7 8 9 10
Speaker - Age Distribution
Fig. 1 : Speaker Age Distribution for OGI Kids Corpus
KG refers to Kindergarten
6. EXPERIMENTAL SETUP6.1. Hybrid TDNN-F HMM Acoustic Model
The Kaldi ASR toolkit [59] was used for training the TDNN-F based hybrid DNN-HMM acoustic model. For the baselineadult models, we use the pre-trained models trained on LIB-RISPEECH [60] made available by KALDI developers . 13-dimensional Mel-filter cepstral coefficients (MFCC) featureswere extracted with a window size of 25ms and window shiftof 10ms with delta and delta-delta coefficients for trainingGMM-HMM system. A GMM-HMM system with linear dis-criminant analysis (LDA), maximum likelihood linear trans-form (MLLT) and feature-space maximum likelihood linearregression (fMLLR) based speaker adaptive training (SAT) isused to obtain the alignments needed to train the TDNN-Facoustic model. 40-dimensional MFCC features were usedwith left and right context of 1 frame along with 100 dimen-sional i-vector features to train the TDNN-F acoustic model.The i-vectors were trained in-domain on LIBRISPEECH us-ing 40-dimensional MFCC features with left and right contextof 3 and a subsequent PCA dimension reduction. The TDNN-F acoustic model is trained to predict among 6024 Gaussianmixtures. For adaptation on children data, we perform transfer learningdue to its performance advantages on children’s speech data[32]. We initialize the acoustic model with the pre-trainedadult model trained on LIBRISPEECH. The last layer is re-moved and a new randomly initialized TDNN-F and outputlinear layers are added to the model. The transferred layersare updated with a smaller learning rate (0.25 times) whilethe newly added layers are trained with a higher learning rateon MyST training corpus. The MyST corpus is forced aligned http://kaldi-asr.org/models/m13 using the pre-trained model and the alignments are obtained.The i-vectors for children data are extracted from the LIB-RISPEECH i-vector model. The TDNN-F model is optimizedfor LF-MMI criterion using stochastic gradient descent with0.001 learning rate trained for 4 epochs. The convergence isensured using the development subset of the MyST corpus. All the end-to-end ASR experiments are carried out with thewav2letter++ toolkit [61]. For evaluations on the baseline(un-adapted) adult models we use two versions of mod-els presented in [20]: (i) model trained on LIBRISPEECH[60], and (ii) semi-supervised model trained on LIBRIVOX .The model trained on LIBRISPEECH is fully supervised.The supervised LIBRISPEECH model is further used to de-code the entire LIBRIVOX database to generate the labelsfor the unlabeled LIBRIVOX dataset. For this purpose, aTransformer network trained with CTC loss with beamsearchdecoding using 4-gram language model is employed. Thesemi-supervised model is trained by combining the LIB-RISPEECH with true labels along with the labels generatedfor the LIBRIVOX corpora. Since the semi-supervised LIB-RIVOX model has data orders of magnitude more than LIB-RISPEECH, two set of architectures are used differing in thenumber of parameters. In this paper, we utilize the pre-trainedacoustic models open sourced for adult ASR. More detailsregarding the experimental setup and hyper-parametrizationof the models can be found in [20]. For adaptation with children’s speech data, instead of trainingmodel from scratch, we initialize the acoustic model withthe adult models trained on LIBRISPEECH and LIBRIVOXand fine-tune the entire model as suggested in [32]. Theend-to-end AMs are trained using 80-dimensional (channel)log-mel filterbank features extracted using Hamming win-dow with window shift of 10ms and a window size of 25msfor Transformers and window size of 30ms for TDS andResNet models. All the acoustic models output probabilitydistribution over 10k word pieces [62] generated using theSentencePiece toolkit .For ResNet and TDS based models, the batch-size is setto 4, the dropout is in the range [0.05, 0.2] increasing withdepth. The momentum is set to 0.5 for ResNet-CTC, 0.1 forResNet-S2S, 0.1 for TDS-CTC and 0.0 for TDS-S2S model.In the case of Transformer models, linear learning rate warm-up is applied for 30k updates, the dropout and layer-drop isset to 0.2 for all Transformer blocks, the momentum is set https://librivox.org https://github.com/facebookresearch/wav2letter/tree/v0.2/recipes/models/sota/2019 https://github.com/google/sentencepiece o 0.95, batch size of 5 is adopted. For training sequence-to-sequence models, 99% teacher forcing, 1% word-piece sam-pling, 5% label smoothing is employed. For sequence-to-sequence Transformer models, dropout and layer-drop in thedecoder is set to 0.1. The learning rate is set to 0.01 witha step-wise learning rate schedule decreasing by a factor of2 every 150 updates. Stochastic Gradient Descent (SGD) isused for updating ResNet, TDS models and Adagrad is usedfor the Transformers. The models are fine-tuned for 10 epochsand convergence is ensured.During beamsearch decoding, we use a beam-size of 500for CTC models and 50 for sequence-to-sequence models, theLM weights are tuned in [0.1, 1.3], word insertion penalty inthe range [0.1, 1.3] and the EOS penalty in the range [-10.0,-4.0] on the development dataset. The base language models are trained on the LIBRISPEECHLM corpus containing data from 14,500 public domainbooks. The 4-gram word LM and 6-gram word-piece LMare trained using the KenLM toolkit [63]. The 4-gram LMdoes not employ any pruning, however the word-piece based6-gram models involve pruning 5-grams once and 6-gramappearing twice or fewer. The GCNN LMs are trained usingthe fairseq toolkit [64]. More details regarding the setup canbe found in [54] and [20].For including children’s data for language modeling, wemake use of the text from MyST training subset of the corpus.In case of n-gram based models, independent LMs are trained,i.e., one word-based 4-gram and one word-piece based 6-gram model with similar setup as described earlier. Next, thechildren LM is interpolated with the LIBRISPEECH LM bytuning weights on the development set of MyST corpus textdata. In case of GCNN LM, the neural network is initializedwith the weights from corresponding LIBRISPEECH LMsand then fine-tuned with the MyST train subset. The GCNN isoptimized with Nesterov accelerated gradient descent for 20epochs with a learning rate of 0.0001 and momentum of 0.99.Gradient clipping and weight normalization are employed forstabilization [53].
7. RESULTS: ADULT ACOUSTIC MODELS
In this section, we present the results comparing the DNN-HMM and the state-of-the-art end-to-end acoustic modelstrained on adult speech for application to children speechrecognition. https://openslr.org/11/ https://github.com/pytorch/fairseq Table 2 lists the DNN-HMM system and various end-to-endASR system both trained on exactly same data (960 hoursof LIBRISPEECH) and incorporates identical language mod-els. It is observed that testing on test-clean subset of LIB-RISPEECH adult speech, the TDNN-F based DNN-HMMsystem achieves a WER of 5.94%. Comparatively, the bestperforming end-to-end ASR based on Transformer architec-ture with sequence-to-sequence training incorporating gated-CNN (GCNN) word-piece language model achieves a WERof 2.4%, i.e., a relative improvement of 59.6%. In terms ofLER, the relative improvement is similar i.e., 59.46%.
Columns 6-9 of Table 2 list the results of Children’s speechrecognition on MyST Kids corpus and the OGI Kids Corpus.First, we observe that both the LER and WER increases forchildren speech, and the results for OGI Kids corpus is rel-atively worse compared to MyST Kids Corpus. This is ex-pected since the MyST Corpus contains speech data for chil-dren in 3-5 grades, whereas the OGI Kids corpus containschildren ranging from Kindergarten to 10th grade (see Fig-ure 1). We believe the inclusion of data for younger childreni.e, kindergarten to 3rd grade in OGI Kids Corpus is the mainfactor for lower performance compared to MyST Corpus. As-sessing the improvements with the end-to-end ASR over theDNN-HMM system, while modest improvements of relative45.24% reduction in WER (37.51% reduction in LER) is ob-served in case of MyST corpus, only 8.38% reduction in WER(7.27% reduction in LER) is observed with OGI Kids corpus.In comparison to the corresponding adult acoustic mod-els, for MyST Corpus with the TDNN-F HMM system, theWER is over 7 times worse for children and for the best per-forming end-to-end based ASR the WER is nearly 10 timesworse. For the OGI Kids Corpus, with the TDNN-F DNN-HMM system, the WER is over 8 times worse for childrenand for the end-to-end ASR the WER is nearly 19.5 timesworse in comparison to adult speech recognition. Althoughthe end-to-end systems give improvements in absolute WERscompared to the DNN-HMM based systems, they undergo ahigher degree of degradation and are relatively less general-izable towards children speech. Overall, the state-of-the-artend-to-end systems setting high benchmarks on adult speechare far from achieving the same level of performance for chil-dren speech.
In this section, we assess the results of exploiting largeamount of adult speech data for training end-to-end acousticmodels. Table 3 presents the results with acoustic modelstrained on combination of LIBRISPEECH (960 hours) andM LM LIB test-clean MyST test OGI KidsLER WER LER WER LER WERKALDI TDNN-F DNN-HMM 4-gram 2.22 5.94 26.98 47.90 36.04 53.55Greedy Decoding ResNet + CTC - 1.57 4.25 21.24 36.82
ResNet + S2S GCNN-wp 1.85 3.79 64.98 86.09 83.13 94.36TDS + CTC GCNN 1.63 3.40 25.73 36.28 39.59 52.15TDS + S2S GCNN-wp 1.17 2.93 38.33 53.77 87.58 90.26Transformer + CTC GCNN 1.12 2.58 17.43
Table 2 : Results on models trained on LIBRISPEECHLIBRIVOX (53,800 hours) (semi-supervised). Compared toresults in Table 2, the performance on adult’s speech (test-clean subset of LIBRISPEECH) improves by relative 22.22%LER and 9.58% WER. Evaluating on children’s speech,MyST Corpus, the relative improvements with additional53,800 hours of training data is 6.82% LER and 3.89% WER,and on OGI Kids corpus, the relative improvements is 25.13%LER and 24.22% WER. With our experiments, we find thatexploiting large amounts of speech data for acoustic modeleven with adult’s speech, improvements are observed forchildren’s speech recognition. A detailed analysis on theseimprovements are provided in section 9.
For adult speech recognition, the best results both with LERand WER are observed with Beamsearch decoding (see Ta-ble 2). The relative improvement obtained with beamsearchdecoding over greedy decoding is 16.96% WER and 11.76%in LER. The beamsearch decoding is able to exploit ad-ditional knowledge from language models especially withGCNN based LM to provide considerable improvements overgreedy decoding. However, with significant increase in thetraining data, see Table 3, evaluating on adults speech, thegreedy decoding outperforms the beamsearch decoding interms of LER (3.8% reduction) and the gains with beam- search decoding in terms of WER reduces to 4.82%. Overall,with large amount of training data the greedy decoding bene-fits, and approaches performance of beamsearch decoding bylearning an implicit language model [20].For children speech recognition, greedy decoding resultsin better LER over beamsearch decoding, i.e., 3.27% forMyST corpus and 0.42% for OGI Kids (see Table 2). How-ever, better WERs are obtained with beamsearch decoding, arelative improvement of 10.32% for MyST and 2.6% for OGIKids. The greedy decoding benefits more with additionalspeech data, see Table 3, improvements in order of 30.86%reduction in WER with MyST corpus and 24.56% reductionwith OGI Kids corpus. With large data, the performance ofgreedy decoding approaches that of beamsearch decodingeven for children speech, similar to observations made withadult speech recognition [20].
For evaluations on adult speech models trained on LIB-RISPEECH (Table 2), we observe that Transformer basedarchitecture gives the best results (18.09% WER reductionover the TDS networks) followed by the Time-Depth Sep-arable networks and then the Residual Networks. We findthe Transformer based architecture consistently gives betterresults both in terms of LER and WER for adult speech withM LM LIB test-clean MyST test OGI KidsLER WER LER WER LER WERGreedy Decoding ResNet + CTC - 0.93 2.74 16.81 28.26 25.75 38.00ResNet + S2S - 1.11 2.70 28.11 41.07 68.33 79.77TDS + CTC - 0.98 2.85 17.71 29.25 26.11 38.24TDS + S2S - 0.85 2.40 21.06 32.29 73.48 76.49Transformer + CTC - 0.87 2.59
TDS + S2S 6-gram-wp 0.86 2.40 21.05 31.93 71.94 74.62Transformer + CTC 4-gram 1.04 2.52 17.57 25.21 54.70 58.66Transformer + S2S 6-gram-wp 0.79 2.25 25.91 39.25 67.85 81.56ResNet + CTC GCNN 1.09 2.45 18.33 26.00 31.59 37.78ResNet + S2S GCNN-wp 1.10 2.65 27.43 37.77 86.36 90.64TDS + CTC GCNN 1.16 2.54 19.28 27.01 30.77 37.42TDS + S2S GCNN-wp 0.86 2.27 23.93 36.22 70.48 77.32Transformer + CTC GCNN 1.03 2.41 16.79
Table 3 : Results on models trained on LIBRISPEECH + LIBRIVOX(58k hours) (M) refers to LM interpolated with MyST model both greedy and beamsearch decoding. This trend also trans-lates to models trained on 54,760 hours of LIBRISPEECHcombined with LIBRIVOX, i.e., the Transformer networksgive a relative 4.41% WER reduction over the TDS networks.With the experiments on children speech, again the Trans-former based architecture proves favorable while evaluat-ing on MyST children speech. The improvements over theResNet architecture is 19.24% (Table 2). The addition oftraining data (54.8k hours) leads to improvement of 19.95%with ResNets and 25.55% with TDS networks. However, theimprovements are minimal with Transformers (7.51%) andthe performance advantage of Transformers over ResNetsdecreases to 6.69%.The evaluations on OGI Kids corpus show the ResNetsand TDS network outperform the Transformer networks. Per-formance of the Transformer networks drops significantlyrelative to the best results obtained with ResNets and TDSnetworks for models trained on 960 hours (2.69% increasein WER). Addition of training data (54.8k hours) leads toimprovements with ResNets (19.95%) and TDS networks (25.55%). But interestingly, the performance of Transformernetworks drops by 7.88% WER. Overall, the WER withTransformers is 46.13% worse relative to best performanceobtained with TDS network. We believe the increased vari-ability due to diverse age range in OGI Corpus (inter-ageacoustic variability in children) impacts the Transformernetworks negatively. This indicates that the Transformernetworks are less generalizable for children speech. Furtheranalysis on this aspect is provided in section 9.
Observations made on test-clean subset of Librispeech indi-cate that sequence to sequence training gives the best per-formance for adult speech. However, the observations arereversed for children speech recognition both with MyST andOGI Kids corpus. The performance of sequence-to-sequencemodels are always much worse compared to the CTC coun-terparts. Moreover, the performance of sequence-to-sequencemodels almost breaks down on the OGI Kids corpus. Webelieve this is because the heightened variability found inM LM MyST test OGI KidsLER WER LER WERKALDI TDNN-F DNN-HMM 4-gram 11.67 19.51 30.40 44.74Greedy Decoding ResNet + CTC - 12.08 19.53 22.44 35.82ResNet + S2S - 20.44 27.53 65.66 76.49TDS + CTC - 11.70 20.04
TDS + S2S 6-gram-wp (M) 13.24 18.77 60.15 65.09Transformer + CTC 4-gram (M) 10.19 16.74 37.18 48.90Transformer + S2S 6-gram-wp (M) 11.78 16.81 63.26 71.11
Table 4 : Results on models fine-tuned on MyST Corpus (M) refers to LM interpolated with MyST model children in terms of speaking rate, varying phoneme durationand acoustic characteristics poses problem for alignment incase of sequence-to-sequence models which implicitly esti-mate attention based alignments. On the other hand, the CTCmodels with explicit alignments are more robust to childrenspeech. Another important factor is the utterance lengths forthe OGI Kids corpus are much longer than that of MySTcorpus, the sequence-to-sequence networks have been shownto have problems with processing long time sequences [7].Another notable observation in our experiments is that inmost of the cases the ResNet-CTC models perform betterthan the TDS-CTC models, while TDS-S2S models performbetter than ResNet-S2S models.
GCNN based LM provides modest gains over the n-grammodels on adult speech recognition. The gains are moreprevalent on models trained on LIBRISPEECH data (rel-ative improvement of 11.76% WER) versus on acousticmodels trained on additional 53,800 hours of LIBRIVOXdata (relative improvement of 3.56% WER). Decoding onMyST children corpus, we find GCNN LM provides im-provements up-to 4.76% WER on LIBRISPEECH acousticmodels, which reduces to improvements of 3.77% WER withadded LIBRIVOX data. With the OGI Kids corpus, we findGCNN LM to be effective only on LIBRISPEECH acousticmodels and they fail to provide improvements on acousticmodel trained on additional LIBRIVOX dataset.Table 3 also presents the results of adult acoustic modelsin conjunction with children LM. The children LM is a mix-ture of the LIBRISPEECH LM interpolated along with LM trained on test-corpus of MyST Corpus. The results showdefinite improvement when decoding the MyST test-corpusin case of all the model architectures. Considering the bestresults, the children LM provides improvement of 3.87% rel-ative to adult LM. However, we find no improvements whentesting on OGI Kids corpus. In context with the perplexityanalysis presented in section 9.5, the reduction in WER isminimal although large improvements were observed in per-plexity values on MyST corpus. This finding can be attributedto two factors: (i) the end-to-end architectures have the abil-ity to implicitly learn language provided enough speech data,and (ii) the acoustic variability in children dominates in oursetup.
8. RESULTS: CHILDREN ACOUSTIC MODEL
In this section, results are presented on the models fine-tunedon MyST dataset. All the results also incorporate interpo-lated language models, i.e., interpolation of language modelsfor LIBRISPEECH and training subset of MyST dataset withinterpolation weights tuned on MyST development dataset.The results are listed in Table 4. Comparing the results withrespect to the adult acoustic models (Table 3), we observea significant performance boost for evaluations made on in-domain MyST test corpus. The LER of the best performingmodel improves by a relative 41.63% and WER by 34.01%.Moreover, we also find significant improvements on out-of-domain evaluations made on OGI Kids, an improvement of11.31% LER and 9.52% WER. We note that improvements onOGI Kids corpus are much lesser than improvements on thein-domain MyST test set. This observation can be explainedwith the fact that in-domain testing has matched age demo-raphics of children, whereas with the out-of-domain OGIKids corpus have a wider, more diverse age demographics.Another important observation is that even with adaptationon child speech and in-domain evaluations, the performanceof children ASR remains much worse (11.1 times worse LERand 6.4 times worse WER) than the adult speech recognitionwith end-to-end ASR systems.
After adaptation to children speech, the DNN-HMM modelimproves by a relative 56.75% LER and 59.27% WER onin-domain MyST test set. Comparing this to the end-to-endsystems (relative improvements of 41.63% LER and 34.01%WER), the DNN-HMM system is able to adapt to a greater de-gree, although in terms of absolute error rates the end-to-endASR systems outperform the DNN-HMM systems by relative21.42% in terms of LER and 17.94% in terms of WER. Withthe OGI Kids corpus the end-to-end ASR systems outperformthe DNN-HMM system by 27.01% (relative) in terms of LERand 24.81% in terms of WER.
Interestingly, the best performance on the in-domain MySTtest data set is obtained with greedy decoding. This sug-gests that the inherent language model estimated by the end-to-end systems trained on more than 58,000 hours of adultspeech contain sufficient information for processing the chil-dren speech in our experiments. This means that improve-ments obtained on in-domain dataset after adaptation is all at-tributed to the acoustics. This finding also hints that the domi-nating factor of mismatch between adults and children maybeacoustics. Overall, the improvements with Greedy decodingis 4.36% relative WER (10.01% LER) over beamsearch de-coding.However, for the out-of-domain evaluation on OGI Kids,the best result is obtained with beamsearch decoding. Thiscould suggest that with heightened acoustic (domain) mis-matches, the language model’s role becomes more promi-nent. The improvements obtained with beamsearch decod-ing is 2.66% WER relative to greedy decoding, however thegreedy decoding gives a better LER (relative reduction of6.61%).
The Transformer networks give significantly better errorrates on in-domain evaluations (MyST test corpus) over theResNets and TDS Networks. However, for out-of-domainevaluations on OGI Kids corpus, both the ResNets as wellas the TDS networks outperform Transformer networks. TheTransformer networks undergo notable degradation whentested on OGI Kids hinting at generalization issue. The above observations agree with those made with adult acousticmodels under section 7.5.
As observed with the adult acoustic models (under sec-tion 7.6), the sequence-to-sequence models are always out-performed by the CTC loss training both with in-domainevaluation on MyST corpus as well as with the OGI Kidscorpus. The performance difference between the CTC train-ing and the sequence-to-sequence increase on out-of-domainOGI Kids corpus. However, the difference remains muchlower with in-domain testing on MyST corpus. We believethe matched age demographics with in-domain MyST testinghelps the sequence-to-sequence models. Overall, we find thesequence-to-sequence training to be less generalizable forchildren speech recognition.
In Table 4, we note that most of the best results are obtainedwith greedy decoding, i.e., without a LM. This is in contrastto the improvements that were noted with LM for adult AMseen in Table 2. Regardless of the large improvements in per-plexity on MyST corpus with inclusion of children LM, seesection 9.5, we find no improvements with beamsearch de-coding. This suggests that the end-to-end models are capableof modeling language given enough training data. It also indi-cates that acoustic mismatch is the dominating factor for chil-dren speech and addressing it is responsible for most of thegains with children speech recognition. This observation is inagreement with the study in [32], where transfer learning oflayers close to acoustic features accounted for the maximumimprovements suggesting acoustic variability is the dominat-ing factor for degradation in children speech recognition.
9. ERROR ANALYSIS
In this section, we conduct various analyses to get further in-sights into errors made by the aforementioned ASR systems.
We conduct a breakdown of the error rates in terms of sub-stitution, deletions and insertions to assess the strengths andweakness of DNN-HMM and end-to-end systems, as well asdifferent architectures and loss functions. Table 5 shows thebreakdown of the WER of various acoustic models trainedon adults speech. The choice of the models are such thatwe cover different aspects such as greedy versus beamsearch,CTC versus sequence-to-sequence, DNN-HMM versus end-to-end systems. From Table 5, we find that substitutions andinsertions are more suppressed with the end-to-end systemscompared to the DNN-HMM system while the deletions areinflated. We find this trend to be consistent across both MySTSR Model % Total Error % Correct % Substitution % Deletions % InsertionsMyST Test TDNN-F DNN-HMM 47.9 63.6 68.3 (32.7) 7.7 (3.7) 24.2 (11.6)Transformer + CTC (Greedy) 25.46 78.4 56.6 (14.4) 28.0 (7.1) 15.3 (3.9)Transformer + S2S (Greedy) 29.01 77.5 51.0 (14.8) 26.9 (7.8) 22.4 (6.5)Transformer + CTC + 4-gram 25.21 77.3 46.0 (11.6) 44.0 (11.1) 9.9 (2.5)TDS + S2S + 6-gram-wp 31.93 74.3 51.7 (16.5) 28.8 (9.2) 18.7 (6.3)Transformer + CTC + GCNN 24.26 78.5 47.4 (11.5) 41.2 (10.0) 11.5 (2.8)OGI Kids TDNN-F DNN-HMM 53.55 52.5 76.0 (40.7) 12.9 (6.9) 11.2 (6.0)Resnet + CTC (Greedy) 38.00 63.5 57.9 (22.0) 38.2 (14.5) 3.9 (1.5)Resnet + CTC + 4-gram 37.32 65.3 59.8 (22.3) 33.2 (12.4) 7.0 (2.6)TDS + S2S + 6-gram-wp 74.62 26.9 21.0 (15.7) 76.9 (57.4) 2.1 (1.6)
Table 5 : Word Level Error Analysis of Adult ASR Models on Children’s speech
Percent Correct refers to the fraction of the words in the reference that are present in the ASR hypothesis. For the substitutions, deletions and insertions, thenumbers indicate the proportion respective to the total error and the numbers inside the parenthesis are the absolute values.
ASR Model % Total Error % Correct % Substitution % Deletions % InsertionsMyST Test TDNN-F DNN-HMM 27.0 84.0 36.7 (9.9) 22.6 (6.1) 40.7 (11.0)Transformer + CTC (Greedy) 15.71 87.8 21.6 (3.4) 56.0 (8.8) 22.3 (3.5)Transformer + S2S (Greedy) 18.78 86.5 21.3 (4.0) 50.6 (9.5) 27.7 (5.2)Transformer + CTC + 4-gram 17.6 84.7 14.8 (2.6) 71.6 (12.6) 13.1 (2.3)TDS + S2S + 6-gram-wp 21.05 84.4 21.9 (4.6) 52.3 (11.0) 26.1 (5.5)Transformer + CTC + GCNN 16.79 85.7 16.1 (2.7) 69.7 (11.7) 14.9 (2.5)OGI Kids TDNN-F DNN-HMM 36.04 76.4 41.3 (14.9) 24.1 (8.7) 34.4 (12.4)Resnet + CTC (Greedy) 25.75 76.8 26.4 (6.8) 64.1 (16.5) 10.1 (2.6)Resnet + CTC + 4-gram 25.02 77.6 27.2 (6.8) 62.4 (15.6) 10.8 (2.7)TDS + S2S + 6-gram-wp 71.94 29.5 7.9 (5.7) 90.2 (64.9) 2.1 (1.5)
Table 6 : Character Level Error Analysis of Adult ASR Models on Children speech
Percent Correct refers to the fraction of the words in the reference that are present in the ASR hypothesis. For the substitutions, deletions and insertions, thenumbers indicate the proportion respective to the total error and the numbers inside the parenthesis are the absolute values. corpus as well as the OGI Kids corpus and over various con-figurations including greedy, beamsearch decoding, CTC andSequence-to-sequence training and various language models.We observe that in the case of breakdown of sequence-to-sequence models, there is a big spike for deletions. All theabove observations are prevalent even at the level of charac-ters, with error rate analysis presented in Table 6.Error rate analysis for the acoustic models trained onMyST kids corpus are presented in Table 7 and Table 8. Afteradaptation with children speech, we observe the proportion ofdeletions of DNN-HMM system increases and the insertionsdecrease and becomes more comparable with that of end-to-end systems (see Table 7). The deletions of the end-to-endsystem continue to be more than the DNN-HMM systems,whereas the substitutions remain relatively low. The aboveobservations is consistent across both MyST corpus and theOGI Kids corpus and also with character level error analysisin Table 8.
In this section, we assess the error rates with respect to chil-dren’s age. All the age related evaluations are performedon the OGI Kids Corpus since it has diverse age distributionamong children.
Figure 2 plots the WER obtained on adult acoustic modeltrained on combination of LIBRISPEECH and LIBRIVOX,corresponding to Table 3 for OGI Kids Corpus across schoolgrades. Firstly, we observe that the WER is worst for kinder-garten children and gets progressively better with increasein children’s age. The decrease in WER is steep until 4thgrade and relatively flattens out. The above trends are consis-tent over all the model configurations including DNN-HMM,greedy and beamsearch decoding, CTC and sequence-to-sequence networks. In sum, the age associated challengeswith children speech recognition are prevalent even in theend-to-end systems and their trends are similar to previousSR Model % Total Error % Correct % Substitution % Deletions % InsertionsMyST Test TDNN-F DNN-HMM 19.51 83.9 52.3 (10.2) 30.2 (5.9) 17.4 (3.4)Transformer + CTC (Greedy) 16.01 86.5 53.1 (8.5) 31.2 (5.0) 15.6 (2.5)Transformer + S2S (Greedy) 16.69 86.5 38.3 (6.4) 41.9 (7.0) 19.2 (3.2)Transformer + CTC + 4-gram 16.74 86.5 48.4 (8.1) 32.3 (5.4) 19.1 (3.2)Transformer + S2S + 6-gram-wp 16.81 86.7 38.7 (6.5) 40.5 (6.8) 20.8 (3.5)OGI Kids TDNN-F DNN-HMM 44.74 57.2 53.6 (24.0) 42.0 (18.8) 4.5 (2.0)TDS + CTC (Greedy) 34.56 67.1 59.0 (20.4) 36.2 (12.5) 4.6 (1.6)TDS + S2S (Greedy) 69.36 32.1 20.8 (14.4) 77.1 (53.5) 2.2 (1.5)TDS + CTC + 4-gram 33.64 67.6 50.2 (16.9) 46.1 (15.5) 3.9 (1.3)TDS + S2S + 4-gram 64.75 37.7 26.9 (17.4) 69.3 (44.9) 3.9 (2.5)
Table 7 : Word Level Error Analysis of Adapted ASR Models on Children’s speech
Percent Correct refers to the fraction of the words in the reference that are present in the ASR hypothesis. For the substitutions, deletions and insertions, thenumbers indicate the proportion respective to the total error and the numbers inside the parenthesis are the absolute values.
ASR Model % Total Error % Correct % Substitution % Deletions % InsertionsMyST Test TDNN-F DNN-HMM 13.10 89.9 27.5 (3.6) 49.6 (6.5) 22.9 (3.0)Transformer + CTC (Greedy) 9.17 93.5 15.3 (1.4) 55.6 (5.1) 29.4 (2.7)Transformer + S2S (Greedy) 11.67 91.5 13.7 (1.6) 59.1 (6.9) 27.4 (3.2)Transformer + CTC + 4-gram 10.19 92.3 14.7 (1.5) 61.8 (6.3) 24.5 (2.5)Transformer + S2S + 6-gram-wp 11.78 91.6 13.6 (1.6) 56.9 (6.7) 28.9 (3.4)OGI Kids TDNN-F DNN-HMM 30.40 73.1 27.0 (8.2) 61.5 (18.7) 11.5 (3.5)TDS + CTC (Greedy) 22.19 80.7 25.2 (5.6) 61.7 (13.7) 11.1 (2.9)TDS + S2S (Greedy) 64.75 36.9 8.3 (5.4) 89.1 (57.7) 2.6 (1.7)TDS + CTC + 4-gram 23.76 78.2 19.4 (4.6) 72.4 (17.2) 8.4 (2.0)TDS + S2S + 4-gram 59.27 43.8 11.8 (7.0) 83.0 (49.2) 5.2 (3.1)
Table 8 : Char Level Error Analysis of Adapted ASR Models on Children speech
Percent Correct refers to the fraction of the words in the reference that are present in the ASR hypothesis. For the substitutions, deletions and insertions, thenumbers indicate the proportion respective to the total error and the numbers inside the parenthesis are the absolute values. works involving GMM-HMM systems [29] and DNN-HMMsystems [32].Comparing the DNN-HMM based model with the best-performing ResNets based end-to-end system, nearly constantimprovements are obtained with the end-to-end system overall age categories. The difference between the error rates be-tween the TDNN-F HMM and the end-to-end systems is min-imal for eldest children (10th grade). We do not observe anystriking differences between different architectures and lossfunctions of the end-to-end systems. Moreover, plots of lettererror rate in Figure 3 also agree with the earlier observations.
Figure 4 plots the WER obtained on acoustic models adaptedon MyST corpus. Note, the acoustic models were adaptedwith data corresponding to children studying in grades 3 to5. Similar to observation with adult acoustic models, we findthe WER is worst for kindergarten children. For end-to-endarchitectures, the WER decreases steeply until grade 4 and flattens out just as in the case of the adult acoustic model. In-terestingly, we find that the trends observed with end-to-endarchitectures are nearly identical as was observed in the un-adapted baseline adult models despite training on children ofgrades 3 - 5. With the end-to-end architecture there is nearconstant improvements in absolute WER throughout all theages in spite of adapting on data from only a subset of agecategories (grades 3 -5). However, with the TDNN-F DNN-HMM model we see that the WER decreases and reaches min-imum for grade 4 and we observe an increase in WER for chil-dren of grade 7 and above. This suggests that the DNN-HMMmodels are more sensitive to the children’s age i.e., adaptationdata age range.Comparing the DNN-HMM acoustic models with theend-to-end architectures, we note the WER with TDNN-FHMM for kindergarten children improves by 17.31% relativeto adult acoustic models and the WER with the best perform-ing end-to-end acoustic model (TDS + CTC + 4-gram LM)for kindergarten children improves only by a relative 4.08%.The differences in WER between the DNN-HMM models rade W E R % Age vs. WER (OGI Kids Corpus)
Fig. 2 : Age versus WER for Adult AM trained on LIB-RISPEECH + LIBRIVOX
Grade L E R % Age vs. LER (OGI Kids Corpus)
Fig. 3 : Age versus LER for Adult AM trained on LIB-RISPEECH + LIBRIVOXand end-to-end system interestingly increase with children’sage. No major differences were observed between differentarchitectures of end-to-end system (CTC vs. sequence-to-sequence and Greedy vs. beamsearch decoding).Figure 5 plots the LER obtained with acoustic modelsadapted on MyST corpus. Few notable differences can bespotted relative to the earlier observed trends with WERs.First, we observe that the greedy decoding yields better LERin comparison with its beamsearch counterpart, and these im-provements are throughout all ages. Second, we note thatwith the beamsearch decoding, the LER is relatively worse foryounger children, i.e., the greedy decoding is better in termsof LER for children of kindergarten to grade 3. The above twoobservations, suggest that the language model hampers theLER in adapted models while providing no improvements interms of WER. Finally, we observe that the difference in LERbetween the DNN-HMM and the best performing end-to-endmodel (TDS + CTC) with beamsearch decoding is minimal.
Grade W E R % Age vs. WER (OGI Kids Corpus)
Fig. 4 : Age versus WER for AM fine-tuned on MyST
Grade L E R % Age vs. LER (OGI Kids Corpus)
Fig. 5 : Age versus LER for AM fine-tuned on MyST
In this section, we analyze the effect of training data onthe performance over different age categories. We con-sider acoustic models trained on: (i) LIBRISPEECH, (ii)LIBRISPEECH + LIBRIVOX, and (iii) LIBRISPEECH +LIBRIVOX adapted on MyST corpus. Figure 6 and Fig-ure 7 illustrates the plots of WER and LER over differentage categories respectively. First, we observe that addition of52,700 hours of LIBRIVOX data helps lower WER over allthe age categories by a considerable margin. An importantobservation here is that with the addition of large amountsof adult speech data for training, relatively larger improve-ments are observed for younger children (kindergarten to 3grade) compared to elder children. Further adaptation withchildren speech data mainly helps the speech recognition foryounger children and does not seem to provide significantimprovements for older children. Noting that these resultsare on out-of-domain OGI Kids corpus, we find relative im-provements in WER of 20.15% with an increase of 54 timesof adult training data. Whereas, the relative improvements of4.24% is obtained with just 0.37% of children speech data forkindergarten children. rade W E R % Age vs. WER (OGI Kids Corpus)
Fig. 6 : Age versus WER for AM trained on different amountsof data
Grade L E R % Age vs. LER (OGI Kids Corpus)
Fig. 7 : Age versus LER for AM trained on different amountsof data
Here we analyze the effect of utterance length on the perfor-mance of the various acoustic models. The utterance lengthdistribution of the MyST test subset is shown in Figure 10.Figure 8 illustrates the plot of WER on MyST test corpuswith adult acoustic model for utterances of varying lengths.We observe the performance of TDNN-F based DNN-HMMsystem improves as the utterance length increases. However,with the end-to-end systems we see degradation for longerutterance lengths. The performance of the CTC based sys-tem improves initially with increase in utterance lengths andthen degrades for utterance lengths of over 100 words. Thesequence-to-sequence acoustic models undergo a more dras-tic degradation for utterance lengths over 60 words. Whilethe GCNN LM provides slight improvements for utterancelengths under 80 words, they provide no advantage for longerutterances.Figure 9 plots the WER on MyST test corpus with thein-domain adapted acoustic model over varying utterancelengths. Similar to the observations made with the adultacoustic model, the DNN-HMM system’s performance im-proves with length and does not undergo any degradation
Utterance Length (Words) W E R % Utterance Length vs. WER
Fig. 8 : Utterance Length versus WER for Adult AM trainedon LIBRISPEECH + LIBRIVOX
Utterance Length (Words) W E R % Utterance Length vs. WER
Fig. 9 : Utterance Length versus WER for AM fine-tuned onMySTfor longer utterances. We note that compared to Figure 8,the improvements are relatively low with increase in length.Next, with the end-to-end architecture with CTC training, weobserve slight degradation for utterances of over 100 words in
Utterance Length (Words)0100002000030000 0-20 20-40 40-60 60-80 80-100 100-291
Utterance-Word Distribution
Fig. 10 : Utterance-Word Length Distribution MyST Corpusanguage Model Perplexity (MyST-test) Perplexity (OGI Kids)4-gram
LIBRISPEECH
LIBRISPEECH
LIBRISPEECH
LIBRISPEECH
MyST
MyST
LIBRISPEECH + MyST
LIBRISPEECH + MyST
LIBRISPEECH + MyST
Table 9 : Language Model Perplexitieslength. Comparatively, the degradation is of lower magnitudeto the one observed with adult acoustic model (see Figure 8).The degradation is drastic for sequence-to-sequence architec-tures for utterance lengths greater than 80. Comparatively, thedegradation onset is greater, however more acute to the oneobserved with adult acoustic model (see Figure 8). Interest-ingly, the sequence-to-sequence architecture performs betterthan the CTC trained counterpart under utterance lengths of80 with adapted acoustic models. Another important observa-tion is that the performance of the DNN-HMM system equalsto that of the best performance end-to-end systems for longerutterances (greater than 100 words).
Table 9 presents the perplexity of the various language mod-els on MyST and OGI Kids corpus. The perplexities providemore context to analyze the results in Table 3 and Table 4.Comparing different LMs we find the word-piece 6-grammodels provide reduction in perplexity of more than 50%over the traditional 4-gram word based LM. The gated convo-lutional neural network LM yield perplexities 70% lower thantypical 4-gram word LMs. The word-piece based GCNN LMprovides further improvements and gives the lowest perplex-ities. Among the two children speech corpora, we observeOGI Kids has higher perplexities compared to the MySTcorpus and this also reflects in terms of WER in previousassessments. We also find original language model trained onpublic domain books show higher degree of mismatch withchildren corpus [65]. The addition of children LM noticeablyreduces the perplexity for both word-based and word-piecebased n-gram models on MyST corpus. However, the in-clusion of children data does not help the perplexities onthe OGI Kids corpus, i.e., the perplexities after addition ofchildren LM results in higher perplexities. In case of GCNNLM, although fine-tuning the LM helps decrease perplexityon development set, we find that they do not translate well tothe test set, resulting in slight increase in perplexity values.Hence, we skip the results of speech recognition with childadapted GCNN LM under Table 3 and Table 4.
Table 10 and Table 11 lists the top-50 confusion pairs amongthe best performing DNN-HMM, end-to-end models trainedon LIBRISPEECH, end-to-end models trained on LIBRIVOXand end-to-end models adapted on children speech corpus,evaluated on MyST and OGI corpus respectively. Begin-ning with the TDNN-F model, most of the errors are be-cause of (i) deletion of certain consonants which result ineither partial word recognition such as biosphere being rec-ognized as sphere, meat being recognized as me, (ii) substi-tution/confusion between vowels that result in errors betweenacoustically similar words such as like & lake, um & arm,and (iii) confusion involving fillers including ah, uh, um,uhm etc. Second observation is that most of the errors areamong stop words. With end-to-end system, trained on LIB-RISPEECH, we see suppression of all three kinds of errorsbut more specifically we notice deletion of consonants to beminimum and the confusion involving the fillers to be lessprevalent. With acoustic models trained on additional speechdata, LIBRIVOX, the confusion pairs are similar to the oneobserved with the LIBRISPEECH model, but there is sup-pression of errors all along the spectrum. After adapting theacoustic model on children data, we observe less prevalenceof word-confusions among stop words. Overall, we findthe end-to-end models are more efficient in handling fillerwords and confusions resultant from deletion of consonantsor breaking of words such as plant & plan, about & bow etc.
10. CONCLUSIONS
In this work, we presented a detailed empirical study of chil-dren speech recognition with the state-of-the-art end-to-endarchitectures. The findings of the study suggest that the end-to-end speech recognition system trained on adult speechhave short-comings for recognizing children speech. In termsof WER, the children speech recognition with MyST corpusis approximately 10 times worse and on OGI Kids Corpusthe performance is approximately 19 times worse comparedto adult speech recognition. With the end-to-end systems the
DNN-F HMM Transformer + CTC + 4-gram (LIBRISPEECH)
Transformer + CTC + GCNN (LIBRIVOX)
Transformer + CTC + GCNN (MyST)
Frequency Confusion Frequency Confusion Frequency Confusion Frequency Confusion801 and → in 234 and → in 145 and → in 126 because → cause244 and → an 107 um → on 102 a → the 100 < unk > → decomposers189 um → arm 103 a → the 94 < unk > → decompose 86 < unk > → decomposer148 the → a 88 the → a 84 the → a 80 < unk > → biosphere103 uh → ah 85 uh → a 83 < unk > → atmosphere 80 a → the101 it’s → its 82 it’s → its 76 < unk > → decomposes 66 in → and86 it → a 68 < unk > → systems 75 < unk > → systems 63 its → it’s86 it’s → it 62 < unk > → decomposes 73 it’s → its 63 the → a78 the → though 61 < unk > → decomposing 57 um → a 61 and → in67 because → cause 61 um → and 56 uh → a 51 < unk > → herbivore56 like → lake 49 in → and 52 um → on 43 < unk > → subsystems55 um → on 49 um → ah 50 < unk > → synthesis 42 < unk > → omnivore54 a → the 44 < unk > → sphere 49 um → and 40 < unk > → the53 biosphere → sphere 44 < unk > → synthesis 48 in → and 38 < unk > → photosynthesis49 it → eh 44 um → a 41 < unk > → the 36 it’s → its49 meat → me 43 < unk > → atmosphere 41 it’s → is 35 < unk > → o49 the → this 40 yeast → east 40 esophagus → oesophagus 35 uh → u46 eat → e 39 two → too 35 predators → creditors 31 < unk > → omnivores43 into → to 37 that → the 34 nutrients → nuts 26 < unk > → geosphere40 and → a 36 uh → ah 34 they’re → are 26 it → it’s40 the → thee 34 it’s → is 33 um → ah 25 < unk > → c39 eats → eat 34 that’s → that 32 cause → because 25 < unk > → hydrosphere37 plants → plant 33 < unk > → system 32 that → the 25 that → the36 it → i 33 < unk > → the 32 two → too 22 < unk > → learned36 that → the 31 predators → creditors 31 < unk > → sphere 21 bloodstream → stream34 photosynthesis → synthesis 30 esophagus → oesophagus 30 < unk > → system 20 < unk > → ecosystem34 subsystems → systems 30 the → that 30 it’s → it 20 it’s → it33 the → that 30 they’re → are 29 palmate → palm 20 uh → a32 plant → plan 29 cause → because 28 < unk > → o 19 are → they’re32 um → ah 28 < unk > → decompose 26 there’s → is 19 it’s → is31 the → their 28 eats → eat 24 eats → eat 18 cord → chord31 the → they 28 it’s → it 23 they’re → their 18 the → they31 they’re → there 28 meat → me 20 chlorophyll → chloroform 17 < unk > → ecosystems30 they’re → their 27 < unk > → decomposed 20 it → i 17 is → there’s30 um → hum 27 it → that 20 nutrients → nut 15 < unk > → and29 is → as 24 systems → system 20 their → the 15 notice → noticed29 plants → plans 24 there’s → is 19 it → and 15 plant → plants29 that’s → that 24 they’re → their 19 meat → me 15 the → that28 about → bow 24 yeah → yes 19 yeast → east 15 this → the28 are → or 23 um → i’m 18 < unk > → a 14 they’re → they28 cells → selves 22 it → i 18 because → cause 13 < unk > → bronchi28 it’s → is 20 < unk > → war 18 into → to 13 that → it28 to → too 20 is → as 18 it → that 13 their → the27 and → end 20 they’re → they 18 yeah → yes 12 < unk > → is27 decomposers → composers 19 nutrients → nuts 17 < unk > → and 12 < unk > → subsystem27 it → at 19 um → i 17 < unk > → bronco 12 am → i’m27 maybe → be 18 is → there’s 17 < unk > → decomposing 12 eats → eat27 um → om 17 < unk > → a 17 plants → plant 12 or → are26 the → de 17 are → our 17 this → the 12 snake → rattlesnake26 they’re → are 17 it → and 16 that’s → that 12 to → into Table 10 : Top 50 Confusion Pairs: On MyST Corpus for top performing DNN-HMM system and end-to-end ASR systems
DNN-F HMM ResNet + CTC + GCNN (LIBRISPEECH)
ResNet + CTC + 4-gram (LIBRIVOX)
TDS + CTC + 4-gram (MyST)
Frequency Confusion Frequency Confusion Frequency Confusion Frequency Confusion933 and → in 358 and → an 434 and → in 2281 < unk > → um819 and → an 311 and → in 396 < unk > → m 230 v → b312 uhm → arm 273 r → are 290 q → k 213 n → and197 uhm → own 265 q → you 285 q → u 173 r → are186 uhm → um 235 < unk > → and 251 < unk > → and 170 a → the169 uhm → hum 211 s → as 234 < unk > → a 153 in → and145 uh → ah 186 < unk > → on 210 a → the 145 q → you143 the → a 183 < unk > → um 202 v → b 145 z → c136 uhm → oh 181 v → the 192 uh → a 139 v → the131 uhm → am 177 a → the 154 gonna → to 138 because → cause129 w → u 168 n → an 146 < unk > → um 136 m → and128 uhm → on 164 < unk > → oh 137 < unk > → o 136 u → you124 a → the 162 uh → a 136 < unk > → on 130 mom → ma117 uhm → m 144 < unk > → i 129 < unk > → ah 103 and → in104 a → e 139 i → a 126 < unk > → oh 95 < unk > → and100 to → a 138 < unk > → a 119 i → a 88 to → the100 to → the 119 m → an 115 z → c 79 gonna → to94 i → a 99 the → a 114 v → e 77 is → there’s91 you → ye 99 y → why 101 and → n 76 uh → um89 uh → er 92 u → you 97 the → a 75 p → b77 n → an 81 she → he 92 < unk > → of 74 z → the75 and → anne 77 them → em 92 mom → ma 73 the → a74 it’s → its 72 gonna → to 86 < unk > → i 71 he → you73 u → you 72 z → why 84 uh → ah 67 v → you73 z → c 71 this → the 81 yeah → yes 60 this → the72 and → an’ 70 < unk > → an 74 in → and 58 she → he72 and → on 69 to → the 72 i → j 56 i → it72 uhm → of 69 v → b 71 to → the 54 sister → system70 and → m 65 to → a 69 z → e 53 is → he’s69 a → eh 64 q → u 68 < unk > → i’m 53 is → she’s68 uhm → ah 62 u → to 68 yeah → ye 53 my → like67 v → b 60 yeah → yes 65 them → em 52 uh → a65 a → of 55 < unk > → i’m 64 to → a 49 < unk > → the65 and → than 55 in → and 61 g → t 46 dad → that65 uhm → an 52 and → i 61 is → there’s 45 g → d63 and → a 52 is → there’s 58 this → the 44 i → a63 and → eh 52 v → you 57 f → n 43 i → um62 and → the 51 and → why 57 she → he 43 uh → the58 and → then 51 z → see 55 < unk > → the 42 q → b57 like → liked 50 and → the 54 i → and 41 he → it56 is → as 50 is → he’s 52 and → m 41 it’s → is56 it’s → is 50 s → are 51 < unk > → in 39 am → i’m56 t → to 49 < unk > → am 51 okay → k 39 gonna → gon55 and → end 49 < unk > → em 51 y → w 39 u → d54 them → em 49 r → you 49 and → the 38 and → um54 this → the 49 t → are 49 because → cause 37 < unk > → that52 favorite → favourite 49 uh → ah 48 and → a 37 and → then52 two → to 48 i → j 48 z → w 37 d → the50 yeah → yes 48 q → e 47 that’s → that 37 gonna → on48 i → j 48 s → you 47 them → him 37 lake → like Table 11 : Top 50 Confusion Pairs: On OGI Corpus for top performing DNN-HMM system and end-to-end ASR systemsap in performance between adult and children is wider incomparison with the DNN-HMM hybrid systems, althoughin terms of absolute WER the end-to-end systems are a sig-nificant improvement over the latter. The benefits establishedwith end-to-end ASR for adult speech do not translate com-pletely to children speech. End-to-end architectures trainedon large amounts of adult speech data can certainly helpperformance on children speech. Addition of large amountsof adult speech is found to benefit more when the acousticmismatch is large between children and adults. Although,adaptation of acoustic model on children speech helps, therecognition performance remains more than 6 times worsecompared to adult ASR. DNN-HMM hybrid models benefitto a larger extent with children speech adaptation comparedto end-to-end ASR, but the latter performs better in absoluteWER. Transformer network architectures are the best per-forming models when the train-test mismatch is low, howeverthey do not generalize well when train-test mismatch is highincluding children age disparities. CTC loss based models arerobust to children speech recognition, however the sequence-to-sequence models can break down completely during highmismatch conditions with children speech recognition. Ourexperiments indicate better performance with greedy decod-ing without language model for children ASR suggesting thatacoustic mismatch dominates performance drop.Insights into the errors reveal the end-to-end system havelower substitutions and insertions and high deletions on chil-dren speech recognition compared to hybrid DNN-HMM.ASR of younger children still remains a challenge withend-to-end systems while the performance increases withincrease in age similar to trends observed in GMM-HMMand DNN-HMM systems in prior literature. On adaptationwith children speech, the end-to-end systems provide nearconstant improvements over all age categories irrespectiveof age demographics of the adaptation data. However, theDNN-HMM hybrid systems are more sensitive to age, givingskewed performance benefits for matched train-test age cat-egories. Training end-to-end systems with large amount ofadult speech data benefits recognition for all age categoriesand younger children benefit to a greater degree. End-to-endsystems suffer in decoding longer utterances and specificallysequence-to-sequence models undergo drastic degradationcompared to CTC models, whereas the DNN-HMM hybridsystems do not undergo any degradation.Overall, the state-of-the-art end-to-end systems settinghigh benchmarks on adult speech are still far from achievingthe same levels of performance for children speech. This em-phasizes the need to include children speech for developingbenchmark tasks for ASR. The results also point to funda-mental challenges that still need to be addressed in childrenspeech recognition with end-to-end architectures.
11. REFERENCES [1] Shrikanth S. Narayanan and Alexandros Potamianos,“Creating conversational interfaces for children,”
IEEETransactions on Speech and Audio Processing , vol. 10,no. 2, pp. 65–78, feb 2002.[2] Daniel Bone, Chi-Chun Lee, Theodora Chaspari, JamesGibson, and Shrikanth Narayanan, “Signal processingand machine learning for mental health research andclinical applications,”
IEEE Signal Processing Maga-zine , vol. 34, no. 5, pp. 189–196, September 2017.[3] Daniel Bone, Theodora Chaspari, and ShrikanthNarayanan, “Behavioral signal processing and autism:Learning from multimodal behavioral signals,”
AutismImaging and Devices , pp. 335–360, 2017.[4] George E Dahl, Dong Yu, Li Deng, and Alex Acero,“Context-dependent pre-trained deep neural networksfor large-vocabulary speech recognition,”
IEEE Trans-actions on audio, speech, and language processing , vol.20, no. 1, pp. 30–42, 2011.[5] Alex Graves, Santiago Fern´andez, Faustino Gomez, andJ¨urgen Schmidhuber, “Connectionist temporal classi-fication: labelling unsegmented sequence data with re-current neural networks,” in
Proceedings of the 23rd in-ternational conference on Machine learning , 2006, pp.369–376.[6] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Se-quence to sequence learning with neural networks,”in
Advances in neural information processing systems ,2014, pp. 3104–3112.[7] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk,Kyunghyun Cho, and Yoshua Bengio, “Attention-basedmodels for speech recognition,” in
Advances in neuralinformation processing systems , 2015, pp. 577–585.[8] William Chan, Navdeep Jaitly, Quoc Le, and OriolVinyals, “Listen, attend and spell: A neural networkfor large vocabulary conversational speech recognition,”in . IEEE, 2016,pp. 4960–4964.[9] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Ro-hit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, An-juli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Go-nina, et al., “State-of-the-art speech recognition withsequence-to-sequence models,” in . IEEE, 2018, pp. 4774–4778.10] Ying Zhang, Mohammad Pezeshki, Phil´emon Brakel,Saizheng Zhang, Cesar Laurent Yoshua Bengio, andAaron Courville, “Towards end-to-end speech recog-nition with deep convolutional neural networks,” arXivpreprint arXiv:1701.02720 , 2017.[11] Neil Zeghidour, Qiantong Xu, Vitaliy Liptchinsky,Nicolas Usunier, Gabriel Synnaeve, and Ronan Col-lobert, “Fully convolutional speech recognition,” arXivpreprint arXiv:1812.06864 , 2018.[12] Dario Amodei, Sundaram Ananthanarayanan, RishitaAnubhai, Jingliang Bai, Eric Battenberg, Carl Case,Jared Casper, Bryan Catanzaro, Qiang Cheng, GuoliangChen, Jie Chen, Jingdong Chen, Zhijie Chen, MikeChrzanowski, Adam Coates, Greg Diamos, Ke Ding,Niandong Du, Erich Elsen, Jesse Engel, Weiwei Fang,Linxi Fan, Christopher Fougner, Liang Gao, CaixiaGong, Awni Hannun, Tony Han, Lappi Johannes, BingJiang, Cai Ju, Billy Jun, Patrick LeGresley, LibbyLin, Junjie Liu, Yang Liu, Weigao Li, Xiangang Li,Dongpeng Ma, Sharan Narang, Andrew Ng, Sher-jil Ozair, Yiping Peng, Ryan Prenger, Sheng Qian,Zongfeng Quan, Jonathan Raiman, Vinay Rao, SanjeevSatheesh, David Seetapun, Shubho Sengupta, KavyaSrinet, Anuroop Sriram, Haiyuan Tang, Liliang Tang,Chong Wang, Jidong Wang, Kaifu Wang, Yi Wang,Zhijian Wang, Zhiqian Wang, Shuang Wu, Likai Wei,Bo Xiao, Wen Xie, Yan Xie, Dani Yogatama, Bin Yuan,Jun Zhan, and Zhenyao Zhu, “Deep speech 2 : End-to-end speech recognition in english and mandarin,” in
Proceedings of The 33rd International Conference onMachine Learning , Maria Florina Balcan and Kilian Q.Weinberger, Eds., New York, New York, USA, 20–22Jun 2016, vol. 48 of
Proceedings of Machine LearningResearch , pp. 173–182, PMLR.[13] Yu Zhang, William Chan, and Navdeep Jaitly, “Verydeep convolutional networks for end-to-end speechrecognition,” in .IEEE, 2017, pp. 4845–4849.[14] Yisen Wang, Xuejiao Deng, Songbai Pu, and Zhi-heng Huang, “Residual convolutional ctc networksfor automatic speech recognition,” arXiv preprintarXiv:1702.07793 , 2017.[15] Jaeyoung Kim, Mostafa El-Khamy, and Jungwon Lee,“Residual lstm: Design of a deep recurrent architec-ture for distant speech recognition,” arXiv preprintarXiv:1701.03360 , 2017.[16] Golan Pundak and Tara N Sainath, “Highway-lstm andrecurrent highway networks for speech recognition,”
Proc. Interspeech 2017 , pp. 1303–1307, 2017. [17] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John RHershey, and Tomoki Hayashi, “Hybrid ctc/attentionarchitecture for end-to-end speech recognition,”
IEEEJournal of Selected Topics in Signal Processing , vol. 11,no. 8, pp. 1240–1253, 2017.[18] Suyoun Kim, Takaaki Hori, and Shinji Watanabe, “Jointctc-attention based end-to-end speech recognition us-ing multi-task learning,” in . IEEE, 2017, pp. 4835–4839.[19] Linhao Dong, Shuang Xu, and Bo Xu, “Speech-transformer: a no-recurrence sequence-to-sequencemodel for speech recognition,” in . IEEE, 2018, pp. 5884–5888.[20] Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, EdouardGrave, Tatiana Likhomanenko, Vineel Pratap, AnuroopSriram, Vitaliy Liptchinsky, and Ronan Collobert,“End-to-end asr: from supervised to semi-supervisedlearning with modern architectures,” arXiv preprintarXiv:1911.08460 , 2019.[21] Sungbok Lee, Alexandros Potamianos, and ShrikanthNarayanan, “Acoustics of children’s speech: Develop-mental changes of temporal and spectral parameters,”
The Journal of the Acoustical Society of America , vol.105, no. 3, pp. 1455–1468, 1999.[22] Sungbok Lee, Alexandros Potamianos, and Shrikanth S.Narayanan, “Developmental acoustic study of americanenglish diphthongs,”
J. Acoust. Soc. Am. , vol. 136, no.4, pp. 1880–1894, oct 2014.[23] Alexandros Potamianos and Shrikanth Narayanan, “Ro-bust recognition of children’s speech,”
IEEE Transac-tions on speech and audio processing , vol. 11, no. 6, pp.603–616, 2003.[24] Matteo Gerosa, Diego Giuliani, and Fabio Brugnara,“Acoustic variability and automatic recognition of chil-dren’s speech,”
Speech Communication , vol. 49, no. 10-11, pp. 847–860, 2007.[25] Alexandros Potamianos, Shrikanth Narayanan, andSungbok Lee, “Automatic speech recognition for chil-dren,” in
Fifth European Conference on Speech Com-munication and Technology , 1997, pp. 2371–2374.[26] Tanya M Gallagher, “Revision behaviors in the speechof normal children developing language,”
Journal ofSpeech and Hearing Research , vol. 20, no. 2, pp. 303–318, 1977.[27] Shweta Ghai and Rohit Sinha, “Exploring the roleof spectral smoothing in context of children’s speechecognition,” in
Tenth Annual Conference of the Inter-national Speech Communication Association , 2009, pp.1607–1610.[28] Rohit Sinha and Syed Shahnawazuddin, “Assessmentof pitch-adaptive front-end signal processing for chil-dren’s speech recognition,”
Computer Speech & Lan-guage , vol. 48, pp. 103–121, 2018.[29] Prashanth Gurunath Shivakumar, Alexandros Potami-anos, Sungbok Lee, and Shrikanth S Narayanan, “Im-proving speech recognition for children using acousticadaptation and pronunciation modeling.,” in
WOCCI ,2014, pp. 15–19.[30] Diego Giuliani and Matteo Gerosa, “Investigatingrecognition of children’s speech,” in
IEEE,2003, vol. 2, pp. II–137.[31] Diego Giuliani, Matteo Gerosa, and Fabio Brug-nara, “Improved automatic speech recognition throughspeaker normalization,”
Computer Speech & Language ,vol. 20, no. 1, pp. 107–123, 2006.[32] Prashanth Gurunath Shivakumar and Panayiotis Geor-giou, “Transfer learning from adult to children forspeech recognition: Evaluation, analysis and recom-mendations,”
Computer Speech & Language , vol. 63,pp. 101077, 2020.[33] Mark JF Gales, “Maximum likelihood linear transfor-mations for hmm-based speech recognition,”
Computerspeech & language , vol. 12, no. 2, pp. 75–98, 1998.[34] Qun Li and Martin J Russell, “An analysis of the causesof increased error rates in children’s speech recogni-tion,” in
Seventh International Conference on SpokenLanguage Processing , 2002, pp. 2337–2340.[35] Subrata Das, Don Nix, and Michael Picheny, “Improve-ments in children’s speech recognition performance,”in
Proceedings of the 1998 IEEE International Con-ference on Acoustics, Speech and Signal Processing,ICASSP’98 (Cat. No. 98CH36181) . IEEE, 1998, vol. 1,pp. 433–436.[36] Si-Ioi Ng, Wei Liu, Zhiyuan Peng, Siyuan Feng,Hing-Pang Huang, Odette Scharenborg, and Tan Lee,“The cuhk-tudelft system for the slt 2021 chil-dren speech recognition challenge,” arXiv preprintarXiv:2011.06239 , 2020.[37] Guoguo Chen, Xingyu Na, Yongqing Wang, Zhiy-ong Yan, Junbo Zhang, Sifan Ma, and YujunWang, “Data augmentation for children’s speech recognition–the” ethiopian” system for the slt 2021 chil-dren speech recognition challenge,” arXiv preprintarXiv:2011.04547 , 2020.[38] Fan Yu, Zhuoyuan Yao, Xiong Wang, Keyu An, Lei Xie,Zhijian Ou, Bo Liu, Xiulin Li, and Guanqiong Miao,“The slt 2021 children speech recognition challenge:Open datasets, rules and baselines,” arXiv preprintarXiv:2011.06724 , 2020.[39] Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li,Hainan Xu, Mahsa Yarmohammadi, and Sanjeev Khu-danpur, “Semi-orthogonal low-rank matrix factorizationfor deep neural networks.,” in
Interspeech , 2018, pp.3743–3747.[40] Fei Wu, Leibny Paola Garc´ıa-Perera, Daniel Povey, andSanjeev Khudanpur, “Advances in automatic speechrecognition for child speech using factored time delayneural network.,” in
INTERSPEECH , 2019, pp. 1–5.[41] Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pe-gah Ghahremani, Vimal Manohar, Xingyu Na, Yim-ing Wang, and Sanjeev Khudanpur, “Purely sequence-trained neural networks for asr based on lattice-freemmi.,” in
Interspeech , 2016, pp. 2751–2755.[42] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep residual learning for image recognition,”2015.[43] Wayne Xiong, Jasha Droppo, Xuedong Huang, FrankSeide, Mike Seltzer, Andreas Stolcke, Dong Yu,and Geoffrey Zweig, “Achieving human parity inconversational speech recognition,” arXiv preprintarXiv:1610.05256 , 2016.[44] George Saon, Gakuto Kurata, Tom Sercu, Kartik Au-dhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xi-aodong Cui, Bhuvana Ramabhadran, Michael Picheny,Lynn-Li Lim, et al., “English conversational telephonespeech recognition by humans and machines,” arXivpreprint arXiv:1703.02136 , 2017.[45] Daniel S Park, William Chan, Yu Zhang, Chung-ChengChiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le,“Specaugment: A simple data augmentation methodfor automatic speech recognition,” arXiv preprintarXiv:1904.08779 , 2019.[46] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey EHinton, “Layer normalization,” arXiv preprintarXiv:1607.06450 , 2016.[47] Awni Hannun, Ann Lee, Qiantong Xu, and Ronan Col-lobert, “Sequence-to-sequence speech recognition withtime-depth separable convolutions,” arXiv preprintarXiv:1904.02619 , 2019.48] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin, “Attention is all you need,” in
Ad-vances in neural information processing systems , 2017,pp. 5998–6008.[49] Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova, “Bert: Pre-training of deep bidirec-tional transformers for language understanding,” arXivpreprint arXiv:1810.04805 , 2018.[50] Shigeki Karita, Nanxin Chen, Tomoki Hayashi, TakaakiHori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki,Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xi-aofei Wang, et al., “A comparative study on transformervs rnn in speech applications,” in . IEEE, 2019, pp. 449–456.[51] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, ŁukaszKaiser, Noam Shazeer, Alexander Ku, and Dustin Tran,“Image transformer,” arXiv preprint arXiv:1802.05751 ,2018.[52] Angela Fan, Edouard Grave, and Armand Joulin, “Re-ducing transformer depth on demand with structureddropout,” arXiv preprint arXiv:1909.11556 , 2019.[53] Yann N Dauphin, Angela Fan, Michael Auli, and DavidGrangier, “Language modeling with gated convolu-tional networks,” in
International conference on ma-chine learning . PMLR, 2017, pp. 933–941.[54] Tatiana Likhomanenko, Gabriel Synnaeve, and RonanCollobert, “Who needs words? lexicon-free speechrecognition,” arXiv preprint arXiv:1904.04479 , 2019.[55] Wayne Ward, Ronald Cole, Daniel Bola˜nos, CindyBuchenroth-Martin, Edward Svirsky, Sarel Van Vuuren,Timothy Weston, Jing Zheng, and Lee Becker, “Myscience tutor: A conversational multimedia virtual tu-tor for elementary school science,”
ACM Transactionson Speech and Language Processing (TSLP) , vol. 7, no.4, pp. 1–29, 2011.[56] Wayne Ward, Ron Cole, and Sameer Pradhan, “My sci-ence tutor and the myst corpus,” 2019.[57] Khaldoun Shobaki, John-Paul Hosom, and Ronald ACole, “The ogi kids’ speech corpus and recognizers,”in
Sixth International Conference on Spoken LanguageProcessing , 2000, vol. 4, pp. 258–261.[58] Matteo Gerosa, Diego Giuliani, and ShrikanthNarayanan, “Acoustic analysis and automaticrecognition of spontaneous children’s speech,” in
Ninth International Conference on Spoken LanguageProcessing , 2006. [59] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, LukasBurget, Ondrej Glembek, Nagendra Goel, Mirko Han-nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz,et al., “The kaldi speech recognition toolkit,” in
IEEE2011 workshop on automatic speech recognition andunderstanding . IEEE Signal Processing Society, 2011,number CONF.[60] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-jeev Khudanpur, “Librispeech: an asr corpus based onpublic domain audio books,” in . IEEE, 2015, pp. 5206–5210.[61] Vineel Pratap, Awni Hannun, Qiantong Xu, Jeff Cai, Ja-cob Kahn, Gabriel Synnaeve, Vitaliy Liptchinsky, andRonan Collobert, “wav2letter++: The fastest open-source speech recognition system,” arXiv preprintarXiv:1812.07625 , 2018.[62] Taku Kudo and John Richardson, “Sentencepiece: Asimple and language independent subword tokenizerand detokenizer for neural text processing,” arXivpreprint arXiv:1808.06226 , 2018.[63] Kenneth Heafield, “Kenlm: Faster and smaller languagemodel queries,” in
Proceedings of the sixth workshop onstatistical machine translation , 2011, pp. 187–197.[64] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,Sam Gross, Nathan Ng, David Grangier, and MichaelAuli, “fairseq: A fast, extensible toolkit for sequencemodeling,” arXiv preprint arXiv:1904.01038arXiv preprint arXiv:1904.01038