[PDF] LSTM Acoustic Models Learn to Align and Pronounce with Graphemes

Abstract

Automated speech recognition coverage of the world's languages continues to expand. However, standard phoneme based systems require handcrafted lexicons that are difficult and expensive to obtain. To address this problem, we propose a training methodology for a grapheme-based speech recognizer that can be trained in a purely data-driven fashion. Built with LSTM networks and trained with the cross-entropy loss, the grapheme-output acoustic models we study are also extremely practical for real-world applications as they can be decoded with conventional ASR stack components such as language models and FST decoders, and produce good quality audio-to-grapheme alignments that are useful in many speech applications. We show that the grapheme models are competitive in WER with their phoneme-output counterparts when trained on large datasets, with the advantage that grapheme models do not require explicit linguistic knowledge as an input. We further compare the alignments generated by the phoneme and grapheme models to demonstrate the quality of the pronunciations learnt by them using four Indian languages that vary linguistically in spoken and written forms.

Full PDF

LLSTM Acoustic Models Learn to Align and Pronounce with Graphemes

Arindrima Datta , Guanlong Zhao *2 , Bhuvana Ramabhadran , Eugene Weinstein Google, Inc., New York, NY, U.S.A. Texas A&M University, College Station, TX, U.S.A. { arindrimadatta, bhuv, weinstein } @google.com, [email protected] Abstract

Automated speech recognition coverage of the world’s lan-guages continues to expand. However, standard phoneme basedsystems require handcrafted lexicons that are difﬁcult and ex-pensive to obtain. To address this problem, we propose a train-ing methodology for a grapheme-based speech recognizer thatcan be trained in a purely data-driven fashion. Built with LSTMnetworks and trained with the cross-entropy loss, the grapheme-output acoustic models we study are also extremely practicalfor real-world applications as they can be decoded with conven-tional ASR stack components such as language models and FSTdecoders, and produce good quality audio-to-grapheme align-ments that are useful in many speech applications. We showthat the grapheme models are competitive in WER with theirphoneme-output counterparts when trained on large datasets,with the advantage that grapheme models do not require ex-plicit linguistic knowledge as an input. We further compare thealignments generated by the phoneme and grapheme models todemonstrate the quality of the pronunciations learnt by them us-ing four Indian languages that vary linguistically in spoken andwritten forms.

Index Terms : Acoustic modeling, grapheme, alignment

1. Introduction

Automated speech recognition (ASR) performance has seenrapid improvements in recent years. Conventional speech rec-ognizers are comprised of four main components: acoustic,language, and pronunciation (lexicon) models, and search de-coders. The acoustic and language models are statistical modelsthat have been improved over the last few decades with neweralgorithms and increased training data. In contrast, the lexi-con is usually manually generated using a dictionary of humantranscribed word pronunciations. Human-curated dictionariessuffer from various challenges, such as the prohibitive cost in-volved with acquiring pronunciations for the different dialectsand accents of a language, as well as for new languages.To address these challenges, there have been several statis-tical approaches aimed at automated unit and lexicon discoveryfrom speech audio [1, 2], grapheme-to-phoneme (g2p) conver-sion [3] and more recently with the use of Long Short-TermMemory (LSTM) networks [4]. However, these models stillneed to be trained on manually curated pronunciation dictionar-ies. Grapheme models have also been used to address thechallenges of acquiring lexica. For instance, in the iARPA-sponsored BABEL program that focused on the rapid develop-ment of ASR and keyword-spotting systems for low resourcelanguages, graphemic systems were widely used [5–9], and itwas shown that graphemic ASR systems can yield similar per-formance to the phonemic systems. In this paper, we aim to *Work done at Google NYC. build graphemic speech recognizers with large scale speechdata for high-resource languages but without the use of lexica.LSTM cells can model varying contexts (“memory”) andhave been successful in a number of sequence predictiontasks [10–12]. In ASR, grapheme LSTMs have been used withthe Connectionist Temporal Classiﬁcation (CTC) loss [13, 14]in [15–17], and with lattice-free MMI objective in [18] to ob-tain competitive performance in speech recognition. Encoder-decoder based grapheme models [10, 12] doing end-to-end(E2E) speech recognition have also been proven to be competi-tive with conventional models.As these models become increasingly popular, an under-standing of the alignment they produce between speech framesand linguistic elements such as phonemes, graphemes, or wordsis important for not only ASR but also tasks such as keywordspotting [19, 20], grapheme-to-phoneme conversion [21], cap-tioning [22, 23] and speech synthesis [24]. While graphemicmodels have been studied in the traditional HMM-GMM frame-work [25, 26], the literature does not offer us a thorough under-standing of the quality of graphemic alignments produced bythese models. Recently [27] explored the use of word align-ments produced by direct acoustics-to-word models trainedwith a cross-entropy objective. However, word-level modelingof speech is problematic due to out-of-vocabulary (OOV) issues[22, 28], which we can avoid by using graphemes instead.Motivated by the success of graphemic LSTM neural net-works, in this paper we study their ability to implicitly learnalignments and pronunciations while being trained to recognizespeech.

Using four Indian languages consiting of 2K-18K hoursof speech data, and with varying phonetic orthography and ac-cents, we demonstrate the ability of LSTMs to align graphemesequences with spoken audio . Our cross-entropy (CE) trainingof these models allows alignment quality to be explicitly opti-mized as part of the training objective, resulting in high-qualityacoustic-to-grapheme alignments. Furthermore, we analyze andcompare the quality of alignments between the grapheme andphoneme models, and attempt to understand the pronunciation-modeling capability of these graphemic LSTMs . To the best ofour knowledge, this is one of the ﬁrst pieces of work on largescale data that speciﬁcally attempts to interpret the pronuncia-tions and alignments learnt by graphemic neural networks.Our proposed training methodology also allows ﬂat start-ing, and is purely data-driven . Therefore if offers the bene-ﬁt of creating speech recognizers for new languages, an use-case where grapheme models have shown the biggest poten-tial. Our grapheme acoustic models (AMs) trained with large-scale speech data, achieve similar WERs in speech recognitionto that of phoneme models while avoiding the use of manually-curated lexicons. Additionally, due to their ability to be com-bined with conventional speech-modeling elements such as lan-guage models and ﬁnite-state transducer (FST) decoders, theyoffer a straightforward launch path , and are thus advantageous a r X i v : . [ ee ss . A S ] A ug ver grapheme-output E2E models.

2. Methods

In this section, we describe the graphemic lexicon and theGMM-based graphemic alignments used to ﬂat start thegraphemic LSTM acoustic models.

A graphemic lexicon is a mapping between graphemes andwords with a special “ < space > ” grapheme acting as the wordboundary. The set of graphemes for a language are generated byenumerating the characters empirically observed in the trainingdata, after removing the very rare graphemes ( <

10 occurrences)and those without any clear acoustic realization (such as emo-jis). Utterances containing the excluded graphemes are droppedfrom the training data. The number of graphemes used in eachlanguage is presented in the ﬁrst column of Table 1.

LSTMs can model long-range temporal dependencies, whichmake them particularly suited for ASR. In real-time stream-ing ASR applications such as voice search, we prefer uni-directional over bi-directional LSTM [15] because the formeris able to perform recognition in an online fashion. The LSTMAM takes the current feature sequence x = ( x , . . . , x T ) as in-puts and estimates the output posteriors over a pre-deﬁned labelset, l = ( l , . . . , l T ) . The LSTM parameters can thus be es-timated by maximizing the cross-entropy (CE) loss [29] on allframes of input utterance x with its corresponding frame-levelalignment l . L CE = − (cid:88) ( x , l ) (cid:88) l, t δ ( l, l t ) y tl , (1)where δ ( · , · ) is the Kronecker delta, and y tl is the network outputactivation for label l at time t .Cross-entropy trained LSTM acoustic models (AMs) areparticularly appealing due to the simplicity and power of thisloss function [30], a learning algorithm that lends itself wellto efﬁcient implementation for diverse model architectures andhardware [31–33] Conventional phoneme-based systems use an existing speechrecognizer to generate forced-alignments to get the boundariesof the phone segments, thus, assigning phoneme labels to theframes. Similarly, for CE-based grapheme AMs, we needframe-level graphemic alignments to provide the training labels.This approach of ﬂat starting used in conventional phoneme-based models (e.g., [34]) can be applied to grpahemes as well.Flat starting is particularly useful for training speech recog-nizers for new languages, which is often the use-case for agrapheme-based model.Since GMMs are easy to train and have been studied ex-tensively in the literature, we use them to generate initial align-ments for the grapheme systems, using the training methodol-ogy similar to [35]. The speech signal is segmented evenly ac-cording to the target graphemic sequences, and the GMMs aretrained to predict 3-state HMM context-independent graphemesusing the Expectation Maximization (EM) algorithm. The inputfeatures are perceptual linear predictive features (PLPs) with Table 1:

Number of graphemes and phonemes per language

Language

Training and test data size

Language

WER : P honeme and G rapheme models after CE and sMBR training Language CE-P CE-G sMBR-P sMBR-GBengali 36.7 41.7 29.2 31.0(+1.8)Tamil 41.5 43.2 30.3 31.9(+1.7)Hindi 33.1 38.1 25.6 28.1(+2.5)English 23.8 24.8 15.2 18.6(+3.4)Table 4:

WER: Acoustic models trained with

GMM vs CE-LSTM alignments with P honeme and G rapheme targets Language GMM-P GMM-G CE-P CE-GBengali 36.7 41.7(+5.0) 32.9 35.7(+2.8)Tamil 41.5 43.2(+1.7) 36.8 37.1(+0.3)Hindi 33.1 38.1(+5.0) 30.7 34.2(+3.5)English 23.8 24.8(+1.0) 22.6 23.1(+0.5)deltas and delta-deltas [35], and the GMMs are trained with 14mixtures per grapheme on less than 6% of the data for each lan-guage.

3. Experimental Setup and Results

For each of the four languages, Bengali, Tamil, Hindi and In-dian English, we used the same amount of data for training thegrapheme and phoneme models. Our training data consistedof anonymized, human-transcribed utterances representative ofGoogle’s trafﬁc. The amount of transcribed data per languageis tabulated in Table 2. To achieve noise robustness the train-ing data is augmented with varying degrees of noise and rever-beration such that the overall SNR is between 0dB and 30dB,and the average SNR is 12dB [36]. The noise sources are fromYouTube and daily life noisy environmental recordings.The phoneme models were trained with the same recipe(Section 2) but using phoneme targets instead. All experimentsused 80-dimensional log-mel features, computed with a 25mswindow and shifted every 10ms. These features were stackedwith 7 frames to the left and down-sampled to 30ms frame rate,following [37].We used 5-layer ×

768 uni-directional LSTMs for cross-entropy training, which was followed by state-level minimumBayes risk (sMBR) training [38], using the same model archi-tecture. Standard FST-based beam-search decoders with 5-gramlanguage models from the target languages were used for de-coding. ome theatresh o: m θ ɪə ʈ ə zo h e a t r e stehome theatresh mPhonemeGrapheme

Figure 1:

Grapheme alignment example. From top to bot-tom: speech waveform, word alignment by the phoneme model,phoneme alignment, word alignment by the grapheme model,and grapheme alignment.

EN_IN P hon e m e Grapheme a b c d e f g h i j k l m n o p q r s t u v w x y zəɑɛaɪbdʒɖeɪfɡhijklmno:pɹsʈuvz

Figure 2:

Alignment confusion matrix for Indian English. Onlya subset of phonemes/graphemes are kept for better visualiza-tion.

Since Indian languages frequently contain Latin charactersin the transcripts, all our models were evaluated after normaliza-tion for transliteration errors ( transliterated WER [39]). Or inother words, if the model correctly decoded a word in Latin al-phabet but the reference was in the native script (or vice versa),this was not considered an error.Table 3 presents the WER of Phoneme and Grapheme mod-els after CE and sMBR training, and we see an average improve-ment in performance of ∼ P and G refer to P honeme and G rapheme models in boththe tables.

4. Alignments: Phoneme vs Grapheme

In order to study the alignments generated by the graphemicmodels, we compared these with phonemic alignments. An ex-ample of comparison is presented in Fig. 1. More generally,we generated the graphemic and phonemic alignments on thesame set of utterances, which is presented in confusion matri-ces for English in Fig. 2, and for Bengali in Fig. 3. The matri- P hon e m e Grapheme d g h j l m n q t উ ◌ু এ %◌ ও '◌ ক ছ জ ঞ ট ঠ ড ণ ত থ ন ফ ভ ম ল স হ bʰɖefhdʒklmnost̪t̪ʰʈʈʰu

BN_BD

Figure 3:

Alignment confusion matrix for Bengali. Only a sub-set of phonemes/graphemes are kept for better visualization. ces are normalized by grapheme distribution (i.e., by columns).We can see that for Bengali, a language with strong phoneme-grapheme correspondence, the graphemes match their corre-sponding phonemes. For example, Bengali symbol “ জ ” is pro-nounced as “j” (as in “jug”), and is matched with the Bengaliphoneme /dZ/ . For English, a language with irregular orthog-raphy, we observe that the grapheme alignments for consonantsstill mostly match, although the correspondence is weaker forvowels as can be seen in the confusion matrix in Fig. 2

5. Pronunciation in Grapheme Models

The languages studied in this paper come from three differentregions of India and vary characteristically in pronunciation andorthography. Additionally, Indian English has its own distinc-tive accent which is inﬂuenced by the native language of thespeaker. Naturally, for a grapheme recognizer that excludes theuse of lexicon, we would like to understand how well the pro-nunciations are learnt. We carry out this exploration using thealignments generated by the grapheme models.

We investigate the grapheme-to-phoneme correspondence fortwo dialects (British and Indian English) using alignments from a b c d e f g h i j k l m n o p q r s t u v w x y zEnglish (GB)tvw English (IN)ʈv P hon e m e Grapheme

Figure 4:

Some examples of differences in the alignment con-fusion matrices between British (GB) and Indian (IN) English.Note that in our phoneme set, English (IN) does not have thephoneme “w.” able 5:

Agreement-score vs G-P WER gap

Language Agreement-Score (G-P) WERBengali 46.7% 1.8Tamil 35.9% 1.7Hindi 22.6% 2.5English 9.4% 3.4models trained on Indian and British English respectively inFig. 4. Between the two dialects, we see clear differences in thegrapheme-phoneme mapping, for instance, the grapheme “ w ”gets mapped to the phoneme “ w ” for British English, and tothe phoneme “ v ” for Indian English. Likewise, the grapheme“ t ” is also mapped to different phonemes for the two dialects.As these sounds are characteristic to Indian English, we canconﬁrm that accented pronunciations are learnt reliably by thegrapheme recognizer simply from the training data. Another important characteristic of Indian speech data is the useof words from multiple languages in the same utterance (code-switching). To better understand the pronunciation capabilityof grapheme models in this setup, we performed a few vari-ants of the experiments reported above. When trained usinggraphemes and data from two languages (English and Bengali),we observed that the phonemes (e.g. /dZ/ ) were mapped to thecorrect graphemes of both the languages (Fig. 3), demonstrat-ing the ability of the recognizer to learn a sound that may berepresented by graphemes of more than one language.We also considered how the model behaves when con-strained to the grapheme targets of a single language while be-ing trained and tested with data from two languages. In anexperiment where training and test data contained a mixtureof Latin and Bangla script, and the model was provided withonly the Bangla graphemes, we observed that the WER (af-ter transliteration normalization) stayed the same (41.7, row 1column 2 in Table 3), and the model learned to output pho-netically equivalent English words with Bangla script (e.g.“ সং ” instead of “song”). These observations led us to believethat the grapheme models were inherently learning the script-agnostic phonetic pronunciations of the audio, in addition to thegraphemic representations (spelling).

6. Error Analysis: Grapheme models

Qualitative analysis of the mistakes made by the graphememodels led us to believe that graphemes that did not have strongcorrespondence with any speciﬁc phoneme were confused moreoften. These mistakes surfaced as confusions in similar sound-ing words (homophones) or in groups of words. For instance,the spelling of the “ u ” sound between “ stood ” and “ student ”,were frequently interchanged. Similarly, the Bengali grapheme“ ◌ং ” pronounced as “ng”, was replaced with “ ন ”, pronouncedas “n” (and vice versa). To quantify this effect, we computed ametric, the “agreement score,” which we deﬁne as the fractionof graphemes aligning in at least half of their occurrences witha single phoneme. agreement score = (cid:80) g ∈ G δ g (cid:80) g ∈ G (2) where δ g = (cid:40) , if ∃ p ∈ P, f ( g, p ) ≥ . , , otherwise. (3)for the phoneme set P and grapheme set G .Calculating the “agreement score” on 100K utterances forall four languages (Table 5), we observed that the higher theagreement-score, the lower the WER difference between thePhoneme and Grapheme models. This suggests that graphememodels might perform better on languages that have more reg-ular phonemic orthography, (e.g. Spanish, Finnish, Polish, etc).Furthermore, we hypothesize that by improving the grapheme-to-phoneme correspondence in the training data, we can im-prove the performance of a grapheme model.

7. Summary

Building speech recognition system for a new language is al-ways a challenge, especially in acquiring a pronunciation lex-icon informed by linguistic knowledge. In this paper, we pro-pose a simple methodology for training a grapheme-based rec-ognizer from scratch with efﬁcient training made possible byhardware acceleration. While graphemic LSTM-based rec-ognizers have been previously explored, our proposed tech-nique offers the additional advantage of generating good qual-ity audio-to-grapheme alignments that are valuable for manyspeech applications such as captioning, keyword spotting andspeech synthesis..Furthermore, using four Indian languages, Bengali, Hindi,Tamil and Indian English with large-scale speech data, we showthat our graphemic models can:• Model the phonetic pronunciations independent of scriptand orthography.• Capture pronunciations from different accents.• Align a particular phonetic realization with multiplegraphemes as seen in code-switched languages.• Achieve performance comparable to lexicon-based mod-els with the same number of parameters.• Be combined with language models and FST decodersfor a straightforward launch path for applied real-worldusage.We believe that our investigation will shed new light on howgraphemes can be used for alternate lexicon-free speech recog-nition systems.

8. Acknowledgments

We would like to thank Tara Sainath, Ehsan Variani, SeungjiLee, Mikaela Grace, and Pedro Moreno for their advice andguidance throughout this project.

9. References [1] B. Ramabhadran, L. R. Bahl, M. Padmanabhan et al. , “Acoustics-only based automatic phonetic baseform generation,” in

Proc.ICASSP , vol. 1. IEEE, 1998, pp. 309–312.[2] L. Lu, A. Ghoshal, and S. Renals, “Acoustic data-driven pronun-ciation lexicon for large vocabulary speech recognition.” in

Proc.ASRU , 2013, pp. 374–379.[3] S. F. Chen, “Conditional and joint models for grapheme-to-phoneme conversion,” in

Proc. Eurospeech , 2003.4] K. Rao, F. Peng, H. Sak, and F. Beaufays, “Grapheme-to-phonemeconversion using long short-term memory recurrent neural net-works,” in

Proc. ICASSP . IEEE, 2015, pp. 4225–4229.[5] V.-B. Le, L. Lamel, A. Messaoudi, W. Hartmann, J.-L. Gauvain,C. Woehrling, J. Despres, and A. Roy, “Developing STT andKWS systems using limited language resources,” in

Proc. Inter-speech , 2014.[6] H. Wang, A. Ragni, M. J. Gales, K. M. Knill, P. C. Woodland,and C. Zhang, “Joint decoding of tandem and hybrid systems forimproved keyword spotting on low resource languages,” in

Proc.Interspeech , 2015.[7] P. Golik, Z. T¨uske, R. Schl¨uter, and H. Ney, “Multilingual featuresbased keyword search for very low-resource languages,” in

Proc.Interspeech , 2015.[8] J. Trmal, M. Wiesner, V. Peddinti, X. Zhang, P. Ghahremani,Y. Wang, V. Manohar, H. Xu, D. Povey, and S. Khudanpur, “Thekaldi openkws system: Improving low resource keyword search,”in

Proc. Interspeech , 2017, pp. 3597–3601.[9] M. J. Gales, K. M. Knill, and A. Ragni, “Unicode-basedgraphemic systems for limited resource languages,” in

Proc.ICASSP . IEEE, 2015, pp. 5186–5190.[10] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attendand spell: A neural network for large vocabulary conversationalspeech recognition,” in

Proc. ICASSP . IEEE, 2016, pp. 4960–4964.[11] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-lation by jointly learning to align and translate,” arXiv preprintarXiv:1409.0473 , 2014.[12] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-gio, “Attention-based models for speech recognition,” in

Ad-vances in Neural Information Processing Systems , 2015, pp. 577–585.[13] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Con-nectionist temporal classiﬁcation: labelling unsegmented se-quence data with recurrent neural networks,” in

Proc. ICML .ACM, 2006, pp. 369–376.[14] A. Graves and N. Jaitly, “Towards end-to-end speech recognitionwith recurrent neural networks,” in

Proc. ICML , 2014, pp. 1764–1772.[15] K. Rao and H. Sak, “Multi-accent speech recognition with hierar-chical grapheme based models,” in

Proc. ICASSP . IEEE, 2017,pp. 4815–4819.[16] C. et al, “Knowledge distillation across ensembles of multilingualmodels for low-resource languages,” in

Proc. ICASSP . IEEE,2017, pp. 4825–4829.[17] A. Rosenberg, K. Audhkhasi, A. Sethy, B. Ramabhadran, andM. Picheny, “End-to-end speech recognition and keyword searchon low-resource languages,” in

Proc. ICASSP . IEEE, 2017, pp.5280–5284.[18] Y. Wang, X. Chen, M. Gales, A. Ragni, and J. Wong, “Phoneticand graphemic systems for multi-genre broadcast transcription,” arXiv preprint arXiv:1802.00254 , 2018.[19] M. Sun, A. Raju, G. Tucker, S. Panchapagesan, G. Fu, A. Mandal,S. Matsoukas, N. Strom, and S. Vitaladevuni, “Max-pooling losstraining of long short-term memory networks for small-footprintkeyword spotting,” in . IEEE, 2016, pp. 474–480.[20] K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran, andB. Kingsbury, “End-to-end asr-free keyword search from speech,”

IEEE Journal of Selected Topics in Signal Processing , vol. 11,no. 8, pp. 1351–1359, 2017.[21] P. Jyothi and M. Hasegawa-Johnson, “Low-resource grapheme-to-phoneme conversion using recurrent neural networks,” in . IEEE, 2017, pp. 5030–5034. [22] H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer:Acoustic-to-word LSTM model for large vocabulary speechrecognition,” arXiv preprint arXiv:1610.09975 , 2016.[23] M. Federico and M. Furini, “An automatic caption alignmentmechanism for off-the-shelf speech recognition technologies,”

Multimedia tools and applications , vol. 72, no. 1, pp. 21–40,2014.[24] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss,N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al. ,“Tacotron: Towards end-to-end speech synthesis,” arXiv preprintarXiv:1703.10135 , 2017.[25] M. Killer, S. Stuker, and T. Schultz, “Grapheme based speechrecognition,” in

Proc. Eurospeech , 2003.[26] Y.-H. Sung, T. Hughes, F. Beaufays, and B. Strope, “Revisitinggraphemes with increasing amounts of data,” in

Proc. ICASSP .IEEE, 2009, pp. 4449–4452.[27] C. Zhang, C. Yu, C. Weng, J. Cui, and D. Yu, “An exploration ofdirectly using word as acoustic modeling unit for speech recog-nition,” in . IEEE, 2018, pp. 64–69.[28] K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, and D. Na-hamoo, “Direct acoustics-to-word models for english conver-sational speech recognition,” arXiv preprint arXiv:1703.07754 ,2017.[29] H. Sak, A. Senior, K. Rao, O. Irsoy, A. Graves, F. Beaufays,and J. Schalkwyk, “Learning acoustic frame labeling for speechrecognition with recurrent neural networks,” in

Proc. ICASSP .IEEE, 2015, pp. 4280–4284.[30] R. Y. Rubinstein and D. P. Kroese,

The cross-entropy method:a uniﬁed approach to combinatorial optimization, Monte-Carlosimulation and machine learning . Springer Science & BusinessMedia, 2013.[31] A. Graves, N. Jaitly, and A.-r. Mohamed, “Hybrid speech recog-nition with deep bidirectional lstm,” in . IEEE, 2013,pp. 273–278.[32] S. Wiesler, J. Li, and J. Xue, “Investigations on hessian-free op-timization for cross-entropy training of deep neural networks.” in

INTERSPEECH , 2013, pp. 3317–3321.[33] K. Vesel`y, L. Burget, and F. Gr´ezl, “Parallel training of neuralnetworks for speech recognition,” in

International Conference onText, Speech and Dialogue . Springer, 2010, pp. 439–446.[34] A. Senior, G. Heigold, M. Bacchiani, and H. Liao, “Gmm-freednn acoustic model training,” in ,2014.[35] M. Elfeky, P. Haghani, S. Lee, E. Weinstein, and P. Moreno, “Em-ploying context-independent gmms to ﬂat start context-dependentctc acoustic models,” in

Proc. ICNLSSP , 2017, pp. 16–20.[36] C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. N.Sainath, and M. Bacchiani, “Generated of large-scale simulatedutterances in virtual rooms to train deep-neural networks for far-ﬁeld speech recognition in Google Home,” in

Proc. Interspeech ,2017.[37] H. Sak, A. Senior, K. Rao, and F. Beaufays, “Fast and accuraterecurrent neural network acoustic models for speech recognition,” arXiv preprint arXiv:1507.06947 , 2015.[38] B. Kingsbury, “Lattice-based optimization of sequence classiﬁ-cation criteria for neural-network acoustic modeling,” in

Proc.ICASSP . IEEE, 2009, pp. 3761–3764.[39] J. Emond, B. Ramabhadran, B. Roark, P. Moreno, and M. Ma,“Transliteration based approaches to improve code-switchedspeech recognition performance,” in