[PDF] Extracting Domain Invariant Features by Unsupervised Learning for Robust Automatic Speech Recognition

Abstract

The performance of automatic speech recognition (ASR) systems can be significantly compromised by previously unseen conditions, which is typically due to a mismatch between training and testing distributions. In this paper, we address robustness by studying domain invariant features, such that domain information becomes transparent to ASR systems, resolving the mismatch problem. Specifically, we investigate a recent model, called the Factorized Hierarchical Variational Autoencoder (FHVAE). FHVAEs learn to factorize sequence-level and segment-level attributes into different latent variables without supervision. We argue that the set of latent variables that contain segment-level information is our desired domain invariant feature for ASR. Experiments are conducted on Aurora-4 and CHiME-4, which demonstrate 41% and 27% absolute word error rate reductions respectively on mismatched domains.

Full PDF

aa r X i v : . [ c s . C L ] M a r EXTRACTING DOMAIN INVARIANT FEATURES BY UNSUPERVISED LEARNINGFOR ROBUST AUTOMATIC SPEECH RECOGNITION

Wei-Ning Hsu, James Glass

MIT Computer Science and Artiﬁcial Intelligence LaboratoryCambridge, MA 02139, USA { wnhsu,glass } @mit.edu ABSTRACT

The performance of automatic speech recognition (ASR) systemscan be signiﬁcantly compromised by previously unseen conditions,which is typically due to a mismatch between training and testingdistributions. In this paper, we address robustness by studying do-main invariant features, such that domain information becomestransparent to ASR systems, resolving the mismatch problem.Speciﬁcally, we investigate a recent model, called the FactorizedHierarchical Variational Autoencoder (FHVAE). FHVAEs learn tofactorize sequence-level and segment-level attributes into differentlatent variables without supervision. We argue that the set of la-tent variables that contain segment-level information is our desireddomain invariant feature for ASR. Experiments are conducted onAurora-4 and CHiME-4, which demonstrate 41% and 27% absoluteword error rate reductions respectively on mismatched domains.

Index Terms — robust speech recognition, factorized hierarchi-cal variational autoencoder, domain invariant representations

1. INTRODUCTION

Recently, neural network-based acoustic models [1, 2, 3] havegreatly improved the performance of automatic speech recognition(ASR) systems. Unfortunately, it is well known (e.g., [4]) that ASRperformance can degrade signiﬁcantly when testing in a domainthat is mismatched from training. A major reason is that speechdata have complex distributions and contain information about notonly linguistic content, but also speaker identity, background noise,room characteristics, etc. Among these sources of variability, onlya subset are relevant to ASR, while the rest can be considered as anuisance and therefore hurt the performance if the distributions ofthese attributes are mismatched between training and testing.To alleviate this issue, some robust ASR research focuses onmapping the out-of-domain data to in-domain data using enhancement-based methods [5, 6, 7], which generally requires parallel data fromboth domains. Another popular strategy is to train an ASR systemwith as large, and as diverse a dataset as possible [8, 9]; however,this strategy is not feasible when the labeled data are not availablefor all domains. Alternatively, robustness can also be achieved bytraining using features that are domain invariant [10, 11, 12, 13, 14].In this case, we would not have domain mismatch issues, becausedomain information is now transparent to the ASR system.In this paper, we consider the same highly adverse scenario asin [4], where both clean and noisy speech are available, but thetranscripts are only available for clean speech. We study the useof a recently proposed model, called Factorized Hierarchical Vari-ational Autoencoder (FHVAE) [15], for learning domain invariant ASR features without supervision. FHVAE models learn to factorizesequence-level attributes and segment-level attributes into differentlatent variables. By training an ASR system on the latent variablesthat encode segment-level attributes, and testing the ASR in mis-matched domains, we demonstrate that these latent variables con-tain linguistic information and are more domain invariant. Compre-hensive experiments study the effect of different FHVAE architec-tures, training strategies, and the use of derived domain features onthe robustness of ASR systems. Our proposed method is evaluatedon Aurora-4 [16] and CHiME-4 [17] datasets, which contain arti-ﬁcially corrupted noisy speech and real noisy speech respectively.The proposed FHVAE-based feature reduces the absolute word errorrate (WER) by 27% to 41% compared to ﬁlter bank features, andby 14% to 16% compared to variational autoencoder-based features.We have released the code of FHVAEs described in the paper. The rest of the paper is organized as follows. In Section 2, we in-troduce the FHVAE model and a method to extract domain invariantfeatures. Section 3 describes the experimental setup, while Section 4presents results and discussion. We conclude our work in Section 5.

2. LEARNING DOMAIN INVARIANT FEATURES2.1. Modeling a Generative Process of Speech Segments

As mentioned above, generation of speech data often involves manyindependent factors, which are however unseen in the unsupervisedsetting. It is therefore natural to describe such a generative processusing a latent variable model, where a latent variable z is ﬁrst sam-pled from a prior distribution, and a speech segment x is then sam-pled from a distribution conditioned on z . In [18], a convolutionalvariational autoencoder (VAE) is proposed to model such process;by assuming the prior to be a diagonal Gaussian, it is shown that theVAE automatically learns to model independent attributes regardinggeneration, such as the speaker identity and the linguistic content,using orthogonal latent subspaces. This result provided a mecha-nism of potentially learning domain invariant features for ASR bydiscovering latent variables that do not contain domain information. The generation of sequential data often involves multiple indepen-dent factors operating at different scales. For instance, the speakeridentity affects the fundamental frequency (F0) at the utterance level,while the phonetic content affects spectral characteristics at the seg-ment level. As a result, sequence-level attributes, such as F0 and https://github.com/wnhsu/FactorizedHierarchicalVAE olume, tends to have a smaller amount of variation within an ut-terance, compared to between utterances, while the other attributes,such as spectral contours, tend to have similar amounts of variationwithin and between utterances.Based on this observation, FHVAEs [15] formulate the genera-tive process of sequential data with a factorized hierarchical graph-ical model that imposes sequence-dependent priors and sequence-independent priors to different sets of latent variables. Speciﬁcally,given a dataset D = { X ( i ) } Mi =1 consisting of M i.i.d. sequences,where X ( i ) = { x ( i,n ) } N ( i ) n =1 is a sequence of N ( i ) segments (sub-sequence), a sequence X of N segments is assumed to be gener-ated from a random process that involves latent variables Z = { z ( n )1 } Nn =1 , Z = { z ( n )2 } Nn =1 , and µ as follows: (1) an s-vector µ is drawn from a prior distribution p θ ( µ ) = N ( µ | , σ µ I ) ; (2) N i.i.d. latent segment variables { z ( n )1 } Nn =1 and latent sequencevariables { z ( n )2 } Nn =1 are drawn from a sequence-independentprior p θ ( z ) = N ( z | , σ z I ) and a sequence-dependent prior p θ ( z | µ ) = N ( z | µ , σ z I ) respectively; (3) N i.i.d. speechsegments { x ( n ) } Nn =1 are drawn from a condition distribution p θ ( x | z , z ) = N ( x | f µ x ( z , z ) , diag ( f σ x ( z , z ))) , whosemean and diagonal variance are parameterized by neural networks.The joint probability for a sequence is formulated in Eq. 1: p θ ( µ ) N Y n =1 p θ ( x ( n ) | z ( n )1 , z ( n )2 ) p θ ( z ( n )1 ) p θ ( z ( n )2 | µ ) . (1)Based on this formulation, µ can be regarded as a summarizationof sequence-level attributes for a sequence, and z is encouragedto encode sequence-level attributes for a segment that are similarwithin an utterance. Consequently, z encodes the residual segment-level attributes for a segment, such that z and z together providesufﬁcient information for generating a segment.Since exact posterior inference is intractable, FHVAEs introducean inference model q φ ( Z ( i )1 , Z ( i )2 , µ ( i )2 | X ( i ) ) as formulated in Eq. 2that approximates the true posterior p θ ( Z ( i )1 , Z ( i )2 , µ ( i )2 | X ( i ) ) : q φ ( µ ( i )2 ) N ( i ) Y n =1 q φ ( z ( i,n )1 | x ( i,n ) , z ( i,n )2 ) q φ ( z ( i,n )2 | x ( i,n ) ) , (2)from which we observe that inference of z ( i,n )1 and z ( i,n )2 onlydepends on the corresponding segment x ( i,n ) ; in particular, theposteriors, q φ ( z | x , z ) = N ( z | g µ z ( x , z ) , diag ( g σ z ( x , z ))) and q φ ( z | x ) = N ( z | g µ z ( x ) , diag ( g σ z ( x ))) , are approxi-mated with diagonal Gaussian distributions whose mean and di-agonal variance are also parameterized by neural networks. Onthe other hand, q φ ( µ ( i )2 ) is modeled as an isotropic Gaussian, N ( µ ( i )2 | g µ µ ( i ) , σ ˜ µ I ) , where g µ µ ( i ) is a trainable lookup tableof the posterior mean of µ for each training sequence. Estimationof µ for testing sequences can be found in [15].As pointed out in [4], nuisance attributes regarding ASR, suchas speaker identity, room geometry, and background noise, are gen-erally consistent within an utterance. If we treat each utterance asa sequence, these attributes then become sequence-level attributes,which would be encoded by z and µ . As a result, z encodesthe residual linguistic information and is invariant to these nuisanceattributes, which is our desired domain invariant ASR feature. As in other generative models, FHVAEs aim to maximize themarginal likelihood of the observed dataset; due to the intractabilityof the exact posterior, FHVAEs optimize the segment variationallower bound , Ł ( θ, ψ ; x ( i,n ) ) , which is formulated as follows: E q φ ( z ( i,n )1 , z ( i,n )2 | x ( i,n ) ) (cid:2) log p θ ( x ( i,n ) | z ( i,n )1 , z ( i,n )2 ) (cid:3) − E q φ ( z ( i,n )2 | x ( i,n ) ) (cid:2) D KL ( q φ ( z ( i,n )1 | x ( i,n ) , z ( i,n )2 ) || p θ ( z ( i,n )1 )) (cid:3) − D KL ( q φ ( z ( i,n )2 | x ( i,n ) ) || p θ ( z ( n )2 | g µ µ ( i ))) + 1 N log p θ ( g µ µ ( i )) . Notice that if the µ are the same for all utterances, an FHVAEwould then degenerate to a vanilla VAE. To prevent µ fromcollapsing, we can add an additional discriminative objective, log p ( i | z ( i,n )2 ) , that encourages the discriminability of z regard-ing which utterance the segment is drawn from. Speciﬁcally, wedeﬁne it as log p θ ( z ( i,n )2 | g µ µ ( i )) − log P Mj =1 p θ ( z ( i,n )2 | g µ µ ( j )) .By combining the two objectives with a weighting parameter α , weobtain the discriminative segment variational lower bound : L dis ( θ, φ ; x ( i,n ) ) = L ( θ, φ ; x ( i,n ) ) + α log p ( i | z ( i,n )2 ) . (3)

3. EXPERIMENT SETUP

To evaluate the effectiveness of the proposed method on extractingdomain invariant features, we consider domain mismatched ASRscenarios. Speciﬁcally, we train an ASR system using a clean set,and test the system on both a clean and noisy set. The idea is that onewould observe a smaller performance discrepancy between differentdomains if the feature representation is more domain invariant. Wenext introduce the datasets, as well as the model architectures andtraining conﬁgurations for the experiments.

We use Aurora-4 [16] as the primary dataset for our experiments.Aurora-4 is a broadband corpus designed for noisy speech recogni-tion tasks based on the Wall Street Journal (WSJ0) corpus [19]. Twomicrophone types, clean/channel are included, and six noise typesare artiﬁcially added to both microphone types, which results in fourconditions: clean(A), channel(B), noisy(C), and channel+noisy(D).We use the multi-condition development set for training the VAE andFHVAE models, because the development set contains both noiselabels and speaker labels for each utterance, which are used in

Exp.Index 5 , while the training set only contains speaker labels. TheASR system is trained on the clean train si84 clean set andevaluated on the multi-condition test eval92 set.To verify our proposed method on a non-artiﬁcial dataset, werepeat our experiments on the CHiME-4 [17] dataset, which con-tains real distant-talking recordings in noisy environments. We usethe original 7,138 clean utterances and the 1,600 single channel realnoisy utterances in the training partition to train the VAE and FH-VAE models. The ASR system is trained on the original clean train-ing set and evaluated on the CHiME-4 development set.

The VAE is trained with stochastic gradient descent using a mini-batch size of 128 without clipping to minimize the negative vari-ational lower bound plus an L -regularization with weight − .etting WER (%) WER (%) by ConditionExp. Index Feature α Seq. Label Avg. A B C D1 FBank - - - - 65.64 3.21 61.61 51.78 82.39 z z z z z z z z z z z z z z z z z z z - µ Table 1 . Aurora-4 test eval92 set word error rate of acoustic models trained on different features.The Adam [20] optimizer is used with β = 0 . , β = 0 . , ǫ = 10 − , and initial learning rate of − . Training is terminatedif the lower bound on the development set does not improve for 50epochs. The FHVAE is trained with the same conﬁguration and op-timization method, except that the loss function is replaced with thenegative discriminative segment variational lower bound.Seq2Seq-VAE [4] and Seq2Seq-FHVAE [15] architectures withLSTM units are used for all experiments. We let the latent space ofthe VAEs contain 64 dimensions. Since the FHVAE models have twolatent spaces, we let each of them be 32 dimensional. Other hyper-parameters are explored in our experiments. Inputs to VAE/FHVAE, x , are chunks of 20 consecutive speech frames randomly drawn fromutterances, where each frame is represented as 80 dimensional ﬁl-ter bank (FBank) energies. To extract features from the VAE andFHVAE for ASR training, for each utterance, we compute and con-catenate the posterior mean and variance of chunks shifted by oneframe, which generates a sequence of new features that are 19 framesshorter than the original sequence. We pad the ﬁrst frame and the lastframe at each end to match the original length. Kaldi [21] is used for feature extraction, decoding, forced align-ment, and training of an initial HMM-GMM model on the origi-nal clean utterances. The recipe provided by the CHiME-4 chal-lenge ( run gmm.sh ) and the Kaldi Aurora-4 recipe are adapted byonly changing the training data being used. The Computational Net-work Toolkit (CNTK) [22] is used for neural network-based acous-tic model training. For all experiments, the same LSTM acousticmodel [23] with the architecture proposed in [24] is applied, whichhas 1,024 memory cells and a 512-node projection layer for eachLSTM layer, and 3 LSTM layers in total.Following the training setup in [25], LSTM acoustic mod-els are trained with a cross-entropy criterion, using truncatedbackpropagation-through-time (BPTT) [26] to optimize. Each BPTT segment contains 20 frames, and each mini-batch contains 80utterances, since we ﬁnd empirically that 80 utterances has similarperformance to 40 utterances. A momentum of 0.9 is used startingfrom the second epoch [3]. Ten percent of the training data is heldout as a validation set to control the learning rate. The learningrate is halved when no gain is observed after an epoch. The samelanguage model is used for decoding for all experiments.

4. EXPERIMENTAL RESULTS AND DISCUSSION

In this section, we report the experimental results on both datasets,and provide insights on the outcome. Table 1 and 2 summarize theresults on Aurora-4 and CHiME-4 respectively. For both tables, dif-ferent experiments are separated by double horizontal lines and in-dexed by the

Exp. Index on the ﬁrst column. The second column,

Feature , refers to the frame representations used for training ASRmodels. The third to the sixth column explains the model conﬁg-uration and the discriminative training weight for VAE or FHVAEmodels. We separate the encoder and decoder parameters by “/” inthe third and the fourth column. Averaged and by-condition worderror rate (WER) are shown in the rest of the columns.

We start with establishing Aurora-4 baseline results trained on differ-ent types of feature representations, including (1) FBank, (2) latentvariable, z , extracted from the VAE, and (3) latent segment variable, z , extracted from the FHVAE. Because each FHVAE model hastwo encoders, to have a fair comparison between VAE and FHVAEmodels, we also consider a VAE model with 512 hidden units at eachencoder layer. The results are shown in Table 1 Exp. Index 1 . Asmentioned, condition A is the matched domain, while conditions B,C, and D are all mismatched domains.FBank degrades signiﬁcantly in the mismatched conditions, pro-ducing between 49% to 79% absolute WER increase. On the otheretting WER (%) WER (%) by Noise TypeExp. Index ASR Feature α Seq. Label Clean Noisy BUS CAF PED STR1 FBank - - - - 19.37 87.69 95.56 92.05 78.77 84.37 z z z z z Table 2 . CHiME-4 development set word error rate of acoustic models trained on different features.hand, both VAE and FHVAE models improve the performance in themismatched domains by a large margin, with only a slight degrada-tion in the matched domain. In particular, the features learned by theFHVAE consistently outperform the VAE features in all mismatchedconditions by 14% absolute WER reduction.We believe that this experiment veriﬁes that FHVAEs can suc-cessfully retain domain invariant linguistic features in z , while en-code domain related information into z . In contrast, as the resultssuggests, VAEs encode all the information into a single set of latentvariables, z , which still contain domain related information that canhurt ASR performance on the mismatched domains. We next explore the optimal FHVAE architectures for extracting do-main invariant features. In particular, we study the effect of the num-ber of hidden units at each layer and the number of layers. Resultsof each variant are listed in Table 1

Exp. Index 2 and

Exp. Index3 respectively. Regarding the averaged WER, the model with 256hidden units at each layer and in total three layers achieves the low-est WER (24.30%). Interestingly, if we break down the WER bycondition, it can be observed that increasing the FHVAE model ca-pacity (i.e. increasing number of layers or hidden units) helps reduc-ing the WER in the noisy condition (B), but deteriorates channel-mismatching condition (C) above 256 hidden units and 2 layers.

Speaker veriﬁcation experiments in [15] suggest that discriminativetraining facilitates factorizing segment-level attributes and sequence-level attributes into two sets of latent variables. Here we study theeffect of discriminative training on learning robust ASR features,and show the results in Table 1

Exp. Index 4 . When α = 0 , themodel is not trained with the discriminative object. While increas-ing the discriminative weight from 0 to 10, we observe consistentimprovement in all 4 conditions due to better factorization of seg-ment and sequence information; however, when further increasingthe weight to 20, the performance starts to degrade. This is becausethe discriminative object can inversely affects the modeling capacityby constraining the expressibility of the latent sequence variables. A core idea of FHVAE is to learn sequence-speciﬁc priors to modelthe generation of sequence-level attributes, which have a smalleramount of variation within a sequence. Suppose we treat each utter-ance as one sequence, then both speaker and noise information be-longs to sequence-level attributes, because they are consistent withinan utterance. Alternatively, we consider two FHVAE models thatlearn speaker-speciﬁc priors and noise-speciﬁc priors respectively. This can be easily achieved by concatenating sequences of the samespeaker label or noise label, and treating it as one sequence used forFHVAE training. We report the results in Table 1

Exp. Index 5 .It may at ﬁrst seem surprising that utilizing supervised informa-tion in this fashion does not improve performance. We believe thatconcatenating utterances actually discards some useful informationwith respect to learning domain invariant features. FHVAEs use la-tent segment variables to encode attributes that are not consistentwithin a sequence. By concatenating speaker utterances, noise in-formation is no longer consistent within sequences, and would thusbe encoded into latent segment variables; similarly, latent segmentvariables would not be speaker invariant in the other case.

Lastly, we study the use of s-vectors, µ , derived from the FHVAEmodel, which can be seen as a summarization of sequence-level at-tributes of an utterance. We apply the same procedure as i-vectorbased speaker adaptation [27]: For each utterance, we ﬁrst estimateits s-vector, and then concatenate s-vectors with the feature repre-sentation of each frame to generate the new feature sequence.Results are shown in Table 1 Exp. Index 6 , from which we ob-serve a signiﬁcant degradation of WER that is similar to those ofthe VAE models. This is reasonable because z and µ in combi-nation actually contains similar information as the latent variable z in VAE models, and the degradation is due to the mismatch betweenthe distributions of µ in the training and testing sets. In this section, we repeat the baseline and the layer experiments onthe CHiME-4 dataset, in order to verify the effectiveness of the FH-VAE and the optimality of the FHVAE architecture on a non-artiﬁcialdataset. The results are shown in Table 2. From

Exp. Index 1 , wesee that the same trend applies to the CHiME-4 dataset, where thelatent segment variables from the FHVAE outperform those from theVAE, and both latent variable representations outperform FBank fea-tures. For the FHVAE architectures, a 7% absolute WER decrease isachieved by increasing the number of encoder/decoder layers from 1to 3, which is also consistent with the trends we saw on Aurora-4.

5. CONCLUSION AND FUTURE WORK

In this paper, we conduct comprehensive experiments on studyingthe use of FHVAE models domain invariant ASR features extrac-tors. Our feature demonstrates superior robustness in mismatcheddomains compared to FBank and VAE-based features by achieving41% and 27% absolute WER reduction on Aurora-4 and CHiME-4respectively. In the future, we plan to study FHVAE-based augmen-tation methods similar to [4]. . REFERENCES [1] Tara N Sainath, Brian Kingsbury, George Saon, Hagen Soltau,Abdel-rahman Mohamed, George Dahl, and Bhuvana Ramab-hadran, “Deep convolutional neural networks for large-scalespeech tasks,”

Neural Networks , vol. 64, pp. 39–48, 2015.[2] Has¸im Sak, F´elix de Chaumont Quitry, Tara Sainath, KanishkaRao, et al., “Acoustic modelling with cd-ctc-smbr lstm rnns,”in

Automatic Speech Recognition and Understanding (ASRU),2015 IEEE Workshop on . IEEE, 2015, pp. 604–609.[3] Wei-Ning Hsu, Yu Zhang, and James Glass, “A prioritized gridlong short-term memory rnn for speech recognition,” in

Spo-ken Language Technology Workshop (SLT), 2016 IEEE . IEEE,2016, pp. 467–473.[4] Wei-Ning Hsu, Yu Zhang, and James Glass, “Unsuperviseddomain adaptation for robust speech recognition via variationalautoencoder-based data augmentation,” in

Automatic SpeechRecognition and Understanding (ASRU), 2017 IEEE Workshopon . IEEE, 2017.[5] Arun Narayanan and DeLiang Wang, “Ideal ratio mask estima-tion using deep neural networks for robust speech recognition,”in

Acoustics, Speech and Signal Processing (ICASSP), 2013IEEE International Conference on . IEEE, 2013, pp. 7092–7096.[6] Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe,and John R Hershey, “Single-channel multi-speaker separa-tion using deep clustering,” arXiv preprint arXiv:1607.02173 ,2016.[7] Xue Feng, Yaodong Zhang, and James Glass, “Speech featuredenoising and dereverberation via deep autoencoders for noisyreverberant speech recognition,” in

Acoustics, Speech and Sig-nal Processing (ICASSP), 2014 IEEE International Confer-ence on . IEEE, 2014, pp. 1759–1763.[8] Jinyu Li, Dong Yu, Jui-Ting Huang, and Yifan Gong, “Im-proving wideband speech recognition using mixed-bandwidthtraining data in cd-dnn-hmm,” in

Spoken Language TechnologyWorkshop (SLT), 2012 IEEE . IEEE, 2012, pp. 131–136.[9] Michael L Seltzer, Dong Yu, and Yongqiang Wang, “An inves-tigation of deep neural networks for noise robust speech recog-nition,” in

Acoustics, Speech and Signal Processing (ICASSP),2013 IEEE International Conference on . IEEE, 2013, pp.7398–7402.[10] Brian ED Kingsbury, Nelson Morgan, and Steven Green-berg, “Robust speech recognition using the modulation spec-trogram,”

Speech communication , vol. 25, no. 1, pp. 117–132,1998.[11] Richard M Stern and Nelson Morgan, “Features based on au-ditory physiology and perception,”

Techniques for Noise Ro-bustness in Automatic Speech Recognition , p. 193227, 2012.[12] Oriol Vinyals and Suman V Ravuri, “Comparing multilayerperceptron to deep belief network tandem features for robustasr,” in

Acoustics, Speech and Signal Processing (ICASSP),2011 IEEE International Conference on . IEEE, 2011, pp.4596–4599.[13] Tara N Sainath, Brian Kingsbury, and Bhuvana Ramabhadran,“Auto-encoder bottleneck features using deep belief networks,”in

Acoustics, Speech and Signal Processing (ICASSP), 2012IEEE International Conference on . IEEE, 2012, pp. 4153–4156. [14] Sining Sun, Binbin Zhang, Lei Xie, and Yanning Zhang,“An unsupervised deep domain adaptation approach for robustspeech recognition,”

Neurocomputing , 2017.[15] Wei-Ning Hsu, Yu Zhang, and James Glass, “Unsupervisedlearning of disentangled and interpretable representations fromsequential data,” in

Advances in Neural Information Process-ing Systems , 2017.[16] David Pearce,

Aurora working group: DSR front end LVCSRevaluation AU/384/02 , Ph.D. thesis, Mississippi State Univer-sity, 2002.[17] Emmanuel Vincent, Shinji Watanabe, Aditya Arie Nugraha,Jon Barker, and Ricard Marxer, “An analysis of environment,microphone and data simulation mismatches in robust speechrecognition,”

Computer Speech & Language , 2016.[18] Wei-Ning Hsu, Yu Zhang, and James Glass, “Learning latentrepresentations for speech generation and transformation,” in

Interspeech , 2017, pp. 1273–1277.[19] John Garofalo, David Graff, Doug Paul, and David Pal-lett, “Csr-i (wsj0) complete,”

Linguistic Data Consortium,Philadelphia , 2007.[20] Diederik Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,” arXiv preprint arXiv:1412.6980 ,2014.[21] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Bur-get, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, PetrMotlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldi speechrecognition toolkit,” in

IEEE 2011 workshop on automaticspeech recognition and understanding . IEEE Signal Process-ing Society, 2011, number EPFL-CONF-192584.[22] Dong Yu, Adam Eversole, Mike Seltzer, Kaisheng Yao, Zhi-heng Huang, Brian Guenter, Oleksii Kuchaiev, Yu Zhang,Frank Seide, Huaming Wang, et al., “An introduction to com-putational networks and the computational network toolkit,”Tech. Rep., Tech. Rep. MSR, Microsoft Research, 2014,http://codebox/cntk, 2014.[23] Hasim Sak, Andrew W Senior, and Franc¸oise Beaufays, “Longshort-term memory recurrent neural network architectures forlarge scale acoustic modeling.,” in

Interspeech , 2014, pp. 338–342.[24] Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yaco, SanjeevKhudanpur, and James Glass, “Highway long short-term mem-ory RNNs for distant speech recognition,” in . IEEE, 2016, pp. 5755–5759.[25] Wei-Ning Hsu, Yu Zhang, Ann Lee, and James R Glass, “Ex-ploiting depth and highway connections in convolutional recur-rent deep neural networks for speech recognition.,” in

INTER-SPEECH , 2016, pp. 395–399.[26] Ronald J Williams and Jing Peng, “An efﬁcient gradient-basedalgorithm for on-line training of recurrent network trajecto-ries,”

Neural computation , vol. 2, no. 4, pp. 490–501, 1990.[27] George Saon, Hagen Soltau, David Nahamoo, and MichaelPicheny, “Speaker adaptation of neural network acoustic mod-els using i-vectors.,” in