Enforcing Encoder-Decoder Modularity in Sequence-to-Sequence Models
Siddharth Dalmia, Abdelrahman Mohamed, Mike Lewis, Florian Metze, Luke Zettlemoyer
aa r X i v : . [ c s . C L ] N ov Enforcing Encoder-Decoder Modularity in Sequence-to-Sequence Models
Siddharth Dalmia, ∗ Abdelrahman Mohamed, Mike Lewis, Florian Metze, Luke Zettlemoyer Carnegie Mellon University Facebook AI Research [email protected], [email protected]
Abstract
Inspired by modular software design princi-ples of independence, interchangeability, andclarity of interface, we introduce a methodfor enforcing encoder-decoder modularity inseq2seq models without sacrificing the overallmodel quality or its full differentiability. Wediscretize the encoder output units into a prede-fined interpretable vocabulary space using theConnectionist Temporal Classification (CTC)loss. Our modular systems achieve near SOTAperformance on the 300h Switchboard bench-mark, with WER of . and . on theSWB and CH subsets, using seq2seq modelswith encoder and decoder modules which areindependent and interchangeable. Modularity is a universal requirement for largescale software and system design, where “a mod-ule is a unit whose structural elements are pow-erfully connected among themselves and rela-tively weakly connected to elements in otherunits.” (Baldwin and Clark, 1999). In addition toindependence, good software architecture empha-sises interchangability of modules, a clear under-standing of the function of each module, and a un-ambiguous interface of how each module interactswith the larger system. In this paper, we demon-strate that widely adopted seq2seq models lackmodularity, and introduce new ways of trainingthese models with independent and interchange-able encoder and decoder modules that do not sac-rifice overall system performance.Fully differentiable seq2seq models(Chan et al., 2016; Bahdanau et al., 2016, 2015;Vaswani et al., 2017) play a critical role in awide range of NLP and speech tasks, but fail tosatisfy even very basic measures of modularity ∗ Work done at Facebook AI Research. between the encoder and decoder components.The decoder cross-attention averages over thecontinuous output representations of the encoderand the parameters of both modules are jointlyoptimized through back propagation. This cause atight coupling, and prevents a clear understandingof the function of each part. As we will showempirically, current seq2seq models lack modularinterchangability, i.e. retraining a single modelwith different random seeds will cause the en-coder and decoder modules to learn very differentfunctions, so much so that interchanging themradically degrades overall model performance.Such tight coupling makes it difficult to measurethe contributions of the individual modules ortransfer components across different domains andtasks.In this paper, we introduce a new method thatguarantees encoder-decoder modularity while alsoensuring the model is fully differentiable. Weconstrain the encoder outputs into a predefineddiscrete vocabulary space using the connectionisttemporal classification (CTC) loss (Graves et al.,2006) that is jointly optimized with the decoderoutput token-level cross entropy loss. This noveluse of the CTC loss ensures discretizing the en-coder output units while respecting their sequen-tial nature. By grounding the discrete encoder out-put into a real-world vocabulary space, we are ableto measure and analyze the encoder performance.We present two proposals for extending the de-coder cross-attention to ingest probability distri-butions, either using probability scores of differ-ent hypotheses or using their rank within a fixedbeam. Combining these techniques enables us totrain seq2seq models that pass the three measuresof modularity; clarity of interface, independence,and interchangeability.The proposed approach combines the best of theend-to-end and the classic sequence transductionpproaches by splitting models into grounded en-coder modules performing translation or acousticmodeling, depending on the task, followed by lan-guage generation decoders, while preserving full-differentiability of the overall system. We presentextensive experiments on the standard Switch-board speech recognition task. Our best model,while having modular encoder and decoder com-ponents, achieves a competitive WER . and . on the standard 300h Switchboard and Call-Home benchmarks respectively. Conditioned on previously generated output to-kens and the full input sequence, encoder-decodermodels (Sutskever et al., 2014) factorize the jointtarget sequence probability into a product of indi-vidual time steps. They are trained by minimizingthe token-level cross-entropy (CE) loss betweenthe true and the decoder predicted distributions.Input sequence information is encoded into thedecoder output through an attention mechanism(Bahdanau et al., 2015) which is conditioned oncurrent decoder states, and run over the encoderoutput representations.
Rather than producing a soft alignment betweenthe input and target sequences, the ConnectionistTemporal Classification (CTC) loss (Graves et al.,2006) maximizes the log conditional likelihood byintegrating over all possible monotonic alignmentsbetween both sequences. Y = Softmax ( Encoder ( X ) ∗ W o ) F CT C ( L, Y ) = − log X z ∈Z ( L,T ) T Y t =1 Y tz t ! Where W o ∈ R d ×| V e | projects the encoder rep-resentations into the output vocabulary space, L is the output label sequence, Z ( L, T ) is the spaceof all possible monotonic alignments of L into T time steps, and the probability of an alignment z is the product of locally-normalized per time stepoutput probabilities Y ∈ R T ×| V e | . The forward-backward algorithm is used for efficient compu- We decided to omit the discussion of the extra CTCblank symbol in the above equation for clarity of presentation,(Graves et al., 2006) provides the full technical treatment ofthe subject tation of the marginalization sum. Only one in-ference step is required to generate the full targetsequence in a non-autoregressive fashion throughthe encoder-only model.
In (Kim et al., 2017; Karita et al., 2019), theencoder-decoder cross-entropy loss is augmentedwith an auxiliary CTC loss, through an extra lin-ear projection of the encoder output representa-tion into the output target space, to guide learn-ing in early optimization phases when gradientsaren’t flowing smoothly from the decoder outputto the encoder parameters due to misaligned cross-attention. The decoder cross-attention still actsover the encoder output representation maintain-ing the tight coupling between the encoder and de-coder modules.
Establishing an interpretable interface between theencoder and decoder components is the first steptowards relaxing their tight coupling in seq2seqmodels. To achieve this goal, we force the en-coder to output distributions over a pre-defined dis-crete vocabulary rather than communicating con-tinuous vector representations to the decoder. Thiscreates an information bottleneck (Tishby et al.,1999) in the model where the decoder can com-municate with the encoder’s learned representa-tions only through probability distributions overthis discrete vocabulary. In addition to being inter-pretable, grounding the encoder outputs offers anopportunity to measure their quality independentof the decoder, if the encoder vocabulary can bemapped to the ground-truth decoder targets.We choose an encoder output vocabulary thatare sub-word units driven from the target labelsequences which may deviate from the decoderoutput vocabulary. To force the encoder to out-put probabilities in the needed vocabulary space,we use the Connectionist Temporal Classification(CTC) loss. This is a novel usage of the CTCloss, not as the main loss driving the model learn-ing process, but as a supervised function to dis-cretize the encoder output space into a pre-defineddiscrete vocabulary. Even if the input-output re-lationship doesn’t adhere to the monotonicity as-sumption of the CTC loss, as a module in thesystem, the encoder component is not expectedo solve the full problem, however, the decodermodule should correct any mismatch in alignmentassumption through its auto-regressive generationprocess.The decoder design needs to change to copewith cross-attention over probability distributionsrather than continuous hidden representations. Weintroduce the
AttPrep component inside the de-coder module to prepare the needed decoder inter-nal representation for attention over the input se-quence. The
AttPrep step enables us to containthe cross-attention operation inside the decodermodule. Y E T = Encoder ( X T ) g T = AttPrep ( Y E T ) y Dt ∼ Decoder ( g , Y D t − ) F OBJ = F CE ( Y D , L ) + F CT C ( Y E , L ) The encoder module has a softmax normaliza-tion layer at the end so that Y E ∈ R T ×| V e | haseach row Y Ei, : as a probability distribution over V e . For discretizing the encoder output, the CTCloss jointly optimized with the decoder cross en-tropy loss. Having distributions over a discretevocabulary V e at the input of the encoder opensthe space for many interesting ideas on how toharness the temporal correlations between encoderoutput units and common confusion patterns. Wepresent two variants for the AttPrep component;the weighted embedding and beam convolution.
AttPrep
Given the encoder output distribution Y E , theweighted embedding AttPrep (WEmb) com-putes an expected embedding per encoder step andcombines it with sinusoidal positional encodings(PE) (Vaswani et al., 2017), then it applies a multi-head self-attention operation (MHA) to aggregateinformation over all time steps. h = Y E ∗ W emb g = MHA ( h + PE ( h )) where W emb ∈ R | V e |× d and d is the decoder in-put dimension. The first operation to compute theexpected embedding is actually a 1-D time convo-lution operation with a receptive field of 1. It canbe extended to larger receptive fields offering theopportunity to learn local confusion patterns fromthe encoder output. h = Conv1D ( Y E ) One variant that we experimented with relaxes thesoftmax operation of the encoder output, whichharshly suppresses most of the encoder outputunits, by applying the 1-D convolution operationabove over log probabilities (WlogEmb) to allowfor more information flow between encoder anddecoder.
AttPrep
Rather than using the encoder output probabilityvalues, the beam convolution
AttPrep (Beam-Conv) uses the rank of of the top-k hypotheses pertime step. It forces a fixed bandwidth on the com-munication channel between the encoder and de-coder, relaxing the dependence on the shape of theencoder output probability distribution. Since thetop-k list doesn’t preserve the unit ordering fromthe encoder output vector, each vocabulary unit isrepresented by a p dimensional embedding vector.Similar to the weighted embedding AttPrep , a 1-D convolution operation is applied over time stepsto aggregate local information followed by a multi-head self attention operation. r = Embedding (cid:0) top-k ( Y E ) (cid:1) h = Conv1D ( r ) g = MHA ( h + PE ( h )) Where r ∈ R T × k × p with beam size k and unitembedding dimension p . For our speech recognition experiments, we fol-low the standard Switchboard setup, with theLDC97S62 300h training set, and the Switchboard(SWB) and CallHome (CH) subsets of HUB5Eval2000 set (LDC2002S09, LDC2000T43) fortesting. Following the data preparation setup ofESPNET (Watanabe et al., 2018), we use meanand variance normalized 83 log-mel filterbank andpitch features from 16kHz upsampled audio. Asmodel targets, we experiment with 100 and 2000sub-word units (Kudo and Richardson, 2018).We use FairSeq (Ott et al., 2019) for allour experiments. We use the Adam opti-mizer (Kingma and Ba, 2014) with eps = 1 e − and an average batch-size of utterances. Wewarm-up the learning rate from e − to a peak lr = 1 e − in k steps, keep it fixed for k steps,then linearly decrease it to e − in k steps. Wefollow the strong Switchboard data augmentationolicy from (Park et al., 2019), but without time-warping. For inference, we don’t use an externalLM or joint decoding over the encoder and de-coder outputs (Watanabe et al., 2018).Our sequence-to-sequence model uses 16transformer blocks (Vaswani et al., 2017) forthe encoder and 6 for the decoder compo-nents with a convolutional context architecture(Mohamed et al., 2019) where input speech fea-tures are processed using two 2-D convolutionblocks with 3x3 kernels, 64 and 128 feature mapsrespectively, 2x2 maxpooling, and ReLU non-linearity. Both encoder and decoder transformerblocks have 1024 dimensions, 16 heads, and 4096dimensional feed-forward network. Sinusoidal po-sitional embeddings are added to the output of theencoder 2-D convolutional context layers. Table 1: Baseline Seq2Seq ASR Models
BPE Units Beam Loss Criterion Eval 2000CTC CE Size CTC CE SWB CHOur baseline implementation100 - 1 ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ (Povey et al., 2016)- - - LF-MMI 8.8 18.1 Table 1 shows the word error rates (WER) ofour ASR baseline implementations employing thethree approaches for seq2seq modeling, alongwith the current SOTA systems in the literature.In line with Irie et al. (2019), the auto-regressiveencoder-decoder models benefits from larger mod-eling units as opposed to the CTC-optimizedone that works best with shorter linguistic units.Encoder-decoder models trained by joint optimiza-tion of the CTC and cross-entropy losses benefits Results from Kaldi’s recent best recipe on GitHub from a hybrid setup with two different vocabularysets.The problem of tight coupling of the encoderand decoder components in the seq2seq modelis highlighted in table 2. The decoder cross-attention over the encoder hidden representationmakes it not only conditioned on the encoder out-puts but also dependent on the encoder architec-tural decisions and internal hidden representations.The whole ASR system fall apart under the inter-changability test, i.e. switching an encoder withanother similar one that is only different in its ini-tial random seed, which brings our point about thelack of modularity in encoder-decoder models.The same level of coupling is also observedin models that utilizes the auxiliary CTC loss toaccelerate early learning stages (Kim et al., 2017;Karita et al., 2019).
Table 2: Effect of switching encoders on WER
Random Loss Criterion Eval2000Seed CTC CE SWB CHSeed 1 ✗ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Tables 3 and 4 show that the proposed modularseq2seq models are competitive with SOTA per-formance levels 1, and that the models are highlymodular. Performance does not degrade when ex-changing encoders and decoders trained from dif-ferent initial seeds or choices of architectures.The information bottleneck at the encoder out-puts is critical for this result, as shown from theWLogEmb architecture. Relaxing the bandwidthconstraint on the encoder-decoder connection byutilizing the log distribution lets the decoder relyon specific patterns of errors at the tail of theencoder output distribution for its final decisions.This improves the overall performance but breaksmodularity when a different encoder is used, asshown in table 3. able 3: Evidence of modularity using our proposed decoupling techniques in attention based ASR Systems
Model 1 Model 2 Enc2 Dec1 Enc1 Dec2Architecture RF K SWB CH Architecture RF K SWB CH SWB CH SWB CHBeamConv 1 20 8.4 17.6 BeamConv 1 20 8.6 17.6 8.6 17.8 8.5 17.3BeamConv 3 10 8.7 17.3 BeamConv 3 10 9.2 17.2 9.7 17.3 8.9 17.5WEmb 3 - 8.7 17.9 WEmb 3 - 8.8 18.3 9.0 18.3 8.6 18.0WLogEmb 3 - 8.0 16.4 WLogEmb 3 - 8.0 16.4 61.4 60.1 55.2 62.7BeamConv 3 20 8.8 17.3 BeamConv 1 10 8.3 17.6 8.4 17.4 9.3 17.4WEmb 3 - 8.7 17.9 BeamConv 1 10 8.3 17.6 8.4 17.2 8.7 18.4WEmb 3 - 8.7 17.9 WLogEmb 1 - 7.8 17.4 8.7 17.9 96.3 265.1BeamConv 1 20 8.4 17.5 WLogEmb 3 - 8.0 16.4 8.3 17.3 85.0 112.5
Table 4: WER of the encoder and decoder outputs forthe proposed modular models
Architecture RF Top Encoder Decoderk SWB CH SWB CHWEmb 1 - 12.2 23.1 8.7 18.0WEmb 3 - 11.9 22.6 8.7 17.9WEmb 5 - 12.0 23.0 8.4 17.9WLogEmb 1 - 11.1 21.9 7.8 17.2WLogEmb 3 - 11.1 21.7 8.0 16.4WLogEmb 5 - 10.9 22.2 7.7 16.3BeamConv 1 10 11.0 21.7 8.3 17.6BeamConv 1 20 11.2 21.9 8.4 17.6BeamConv 1 50 11.2 21.7 8.4 17.2BeamConv 3 10 11.2 21.9 8.7 17.3BeamConv 3 20 10.8 22.0 8.8 17.3BeamConv 3 50 11.5 21.7 9.9 17.0
Building modular seq2seq models brings aboutmany advantages including interpretable modularinterface, functional independence of each mod-ule, and interchangeability. In addition to open-ing up the space for designing modules implement-ing specific functions, grounding each module’soutput into some interpretable discrete vocabularyallows for debugging and measure the quality ofeach modular component in the system. In ourspeech recognition experiments, the encoder actslike an acoustic model by mapping input acousticevidences into low level linguistic units, while thedecoder, acting like a language model, aggregatesdistributions of such units to generate the mostlikely full sentence. For example, table 4 showshow the overall performance is improving goingfrom the encoder to the decoder module in the lasttwo columns.Another benefit of modular independence,which we enjoy in software design but not in build- ing fully-differentiable seq2seq models, is the abil-ity to carefully build one critical module to higherlevels of performance then switch it into the fullsystem without the need for any model fine-tuning.The new module may reflect a new architecture de-sign in that module, e.g. from LSTMs to Trans-formers, or simply more training data that becomeavailable for that module. Such “modular upgrade”capability is demonstrated in table 5 where the up-graded encoder performance is reflected into theoverall system WER. In our case, we just used anencoder model that is trained independently usingthe CTC loss for a larger number of updates.
Table 5: WER without (scratched) and with modularupgrade of the decoupled model.
Architecture RF Top Original Enc Decoderk SWB CH SWB CHUpgraded Encoder 11.2 21.1 - -BeamConv 1 10 11.0 21.7 8.3 8.7 17.6 16.6BeamConv 3 50 11.5 21.7 9.9 9.3 17.0 16.4WEmb 1 - 12.2 23.1 8.7 8.9 18.0 17.8WEmb 5 - 12.0 23.0 8.4 8.8 17.9 17.4
Table 6 presents experiments for a slightlydifferent scenario where one module is trainedfrom scratch conditioned on the output of itsparent module with frozen parameters, dubbed
PostEdit in our experiments. The beam con-volution attention preparation architecture, whichuses only the rank of encoder hypotheses ratherthan probability values, shows much more re-silience and ability to fix frozen parent module er-rors compared to the weighed embedding architec-ture. There is still a slight degradation of the finaldecoder performance when trained conditionallyon the encoder output – without joint fine-tuning.The reason for that is the lack of data augmenta-ion effect when training the decoder module, asa side effect of modular components, because theencoder is trained to be invariant to augmentationwhen producing its final probability distribution.This can be treated by designing data augmenta-tion techniques suitable to be applied at the inputof each module, which we refer to future work.
Table 6:
PostEdit conditional training of the De-coder module
Architecture RF Top Encoder PostEdit Dec.k SWB CH SWB CHBeamConv 1 10 11.2 21.7 9.5 18.6BeamConv 3 50 11.2 21.7 10.6 17.8WEmb 1 - 11.2 21.7 17.3 27.4WEmb 5 - 11.2 21.7 12.2 17.9
Modularity provides us with the ability to createan ensemble of exponential number of models, e.gby training 3 different modular seq2seq systems,we end up with an ensemble of 9. In table 7 weshow that a modular ensemble of 4 provides fur-ther improvement over the WER of an ensembleof the original 2 models.
Table 7: Modular Ensembling further improves WER
Architecture RF Top Decoderk SWB CHBeamConv 1 10 8.3 17.6WEmb 3 - 8.7 18.0Ensemble of 2 1/3 10/- 7.7 16.1+ 2 using modular swap 1/3 10/- 7.7 15.8
This work is applying the component modular-ity notion from the design and analysis of com-plex systems (Baldwin and Clark, 1999) to fully-differentiable seq2seq models which achievedimpressive levels of performance across manytasks (Chan et al., 2016; Bahdanau et al., 2016,2015; Vaswani et al., 2017). The ConnectionistTemporal Classification (CTC) loss (Graves et al.,2006) was applied as a sequence level loss fortraining encoder-only speech recognition mod-els (Graves and Jaitly, 2014; Hannun et al., 2014),and as a joint loss in attention-based systems forencouraging monotonic alignment between inputand output sequences (Kim et al., 2017). The CTCloss serves the purpose of introducing an informa-tion bottleneck (Tishby et al., 1999) through dis- cretizing the encoder output into an interpretablevocabulary space.By enforcing modularity between the encoderand decoder components in seq2seq models, thedecoder module can be viewed as a post-edit mod-ule to the recognition output of the encoder. Also,the decoder can be viewed as an instance of a dif-ferentiable beam-search decoder (Collobert et al.).There is a long history of research in learn-ing disentangled, distributed hidden representa-tions (Rumelhart et al., 1986; Hinton et al., 1986),unsupervised discovery of abstract factors ofvariations within the training data (Bengio,2013; Mathieu et al., 2016; Chen et al., 2016;Higgins et al., 2017). This line of research is com-plementary to our work which enforces modularityonly at the link connecting two big components ina seq2seq system. In this work, a component is de-fined as a deep and complex network with multiplelayers of representations which serves a specificfunction within the bigger system, and outputs dis-tributions over interpretable vocabulary units.Another line of research that is related to ourscenters around inducing a modular structure onthe space of learned concepts through hierarchi-cally gating information flow or via high-level con-cept blueprints (Andreas et al., 2016; Devin et al.,2016; Purushwalkam et al., 2019) to enable zeroand few-shot transfer learning (Andreas et al.,2017; Socher et al., 2013), multi-lingual and cross-lingual learning (Adams et al., 2019; Dalmia et al.,2018; Swietojanski et al., 2012).Hybrid HMM-DNN speech recognition sys-tems (Gales and Young; Hinton et al., 2012) aremodular by design but they lack end-to-end learn-ing capability. We aim at bringing the same modu-lar properties without losing quality nor full differ-entiability.
Motivated by modular software and system de-sign literature, we presented a method for induc-ing modularity in attention-based seq2seq mod-els through discretizing the encoder output intoa real-world vocabulary units. The Connection-ist Temporal Classification (CTC) loss is appliedto the encoder outputs to ground them into thepredefined vocabulary while respecting their se-quential nature. The learned model adhere to thethree properties of modular systems – indepen-dence, interchangeability, and clearness of inter-ace – while achieving a competitive WER per-formance in the standard 300h Switchboard taskof . and . on the SWB and CH subsetsrespectively. Our future work focuses on extend-ing this work to other sequence-to-sequence ma-chine translation and language processing tasks, aswell as exploring the benefits of modular transferin multi-task and multi-modal settings. The authors would like to thank Paul Michel,Dmytro Okhonko, Matthew Weisner for their help-ful discussions and comments.
References
Oliver Adams, Matthew Wiesner, Shinji Watanabe, andDavid Yarowsky. 2019. Massively Multilingual Ad-versarial Speech Recognition. In
Proc. NAACL-HLT .Jacob Andreas, Dan Klein, and Sergey Levine. 2017.Modular multitask Reinforcement Learning withPolicy Sketches. In
Proc. ICML .Jacob Andreas, Marcus Rohrbach, Trevor Darrell, andDan Klein. 2016. Neural Module Networks. In
Proc. CVPR .Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural Machine Translation by JointlyLearning to Align and Translate.
Proc. ICLR .Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk,Philemon Brakel, and Yoshua Bengio. 2016. End-to-End Attention-based Large Vocabulary SpeechRecognition. In
Proc. ICASSP .Carliss Y. Baldwin and Kim B. Clark. 1999.
DesignRules: The Power of Modularity Volume 1 . MITPress.Yoshua Bengio. 2013. Deep learning of representa-tions: Looking forward.
CoRR .William Chan, Navdeep Jaitly, Quoc Le, and OriolVinyals. 2016. Listen, Attend and Spell: A neuralnetwork for large vocabulary conversational speechrecognition. In
Proc. ICASSP .Xi Chen, Yan Duan, Rein Houthooft, John Schulman,Ilya Sutskever, and Pieter Abbeel. 2016. InfoGAN:Interpretable Representation Learning by Informa-tion Maximizing Generative Adversarial Nets. In
Proc. NeurIPS .Ronan Collobert, Awni Hannun, and Gabriel Synnaeve.A fully differentiable beam search decoder. In
ICML2019 . Siddharth Dalmia, Ramon Sanabria, Florian Metze,and Alan W Black. 2018. Sequence-Based Multi-Lingual Low Resource Speech Recognition. In
Proc.ICASSP .Coline Devin, Abhishek Gupta, Trevor Darrell, PieterAbbeel, and Sergey Levine. 2016. Learning modu-lar neural network policies for multi-task and multi-robot transfer.
CoRR .Mark Gales and Steve Young. The application of hid-den markov models in speech recognition.
Found.Trends Signal Process. , 1(3).Alex Graves, Santiago Fern´andez, Faustino Gomez,and J¨urgen Schmidhuber. 2006. Connectionist Tem-poral Classification: Labelling Unsegmented Se-quence Data with Recurrent Neural Networks. In
Proc. ICML .Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural net-works. In
Proceedings of the 31st International Con-ference on International Conference on MachineLearning .Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catan-zaro, Greg Diamos, Erich Elsen, Ryan Prenger, San-jeev Satheesh, Shubho Sengupta, Adam Coates, andAndrew Y. Ng. 2014. Deep speech: Scaling up end-to-end speech recognition.
ArXiv .Irina Higgins, Loic Matthey, Arka Pal, ChristopherBurgess, Xavier Glorot, Matthew Botvinick, ShakirMohamed, and Alexander Lerchner. 2017. beta-VAE: Learning Basic Visual Concepts with a Con-strained Variational Framework. In
Proc. ICLR .G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed,N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N.Sainath, and B. Kingsbury. 2012. Deep neural net-works for acoustic modeling in speech recognition:The shared views of four research groups.
IEEE Sig-nal Processing Magazine , 29(6).Geoffrey E. Hinton et al. 1986. Learning distributedrepresentations of concepts. In
Proceedings of theeighth annual conference of the cognitive science so-ciety .Kazuki Irie, Rohit Prabhavalkar, Anjuli Kannan, An-toine Bruguier, David Rybach, and Patrick Nguyen.2019. On the Choice of Modeling Unit forSequence-to-Sequence Speech Recognition. In
Proc. InterSpeech .Shigeki Karita, Nanxin Chen, Tomoki Hayashi, andother. 2019. A Comparative Study on Transformervs RNN in Speech Applications.
Proc. ASRU .Suyoun Kim, Takaaki Hori, and Shinji Watanabe.2017. Joint CTC-Attention based End-to-EndSpeech Recognition using Multi-task Learning. In
Proc. ICASSP .Diederik P Kingma and Jimmy Lei Ba. 2014. Adam: AMethod for Stochastic Optimization. In
Proc. ICLR .aku Kudo and John Richardson. 2018. SentencePiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing. In
Proc. EMNLP: System Demonstrations .Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao,Aditya Ramesh, Pablo Sprechmann, and Yann Le-Cun. 2016. Disentangling factors of variation indeep representation using adversarial training. In
Advances in Neural Information Processing Systems29 .Abdelrahman Mohamed, Dmytro Okhonko, and LukeZettlemoyer. 2019. Transformers with convo-lutional context for ASR. In arXiv preprintarXiv:1904.11660 .Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A Fast, ExtensibleToolkit for Sequence Modeling. In
Proc. NAACL-HLT: Demonstrations .Daniel S Park, William Chan, Yu Zhang, Chung-ChengChiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le.2019. SpecAugment: A simple data augmentationmethod for automatic speech recognition. In
Proc.Interspeech .Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pe-gah Ghahremani, Vimal Manohar, Xingyu Na, Yim-ing Wang, and Sanjeev Khudanpur. 2016. Purelysequence-trained neural networks for asr based onlattice-free mmi.Senthil Purushwalkam, Maximilian Nickel, AbhinavGupta, and Marc’Aurelio Ranzato. 2019. Task-driven modular networks for zero-shot composi-tional learning.
CoRR .David E. Rumelhart, James L. McClelland, and COR-PORATE PDP Research Group, editors. 1986.
Par-allel Distributed Processing: Explorations in the Mi-crostructure of Cognition, Vol. 1: Foundations .Richard Socher, Milind Ganjoo, Christopher D Man-ning, and Andrew Ng. 2013. Zero-Shot LearningThrough Cross-Modal Transfer. In
Proc. NeurIPS .Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.In
Advances in neural information processing sys-tems , pages 3104–3112.Pawel Swietojanski, Arnab Ghoshal, and Steve Renals.2012. Unsupervised cross-lingual knowledge trans-fer in DNN-based LVCSR. In
Proc. SLT .Naftali Tishby, Fernando C. Pereira, and WilliamBialek. 1999. The information bottleneck method.In
Proc. of the 37-th Annual Allerton Conference onCommunication, Control and Computing .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Proc. NeurIPS . Shinji Watanabe, Takaaki Hori, Shigeki Karita, TomokiHayashi, Jiro Nishitoba, Yuya Unno, et al. 2018. ES-Pnet: End-to-End Speech Processing Toolkit. In