[PDF] ESPnet-ST: All-in-One Speech Translation Toolkit

Abstract

We present ESPnet-ST, which is designed for the quick development of speech-to-speech translation systems in a single framework. ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet, which integrates or newly implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation. We provide all-in-one recipes including data pre-processing, feature extraction, training, and decoding pipelines for a wide range of benchmark datasets. Our reproducible results can match or even outperform the current state-of-the-art performances; these pre-trained models are downloadable. The toolkit is publicly available at this https URL.

Full PDF

aa r X i v : . [ c s . C L ] A p r ESPnet-ST: All-in-One Speech Translation Toolkit

Hirofumi Inaguma Shun Kiyono Kevin Duh Shigeki Karita Nelson Enrique Yalta Soplin Tomoki Hayashi , Shinji Watanabe Kyoto University RIKEN AIP Johns Hopkins University NTT Communication Science Laboratories Waseda University Nagoya University Human Dataware Lab. Co., Ltd. [email protected]

Abstract

We present

ESPnet-ST , which is designed forthe quick development of speech-to-speechtranslation systems in a single framework.

ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet, whichintegrates or newly implements automaticspeech recognition, machine translation, andtext-to-speech functions for speech transla-tion. We provide all-in-one recipes includingdata pre-processing, feature extraction, train-ing, and decoding pipelines for a wide rangeof benchmark datasets. Our reproducible re-sults can match or even outperform the cur-rent state-of-the-art performances; these pre-trained models are downloadable. The toolkitis publicly available at https://github.com/espnet/espnet . Speech translation (ST), where converting speechsignals in a language to text in another language,is a key technique to break the language barrierfor human communication. Traditional ST sys-tems involve cascading automatic speech recog-nition (ASR), text normalization (e.g., punctu-ation insertion, case restoration), and machinetranslation (MT) modules; we call this Cascade-ST (Ney, 1999; Casacuberta et al., 2008; Ku-mar et al., 2014). Recently, sequence-to-sequence(S2S) models have become the method of choicein implementing both the ASR and MT modules(c.f. (Chan et al., 2016; Bahdanau et al., 2015)).This convergence of models has opened up the pos-sibility of designing end-to-end speech translation(E2E-ST) systems, where a single S2S directlymaps speech in a source language to its translationin the target language (B´erard et al., 2016; Weisset al., 2017).E2E-ST has several advantages over the cas-caded approach: (1) a single E2E-ST model can reduce latency at inference time, which isuseful for time-critical use cases like simultane-ous interpretation. (2) A single model enablesback-propagation training in an end-to-end fash-ion, which mitigates the risk of error propagationby cascaded modules. (3) In certain use casessuch as endangered language documentation (Birdet al., 2014), source speech and target text trans-lation (without the intermediate source text tran-script) might be easier to obtain, necessitating theadoption of E2E-ST models (Anastasopoulos andChiang, 2018). Nevertheless, the verdict is stillout on the comparison of translation quality be-tween E2E-ST and Cascade-ST. Some empiricalresults favor E2E (Weiss et al., 2017) while othersfavor Cascade (Niehues et al., 2019); the conclu-sion also depends on the nuances of the trainingdata condition (Sperber et al., 2019).We believe the time is ripe to develop a uni-ﬁed toolkit that facilitates research in both E2Eand cascaded approaches. We present

ESPnet-ST , a toolkit that implements many of the recentmodels for E2E-ST, as well as the ASR and MTmodules for Cascade-ST. Our goal is to providea toolkit where researchers can easily incorporateand test new ideas under different approaches. Re-cent research suggests that pre-training, multi-tasklearning, and transfer learning are important tech-niques for achieving improved results for E2E-ST (B´erard et al., 2018; Anastasopoulos and Chi-ang, 2018; Bansal et al., 2019; Inaguma et al.,2019). Thus, a uniﬁed toolkit that enables re-searchers to seamlessly mix-and-match differentASR and MT models in training both E2E-ST andCascade-ST systems would facilitate research inthe ﬁeld. There exist many excellent toolkits that support bothASR and MT tasks (see Table 1). However, it is not alwaysstraightforward to use them for E2E-ST and Cascade-ST, dueto incompatible training/inference pipelines in different mod-oolkit Supported task Example (w/ corpus pre-processing) Pre-trainedmodelASR LM E2E- Cascade- MT TTS ASR LM E2E- Cascade- MT TTSST ST ST STESPnet-ST (ours)

X X X X X X X X X X X X X

Lingvo X X X ♣ X ♣ X X ♣ X X – – X – –OpenSeq2seq X X – –

X X X X – – X – X RETURNN X X X – X – – – – – – – X SLT.KIT X – X X X – X – X X X – X Fairseq X X – – X – X X – – X – X Tensor2Tensor X X – – X – – – – – X – X ♦ OpenNMT- { py, tf } X X – – X – – – – – – – X Kaldi X X – – – –

X X – – – – X Wav2letter++ X X – – – –

X X – – – – X Table 1: Framework comparison on supported tasks in January, 2020. ♣ Not publicly available. ♦ Available onlyin Google Cloud storage. (Shen et al., 2019) (Kuchaiev et al., 2018) (Zeyer et al., 2018), (Zenkel et al., 2018) (Ott et al., 2019) (Vaswani et al., 2018) (Klein et al., 2017) (Povey et al., 2011) (Pratap et al., 2019) ESPnet-ST is especially designed to target theST task. ESPnet was originally developed for theASR task (Watanabe et al., 2018), and recently ex-tended to the text-to-speech (TTS) task (Hayashiet al., 2020). Here, we extend ESPnet to ST tasks,providing code for building translation systemsand recipes (i.e., scripts that encapsulate the en-tire training/inference procedure for reproducibil-ity purposes) for a wide range of ST benchmarks.This is a non-trivial extension: with a uniﬁed code-base for ASR/MT/ST and a wide range of recipes,we believe ESPnet-ST is an all-in-one toolkit thatshould make it easier for both ASR and MT re-searchers to get started in ST research.The contributions of

ESPnet-ST are as follows: • To the best of our knowledge, this is theﬁrst toolkit to include ASR, MT, TTS, andST recipes and models in the same code-base. Since our codebase is based on theuniﬁed framework with a common stage-by-stage processing (Povey et al., 2011), it isvery easy to customize training data and mod-els. • We provide recipes for ST corpora suchas Fisher-CallHome (Post et al., 2013),Libri-trans (Kocabiyikoglu et al., 2018),How2 (Sanabria et al., 2018), and Must-C (Di Gangi et al., 2019a) . Each recipe con-tains a single script ( run.sh ), which cov-ers all experimental processes, such as corpuspreparation, data augmentations, and transferlearning. • We provide the open-sourced toolkit and thepre-trained models whose hyper-parameters ules or lack of detailed preprocessing/training scripts. We also support ST-TED (Jan et al., 2018) and low-resourced Mboshi-French (Godard et al., 2018) recipes. are intensively tuned. Moreover, we provideinteractive demo of speech-to-speech transla-tion hosted by Google Colab. All required tools are automatically downloadedand built under tools (see Figure 1) by a makecommand. The tools include (1) neural networklibraries such as PyTorch (Paszke et al., 2019),(2) ASR-related toolkits such as Kaldi (Poveyet al., 2011), and (3) MT-related toolkits suchas Moses (Koehn et al., 2007) and sentence-piece (Kudo, 2018).

ESPnet-ST is implementedwith Pytorch backend.

We provide various recipes for all tasks in order toquickly and easily reproduce the strong baselinesystems with a single script. The directory struc-ture is depicted as in Figure 1. egs contains cor-pus directories, in which the corresponding taskdirectories (e.g., st1 ) are included. To run ex-periments, we simply execute run.sh under thedesired task directory. Conﬁguration yaml ﬁlesfor feature extraction, data augmentation, modeltraining, and decoding etc. are included in conf .Model directories including checkpoints are savedunder exp . More details are described in Section2.4.

We support language modeling (LM), neural text-to-speech (TTS) in addition to ASR, ST, and MT https://colab.research.google.com/github/espnet/notebook/blob/master/st_demo.ipynb igure 1: Directory structure of ESPnet-ST !" !" -./0( !" (*"+&,- (*"+&,. (*"+&,/ (*"+&,0 (*"+&,1 (*"+&,23&"*45&&6*5"$*789 :(;<,=7>&?@8$"A4>"5B+&9&5"*789 CD,*5"7979+ D8%&>*5"7979+ E&$8%79+(*"+&,- (*"+&,. (*"+&,/ (*"+&,1 (*"+&,23&"*45&&6*5"$*789 :(;<,=7>&?@8$"A4>"5B+&9&5"*789

D8%&>*5"7979+ E&$8%79+ !1!2.30( "*$% (*"+&,F(*"+&,- (*"+&,. (*"+&,/ (*"+&,1 (*"+&,23&"*45&&6*5"$*789 :(;<,=7>&?@8$"A4>"5B+&9&5"*789

D8%&>*5"7979+ E&$8%79+ +*$%

E&$8%79+ !85G4 .’ .3

37) )6) Figure 2: All-in-one process pipelines in ESPnet-ST tasks. To the best of our knowledge, none of frame-works support all these tasks in a single toolkit.A comparison with other frameworks are summa-rized in Table 1. Conceptually, it is possible tocombine ASR and MT modules for Cascade-ST,but few frameworks provide such examples. More-over, though some toolkits indeed support speech-to-text tasks, it is not trivial to switch ASR andE2E-ST tasks since E2E-ST requires the auxiliarytasks (ASR/MT objectives) to achieve reasonableperformance.

ESPnet-ST is based on a stage-by-stage processingincluding corpus-dependent pre-processing, fea-ture extraction, training, and decoding stages. Wefollow Kaldi-style data preparation, which makesit easy to augment speech data by leveraging otherdata resources prepared in egs .Once run.sh is executed, the following pro-cesses are started.

Stage 0 : Corpus-dependent pre-processing is con-ducted using scripts under local and the result-ing text data is automatically saved under data .Both transcriptions and the corresponding trans-lations with three different treatments of casingand punctuation marks (hereafter, punct.) are gen-erated after text normalization and tokenizationwith tokenizer.perl in Moses; (a) tc : true-cased text with punct., (b) lc : lowercased text withpunct., and (3) lc.rm : lowercased text withoutpunct. except for apostrophe. lc.rm is designedfor the ASR task since the conventional ASR sys-tem does not generate punctuation marks. How-ever, it is possible to train ASR models so as to generate truecased text using tc . Stage 1 : Speech feature extraction based on Kaldiand our own implementations is performed.

Stage 2 : Dataset JSON ﬁles in a format in-gestable by ESPnet’s Pytorch back-end (contain-ing token/utterance/speaker/language IDs, inputand output sequence lengths, transcriptions, andtranslations) are dumped under dump . Stage 3 : (ASR recipe only) LM is trained.

Stage 4 : Model training (RNN/Transformer) isperformed.

Stage 5 : Model averaging, beam search decoding,and score calculation are conducted.

Stage 6 : (Cascade-ST recipe only) The systemis evaluated by feeding ASR outputs to the MTmodel.

In ST literature, it is acknowledged that the op-timization of E2E-ST is more difﬁcult than indi-vidually training ASR and MT models. Multitasktraining (MTL) and transfer learning from ASRand MT tasks are promising approaches for thisproblem (Weiss et al., 2017; B´erard et al., 2018;Sperber et al., 2019; Bansal et al., 2019). Thus,in

Stage 4 of the E2E-ST recipe, we allow op-tions to add auxiliary ASR and MT objectives. Wealso support options to initialize the parameters ofthe ST encoder with a pre-trained ASR encoder in asr1 , and to initialize the parameters of the STdecoder with a pre-trained MT decoder in mt1 . We found that this degrades the ASR performance. .6 Speech data augmentation

We implement techniques that have shown to giveimproved robustness in the ASR component.

Speed perturbation

We augmented speech databy changing the speed with factors of 0.9, 1.0, and1.1, which results in 3-fold data augmentation. Wefound this is important to stabilize E2E-ST train-ing.

SpecAugment

Time and frequency maskingblocks are randomly applied to log mel-ﬁlterbankfeatures. This has been originally proposed to im-prove the ASR performance and shown to be ef-fective for E2E-ST as well (Bahar et al., 2019b).

Multilingual training, where datasets from differ-ent language pairs are combined to train a singlemodel, is a potential way to improve performanceof E2E-ST models (Inaguma et al., 2019; Di Gangiet al., 2019c). Multilingual E2E-ST/MT modelsare supported in several recipes.

We customize the dataloader, trainer, and evaluator by overridingChainer (Tokui et al., 2019) modules. The com-mon processes are shared among all tasks.

Large-scale training/decoding

We support jobschedulers (e.g., SLURM, Grid Engine), multipleGPUs and half/mixed-precision training/decodingwith apex (Micikevicius et al., 2018). Our beamsearch implementation vectorizes hypotheses forfaster decoding (Seki et al., 2019).

Performance monitoring

Attention weightsand all kinds of training/validation scores andlosses for ASR, MT, and ST tasks can becollectively monitored through TensorBoard.

Ensemble decoding

Averaging posterior proba-bilities from multiple models during beam searchdecoding is supported.

To give a ﬂavor of the models that are supportedwith ESPnet-ST, we describe in detail the con-struction of an example E2E-ST model, whichis used later in the Experiments section. Notethat there are many customizable options not men-tioned here. https://github.com/NVIDIA/apex Automatic speech recognition (ASR)

We buildASR components with the Transformer-based hy-brid CTC/attention framework (Watanabe et al.,2017), which has been shown to be more effec-tive than RNN-based models on various speechcorpora (Karita et al., 2019). Decoding with theexternal LSTM-based LM trained in the

Stage 3 is also conducted (Kannan et al., 2017). The trans-former uses 12 self-attention blocks stacked onthe two VGG blocks in the speech encoder and 6self-attention blocks in the transcription decoder;see (Karita et al., 2019) for implementation details.

Machine translation (MT)

The MT model con-sists of the source text encoder and translation de-coder, implemented as a transformer with 6 self-attention blocks. For simplicity, we train the MTmodel by feeding lowercased source sentenceswithout punctuation marks ( lc.rm ) (Peitz et al.,2011). There are options to explore characters anddifferent subword units in the MT component.

End-to-end speech translation (E2E-ST)

OurE2E-ST model is composed of the speech encoderand translation decoder. Since the deﬁnition of pa-rameter names is exactly same as in the ASR andMT components, it is quite easy to copy parame-ters from the pre-trained models for transfer learn-ing. After ASR and MT models are trained as de-scribed above, their parameters are extracted andused to initialize the E2E-ST model. The model isthen trained on ST data, with the option of incor-porating multi-task objectives as well.

Text-to-speech (TTS)

We also support end-to-end text-to-speech (E2E-TTS), which can be ap-plied after ST outputs a translation. The E2E-TTS model consists of the feature generation net-work converting an input text to acoustic fea-tures (e.g., log-mel ﬁlterbank coefﬁcients) and thevocoder network converting the features to a wave-form. Tacotron 2 (Shen et al., 2018), Transformer-TTS (Li et al., 2019), FastSpeech (Ren et al.,2019), and their variants such as a multi-speakermodel are supported as the feature generation net-work. WaveNet (van den Oord et al., 2016) andParallel WaveGAN (Yamamoto et al., 2020) areavailable as the vocoder network. See Hayashiet al. (2020) for more details.

In this section, we demonstrate how models fromour ESPnet recipes perform on benchmark speech odel Es → EnFisher CallHomedev dev2 test devtest evltestE2E Char RNN + ASR-MTL (Weiss et al., 2017) 48.30 49.10 48.70 16.80 17.40

ESPnet-ST (Transformer)

ASR-MTL (multi-task w/ ASR) 46.64 47.64 46.45 16.80 16.80+ MT-MTL (multi-task w/ MT) 47.17 48.20 46.99 17.51 17.64ASR encoder init. ( (cid:13) ) 46.25 47.11 46.21 17.35 16.94+ MT decoder init. ( (cid:13) ) 46.25 47.60 46.72 17.62 17.50+ SpecAugment ( (cid:13) ) 48.94 49.32 48.39 18.83 18.67+ Ensemble 3 models ( (cid:13) + (cid:13) + (cid:13) ) Cascade Char RNN ASR → Char RNN MT (Weiss et al., 2017) → Char RNN MT (Inaguma et al., 2019) ♣ ESPnet-ST

Transformer ASR ♦ → Transformer MT 41.96 43.46 42.16 19.56 19.82

Table 2: BLEU of ST systems on Fisher-CallHome Spanish corpus. ♣ Implemented w/ ESPnet. ♦ w/ SpecAugment. Model En → FrE2E Transformer + ASR/MT-trans + KD Transformer + PT △ + adaptor △ + SpecAugment , ♣ ESPnet-ST (Transformer)

ASR-MTL 15.30+ MT-MLT 15.47ASR encoder init. ( (cid:13) ) 15.53+ MT decoder init. ( (cid:13) ) 16.22+ SpecAugment ( (cid:13) ) 16.70+ Ensemble 3 models ( (cid:13) + (cid:13) + (cid:13) ) 17.40Cascade Transformer ASR → Transformer MT Transformer ASR ♦ → Transformer MT 16.96

Table 3: BLEU of ST systems on Libri-trans cor-pus. ♣ Implemented w/ ESPnet. △ Pre-training. ♦ w/SpecAugment. (Liu et al., 2019) (Bahar et al., 2019a) (Bahar et al., 2019b) (Wang et al., 2020) translation corpora: Fisher-CallHome SpanishEn → Es, Libri-trans En → Fr, How2 En → Pt, andMust-C En → tokenizer.perl script in the Mosestoolkit (Koehn et al., 2007). We used thejoint source and target vocabularies based onbyte pair encoding (BPE) (Sennrich et al.,2016) units. ASR vocabularies were createdwith English sentences only with lc.rm . Wereport 4-gram BLEU (Papineni et al., 2002)scores with the multi-bleu.perl scriptin Moses. For speech features, we extracted80-channel log-mel ﬁlterbank coefﬁcients with3-dimensional pitch features using Kaldi, re-sulting 83-dimensional features per frame.Detailed training and decoding conﬁgurationsare available in conf/train.yaml and Model En → PtE2E RNN (Sanabria et al., 2018) 36.0

ESPnet-ST

Transformer 40.59+ ASR-MTL 44.90+ MT-MLT 45.10Transformer + ASR encoder init. ( (cid:13) ) 45.03+ MT decoder init. ( (cid:13) ) 45.63+ SpecAugment ( (cid:13) ) 45.68+ Ensemble 3 models ( (cid:13) + (cid:13) + (cid:13) ) Cascade

ESPnet-ST

Transformer ASR → Transformer MT 44.90

Table 4: BLEU of ST systems on How2 corpus conf/decode.yaml , respectively. → En)

Fisher-CallHome Spanish corpus contains 170-hours of Spanish conversational telephone speech,the corresponding transcription, as well as theEnglish translations (Post et al., 2013). Allpunctuation marks except for apostrophe were re-moved (Post et al., 2013; Kumar et al., 2014;Weiss et al., 2017). We report case-insensitiveBLEU on Fisher- { dev , dev2 , test } (with four ref-erences), and CallHome- { devtest , evltest } (with asingle reference). We used 1k vocabulary for alltasks.Results are shown in Table 2. It is worth notingthat we did not use any additional data resource.Both MTL and transfer learning improved the per-formance of vanilla Transformer. Our best systemwith SpecAugment matches the current state-of-the-art performance (Weiss et al., 2017). More-over, the total training/inference time is muchshorter since our E2E-ST models are based on theBPE1k unit rather than characters. Weiss et al. (2017) trained their model for more than 2.5weeks with 16 GPUs, while

ESPnet-ST requires just 1-2 dayswith a single GPU. The fast inference of ESPnet-ST can beconﬁrmed in our interactive demo page (RTF 0.7755).odel De Pt Fr Es Ro Ru Nl ItE2E Transformer + ASR encoder init. , ♣ ESPnet-ST (Transformer)

ASR encoder/MT decoder init. 22.33 27.26 31.54 27.84 20.91 15.32 26.86 22.81+ SpecAugment

Cascade Transformer → Transformer ASR ESPnet-ST

Transformer ASR → Transformer MT

Table 5: BLEU of ST systems on Must-C corpus. ♣ Implemented w/ Fairseq. (Di Gangi et al., 2019b) En → De De → EnFramework test2012 test2013 test2014 test2012 test2013 test2014Fairseq 27.73 29.45 25.14 32.25 34.23 29.49ESPnet-ST 26.92 28.88 24.70 32.19 33.46 29.22

Table 6: BLEU of MT systems on IWSLT 2016 corpus → Fr)

Libri-trans corpus contains 236-hours of Englishread speech, the corresponding transcription, andthe French translations (Kocabiyikoglu et al.,2018). We used the clean 100-hours of speech dataand augmented translation references with GoogleTranslate for the training set (B´erard et al., 2018;Liu et al., 2019; Bahar et al., 2019a,b). We reportcase-insensitive BLEU on the test set. We used 1kvocabulary for all tasks.Results are shown in Table 3. Note that all mod-els used the same data resource and are competi-tive to previous work. → Pt)

How2 corpus contains English speech extractedfrom YouTube videos, the corresponding tran-scription, as well as the Portuguese transla-tion (Sanabria et al., 2018). We used the ofﬁ-cial 300-hour subset for training. Since speechfeatures in the How2 corpus is pre-processed as40-channel log-mel ﬁlterbank coefﬁcients with 3-dimensional pitch features with Kaldi in advance,we used them without speed perturbation. We used5k and 8k vocabularies for ASR and E2E-ST/MTmodels, respectively. We report case-sensitiveBLEU on the dev5 set.Results are shown in Table 4. Our systemssigniﬁcantly outperform the previous RNN-basedmodel (Sanabria et al., 2018). We believe that oursystems can be regarded as the reliable baselinesfor future research. → Must-C corpus contains English speech extractedfrom TED talks, the corresponding transcription,and the target translations in 8 language directions (De, Pt, Fr, Es, Ro, Ru, Nl, and It) (Di Gangi et al.,2019a). We conducted experiments in all 8 direc-tions. We used 5k and 8k vocabularies for ASRand E2E-ST/MT models, respectively. We reportcase-sensitive BLEU on the tst-COMMON set.Results are shown in Table 5. Our sys-tems outperformed the previous work (Di Gangiet al., 2019b) implemented with the custermizedFairseq with a large margin. ↔ De IWSLT evaluation campaign dataset (Cettolo et al.,2012) is the origin of the dataset for our MT ex-periments. We used En-De language pair. Specif-ically, IWSLT 2016 training set for training data,test2012 as the development data, and test2013and test2014 sets as our test data respectively.We compare the performance of Transformermodel in

ESPnet-ST with that of Fairseq in Ta-ble 6.

ESPnet-ST achieves the performance al-most comparable to the Fairseq. We assume thatthe performance gap is due to the minor differencein the implementation of two frameworks. Also,we carefully tuned the hyper-parameters for theMT task in the small ST corpora, which is con-ﬁrmed from the reasonable performances of ourCascaded-ST systems. It is acknowledged thatTransformer model is extremely sensitive to thehyper-parameters such as the learning rate and thenumber of warmup steps (Popel and Bojar, 2018).Thus, it is possible that the suitable sets of hyper-parameters are different across frameworks. https://github.com/mattiadg/FBK-Fairseq-ST Conclusion

We presented

ESPnet-ST for the fast developmentof end-to-end and cascaded ST systems. We pro-vide various all-in-one example scripts containingcorpus-dependent pre-processing, feature extrac-tion, training, and inference. In the future, we willsupport more corpora and implement novel tech-niques to bridge the gap between end-to-end andcascaded approaches.

Acknowledgment

We thank Jun Suzuki for providing helpful feed-back for the paper.

References

Antonios Anastasopoulos and David Chiang. 2018.Tied multitask learning for neural speech translation.In

Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies(NAACL-HLT 2018) , pages 82–91.Parnia Bahar, Tobias Bieschke, and Hermann Ney.2019a. A comparative study on end-to-end speechto text translation. In

Proceedings of 2019 IEEEAutomatic Speech Recognition and UnderstandingWorkshop (ASRU 2019) , pages 792–799.Parnia Bahar, Albert Zeyer, Ralf Schl¨uter, and Her-mann Ney. 2019b. On using SpecAugment for end-to-end speech translation. In

Proceedings of 16thInternational Workshop on Spoken Language Trans-lation 2019 (IWSLT 2019) .Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In

Proceedings ofthe 3rd International Conference on Learning Rep-resentations (ICLR 2015) .Sameer Bansal, Herman Kamper, Karen Livescu,Adam Lopez, and Sharon Goldwater. 2019. Pre–training on high-resource speech recognition im-proves low-resource speech-to-text translation. In

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies(NAACL-HLT 2019) , pages 58–68.Alexandre B´erard, Laurent Besacier, Ali Can Ko-cabiyikoglu, and Olivier Pietquin. 2018. End-to-endautomatic speech translation of audiobooks. In

Pro-ceedings of 2018 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP2018) , pages 6224–6228.Alexandre B´erard, Olivier Pietquin, Christophe Servan,and Laurent Besacier. 2016. Listen and translate: A proof of concept for end-to-end speech-to-text trans-lation. In

Proceedings of NIPS 2016 End-to-endLearning for Speech and Audio Processing Work-shop .Steven Bird, Lauren Gawne, Katie Gelbart, and IsaacMcAlister. 2014. Collecting bilingual audio inremote indigenous communities. In

Proceedingsof COLING 2014, the 25th International Confer-ence on Computational Linguistics (COLING 2014) ,pages 1015–1024.F. Casacuberta, M. Federico, H. Ney, and E. Vidal.2008. Recent efforts in spoken language translation.

IEEE Signal Processing Magazine , 25(3):80–88.Mauro Cettolo, Christian Girardi, and Marcello Fed-erico. 2012. Wit3: Web inventory of transcribed andtranslated talks. In

Conference of european associa-tion for machine translation , pages 261–268.William Chan, Navdeep Jaitly, Quoc Le, and OriolVinyals. 2016. Listen, attend and spell: A neuralnetwork for large vocabulary conversational speechrecognition. In

Proceedings of 2016 IEEE Interna-tional Conference on Acoustics, Speech and SignalProcessing (ICASSP 2016) , pages 4960–4964.Mattia A. Di Gangi, Roldano Cattoni, Luisa Ben-tivogli, Matteo Negri, and Marco Turchi. 2019a.MuST-C: a Multilingual Speech Translation Corpus.In

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies(NAACL-HLT 2019) , pages 2012–2017.Mattia A Di Gangi, Matteo Negri, and Marco Turchi.2019b. Adapting transformer to end-to-end spokenlanguage translation. In

Proceedings of 20th AnnualConference of the International Speech Communi-cation Association (INTERSPEECH 2019) , pages1133–1137.Mattia Antonino Di Gangi, Matteo Negri, and MarcoTurchi. 2019c. One-to-many multilingual end-to-end speech translation. In

Proceedings of 2019IEEE Automatic Speech Recognition and Under-standing Workshop (ASRU 2019) , pages 585–592.Pierre Godard, Gilles Adda, Martine Adda-Decker,Juan Benjumea, Laurent Besacier, Jamison Cooper-Leavitt, Guy-Noel Kouarata, Lori Lamel, H´el`eneMaynard, Markus Mueller, Annie Rialland, Sebas-tian Stueker, Franc¸ois Yvon, and Marcely Zanon-Boito. 2018. A very low resource language speechcorpus for computational language documentationexperiments. In

Proceedings of the Eleventh Inter-national Conference on Language Resources andEvaluation (LREC 2018) , Miyazaki, Japan. Euro-pean Language Resources Association (ELRA).Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue,Takenori Yoshimura, Shinji Watanabe, Tomoki Toda,Kazuya Takeda, Yu Zhang, and Xu Tan. 2020.ESPnet-TTS: Uniﬁed, reproducible, and integrat-able open source end-to-end text-to-speech toolkit.n

Proceedings of 2020 IEEE International Confer-ence on Acoustics, Speech and Signal Processing(ICASSP 2020) .Hirofumi Inaguma, Kevin Duh, Tatsuya Kawahara, andShinji Watanabe. 2019. Multilingual end-to-endspeech translation. In

Proceedings of 2019 IEEEAutomatic Speech Recognition and UnderstandingWorkshop (ASRU 2019) , pages 570–577.Niehues Jan, Roldano Cattoni, St¨uker Sebastian,Mauro Cettolo, Marco Turchi, and Marcello Fed-erico. 2018. The IWSLT 2018 evaluation campaign.In

Proceedings of 15th International Workshop onSpoken Language Translation 2018 (IWSLT 2018) ,pages 2–6.Anjuli Kannan, Yonghui Wu, Patrick Nguyen, Tara NSainath, Zhifeng Chen, and Rohit Prabhavalkar.2017. An analysis of incorporating an externallanguage model into a sequence-to-sequence model.In

Proceedings of 2017 IEEE International Confer-ence on Acoustics, Speech and Signal Processing(ICASSP 2017) , pages 5824–5828.Shigeki Karita, Nanxin Chen, Tomoki Hayashi,Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang,Masao Someki, Nelson Enrique Yalta Soplin,Ryuichi Yamamoto, Xiaofei Wang, et al. 2019.A comparative study on Transformer vs RNN inspeech applications. In

Proceedings of 2019 IEEEAutomatic Speech Recognition and UnderstandingWorkshop (ASRU 2019) , pages 499–456.Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senel-lart, and Alexander Rush. 2017. OpenNMT: Open–source toolkit for neural machine translation. In

Proceedings of ACL 2017, System Demonstrations ,pages 67–72.Ali Can Kocabiyikoglu, Laurent Besacier, and OlivierKraif. 2018. Augmenting Librispeech with Frenchtranslations: A multimodal corpus for direct speechtranslation evaluation. In

Proceedings of theEleventh International Conference on Language Re-sources and Evaluation (LREC 2018) .Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondˇrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: Opensource toolkit for statistical machine translation. In

Proceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions , pages 177–180.Oleksii Kuchaiev, Boris Ginsburg, Igor Gitman, Vi-taly Lavrukhin, Carl Case, and Paulius Micikevi-cius. 2018. OpenSeq2Seq: Extensible toolkit for dis-tributed and mixed precision training of sequence–to-sequence models. In

Proceedings of Workshopfor NLP Open Source Software (NLP-OSS) , pages41–46. Taku Kudo. 2018. Subword regularization: Improvingneural network translation models with multiple sub-word candidates. In

Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (ACL 2018) , pages 66–75.Gaurav Kumar, Matt Post, Daniel Povey, and SanjeevKhudanpur. 2014. Some insights from translatingconversational telephone speech. In

Proceedingsof 2014 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP 2014) ,pages 3231–3235.Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, andMing Liu. 2019. Neural speech synthesis with trans-former network. In

Proceedings of the AAAI Con-ference on Artiﬁcial Intelligence , volume 33, pages6706–6713.Yuchen Liu, Hao Xiong, Zhongjun He, Jiajun Zhang,Hua Wu, Haifeng Wang, and Chengqing Zong. 2019.End-to-end speech translation with knowledge distil-lation. In

Proceedings of 20th Annual Conferenceof the International Speech Communication Associ-ation (INTERSPEECH 2019) , pages 1128–1132.Paulius Micikevicius, Sharan Narang, Jonah Alben,Gregory Diamos, Erich Elsen, David Garcia, BorisGinsburg, Michael Houston, Oleksii Kuchaiev,Ganesh Venkatesh, and Hao Wu. 2018. Mixed pre-cision training. In

Proceedings of the 6th Inter-national Conference on Learning Representations(ICLR 2018) .Hermann Ney. 1999. Speech translation: Couplingof recognition and translation. In

Proceedingsof 1999 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP 1999) ,pages 517–520.J. Niehues, R. Cattoni, S. St¨uker, M. Negri, M. Turchi,E. Salesky, R. Sanabria, L. Barrault, L. Specia, andM Federico. 2019. The IWSLT 2019 evaluationcampaign. In

Proceedings of 16th InternationalWorkshop on Spoken Language Translation 2019(IWSLT 2019) .Aaron van den Oord, Sander Dieleman, Heiga Zen,Karen Simonyan, Oriol Vinyals, Alex Graves,Nal Kalchbrenner, Andrew W. Senior, and KorayKavukcuoglu. 2016. Wavenet: A generative modelfor raw audio. arXiv preprint arXiv:1609.03499 .Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. Fairseq: A fast, extensibletoolkit for sequence modeling. In

Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics(Demonstrations) , pages 48–53.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics (ACL 2002) , pages 311–318.dam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, et al. 2019. PyTorch: An imperative style,high-performance deep learning library. In

Proceed-ings of Advances in Neural Information ProcessingSystems 32 (NeurIPS 2019) , pages 8024–8035.Stephan Peitz, Markus Freitag, Arne Mauser, and Her-mann Ney. 2011. Modeling punctuation predictionas machine translation. In

Proceedings of 8th Inter-national Workshop on Spoken Language Translation2011 (IWSLT 2011) , pages 238–245.Martin Popel and Ondˇrej Bojar. 2018. Training Tipsfor the Transformer Model.

The Prague Bulletin ofMathematical Linguistics , 110(1):43–70.Matt Post, Gaurav Kumar, Adam Lopez, DamianosKarakos, Chris Callison-Burch, and Sanjeev Khu-danpur. 2013. Improved speech-to-text transla-tion with the Fisher and Callhome Spanish–Englishspeech translation corpus. In

Proceedings of 10thInternational Workshop on Spoken Language Trans-lation 2013 (IWSLT 2013) .Daniel Povey, Arnab Ghoshal, Gilles Boulianne, LukasBurget, Ondrej Glembek, Nagendra Goel, MirkoHannemann, Petr Motlicek, Yanmin Qian, PetrSchwarz, et al. 2011. The kaldi speech recognitiontoolkit. In

Proceedings of 2011 IEEE AutomaticSpeech Recognition and Understanding Workshop(ASRU 2011) .Vineel Pratap, Awni Hannun, Qiantong Xu, Jeff Cai,Jacob Kahn, Gabriel Synnaeve, Vitaliy Liptchinsky,and Ronan Collobert. 2019. Wav2Letter++: A fastopen-source speech recognition system. In

Pro-ceedings of 2019 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP2019) , pages 6460–6464.Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao,Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech:Fast, robust and controllable text to speech. In

Ad-vances in Neural Information Processing Systems 32(NeurIPS 2019) , pages 3165–3174.Ramon Sanabria, Ozan Caglayan, Shruti Palaskar,Desmond Elliott, Lo¨ıc Barrault, Lucia Specia, andFlorian Metze. 2018. How2: A large-scale datasetfor multimodal language understanding. In

Proceed-ings of the Workshop on Visually Grounded Interac-tion and Language (ViGIL) .Hiroshi Seki, Takaaki Hori, Shinji Watanabe, NikoMoritz, and Jonathan Le Roux. 2019. Vector-ized Beam Search for CTC-Attention-Based SpeechRecognition. In

Proceedings of 20th Annual Con-ference of the International Speech CommunicationAssociation (INTERSPEECH 2019) , pages 3825–3829.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words with subword units. In

Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (ACL 2016) , pages 1715–1725.Jonathan Shen, Patrick Nguyen, Yonghui Wu, ZhifengChen, Mia X Chen, Ye Jia, Anjuli Kannan, TaraSainath, Yuan Cao, Chung-Cheng Chiu, et al. 2019.Lingvo: a modular and scalable framework forsequence-to-sequence modeling. arXiv preprintarXiv:1902.08295 .Jonathan Shen, Ruoming Pang, Ron J. Weiss, MikeSchuster, Navdeep Jaitly, Zongheng Yang, ZhifengChen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan,Rif A. Saurous, Yannis Agiomyrgiannakis, andYonghui Wu. 2018. Natural TTS synthesis by con-ditioning WaveNet on Mel spectrogram predictions.In

Proceedings of 2017 IEEE International Confer-ence on Acoustics, Speech and Signal Processing(ICASSP 2017) , pages 4779–4783.Matthias Sperber, Graham Neubig, Jan Niehues, andAlex Waibel. 2019. Attention-passing models for ro-bust and data-efﬁcient end-to-end speech translation.

Transactions of the Association for ComputationalLinguistics , 7:313–325.Seiya Tokui, Ryosuke Okuta, Takuya Akiba, YusukeNiitani, Toru Ogawa, Shunta Saito, Shuji Suzuki,Kota Uenishi, Brian Vogel, and Hiroyuki Ya-mazaki Vincent. 2019. Chainer: A deep learn-ing framework for accelerating the research cycle.In

Proceedings of the 25th ACM SIGKDD Interna-tional Conference on Knowledge Discovery & DataMining (KDD 2019) , pages 2002–2011.Ashish Vaswani, Samy Bengio, Eugene Brevdo, Fran-cois Chollet, Aidan Gomez, Stephan Gouws, LlionJones, Łukasz Kaiser, Nal Kalchbrenner, Niki Par-mar, Ryan Sepassi, Noam Shazeer, and Jakob Uszko-reit. 2018. Tensor2Tensor for neural machine trans-lation. In

Proceedings of the 13th Conference of theAssociation for Machine Translation in the Ameri-cas (Volume 1: Research Papers) , pages 193–199,Boston, MA. Association for Machine Translationin the Americas.Chengyi Wang, Yu Wu, Shujie Liu, Zhenglu Yang, andMing Zhou. 2020. Bridging the gap between pre-training and ﬁne-tuning for end-to-end speech trans-lation. In

Proceedings of the AAAI conference onartiﬁcial intelligence 2020 (AAAI 2020) .Shinji Watanabe, Takaaki Hori, Shigeki Karita, TomokiHayashi, Jiro Nishitoba, Yuya Unno, Nelson En-rique Yalta Soplin, Jahn Heymann, Matthew Wies-ner, Nanxin Chen, et al. 2018. ESPnet: End-to-end speech processing toolkit. In

Proceed-ings of 19th Annual Conference of the Interna-tional Speech Communication Association (INTER-SPEECH 2018) , pages 2207–2211.Shinji Watanabe, Takaaki Hori, Suyoun Kim, John RHershey, and Tomoki Hayashi. 2017. HybridCTC/attention architecture for end-to-end speechecognition.

IEEE Journal of Selected Topics in Sig-nal Processing , 11(8):1240–1253.Ron J Weiss, Jan Chorowski, Navdeep Jaitly, YonghuiWu, and Zhifeng Chen. 2017. Sequence-to-sequence models can directly translate foreignspeech. In

Proceedings of 18th Annual Conferenceof the International Speech Communication Associ-ation (INTERSPEECH 2017) , pages 2625–2629.Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim.2020. Parallel WaveGAN: A fast waveform genera-tion model based on generative adversarial networkswith multi-resolution spectrogram. In

Proceedingsof 2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP 2020) .Thomas Zenkel, Matthias Sperber, Jan Niehues,Markus M¨uller, Ngoc-Quan Pham, Sebastian St¨uker,and Alex Waibel. 2018. Open source toolkit forspeech to text translation.

Prague Bull. Math. Lin-guistics , 111:125–135.Albert Zeyer, Tamer Alkhouli, and Hermann Ney. 2018.RETURNN as a generic ﬂexible neural toolkit withapplication to translation and speech recognition. In