[PDF] End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020

Abstract

This paper describes FBK's participation in the IWSLT 2020 offline speech translation (ST) task. The task evaluates systems' ability to translate English TED talks audio into German texts. The test talks are provided in two versions: one contains the data already segmented with automatic tools and the other is the raw data without any segmentation. Participants can decide whether to work on custom segmentation or not. We used the provided segmentation. Our system is an end-to-end model based on an adaptation of the Transformer for speech data. Its training process is the main focus of this paper and it is based on: i) transfer learning (ASR pretraining and knowledge distillation), ii) data augmentation (SpecAugment, time stretch and synthetic data), iii) combining synthetic and real data marked as different domains, and iv) multi-task learning using the CTC loss. Finally, after the training with word-level knowledge distillation is complete, our ST models are fine-tuned using label smoothed cross entropy. Our best model scored 29 BLEU on the MuST-C En-De test set, which is an excellent result compared to recent papers, and 23.7 BLEU on the same data segmented with VAD, showing the need for researching solutions addressing this specific data condition.

Full PDF

EEnd-to-End Speech-Translation with Knowledge Distillation:FBK@IWSLT2020

Marco Gaido , Mattia Antonino Di Gangi , Matteo Negri , Marco Turchi Fondazione Bruno Kessler, Trento, Italy University of Trento, Italy { mgaido|digangi|negri|turchi } @fbk.eu Abstract

This paper describes FBK’s participation inthe IWSLT 2020 ofﬂine speech translation(ST) task. The task evaluates systems’ abilityto translate English TED talks audio into Ger-man texts. The test talks are provided in twoversions: one contains the data already seg-mented with automatic tools and the other isthe raw data without any segmentation. Par-ticipants can decide whether to work on cus-tom segmentation or not. We used the pro-vided segmentation. Our system is an end-to-end model based on an adaptation of theTransformer for speech data. Its training pro-cess is the main focus of this paper and it isbased on: i) transfer learning (ASR pretrainingand knowledge distillation), ii) data augmenta-tion (SpecAugment, time stretch and syntheticdata), iii) combining synthetic and real datamarked as different domains, and iv) multi-task learning using the CTC loss. Finally, af-ter the training with word-level knowledge dis-tillation is complete, our ST models are ﬁne-tuned using label smoothed cross entropy. Ourbest model scored 29 BLEU on the MuST-CEn-De test set, which is an excellent resultcompared to recent papers, and 23.7 BLEU onthe same data segmented with VAD, showingthe need for researching solutions addressingthis speciﬁc data condition. The ofﬂine speech translation task consists in gen-erating the text translation of speech audio record-ings into a different language. In particular, theIWSLT2020 task (Ansari et al., 2020) evaluatesGerman translation of English recordings extractedfrom TED talks. The test dataset is provided toparticipants both segmented in a sentence-like for-mat using a Voice Activity Detector (VAD) andin the original unsegmented form. Although cus-tom segmentation of the data can provide drastic improvements in the ﬁnal scores, in our work wehave not addressed it, participating only with theprovided segmentation.Two main approaches are possible to face thespeech translation task. The classic one is the cas-cade solution, which includes automatic speechrecognition (ASR) and machine translation (MT)components. The other option is an end-to-end(E2E) solution, which performs ST with a singlesequence-to-sequence model. Both of them are al-lowed for the IWSLT2020 task, but our submissionis based on an E2E model.E2E ST models gained popularity in the last fewyears. Their rise is due to the lack of error propaga-tion and the reduced latency in generating the out-put compared to the traditional cascaded approach.Despite these appealing properties, they failed sofar to reach the same results obtained by cascadesystems, as shown also by last year’s IWSLT cam-paign (Niehues et al., 2019). One reason for this isthe limited amount of parallel corpora compared tothose used to separately train ASR and MT com-ponents. Moreover, training an E2E ST system ismore difﬁcult because the task is more complex,since it deals with understanding the content ofthe input audio and translating it into a differentlanguage directly and without recurring to interme-diate representations.The above-mentioned observations have led re-searchers to focus on transferring knowledge fromMT and ASR systems to improve the ST models.A traditional approach consists in pretraining com-ponents: the ST encoder is initialized with the ASRencoder and the ST decoder with the MT decoder.The encoder pretraining has indeed proved to beeffective (Bansal et al., 2019), while the decoderpretraining has not demonstrated to be as effective,unless with the addition of adaptation layers (Baharet al., 2019a). A more promising way to transferknowledge from an MT model is to use the MT as a r X i v : . [ c s . C L ] J un teacher to distill knowledge for the ST training(Liu et al., 2019). This is the approach we explorein the paper.Despite its demonstrated effectiveness, ASR pre-training has been replaced in some works by mul-titask learning (Weiss et al., 2017). In this case,the model is jointly trained with two (or more) lossfunctions and usually the model is composed of3 components: i) a shared encoder, ii) a decoderwhich generates the transcription, and iii) a de-coder which generates the translation. We adoptthe slightly different approach introduced by (Ba-har et al., 2019a), which does not introduce anadditional decoder but relies on the CTC loss in or-der to predict the transcriptions (Kim et al., 2017).As this multi-task learning has been proposed forspeech recognition and has demonstrated to be use-ful in that scenario, we also include the CTC lossin ASR pretraining.Another topic that received considerable atten-tion is data augmentation. Many techniques havebeen proposed: in this work we focus on SpecAug-ment (Park et al., 2019), time stretch and sub-sequence sampling (Nguyen et al., 2020). More-over, we used synthetic data generated by automat-ically translating the ASR datasets with our MTmodel. This process can also be considered as asequence-level knowledge distillation technique,named Sequence KD (Kim and Rush, 2016).In this paper, we explore different ways to com-bine synthetic and real data. We also check if thebeneﬁts of the techniques mentioned above are or-thogonal and joining them leads to better results.Our experiments show that: • knowledge distillation, ASR pretraining,multi-task learning and data augmentation arecomplementary , i.e. they cooperate to pro-duce a better model; • combining synthetic and real data markingthem with different tags (Caswell et al., 2019)leads to a model which generalizes better; • ﬁne-tuning a model trained with word-levelknowledge distillation using the more clas-sical label smoothed cross entropy (Szegedyet al., 2016) signiﬁcantly improves the results; • there is a huge performance gap between datasegmented in sentences and data segmentedwith VAD. Indeed, on the same test set, thescore on VAD-segmented data is lower by 5.5BLEU. To summarize, our submission is characterizedby tagged synthetic data, multi-task with CTC losson the transcriptions, data augmentation and word-level knowledge distillation. This section describes the data used to build ourmodels. They include: i) MT corpora (English-German sentence pairs), for the model used inknowledge distillation; ii)

ASR corpora (audio andEnglish transcriptions), for generating a pretrainedencoder for the ST task; iii)

ST corpora (audioswith corresponding English transcription and Ger-man translation), for the training of our ST models.For each task, we used all the relevant datasetsallowed by the evaluation campaign . MT.

All datasets allowed in WMT 2019 (Bar-rault et al., 2019) were used for the MT training,with the addition of OpenSubtitles2018 (Lison andTiedemann, 2016). These datasets contain spuri-ous sentence pairs: some target sentences are ina language different from German (often in En-glish) or are unrelated to the corresponding Englishsource or contain unexpected characters (such asideograms). As a consequence, an initial trainingon them caused the model to produce some Englishsentences, instead of German, in the output. Hence,we cleaned our MT training data using ModernMT (Bertoldi et al., 2017) , in order to remove sen-tences whose language is not the correct one. Wefurther ﬁltered out sentences containing ideogramswith a custom script. Overall, we removed roughly25% of the data and the ﬁnal dataset used in thetraining contains nearly 49 million sentence pairs. ASR.

For this task, we used both pure ASR andST available corpora. They include TED-LIUM3 (Hernandez et al., 2018), Librispeech (Panay-otov et al., 2015), Mozilla Common Voice , How2(Sanabria et al., 2018), the En-De section of MuST-C (Di Gangi et al., 2019a), the Speech-TranslationTED corpus provided by the task organizers andthe En-De section of Europarl-ST (Iranzo-S´anchezet al., 2020). All data was lowercased and punctua-tion was removed. ST.

In addition to the allowed ST corpora (MuST-C, Europarl-ST and the Speech-Translation TED http://iwslt.org/doku.php?id=offline_speech_translation We run the CleaningPipelineMain class of MMT. https://voice.mozilla.org/ orpus), we generated synthetic data using Se-quence KD (see Section 3.2) for all the ASRdatasets missing the German reference. Moreover,we generated synthetic data for the En-Fr sectionof MuST-C. Overall, the combination of real andgenerated data resulted in a ST training set of 1.5million samples.All texts were preprocessed by tokenizing them,de-escaping special characters and normalizingpunctuation with the scripts in the Moses toolkit(Koehn et al., 2007). The words in both languageswere segmented using BPE with 8,000 merge ruleslearned jointly on the two languages of the MTtraining data (Sennrich et al., 2016). The audiowas converted into 40 log Mel-ﬁlter banks withspeaker normalization using XNMT (Neubig et al.,2018). We discarded samples with more than 2,000ﬁlter-banks in order to prevent memory issues. The models we trained are based on Transformer(Vaswani et al., 2017). The MT model is a plainTransformer with 6 layers for both the encoder andthe decoder, 16 attention heads, 1,024 features forthe attention layers and 4,096 hidden units in feed-forward layers.

2D Self-Attention Encoder Decoder BLEU2 6 6 16.500 8 6

Table 1: Results on Librispeech with Word KD varyingthe number of layers.

The ASR and ST models are a revisited versionof the S-Transformer introduced by (Di Gangi et al.,2019c). In preliminary experiments on Librispeech(see Table 1), we observed that replacing 2D self-attention layers with additional Transformer en-coder layers was beneﬁcial to the ﬁnal score. More-over, we noticed that adding more layers in theencoder improves the results, while removing fewlayers of the decoder does not harm performance.Hence, the models used in this work process theinput with two 2D CNNs, whose output is pro-jected into the higher-dimensional space used bythe Transformer encoder layers. The projected out-put is summed with positional embeddings beforebeing fed to the Transformer encoder layers, which use logarithmic distance penalty.Both our ASR and ST models have 8 attentionheads, 512 features for the attention layers and2,048 hidden units in FFN layers. The ASR modelhas 8 encoder layers and 6 decoder layers, whilethe ST model has 11 encoder layers and 4 decoderlayers. The ST encoder is initialized with the ASRencoder (except for the additional 3 layers that areinitialized with random values). The decision ofhaving a different number of encoder layers in thetwo encoders is motivated by the idea of introduc-ing adaptation layers, which (Bahar et al., 2019a)reported to be essential when initializing the de-coder with that of a pretrained MT model.

One of the main problems for end-to-end ST is thescarcity of parallel corpora. In order to mitigate thisissue, we explored the following data augmentationstrategies in our participation.

SpecAugment.

SpecAugment is a data augmen-tation technique originally introduced for ASR,whose effectiveness has also been demonstrated forST (Bahar et al., 2019b). It operates on the inputﬁlterbanks and it consists in masking consecutiveportions of the input both in the frequency and inthe time dimensions. On every input, at each itera-tion, SpecAugment is applied with probability p . Incase of application, it generates frequency maskingnum masks on the frequency axis and time mask-ing num masks on the time axis. Each mask hasa starting index, which is sampled from a uniformdistribution, and a number of consecutive items tomask, which is a random number between 0 andrespectively frequency masking pars and time mask-ing pars . Masked items are set to 0. In our work,we always applied SpecAugment to both the ASRpretraining and the ST training. The conﬁgurationwe used are: p = 0.5, frequency masking pars = 13, time masking pars = 20, frequency masking num =2 and time masking num = 2. Time stretch.

Time stretch (Nguyen et al., 2020)is another technique which operates directly on theﬁlterbanks, aiming at generating the same effect ofspeed perturbation. It divides the input sequence inwindows of w features and re-samples each of themby a random factor s drawn by a uniform distribu-tion between 0.8 and 1.25 (in our implementation,the lower bound is set to 1.0 in case of an inputsequence with length lower than 10). In this work,we perturb an input sample using time stretch withrobability 0.3. Sub-sequence sampling.

As mentioned in theintroduction, there is a huge gap in model’s perfor-mance when translating data split in well-formedsentences and data split with VAD. In order to re-duce this difference, we tried to train the modelon sentences which are not always well-formed byusing sub-sequence sampling (Nguyen et al., 2020).Sub-sequence sampling requires the alignments be-tween the speech and the target text at word level.As this information is not possible to obtain for thetranslations, we created the sub-sequences with thealignments between the audio and the transcription,and then we translated the obtained transcriptionwith our MT model to get the target German trans-lation. For every input sentence, we generated threesegments: i) one starting at the beginning of thesentence and ending at a random word in the sec-ond half of the sentence, ii) one starting at a randomword in the ﬁrst half of the sentence and endingat the end of the sentence, and iii) one starting ata random word in the ﬁrst quarter of the sentenceand ending at a random word in the last quarter ofthe sentence.In our experiments, this technique has not pro-vided signiﬁcant improvements (the gain was lessthan 0.1 BLEU on the VAD-segmented test set).Hence, it was not included in our ﬁnal models. Synthetic data.

Finally, we generated synthetictranslations for the data in the ASR datasets to cre-ate parallel audio-translation pairs to be included inthe ST trainings. The missing target sentences wereproduced by translating the transcript of each audiosample with our MT model, as in (Jia et al., 2019).If the transcription of a dataset was provided withpunctuation and correct casing, this was fed to theMT model; otherwise, we had to use the lowercasetranscription without punctuation.

Top K BLEU4 16.438

64 16.371024 16.34

Table 2: Results on Librispeech with different K values,where K is the number of tokens considered for WordKD.

While the ASR and MT models are optimized onlabel smoothed cross entropy with smoothing fac-tor 0.1, our ST models are trained with word-level knowledge distillation (Word KD). In Word KD,the model being trained is named student and thegoal is to teach it to produce the same output dis-tribution of another - pretrained - model, named teacher . This is obtained by computing the KLdivergence (Kullback and Leibler, 1951) betweenthe distribution produced by the student and thedistribution produced by the teacher. The ratio-nale of knowledge distillation resides in providingadditional information to the student, as the out-put probabilities produced by the teacher reﬂect itshidden knowledge (the so-called dark knowledge ),and in the fact that the soft labels produced by theteacher are an easier target to match for the studentthan cross entropy.In this work, we follow (Liu et al., 2019), so theteacher model is our MT model and the studentis the ST model. Compared to (Liu et al., 2019),we make the training more efﬁcient by extractingonly the top 8 tokens from the teacher distribution.In this way, we can precompute and store the MToutput instead of computing it at each training it-eration, since its size is reduced by three orders ofmagnitude. Moreover, this approach does not affectnegatively the ﬁnal score, as shown by (Tan et al.,2019) and conﬁrmed for ST by our experiments inTable 2).Moreover, once the training with Word KD isterminated, we perform a ﬁne-tuning of the STmodel using the label smoothed cross entropy. Fine-tuning on a different target is an approach whoseeffectiveness has been shown by (Kim and Rush,2016). Nevertheless, they applied a ﬁne-tuningon knowledge distillation after a pretraining withthe cross entropy loss, while here we do the op-posite. Preliminary experiments on Librispeechshowed that there is no difference in the order ofthe trainings (16.79 vs 16.81 BLEU, compared to16.5 BLEU before the ﬁne-tuning). In the ﬁne-tuning, we train both on real and synthetic data,but we do not use the other data augmentation tech-niques.

A key aspect is the training scheme used to combinethe real and synthetic datasets. In this paper, weexplore two alternatives: • Sequence KD + Finetune : this is the trainingscheme suggested in (He et al., 2020). Themodel is ﬁrst trained with Sequence KD andWord KD on the synthetic datasets and then its ﬁne-tuned on the datasets with ground-truthtargets using Word KD. • Multi-domain : similarly to our last year sub-mission (Di Gangi et al., 2019b), the trainingis executed on all data at once, but we intro-duce three tokens representing the three typesof data, namely: i) those whose ground-truthtranslations are provided, ii) those generatedfrom true case transcriptions with punctuation, iii) those generated from lowercase transcrip-tions without punctuation. We explore thetwo most promising approaches according to(Di Gangi et al., 2019d) to integrate the tokenwith the data, i.e. summing the token to all in-put data and summing the token to all decoderinput embeddings. We found that adding the CTC loss (Graves et al.,2006) to the training objective gives better resultsboth in ASR and ST, although it slows down thetraining by nearly a factor of 2. During the ASRtraining, we added the CTC loss on the output ofthe last layer of the encoder. During the ST train-ing, instead, the CTC loss was computed using theoutput of the last layer pretrained with the ASR en-coder, ie. the 8th layer. In this way, the ST encoderhas three additional layers which can transform therepresentation into features which are more conve-nient for the ST task, as Bahar et al. (2019a) didintroducing an adaptation layer.

For our experiments, we used the described trainingsets and we picked the best model according tothe perplexity on MuST-C En-De validation set.We evaluated our models on three benchmarks: i) the MuST-C En-De test set segmented at sentencelevel, ii) the same test set segmented with a VAD(Meignier and Merlin, 2010), and iii) the IWSLT2015 test set (Cettolo et al., 2015).We trained with Adam (Kingma and Ba, 2015)(betas (0.9, 0.98) ). Unless stated otherwise, thelearning rate was set to increase linearly from 3e-4to 5e-4 in the ﬁrst 5,000 steps and then decay withan inverse square root policy. For ﬁne-tuning, thelearning rate was kept ﬁxed at 1e-4. A 0.1 dropoutwas applied.Each GPU processed mini-batches containing upto 12K tokens or 8 samples and updates were per-formed every 8 mini-batches. As we had 8 GPUs, the actual batch size was about 512. In the caseof multi-domain training, a batch for each domainwas processed before an update: since we havethree domains, the overall batch size was about1,536. Moreover, the datasets in the different do-mains had different sizes, so the smaller ones wereoversampled to match the size of the largest.As the truncation of the output values of theteacher model to the top 8 leads to a more peakeddistribution, we checked if contrasting this bias isbeneﬁcial or not. Hence, we tuned the value ofthe temperature at generation time in the interval0.8-1.5. The temperature T is a parameter which isused to divide the logits before the sof tmax anddetermines whether to output a softer (if T > )or a sharper (if T < ) distribution (Hinton et al.,2015). By default T is 1, returning an unmodiﬁeddistribution. The generation of the results reportedin this paper was performed using T = 1.3 for themodels trained on Word KD. This usually provideda 0.1-0.5 BLEU increase on our benchmarks com-pared to T = 1, conﬁrming our hypothesis that acompensation of the bias towards a sharper distri-bution is useful. Instead, the T was set to 1 dur-ing the generation with models trained with labelsmoothed cross entropy, as in this case a higher (orlower) temperature caused performance losses upto 1 BLEU point.All experiments were executed on a single ma-chine with 8 Tesla K80 with 11GB RAM. Ourimplementation is built on top of fairseq (Ott et al.,2019), an open source tool based on PyTorch(Paszke et al., 2019). The MT model used as teacher for Sequence KDand Word KD scored 32.09 BLEU on the MuST-C En-De test set. We trained also a smaller MTmodel to initialize the ST decoder with it. More-over, we trained two ASR models. One without themultitask CTC loss and one with it. They scoredrespectively 14.67 and 10.21 WER. All the ST sys-tems having CTC loss were initialized with thelatter, while the others were initialized with theformer.Table 3 shows our ST models’ results computedon the MuST-C En-De and IWSLT2015 test set. odel MuST-Csentence MuST-CVAD IWSLT2015Seq KD+FT (w/o TS) 25.80 20.94 17.18+ FT w/o KD 27.55 19.64 16.93Multi ENC (w/o TS) 25.79 21.37 19.07+ FT w/o KD 27.24 20.87 19.08Multi ENC+DEC PT 25.30 20.80 16.76+ FT w/o KD 27.40 21.90 18.55Multi ENC+CTC + FT w/o KD (1) 27.98 22.51 20.58Multi ENC+CTC (5e-3) 25.44 20.41 16.36+ FT w/o KD

Multi DEC+CTC (5e-3) 26.10 19.94 17.92+ FT w/o KD 28.22 22.61 18.31Ensemble (1) and (2)

Table 3: Case sensitive BLEU scores for our E2E STmodels. Notes: Seq KD: Sequence KD; FT: ﬁnetuningon ground-truth datasets; TS: time stretch; Multi ENC:multi-domain model with sum of the language tokento the encoder input; Multi DEC: multi-domain modelwith sum of the language token to the decoder input;DEC PT: pretraining of the decoder with that of an MTmodel; CTC: multitask training with CTC loss on the8th encoder layer in addition to the target loss; FT w/oKD: ﬁnetuning on all data with label smoothed crossentropy; 5e-3: indicates the learning rate used; AVG 5:average 5 checkpoints around the best.

First, we compare the two training schemes ex-amined. As shown in Table 3,

Sequence KD +Finetune [Seq KD+FT] has the same performanceas

Multi-domain with language token summed tothe input [Multi ENC] (or even slightly better) onthe MuST-C test set, but it is signiﬁcantly worse onthe two test set segmented with VAD. This can beexplained by the higher generalization capabilityof the

Multi-domain model. Indeed,

Sequence KD+ Finetune seems to overﬁt more the training data;thus, on data coming from a different distribution,as VAD-segmented data are, its performance dropssigniﬁcantly. For this reason, all the following ex-periments use the

Multi-domain training scheme. time stretch

The pretraining of the decoder with that of an MTmodel does not bring consistent and signiﬁcant im-provements across the test sets [Multi ENC+DECPT]. Before the ﬁne-tuning with label smoothedcross entropy, indeed, the model performs worseon all test sets. The ﬁne-tuning, though, helps im-proving performances on all test sets, which wasnot the case with the previous training. This can be related to the introduction of time stretch , whichreduces the overﬁtting to the training data. There-fore, we decided to discard the MT pretraining andkeep time stretch . The multitask training with CTC loss, instead, im-proves the results consistently. The model trainedwith it [Multi ENC+CTC] outperforms all the oth-ers on all test sets by up to 1.5 BLEU points. Dur-ing the ﬁne-tuning of these models, we do not per-form multitask training with the CTC loss, so theﬁne-tuning training is exactly the same as for pre-vious models.Interestingly, increasing the learning rate [MultiENC+CTC (5e-3)], the performance before theﬁne-tuning is worse, but the ﬁne-tuning of this mod-els brings an impressive improvement over all testsets. The reason of this behavior is probably relatedto a better initial exploration of the solution spacethanks to the higher learning rate, which, on theother side, prevents to get very close to the localoptimum found. In this scenario, the ﬁne-tuningwith a lower learning rate helps getting closer tothe local optimum, in addition to the usual beneﬁts.

Finally, we tried adding the language token to theembeddings provided to the decoder, instead ofthe input data [Multi DEC+CTC (5e-3)]. This wasmotivated by the idea that propagating this infor-mation through the decoder may be more difﬁcultdue to the CTC loss, which is not dependent on thatinformation so it may hide it to higher layers. Theexperiments disproved this hypothesis, as after theﬁne-tuning the results are lower on all benchmarks.

We averaged our best model over 5 checkpoints,centered in the best according to the validation loss.We also created an ensemble with the resultingmodel and the best among the others. Both oper-ations were not useful on the two variants of theMuST-C test set, but improved the score on theIWSLT2015 test set. We argue this means that theyare more robust and generalize better.Our primary submission has been obtained withthe ensemble of two models, scoring 20.75 BLEUon the 2020 test set and 19.52 BLEU on the 2019test set. Our contrastive submission has been gen-erated with the 5 checkpoints average of our bestodel, scoring 20.25 BLEU on the 2020 test setand 18.92 BLEU on the 2019 test set.

We described FBK’s participation in IWSLT2020ofﬂine speech translation evaluation campaign(Ansari et al., 2020). Our work focused on theintegration of transfer learning, data augmentation,multi-task training and the training scheme usedto combine real and synthetic data. Based on theresults of our experiments, our submission is char-acterized by a multi-domain training scheme, withadditional CTC loss on the transcriptions and word-level knowledge distillation, followed by a ﬁne-tuning on label smoothed cross entropy.Overall, the paper demonstrates that the com-bination of the above-mentioned techniques canimprove the performance of end-to-end ST modelsso that they can be competitive with cascaded solu-tions. Moreover, it shows that i) tagged syntheticdata leads to more robust models than a pretrain-ing on synthetic data followed by a ﬁne-tuningon datasets with ground-truth targets and ii) ﬁne-tuning on label smoothed cross entropy after a train-ing with knowledge distillation brings signiﬁcantimprovements. The huge gap (5.5 BLEU) betweendata segmented in sentences and data segmentedwith VAD highlights the need of custom solutionsfor the latter. In light of these considerations, ourfuture research will focus on techniques to improvethe results when the audio segmentation is chal-lenging for ST models. Acknowledgments

This work is part of the “End-to-end SpokenLanguage Translation in Rich Data Conditions”project, which is ﬁnancially supported by an Ama-zon AWS ML Grant. References

Ebrahim Ansari, Amittai Axelrod, Nguyen Bach, On-drej Bojar, Roldano Cattoni, Fahim Dalvi, NadirDurrani, Marcello Federico, Christian Federmann,Jiatao Gu, Fei Huang, Kevin Knight, Xutai Ma, AjayNagesh, Matteo Negri, Jan Niehues, Juan Pino, Eliz-abeth Salesky, Xing Shi, Sebastian St¨uker, MarcoTurchi, and Changhan Wang. 2020. Findings of theIWSLT 2020 Evaluation Campaign. In

Proceedingsof the 17th International Conference on Spoken Lan-guage Translation (IWSLT 2020) , Seattle, USA. https://ict.fbk.eu/units-hlt-mt-e2eslt/ Parnia Bahar, Tobias Bieschke, and Hermann Ney.2019a. A Comparative Study on End-to-end Speechto Text Translation. In

Proceedings of InternationalWorkshop on Automatic Speech Recognition and Un-derstanding (ASRU) , pages 792–799, Sentosa, Sin-gapore.Parnia Bahar, Albert Zeyer, Ralf Schlter, and HermannNey. 2019b. On Using SpecAugment for End-to-End Speech Translation. In

Proceedings of 16th In-ternational Workshop on Spoken Language Transla-tion (IWSLT) , Hong Kong.Sameer Bansal, Herman Kamper, Karen Livescu,Adam Lopez, and Sharon Goldwater. 2019. Pre-training on High-resource Speech Recognition Im-proves Low-resource Speech-to-text Translation. In

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers) , pages 58–68,Minneapolis, Minnesota. Association for Computa-tional Linguistics.Lo¨ıc Barrault, Ondˇrej Bojar, Marta R. Costa-juss`a,Christian Federmann, Mark Fishel, Yvette Gra-ham, Barry Haddow, Matthias Huck, Philipp Koehn,Shervin Malmasi, Christof Monz, Mathias M¨uller,Santanu Pal, Matt Post, and Marcos Zampieri. 2019.Findings of the 2019 Conference on Machine Trans-lation (WMT19). In

Proceedings of the Fourth Con-ference on Machine Translation (Volume 2: SharedTask Papers, Day 1) , pages 1–61, Florence, Italy. As-sociation for Computational Linguistics.Nicola Bertoldi, Roldano Cattoni, Mauro Cettolo,Amin Farajian, Marcello Federico, Davide Caroselli,Luca Mastrostefano, Andrea Rossi, Marco Trom-betti, Ulrich Germann, and David Madl. 2017.MMT: New Open Source MT for the Translation In-dustry. In

Proceedings of the 20th Annual Confer-ence of the European Association for Machine Trans-lation (EAMT) , pages 86–91, Prague, Czech Repub-lic.Isaac Caswell, Ciprian Chelba, and David Grangier.2019. Tagged Back-Translation. In

Proceedings ofthe Fourth Conference on Machine Translation (Vol-ume 1: Research Papers) , pages 53–63, Florence,Italy. Association for Computational Linguistics.Mauro Cettolo, Jan Niehues, Sebastian St¨uker, LuisaBentivogli, Roldano Cattoni, and Marcello Federico.2015. The IWSLT 2015 Evaluation Campaign. In

Proceedings of 12th International Workshop on Spo-ken Language Translation (IWSLT) , Da Nang, Viet-nam.Mattia Di Gangi, Roldano Cattoni, Luisa Bentivogli,Matteo Negri, and Marco Turchi. 2019a. MuST-C: a Multilingual Speech Translation Corpus. In

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies(NAACL-HLT) , page 20122017, Minneapolis, Min-nesota.attia Antonino Di Gangi, Matteo Negri, Viet NhatNguyen, Amirhossein Tebbifakhr, and MarcoTurchi. 2019b. Data Augmentation for End-to-EndSpeech Translation: FBK@IWSLT ’19. In

Proceed-ings of 16th International Workshop on Spoken Lan-guage Translation (IWSLT) , Hong Kong.Mattia Antonino Di Gangi, Matteo Negri, and MarcoTurchi. 2019c. Adapting Transformer to End-to-EndSpoken Language Translation. In

Proceedings of In-terspeech 2019 , pages 1133–1137, Graz, Austria.Mattia Antonino Di Gangi, Matteo Negri, and MarcoTurchi. 2019d. One-To-Many Multilingual End-to-end Speech Translation. In

Proceedings of the2019 IEEE Automatic Speech Recognition and Un-derstanding Workshop (ASRU) , pages 585–592, Sen-tosa, Singapore.Alex Graves, Santiago Fern´andez, Faustino J. Gomez,and J¨urgen Schmidhuber. 2006. Connectionist Tem-poral Classiﬁcation: Labelling Unsegmented Se-quence Data with Recurrent Neural Networks. In

Proceedings of the 23rd international conferenceon Machine learning (ICML) , pages 369–376, Pitts-burgh, Pennsylvania.Junxian He, Jiatao Gu, Jiajun Shen, and Marc’AurelioRanzato. 2020. Revisiting Self-Training for NeuralSequence Generation. In

Proceedings of the Inter-national Conference on Learning Representations(ICLR) , Virtual Conference.Franois Hernandez, Vincent Nguyen, Sahar Ghan-nay, Natalia Tomashenko, and Yannick Estve. 2018.TED-LIUM 3: Twice as Much Data and CorpusRepartition for Experiments on Speaker Adaptation.In

Proceedings of the Speech and Computer - 20thInternational Conference (SPECOM) , pages 198–208, Leipzig, Germany. Springer International Pub-lishing.Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.Distilling the Knowledge in a Neural Network. In

Proceedings of NIPS Deep Learning and Represen-tation Learning Workshop , Montr´eal, Canada.Javier Iranzo-S´anchez, Joan Albert Silvestre-Cerd`a,Javier Jorge, Nahuel Rosell´o, Gim´enez. Adri`a, Al-bert Sanchis, Jorge Civera, and Alfons Juan. 2020.Europarl-ST: A Multilingual Corpus For SpeechTranslation Of Parliamentary Debates. In

Proceed-ings of 2020 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) ,pages 8229–8233, Barcelona, Spain.Ye Jia, Melvin Johnson, Wolfgang Macherey, Ron J.Weiss, Yuan Cao, Chung-Cheng Chiu, Naveen Ari,Stella Laurenzo, and Yonghui Wu. 2019. Lever-aging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation. In

Proceedings ofthe 2019 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) , pages7180–7184, Brighton, UK. Suyoun Kim, Takaaki Hori, and Shinji Watanabe.2017. Joint CTC-Attention based End-to-EndSpeech Recognition using Multi-task Learning. In

Proceedings of the 2017 IEEE International Con-ference on Acoustics, Speech and Signal Process-ing (ICASSP) , pages 4835–4839, New Orleans,Louisiana.Yoon Kim and Alexander M. Rush. 2016. Sequence-Level Knowledge Distillation. In

Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing , pages 1317–1327, Austin,Texas.Diederik Kingma and Jimmy Ba. 2015. Adam: AMethod for Stochastic Optimization. In

Proceed-ings of 3rd International Conference on LearningRepresentations (ICLR) , San Diego, California.Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondˇrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: OpenSource Toolkit for Statistical Machine Translation.In

Proceedings of the 45th Annual Meeting of theAssociation for Computational Linguistics Compan-ion Volume Proceedings of the Demo and Poster Ses-sions , pages 177–180, Prague, Czech Republic. As-sociation for Computational Linguistics.Solomon Kullback and Richard Arthur Leibler. 1951.On information and sufﬁciency.

Ann. Math. Statist. ,22(1):79–86.Pierre Lison and J¨org Tiedemann. 2016. OpenSubti-tles2016: Extracting Large Parallel Corpora fromMovie and TV Subtitles. In

Proceedings of theTenth International Language Resources and Evalu-ation Conference (LREC) , pages 923–929, Portoroz,Slovenia.Yuchen Liu, Hao Xiong, Jiajun Zhang, ZhongjunHe, Hua Wu, Haifeng Wang, and Chengqing Zong.2019. End-to-End Speech Translation with Knowl-edge Distillation. In

Proceedings of Interspeech2019 , pages 1128–1132, Graz, Austria.Sylvain Meignier and Teva Merlin. 2010. LIUM Sp-kDiarization: An Open Source Toolkit For Diariza-tion. In

Proceedings of the CMU SPUD Workshop ,Dallas, Texas.Graham Neubig, Matthias Sperber, Xinyi Wang,Matthieu Felix, Austin Matthews, Sarguna Padman-abhan, Ye Qi, Devendra Sachan, Philip Arthur,Pierre Godard, John Hewitt, Rachid Riad, and Lim-ing Wang. 2018. XNMT: The eXtensible NeuralMachine Translation Toolkit. In

Proceedings ofthe 13th Conference of the Association for MachineTranslation in the Americas (Volume 1: Research Pa-pers) , pages 185–192, Boston, MA. Association forMachine Translation in the Americas.hai-Son Nguyen, Sebastian Stueker, Jan Niehues,and Alex Waibel. 2020. Improving Sequence-to-sequence Speech Recognition Training with On-the-ﬂy Data Augmentation. In

Proceedings of the 2020International Conference on Acoustics, Speech, andSignal Processing IEEE-ICASSP-2020 , Barcelona,Spain.Jan Niehues, Roldano Cattoni, Sebastian Stucker, Mat-teo Negri, Marco Turchi, Thanh-Le Ha, ElizabethSalesky, Ramon Sanabria, Loic Barrault, Lucia Spe-cia, and Marcello Federico. 2019. The IWSLT 2019Evaluation Campaign. In

Proceedings of 16th Inter-national Workshop on Spoken Language Translation(IWSLT) , Hong Kong.Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A Fast, ExtensibleToolkit for Sequence Modeling. In

Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics(Demonstrations) , pages 48–53, Minneapolis, Min-nesota. Association for Computational Linguistics.Vassil Panayotov, Guoguo Chen, Daniel Povey, andSanjeev Khudanpur. 2015. Librispeech: An ASRcorpus based on public domain audio books. In , pages5206–5210, South Brisbane, Queensland, Australia.Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, andQuoc V. Le. 2019. SpecAugment: A Simple DataAugmentation Method for Automatic Speech Recog-nition. In

Proceedings of Interspeech 2019 , pages2613–2617, Graz, Austria.Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Kopf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,Junjie Bai, and Soumith Chintala. 2019. Py-torch: An imperative style, high-performance deeplearning library. In H. Wallach, H. Larochelle,A. Beygelzimer, F. Alch´e-Buc, E. Fox, and R. Gar-nett, editors,

Proceedings of Advances in NeuralInformation Processing Systems 32 (NIPS) , pages8024–8035. Curran Associates, Inc.Ramon Sanabria, Ozan Caglayan, Shruti Palaskar,Desmond Elliott, Lo¨ıc Barrault, Lucia Specia, andFlorian Metze. 2018. How2: A Large-scale DatasetFor Multimodal Language Understanding. In

Pro-ceedings of Visually Grounded Interaction and Lan-guage (ViGIL) , Montr´eal, Canada. Neural Informa-tion Processing Society (NeurIPS).Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural Machine Translation of Rare Wordswith Subword Units. In

Proceedings of the 54th An-nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,Jon Shlens, and Zbigniew Wojna. 2016. Rethink-ing the Inception Architecture for Computer Vision.In

Proceedings of 2016 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , pages2818–2826, Las Vegas, Nevada, United States.Xu Tan, Yi Ren, Di He, Tao Qin, and Tie-YanLiu. 2019. Multilingual Neural Machine Trans-lation with Knowledge Distillation. In

Proceed-ings of International Conference on Learning Repre-sentations (ICLR) , New Orleans, Louisiana, UnitedStates.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is AllYou Need. In

Proceedings of Advances in NeuralInformation Processing Systems 30 (NIPS) , pages5998–6008, Long Beach, California.Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, YonghuiWu, and Zhifeng Chen. 2017. Sequence-to-Sequence Models Can Directly Translate ForeignSpeech. In