Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation
FFused Acoustic and Text Encoding forMultimodal Bilingual Pretraining and Speech Translation
Renjie Zheng * 1
Junkun Chen * 2
Mingbo Ma Liang Huang
Abstract
Recently text and speech representation learninghas successfully improved many language relatedtasks. However, all existing methods only learnfrom one input modality, while a unified acousticand text representation is desired by many speech-related tasks such as speech translation. We pro-pose a Fused Acoustic and Text Masked LanguageModel (FAT-MLM) which jointly learns a uni-fied representation for both acoustic and text in-put. Within this crossmodal representation learn-ing framework, we further present an end-to-endmodel for Fused Acoustic and Text Speech Trans-lation (FAT-ST). Experiments on three translationdirections show that our proposed speech transla-tion models fine-tuned from FAT-MLM substan-tially improve translation quality ( +5 . BLEU).
1. Introduction
In recent years, task-agnostic text representation learn-ing (Peters et al., 2018; Devlin et al., 2019; Sun et al.,2019) has attracted much attention in the NLP communitydue to its strong performance to many downstream tasks.More recently, unsupervised speech representation learning(Baevski et al., 2020; Chen et al., 2020; Liu et al., 2020a)also successfully improved many speech related tasks, suchas speech recognition and speech translation.However all these existing methods can only handle onemodality, either text or speech, while joint acoustic and textrepresentation is desired for many end-to-end spoken lan-guage processing tasks, such as spoken question answering(Chuang et al., 2019) and end-to-end speech-to-text trans-lation (Liu et al., 2020b). For example, end-to-end speechtranslation (ST) is desired due to its advantages over thepipeline paradigm, such as low latency, alleviation of er-ror propagation, and fewer parameters (Weiss et al., 2017;B´erard et al., 2018; Jia et al., 2019; Sperber et al., 2017).However, its translation quality is limited by the scarcity of * Equal contribution Baidu Research, Sunnyvale, CA, USA Oregon State University, Corvallis, OR, USA. Part of J.C.’s con-tributions were made during his internship at Baidu Research.
Speech TranscriptionTranslationScarce AbundantAbundant
Figure 1.
The quality of end-to-end speech translation models hasbeen limited by the scarcity of speech translation datasets. How-ever, there is an abundance of datasets for speech, text, speechrecognition, and machine translation data that can be leveraged. large-scale parallel speech translation data while there ex-ists sufficient data for speech recognition and text machinetranslation (Fig. 1). It would be helpful if source speech andbilingual text can be encoded into a unified representationvia abundant speech recognition and text machine transla-tion data. Liu et al. (2020b) show that jointly training amulti-modal ST encoder can largely improve the translationquality. However, their proposed representation learningmethod is constrained to the sequence-to-sequence frame-work and there is no experiment showing whether theirproposed method can benefit from extra speech recognitionand machine translation data.Inspired by recent cross-lingual language model pre-trainingwork (Lample & Conneau, 2019) which shows the potentialto unify the representations of different languages into oneencoder, we propose a Fused Acoustic and Text MaskedLanguage Model (FAT-MLM). This model jointly learns aunified representation for both acoustic and text input. Inthis way, we extend the masked language model’s input fromonly acoustic or text data to multimodal corpora containingboth acoustic and text data, such as speech recognition andspeech translation for the first time (Fig. 1).We further extend this Fused Acoustic and Text encoder to asequence-to-sequence framework and present an end-to-endSpeech Translation model (FAT-ST). This enables the modelto be trained from both speech and text machine translation a r X i v : . [ c s . C L ] F e b used Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation Transformer [MASK] morning GoodText embeddings en en en en
Guten [MASK] de de de de + + + + + + + +
Language embeddings Tag [MASK] morning
Text embeddings en en en en + + + +
Language embeddings 0 1 2 3 + + + +
Positional embeddings + + + + + + + +
Positional embeddings
Transformer [MASK] morning
Good
Text embeddings 0 1 2 3 + + + +
Positional embeddings
Transformer
Good (a) Masked Language Model (MLM) for text representation learning.
Transformer [MASK] morning GoodText embeddings en en en en
Guten [MASK] de de de de + + + + + + + +
Language embeddings Tag [MASK] morning
Text embeddings en en en en + + + +
Language embeddings 0 1 2 3 + + + +
Positional embeddings + + + + + + + +
Positional embeddings
Transformer [MASK] morning
Good
Text embeddings 0 1 2 3 + + + +
Positional embeddings
Transformer
Good (b) Translation Language Model (TLM) for crosslingual text.
2D Convolution2D Deconvolution
Masked Spectrogram as InputReconstructed Spectrogram
Transformer Encoder ++ + +
Positionalembeddings M A S K M A S K M A S K (c) Masked Acoustic Model (MAM) for speech. Figure 2.
Previous work for speech or text monomodal representation learning. data into one single encoder-decoder model. Meanwhile,this model can also learn from speech recognition data usingan extra FAT-MLM loss. This resolves the limitation of ex-isting single encoder and decoder speech translation models,which can only learn from scarce parallel speech translationdata, but neglects much larger scale speech recognition andtext machine translation data (Fig. 1).We make the following contributions:• We propose the Fused Acoustic and Text Masked Lan-guage Model (FAT-MLM), which can learn a unifiedacoustic and text representation.• Based on FAT-MLM, we propose the Fused Acousticand Text Speech Translation model (FAT-ST), whichcan do speech recognition and machine translation in asingle encoder-decoder framework.• Spontaneous speech translation experiments on threelanguage pairs show that by finetuning FAT-MLM, theaccuracy of FAT-ST improves end-to-end speech trans-lation model by +4 . BLEU in average and achievesstate-of-the-art. This is the first time that an end-to-endspeech translation model achieves similar performancewith the strong cascaded system in these three transla-tion directions of this dataset, while still maintaining asmaller model size and faster decoding time.• We show that FAT-MLM trained with additional speechrecognition, machine translation, and monolingual textdata can improve FAT-ST by +1 . BLEU. FAT-STcan be further improved by using additional speechrecognition and machine translation data.
2. Previous Work
Radford et al. (2018), Howard & Ruder (2018) and Devlinet al. (2019) investigate language modeling for pretrainingTransformer encoders. Unlike Radford et al. (2018) usingunidirectional language models for pretraining, Devlin et al.(2019) proposes BERT which enables deep bidirectionalrepresentation pretraining by a masked language modeling(MLM) objective inspired by the Cloze task (Taylor, 1953)which randomly masks some of the tokens from the input,with an objective to recover the masked word based only onits context. Their approaches lead to drastic improvementson several natural language understanding tasks includingtext classification (Wang et al., 2018),and question answer-ing (Rajpurkar et al., 2016).
Lample & Conneau (2019) extend MLM to cross-lingualpretraining by proposing two methods: one unsupervisedthat only relies on monolingual data, and one supervisedthat leverages parallel data with a new cross-lingual lan-guage model objective which is called Translation LanguageModel (TLM). As shown in Fig. 2(b), TLM encodes bothsource and target sentences from a parallel data after mask-ing several tokens with [MASK] , and then learn to recoverthe masked tokens. Experiments show that TLM achievesstate-of-the-art results on cross-lingual classification, unsu-pervised and supervised machine translation.
Recently, Chen et al. (2020) propose to learn a speech en-coder in a self-supervised fashion on the speech side, which used Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation
Speech Reconstruction Module
Transformer Encoder [MASK] morning Transformer Encoder
Good Textembeddings + + + +
Positional embeddings ++ + +
2D Convolution M A S K M A S K M A S K Speech Embedding Module 2D DeconvolutionAcoustic EmbeddingSpeech Reconstruction Module ℓ s ( D s , x ) ℓ x ( D s , x ) ̂ x ̂ s Acoustic Embedding ̂ e s ̂ e s (a) Speech Reconstruction Module. Speech Reconstruction Module
Transformer Encoder [MASK] morning Transformer Encoder
Good Textembeddings + + + +
Positional embeddings ++ + +
2D Convolution M A S K M A S K M A S K Speech Embedding Module 2D DeconvolutionAcoustic EmbeddingSpeech Reconstruction Module ℓ s ( D s , x ) ℓ x ( D s , x ) ̂ x ̂ s Acoustic Embedding e ̂ s e ̂ s (b) Monolingual FAT-MLM. Speech Reconstruction Module
Transformer Encoder [MASK] morning Transformer Encoder
Good Textembeddings + + + +
Positional embeddings ++ + +
2D Convolution M A S K M A S K M A S K Speech Embedding Module 2D DeconvolutionAcoustic EmbeddingSpeech Reconstruction Module ℓ s ( D s , x ) ℓ x ( D s , x ) ̂ x ̂ s Acoustic Embedding e ̂ s e ̂ s (c) Acoustic Embedding Module. Transformer Encoder [MASK] morning Good
Textembeddings en en en en en en en
Guten [MASK] de de de deen + + + + + + + +
Languageembeddings ++ + +
Tag + + + + + + + +
Positional embeddings
Speech Reconstruction Module
Acoustic Embedding ℓ s ( D s , x , y ) ℓ s ( D s , x , y ) ℓ s ( D s , x , y ) ̂ x ̂ ye ̂ s (d) Translation FAT-MLM Figure 3.
Fused Acoustic and Text-Masked Language Model (FAT-MLM). can utilize speech data without transcription. This tech-nique termed Masked Acoustic Modeling (MAM), can alsoperform pretraining on any acoustic signals (including non-speech ones) without annotation. Fig. 2(c) demonstrate thearchitecture of MAM. Similar with MLM, MAM replacesa span of speech spectrogram with mask tokens [MASK] .After a 2D Convolution layer and a Transformer Encoder,MAM learns to recover the masked spectrogram via a 2DDe-convolution layer during training. Chen et al. (2020)shows that MAM can improve end-to-end speech translationas either an additional loss or a pretraining model. Paral-lel to MAM, Baevski et al. (2020) proposes the wav2vec2.0 pretraining model, which masks the speech input in thelatent space and pretrains the model via a contrastive taskdefined over a quantization of the latent representations.
3. Fused Acoustic and Text Masked LanguageModel (FAT-MLM)
Although existing pretraining models show a strong repre-sentation learning ability and significantly improve uponmany down-streaming tasks, they all can only learn the rep-resentation for either text or speech. However, a unifiedspeech and text multi-modal representation is useful formany end-to-end spoken language processing tasks.To address this problem, we propose the Fused Acoustic andText Masked Language Model (FAT-MLM), a multimodalpretraining model which encodes acoustic, text into a unified representation. The idea is similar with Lample & Conneau(2019) who propose to learn a unified representation ofdifferent languages. They first propose a method relying onthe shared sub-word vocabulary to align different languages’representation. However this is unapplicable in our casebecause of the modality difference. Thus we propose amethod similar to their second approach TLM which usesparallel speech recognition data. In the following sections,we first introduce the monolingual FAT-MLM and then showhow to extend it to translation scenario.
The monolingual FAT-MLM takes speech and transcrip-tion tuples as input, denotes as D s , x = { ( s , x ) } , where s = ( s , ..., s | s | ) is a sequence of acoustic features s i ∈ R d s which can be the spectrogram or mel-spectrogram of thespeech audio, and each s i represents the frame-level speechfeature, and x = ( x , ..., x | x | ) is the sequence of corre-sponding transcription.As shown in Fig. 3(c), we first randomly mask several spansof s by a random masking function over the input s : ˆ s ∼ Mask span ( s , λ ) (1)where Mask span ( · ) replaces several random spans of s byprobability of λ (30% in our work) with a random initializedvector (cid:15) s ∈ R d s . Then we encode ˆ s with Convolutionsand a Transformer encoder for acoustic embeddings e ˆ s . used Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation Similarly, we randomly mask tokens in x by a randommasking function over the input s , x : ˆ x ∼ Mask token ( x , λ ) (2)where Mask token ( · ) replaces several tokens of x by probabil-ity of λ with a random initialized vector (cid:15) token ∈ R d x . Thenwe concatenate acoustic embeddings and source text embed-dings [ˆ e s ; ˆ x ] , and obtain the latent representation f ([ e ˆ s ; ˆ x ]) using another Transformer encoder, denoted as f . Samewith Lample & Conneau (2019), we reset the positionalembeddings for different types of input.The training objective of monolingual FAT-MLM includesa speech reconstruction loss (cid:96) s ( D s , x ) and a text reconstruc-tion loss (cid:96) x ( D s , x ) . For speech input s , we have the fol-lowing training objective to reconstruct the original speechsignal with the surrounding context information : (cid:96) s ( D s , x ) = (cid:80) ( s , x ) ∈ D s , x || s − g ( f ([ e ˆ s ; ˆ x ]) || (3)where g is a reconstruction function (we use 2D deconvo-lution in this work) which tries to recover the original sig-nal from encoded representation f ([ e ˆ s ; ˆ x ]) . We use meansquared error for measuring the difference between s andthe reconstructed spectrogram. For transcription input x ,following Devlin et al. (2019) we use cross entropy loss ,denoted as (cid:96) x ( D s , x ) = − (cid:80) ( s , x ) ∈ D s , x log p ( x | [ e ˆ s ; ˆ x ]) (4)to reconstruct the masked token. The final loss for monolin-gual FAT-MLM is: (cid:96) FAT-MLM ( D s , x ) = (cid:96) s ( D s , x ) + (cid:96) x ( D s , x ) (5) To support multimodal crosslingual tasks such as speechtranslation, We propose Translation FAT-MLM which ex-tends Monolingual FAT-MLM by using additional targetlanguage translation of the source language transcriptionas input. Formally Translation FAT-MLM takes D s , x , y = { ( s , x , y ) } as input, where y = [ y , ..., y | y | ] denotes thesequence of target language translation. This kind of tripletinput is very common in speech translation corpus.As shown in Fig. 3(d), we incorporate source language em-bedding e src and target language embedding e tgt for differentlanguages to show the language difference. Similar to Mono-lingual FAT-MLM, Translation FAT-MLM randomly masksthe translation input ˆ y ∼ Mask token ( y , λ ) and concatenateit with another two embeddings: h s , x , y = [ e ˆ s + e src ; ˆ x + e src ; ˆ y + e tgt ] Similar with previous work using masked language modelobjective, this loss only takes the masked input into consideration.
Figure 4.
One speech self-attention head’s output at the first trans-former layer in acoustic embedding module and its correspondingspectrogram. This is a Translation FAT-MLM model trained withMust-C En → De dataset.
Then we reconstruct masked input from concatenated em-beddings h s , x , y via a Transformer encoder. The reconstruc-tion loss for different masked input is: (cid:96) s ( D s , x , y ) = (cid:80) ( s , x , y ) ∈ D s , x , y || s − g ( f ( h s , x , y ) || (cid:96) x ( D s , x , y ) = − (cid:80) ( s , x , y ) ∈ D s , x , y log p ( x | h s , x , y ) (cid:96) y ( D s , x , y ) = − (cid:80) ( s , x , y ) ∈ D s , x , y log p ( y | h s , x , y ) We sum these loss functions for the final loss function ofTranslation FAT-MLM: (cid:96)
FAT-MLM ( D s , x , y ) = (cid:96) s ( D s , x , y ) + (cid:96) x ( D s , x , y ) + (cid:96) y ( D s , x , y ) To fully utilize the corpora for different tasks, FAT-MLMcan take any combination of speech, transcription, trans-lation triplets D { s , x , y } as input. Specifically, these com-binations include speech only data { s } , monolingual textdata, { x } or { y } , speech and transcription tuple { ( s , x ) } for speech recognition, transcription and translation tuple { ( x , y ) } for machine translation, speech and translationtuple { ( s , y ) } for direct speech translation and speech tran-scription translation triplets { ( s , x , y ) } . For different combi-nations of input, FAT-MLM encodes the full concatenationof their embeddings and recover the masked portion. Theloss function is: (cid:96) FAT-MLM ( D { s , x , y } ) = (cid:96) s ( D s (cid:63) )+ (cid:96) x ( D x (cid:63) )+ (cid:96) y ( D y (cid:63) ) (6)where D s (cid:63) , D x (cid:63) , D y (cid:63) means any input including speech,source language text and target language text respectively.Note that in this framework, we can denote MLM as (cid:96) x ( D x ) ,TLM as (cid:96) x , y ( D x , y ) , MAM as (cid:96) s ( s ) . To demonstrate FAT-MLM’s ability to unify the represen-tation of different modality and language, we show theself-attention layers of a translation FAT-MLM in Fig. 4 and { s , x , y } is the power set of { s , x , y } triplets. used Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation (a) This self-attention head shows bilingual alignment be-tween “ and ’‘ and “ Und ”, “ you ’‘ and “
Sie ”, “ what ” and“ ? ” in transcription and translation respectively.(b) Left side spectrogram shows gold speech-transcriptionalignment. This self-attention head shows monotonic cross-modal attention in red box. Meanwhile, the speech-to-translation attention (in blue box) clearly show the align-ment between “ you ’‘ and “ Sie ”, “ know ” and “ wissen ”in speech and translation respectively. Note that in this speech,the pronounciation of “ and ” is very weak.
Figure 5.
Two self-attention heads’ output at the first layer of acous-tic and text shared transformerfrom a Translation FAT-MLM modeltrained with Must-C En → De dataset, annotated with correspond-ing spectrogram, transcription (red) and translation (blue).
5. The clear monotonic attention in Fig. 4 shows that ourproposed method can learn good representation for speech(Chen et al., 2020). Fig. 5(a) shows that FAT-MLM canlearn a good crosslingual alignment between two languages,such as and to Und and you to Sie . Fig. 5(b) shows thatFAT-MLM is able to learn a clear monotonic speech-to-textcrossmodal attention like many speech recognition models. Good morning
Textembeddings + + + +
Positionalembeddings
Transformer EncoderTransformer Decoder
Guten
Tag Acoustic Embedding ℓ ST ( D s , y ) ℓ MT ( D x , y ) e ̂ s xy Figure 6.
Fused Acoustic and Text-Speech Translation (FAT-ST).
4. Fused Acoustic and Text SpeechTranslation (FAT-ST)
In this section, we present how to adapt FAT-MLM to speechtranslation and enable speech translation models to learnfrom speech recognition and text machine translation.
Regardless of the particular design of different seq-to-seqmodels, the text machine translation encoder always takesthe input sequence x = ( x , ..., x n ) where each x i ∈ R d x is a word embedding of d x dimensions, and produces a newsequence of hidden states h = f ( x ) = ( h , ..., h n ) . On theother hand, a decoder predicts the next output word y t giventhe source sequence (actually its representation h ) and previ-ously generated words, denoted y Similar to Lample & Conneau (2019) we can further im-prove FAT-ST by finetuning from FAT-MLM. Since theFAT-ST decoder predicts text only, we initialize it from theacoustic and text shared Transformer encoder. AlthoughTransformer decoder is unidirectional which is differentfrom bidirectional FAT-MLM, it can still benefit from FAT-MLM in our experiments, This is also observed by Lample& Conneau (2019) and Devlin et al. (2019). 5. Experiments We conducted speech translation experiments in 3 directions:English to German (En → De), English to Spanish (En → Es),and English to Dutch (En → Nl) to show the translation qual-ity of baselines and our proposed methods. We use 5 corpora with different modalities and languages:speech translation data D s , x , y Must-C (Di Gangi et al.,2019), speech recognition data D s , x Librispeech (Panayotovet al., 2015), machine translation and monolingual text data D x , y , D x , D y Europarl V7 (Koehn, 2005), speech only data D s Libri-Light (medium version) (Kahn et al., 2020) andmonolingual text data Wiki Text (only for Nl). The statisticalresults of the dataset are shown in Table. 1. We evaluateour models on Must-C dev and test set. Note that Must-C iscollected based on spontaneous speeches (TED) which arevery different from other audiobook speech dataset used inour experiments. Spontaneous speeches are much harder forspeech translation than audiobook dataset such as Libri-trans(Kocabiyikoglu et al., 2018). That is one of the reasons why (a) Bilingual Dataset Type Name En → De En → Es En → NlHours D s , x , y Must-C ST 408 226K 504 262K 442 245K D x , y Europarl MT - 1.9M - 2.0M - 2.0M (b) Monolingual Dataset Type Name En De Es NlHours D s , x Librispeech ASR 960 281K - - - D s Libri-light Speech 3,748 579K - - - D x /D y Europarl / Wiki Text - 2.3M 2.1M 2.0M 2.3M Table 1. Statistics of all datasets used in our experiments. Notethat we use Europarl for En, De, Es monolingual text and WikiText for Nl because there is no monolingual Nl portion in Europarl. the translation accuracy of end-to-end speech translation ismuch worse than cascaded systems on Must-C than otherspeech translation corpus. All raw audio files are processed by Kaldi (Povey et al.,2011) to extract 80-dimensional log-Mel filterbanks stackedwith 3-dimensional pitch features using a window size of 25ms and step size of 10 ms. We train sentencepiece (Kudo &Richardson, 2018) models with a joint vocabulary size of 8Kfor each dataset. We remove samples that have more than3000 frames for GPU efficiency. Our basic Transformer-based E2E-ST framework has similar settings with ESPnet-ST(Inaguma et al., 2020). We first downsample the speechinput with 2 layers of 2D convolution of size 3 with stridesize of 2. Then there is a standard 12-layers Transformerwith 2048 hidden size to bridge the source and target side.We only use 4 attention heads on each side of the transformerand each of them has a dimensionality of 256. We alsoshow the results of FAT-ST big model with 4096 hiddensize for feedforward layers of all transformer layer. Forspeech reconstruction module, we simply linearly projectthe outputs of the Transformer encoder to another latentspace, then upsample the latent representation with 2-layersdeconvolution to match the size of the original input signal.For the random masking ratio λ , we choose 30% across allthe experiments including pre-training. During inference,we do not perform any masking over the speech input. Weaverage the last 5 checkpoints for testing. For decoding, weuse a beam search with beam-size 5 and length penalty 0.6for German, 0.0 for Spanish and 0.3 for Dutch. We showcase the translation accuracy of FAT-ST comparingagainst to the baselines in Table 2 and Table 3: used Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation Pretrain Method Models En → De En → Es En → Nl Avg. Model SizeNoPretraining ST 19.64 23.68 23.01 22.11 31.25MST + ASR 21.70 26.83 25.44 24.66 (+2.55) 44.82MST + ASR & MT 21.58 26.37 26.17 24.71 (+2.60) 56.81MST + MAM 20.78 25.34 24.46 23.53 (+1.42) 33.15MST + MAM + ASR 22.41 26.89 26.49 25.26 (+3.15) 46.72MLiu et al. (2020b) 22.55 - - - -Le et al. (2020) 23.63 28.12 27.55 26.43 (+4.32) 51.20MCascade § (+4.65) 58.25M Table 2. BLEU comparisons on Must-C test set between our proposed methods and other baselines over 3 translation directions usingMuST-C ( D s , x , y ) only (including pretraining methods). § are reported in Inaguma et al. (2020). Pretrain Data Pretrain Method Train Data Models En → De En → Es En → Nl Avg. ∅ D s , x , y ST 19.64 23.68 23.01 22.11Cascade § D s , x , y ∪ D s , x ∪ D x , y ASR & MT ST 22.20 27.16 26.15 25.17 (+3.06)ST + ASR & MT 22.73 27.99 27.12 25.95 (+3.84)FAT-MLM FAT-ST (base) 23.98 28.95 28.08 27.00 (+4.89)FAT-ST (big) 24.34 29.41 28.86 27.54 (+5.43) D s , x , y ∪ D s , x ∪ D x , y ∪ D s ∪ D x ∪ D y FAT-MLM FAT-ST (base) 24.02 29.25 28.28 27.18 (+5.07)FAT-ST (big) 24.58 30.10 29.36 28.01 (+5.90) D s , x , y ∪ D s , x ∪ D x , y ∪ D s ∪ D x ∪ D y FAT-MLM D s , x , y D s , x ∪ D x , y FAT-ST (base) 23.91 29.01 28.18 27.03 (+4.92)FAT-ST (big) ∅ D s , x , y + D (cid:48) s , y Pino et al. (2020) 25.2 - - - Table 3. BLEU comparisons on Must-C test set between our proposed methods using additional data. D s , x : Librispeech, D x , y : EuroparlMT, D s : libri-light, D x , D y : monolingual data from Europarl or Wiki Text. § are reported in Inaguma et al. (2020). Pino et al. (2020) useextra D (cid:48) s , y which includes Librispeech ( D s , x ) and 35,217 hour version of Libri-light speech data (almost × of our D s ) paired withtheir corresponding pseudo-translations generated by ASR and MT models. Their model size is 435.0M. • ST : this is the vanilla speech translation system whichdoes not use transcriptions.• ST + ASR MTL : ST model with an additional ASRdecoder and is trained with ASR multi-task learningusing the transcriptions.• ST + ASR & MT MTL : ST model with an additionalASR decoder and a MT encoder. It is trained with ASRand MT multi-task learning.• ST + MAM : ST trained with additional MAMloss (Chen et al., 2020) which can be formalized as (cid:96) s ( D s ) (See Fig. 2(c)).• ST + MAM + ASR MTL : ST trained with MAM lossand ASR multi-task learning.• Liu et al. (2020b) : An end-to-end ST system with amultimodal encoder. • Le et al. (2020) : The state-of-the-art end-to-end STmodel with an extra ASR decoder.• Cascade : cascaded model which first transcribes thespeech into transcription then passes the results to amachines translation system.• ST + ASR & MT pretraining : the encoder of ST isinitialized by a pretrained ASR encoder and decoderinitialized by a pretrained MT decoder• Pino et al. (2020) : They propose to leverage additionalspeech data by generating pseudo-translations using acascaded or an end-to-end speech translation model. used Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation Model Table 4. Models sizes of different models. ODEL S IZE OF P RETRAINING M ODELS Table 4 shows the number of parameters of different pre-training models. We can see that our FAT-MLM base modelis a little bit larger than the MAM pretraining model, and theFAT-MLM big model is much larger than the base model.5.3.2. T RAINING WITH D s , x , y In Table 2, with no pretraining, we can see that our proposedFAT-ST base model achieves the best results except Le et al.(2020) and the cascaded model. However, our base modelhas much less parameters than both of them. Models withASR or MT MTL and Liu et al. (2020b) all use the transcrip-tion data in Must-C dataset but show worse performance,thus our model can use transcription data more efficiently.Similar to other open source ST implementation results onMust-C , our implementation of ST + ASR & MT MTL isworse than ST + ASR.We also compare the performance of models pretrainedfrom different pretraining models. With pretrained on Must-C, FAT-ST (base) is improved by 0.85 BLEU by beingfinetuned from FAT-MLM, while it’s performance drops byfinetuning from MAM. Meanwhile, our proposed methodsachieve much better performance compared with ASR &MT pretraining baselines. We also note that our FAT-STbase model for the first time achieves similar performancescompared with Cascade baselines in these three translationdirections of Must-C, while comparing with the cascadedmodel, our our base model is much smaller in size and fasterin inference (see Fig. 7).5.3.3. P RETRAINING WITH A DDITIONAL D ATA Table 3 shows that FAT-MLM can further improve FAT-ST by simply adding speech recognition data D s , x (Lib-rispeech) text machine translation data D x , y (Europarl) andeven speech only data D s (Libri-light) and monolingual textdata D x ∪ D y . This shows good representation learningability of our proposed FAT-MLM models. We can see thatusing larger data, the performance of our big model is in-creased much faster than the base model. That’s becausethe number of parameters of the base model is too limitedto learn from such big data. Espnet ST Train Data Pretrain Data Models → De → Es → Nl D s , x , y Nopretraining MT § D s , x , y FAT-ST (base) 27.24 31.98 31.27FAT-ST (big) 26.92 32.29 31.48 D s , x , y ∪ D s , x ∪ D x , y FAT-ST (base) 27.43 32.38 32.44FAT-ST (big) 27.60 32.95 32.37 D s , x , y ∪ D s , x ∪ D x , y ∪ D s ∪ D x ∪ D y FAT-ST (base) 27.63 32.75 32.52FAT-ST (big) 28.13 33.39 32.72 D s , x , y ∪ D s , x ∪ D x , y D s , x , y ∪ D s , x ∪ D x , y ∪ D s ∪ D x ∪ D y FAT-ST (base) 27.89 32.96 32.43FAT-ST (big) 28.80 34.28 34.22 Table 5. Comparisons of the auxiliary MT task between MT base-lines and our proposed methods. § are reported in Inaguma et al.(2020). INETUNING WITH A DDITIONAL D ATA The last part of Table 2 show that FAT-ST can be improvedby learning from extra speech recognition and machinetranslation data. This is promising because speech transla-tion data is very limited compared with much more abundantspeech recognition and machine translation data. Differentfrom Pino et al. (2020) who propose to leverage additionalspeech data by generating pseudo-translations, our methoddoesn’t use any pseudo-labels. Our best model outperformstheir result on En → De by using much × smaller modelsize and almost × smaller speech data. Model En → DeFAT-ST with FAT-MLM (base) 23.68- FAT-MLM decoder init. 23.20- FAT-MLM encoder init. 22.70- CTC loss 22.30- Hierarchical Transformer 22.07- FAT-MLM loss 20.64- MT loss 19.64 Table 6. Ablation study. Here, hierarchical transformer means themodel only shares the 6 layers of the transformer encoder foracoustic feature input and text feature input. ERFORMANCE OF A UXILIARY MT T ASK Table 5 shows the translation quality of auxiliary MT taskof FAT-ST. Although our models trained with Must-C areworse than the MT baseline, by using FAT-MLM trainedwith more data, our proposed methods can easily outperformthe MT baseline. Note that these models’ parameters aretuned to optimize speech translation task and MT is just anauxiliary task.5.3.6. A BLATION S TUDY Table 6 shows an ablation study of our proposed method.we can see that all the components contribute to the finalperformance. As mentioned in used Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation Figure 7. Decoding speech comparison between Cascaded model(including its ASR) and FAT-ST. ECODING S PEED Fig. 7 shows the decoding speed comparison between theCascade model and our proposed FAT-ST. Our proposedFAT-ST model is almost × faster than the Cascade systemwhich needs to wait for the speech recognition module tofinish before starting to translate. The decoding time ofFAT-ST (big) is almost the same as FAT-ST (base) becausewe only increase the feedforward network in Transformers. Conclusion In this paper, we propose Fused Acoustic and Text MaskedLanguage Model (FAT-MLM) which learns a unified repre-sentation for text and speech from any data that combinesspeech and text. We further extend this framework to asequence-to-sequence speech translation model which en-ables learning from speech recognition and text-based ma-chine translation data at the first time. Our results showsignificant improvement on three translation directions ofthe Must-C dataset and outperform the cascaded baseline. References Baevski, A., Zhou, H., Mohamed, A., and Auli, M. wav2vec2.0: A framework for self-supervised learning of speechrepresentations. NeurIPS 2020 , 2020.B´erard, A., Besacier, L., Kocabiyikoglu, A. C., and Pietquin,O. End-to-end automatic speech translation of audio-books. In , pp. 6224–6228. IEEE, 2018.Chen, J., Ma, M., Zheng, R., and Huang, L. Mam: Maskedacoustic modeling for end-to-end speech-to-text transla-tion. arXiv preprint arXiv:2010.11445 , 2020.Chuang, Y.-S., Liu, C.-L., Lee, H.-y., and Lee, L.-s. Speech-bert: An audio-and-text jointly learned language model for end-to-end spoken question answering. arXiv preprintarXiv:1910.11559 , 2019.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:Pre-training of deep bidirectional transformers for lan-guage understanding. In NAACL-HLT , 2019.Di Gangi, M. A., Cattoni, R., Bentivogli, L., Negri, M., andTurchi, M. MuST-C: a Multilingual Speech TranslationCorpus. In NAACL , 2019.Graves, A., Fern´andez, S., Gomez, F., and Schmidhuber,J. Connectionist temporal classification: labelling un-segmented sequence data with recurrent neural networks.In Proceedings of the 23rd international conference onMachine learning , pp. 369–376, 2006.Howard, J. and Ruder, S. Universal language modelfine-tuning for text classification. arXiv preprintarXiv:1801.06146 , 2018.Inaguma, H., Kiyono, S., Duh, K., Karita, S., Soplin,N. E. Y., Hayashi, T., and Watanabe, S. Espnet-st:All-in-one speech translation toolkit. arXiv preprintarXiv:2004.10234 , 2020.Jia, Y., Johnson, M., Macherey, W., Weiss, R. J., Cao, Y.,Chiu, C.-C., Ari, N., Laurenzo, S., and Wu, Y. Leveragingweakly supervised data to improve end-to-end speech-to-text translation. In ICASSP 2019-2019 IEEE Inter-national Conference on Acoustics, Speech and SignalProcessing (ICASSP) , pp. 7180–7184. IEEE, 2019.Kahn, J., Rivi`ere, M., Zheng, W., Kharitonov, E., Xu,Q., Mazar´e, P. E., Karadayi, J., Liptchinsky, V., Col-lobert, R., Fuegen, C., Likhomanenko, T., Synnaeve,G., Joulin, A., Mohamed, A., and Dupoux, E. Libri-light: A benchmark for asr with limited or no super-vision. In ICASSP 2020 - 2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing(ICASSP) , pp. 7669–7673, 2020. https://github.com/facebookresearch/libri-light .Kocabiyikoglu, A. C., Besacier, L., and Kraif, O. Augment-ing librispeech with french translations: A multimodalcorpus for direct speech translation evaluation. In Pro-ceedings of the Eleventh International Conference onLanguage Resources and Evaluation (LREC 2018) , 2018.Koehn, P. Europarl: A parallel corpus for statistical machinetranslation. In MT summit , volume 5, pp. 79–86. Citeseer,2005.Kudo, T. and Richardson, J. SentencePiece: A simpleand language independent subword tokenizer and deto-kenizer for neural text processing. In Proceedings ofthe 2018 Conference on Empirical Methods in Natu-ral Language Processing: System Demonstrations , pp. used Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation .Lample, G. and Conneau, A. Cross-lingual language modelpretraining. arXiv preprint arXiv:1901.07291 , 2019.Le, H., Pino, J., Wang, C., Gu, J., Schwab, D., and Be-sacier, L. Dual-decoder transformer for joint automaticspeech recognition and multilingual speech translation.In Proceedings of the 28th International Conference onComputational Linguistics , pp. 3520–3533, 2020.Liu, A. T., Li, S.-W., and Lee, H.-y. Tera: Self-supervisedlearning of transformer encoder representation for speech. arXiv preprint arXiv:2007.06028 , 2020a.Liu, Y., Zhu, J., Zhang, J., and Zong, C. Bridging the modal-ity gap for speech-to-text translation. arXiv preprintarXiv:2010.14920 , 2020b.Panayotov, V., Chen, G., Povey, D., and Khudanpur, S.Librispeech: an asr corpus based on public domain au-dio books. In , pp.5206–5210. IEEE, 2015.Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark,C., Lee, K., and Zettlemoyer, L. Deep contextualizedword representations. arXiv preprint arXiv:1802.05365 ,2018.Pino, J., Xu, Q., Ma, X., Dousti, M. J., and Tang, Y. Self-training for end-to-end speech translation. Proc. Inter-speech 2020 , pp. 1476–1480, 2020.Povey, D., Ghoshal, A., Boulianne, G., Goel, N., Hanne-mann, M., Qian, Y., Schwarz, P., and Stemmer, G. Thekaldi speech recognition toolkit. In In IEEE 2011 work-shop , 2011.Radford, A., Narasimhan, K., Salimans, T., and Sutskever,I. Improving language understanding by generative pre-training. 2018.Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad:100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 , 2016.Sperber, M., Neubig, G., Niehues, J., and Waibel, A. Neurallattice-to-sequence models for uncertain inputs. arXivpreprint arXiv:1704.00559 , 2017.Sun, Y., Wang, S., Li, Y., Feng, S., Chen, X., Zhang, H.,Tian, X., Zhu, D., Tian, H., and Wu, H. Ernie: Enhancedrepresentation through knowledge integration. arXivpreprint arXiv:1904.09223 , 2019. Taylor, W. L. “cloze procedure”: A new tool for measuringreadability. Journalism quarterly , 30(4):415–433, 1953.Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., andBowman, S. R. Glue: A multi-task benchmark and anal-ysis platform for natural language understanding. arXivpreprint arXiv:1804.07461 , 2018.Weiss, R. J., Chorowski, J., Jaitly, N., Wu, Y., and Chen,Z. Sequence-to-sequence models can directly translateforeign speech.