[PDF] CTC-based Compression for Direct Speech Translation

Abstract

Previous studies demonstrated that a dynamic phone-informed compression of the input audio is beneficial for speech translation (ST). However, they required a dedicated model for phone recognition and did not test this solution for direct ST, in which a single model translates the input audio into the target language without intermediate representations. In this work, we propose the first method able to perform a dynamic compression of the input indirect ST models. In particular, we exploit the Connectionist Temporal Classification (CTC) to compress the input sequence according to its phonetic characteristics. Our experiments demonstrate that our solution brings a 1.3-1.5 BLEU improvement over a strong baseline on two language pairs (English-Italian and English-German), contextually reducing the memory footprint by more than 10%.

Full PDF

CCTC-based Compression for Direct Speech Translation

Marco Gaido † , ∗ , Mauro Cettolo † , Matteo Negri † , Marco Turchi †† Fondazione Bruno Kessler ∗ University of Trento { mgaido,cettolo,negri,turchi } @fbk.eu Abstract

Previous studies demonstrated that a dynamicphone-informed compression of the input au-dio is beneﬁcial for speech translation (ST).However, they required a dedicated model forphone recognition and did not test this solutionfor direct ST, in which a single model trans-lates the input audio into the target languagewithout intermediate representations. In thiswork, we propose the ﬁrst method able to per-form a dynamic compression of the input indirect ST models. In particular, we exploit theConnectionist Temporal Classiﬁcation (CTC)to compress the input sequence according toits phonetic characteristics. Our experimentsdemonstrate that our solution brings a 1.3-1.5 BLEU improvement over a strong base-line on two language pairs (English-Italian andEnglish-German), contextually reducing thememory footprint by more than 10%.

Speech translation (ST) is the process that convertsutterances in one language into text in another lan-guage. Traditional approaches to ST consist ofseparate modules, each dedicated to an easier sub-task, which are eventually integrated in a so-called cascade architecture (Stentiford and Steer, 1988;Waibel et al., 1991). Usually, its main componentsare an automatic speech recognition (ASR) model- which generates the transcripts from the audio- and a machine translation (MT) model - whichtranslates the transcripts into the target language.A newer approach is direct

ST, in which a singlemodel performs the whole task without interme-diate representations (B´erard et al., 2016; Weisset al., 2017). The main advantages of direct ST sys-tems are: i) the access to information not present inthe text (e.g. prosody, vocal characteristics of thespeaker) during the translation phase, ii) a reducedlatency, iii) a simpler and easier to manage architec- ture (only one model has to be maintained), which iv) avoids error propagation across components.In both paradigms (cascade and direct), the audiois commonly represented as a sequence of vectorsobtained with a Mel ﬁlter bank. These vectors arecollected with a high frequency, typically one every10 ms. The resulting sequences are much longerthan the corresponding textual ones (usually by afactor of ~ a r X i v : . [ c s . C L ] F e b ition and intermediate representations. So, they: i) are affected by error propagation (Salesky andBlack 2020 show in fact that lower quality in phonerecognition signiﬁcantly degrades ﬁnal ST perfor-mance), ii) have higher latency and iii) a morecomplex architecture. A direct model with phone-based multi-task training was introduced by Jiaet al. (2019) for speech-to-speech translation, butthey neither compared with a training using tran-scripts nor investigated dynamic compression.In this paper, we explore the usage of phonesand dynamic content-based input compression fordirect ST (and ASR). Our goal is an input reduc-tion that, limiting the amount of redundant/uselessinformation, yields better performance and lowermemory consumption at the same time. To this aim,we propose to exploit the Connectionist TemporalClassiﬁcation (CTC) (Graves et al., 2006) to addphones prediction in a multi-task training and com-press the sequence accordingly. To disentangle thecontribution of the introduction of phone recogni-tion and the compression based on it, we compareagainst similar trainings leveraging transcripts in-stead of phones. Our results show that phone-basedmulti-task training with sequence compression im-proves over a strong baseline by up to 1.5 BLEUpoints on two language pairs (English-German andEnglish-Italian), with a memory footprint reductionof at least 10%. The CTC algorithm is usually employed for train-ing a model to predict an output sequence of vari-able length that is shorter than the input one. This isthe case of speech/phone recognition, as the inputis a long sequence of audio samples, while the out-put is the sequence of uttered symbols (e.g. phones,sub-words), which is signiﬁcantly shorter. In partic-ular, for each time step, the CTC produces a prob-ability distribution over the possible target labelsaugmented with a dedicated symbol rep-resenting the absence of a target value. These distri-butions are then exploited to compute the probabil-ities of different sequences, in which consecutiveequal predictions are collapsed and sym-bols are removed. Finally, the resulting sequencesare compared with the target sequence.Adding an auxiliary CTC loss to the training ofdirect ST and acoustic ASR models has been shownto improve performance (Kim et al., 2017; Baharet al., 2019). In these works, the CTC loss is com-

Figure 1: Encoder architecture with CTC loss. puted against the transcripts on the encoder outputto favour model convergence. Generally, the CTCloss can be added to the output of any encoder layer,as in Figure 1 where the hyper-parameter N CTC in-dicates the number of the layer at which the CTCis computed. Formally, the ﬁnal loss function is: λ = CT C ( E N CTC ) + CE ( D N D ) (1)where E x is the output of the x -th encoder layer, D N D is the decoder output, CT C is the CTC func-tion, and CE is the label smoothed cross entropy.If N CTC is equal to the number of encoder layers( N E ), the CTC input is the encoder output. Weconsider this solution as our baseline and we alsotest it with phones as target.As shown in Figure 1, we use as model a Trans-former, whose encoder layers are preceded by two2D convolutional layers that reduce the input sizeby a factor of 4. Therefore, the CTC produces aprediction every 4 input time frames. The sequencelength reduction is necessary both because it makespossible the training (otherwise out of memory er-rors would occur) and to have a fair comparisonwith modern state-of-the-art models. A logarithmicdistance penalty (Di Gangi et al., 2019) is added toall the Transformer encoder layers.Our proposed architecture is represented in Fig-ure 2. The difference with the baseline is the in-troduction of an additional block ( Collapse samepredictions ) that exploits the CTC predictions tocompress the input elements (vectors). Hence, inthis case the CTC does not only help model conver-gence, but it also deﬁnes variable-length segmentsrepresenting the same content. So, dense audioportions can be given more importance, while re- igure 2: Encoder architecture with CTC compression. dundant/uninformative vectors can be compressed.This allows the following encoder layers and the de-coder to attend to useful information without being“distracted” by noisy elements. The architectureis a direct ST solution as there is a single modelwhose parameters are optimized together withoutintermediate representations. At inference time, theonly input is the audio and the model produces thetranslation into the target language (contextuallygenerating the transcripts/phones with the CTC).We compare three techniques to compress theconsecutive vectors with the same CTC prediction:•

Average.

The vectors to be collapsed togetherare averaged. As there is only a linear layerbetween the CTC inputs and its predictions,the vectors in each group are likely to be sim-ilar, so the compression should not removemuch information.•

Weighted.

The vectors are averaged but theweight of each vector depends on the con-ﬁdence (i.e. the predicted probability) ofthe CTC prediction. This solution is meantto give less importance to vectors whosephone/transcript is not certain.•

Softmax.

In this case, the weight of each vec-tor is obtained by computing the softmax ofthe CTC predicted probabilities. The idea is topropagate information (nearly) only througha single input vector (the more conﬁdent one)for each group.

We experiment with MuST-C (Cattoni et al., 2021),a multilingual ST corpus built from TED talks.We focus on the English-Italian (465 hours) andEnglish-German (408 hours) sections. For each set(train, validation, test), it contains the audio ﬁles,the transcripts, the translations and a YAML ﬁlewith the start time and duration of the segments.In addition, we extract the phones using Gen-tle. Besides aligning the transcripts with the au-dio, Gentle returns the start and end time for eachrecognized word, together with the correspondingphones. For the words not recognized in the audio,Gentle does not provide the phones, so we lookuptheir phonetic transcription on the VoxForge dic-tionary. For each sample in the corpus, we relyon the YAML ﬁle and the alignments generated byGentle to get all the words (and phones) belongingto it. The phones have a sufﬁx indicating the posi-tion in a word (at the end, at the beginning, in themiddle or standalone). We also generated a versionwithout the sufﬁx (we refer to it as PH W/O POS in the rest of the paper). The resulting dictionariescontain respectively 144 and 48 symbols.

Our Transformer layers have 8 attention heads, 512features for the attention and 2,048 hidden units inFFN. We set a 0.2 dropout and include

SpecAug-ment (Park et al., 2019) in our trainings. Weoptimize label smoothed cross entropy (Szegedyet al., 2016) with 0.1 smoothing factor using Adam(Kingma and Ba, 2015) (betas (0.9, 0.98) ). Thelearning rate increases linearly from 3e-4 to 5e-3for 4,000 updates, then decays with the inversesquare root. As we train on 8 GPUs with mini-batches of 8 sentences and we update the modelevery 8 steps, the resulting batch size is 512. Theaudio is pre-processed performing speaker normal-ization and extracting 40-channel Mel ﬁlter-bankfeatures per frame. The text is tokenized into sub-words with 1,000 BPE merge rules (Sennrich et al.,2016).As having more encoder layers than decoder lay-ers has been shown to be beneﬁcial (Potapczykand Przybysz, 2020; Gaido et al., 2020), we use8 Transformer encoder layers and 6 decoder lay-ers for ASR and 11 encoder and 4 decoder layersfor ST unless stated otherwise. We train until the https://lowerquality.com/gentle/ odel does not improve on the validation set for5 epochs and we average the last 5 checkpoints.Trainings were performed on K80 GPUs and lasted ~

48 hours ( ~

50 minutes per epoch). Our implemen-tation is based on Fairseq (Ott et al., 2019).We evaluate performance with WER for ASRand with BLEU (Papineni et al., 2002) and Sacre-BLEU (Post, 2018) for ST. WER ( ↓ ) RAM (MB) Baseline - 8L EN 16.0 6929 (1.00)8L PH

Table 1: Results on ASR using the CTC loss withtranscripts and phones as target.

AVG indicates that se-quence is compressed averaging the vectors.

We ﬁrst tested whether ASR beneﬁts from the us-age of phones and sequence compression. Table 1shows that having phones instead of English tran-scripts (Baseline - 8L EN) as target of the CTCloss (

8L PH ) without compression is beneﬁcial.When compressing the sequence, there is little dif-ference according to the target used (

8L PH AVG ,

8L PH W/O POS. AVG ,

8L EN AVG ). How-ever, the compression causes a 0.3-0.5 WER per-formance degradation and a 12-5% saving of RAM.Moving the compression to previous layers ( ,

2L PH AVG ) further decreases the out-put quality and the RAM usage. We can concludethat compressing the input sequence harms ASRperformance, but might be useful if RAM usage iscritical and should be traded off with performance.

In early experiments, we pre-trained the ﬁrst 8 lay-ers of the ST encoder with that of the ASR model,adding three adapter layers (Bahar et al., 2019).We realized that ASR pre-training was not useful(probably because ASR and ST data are the same),so we report results without pre-training. Available at https://github.com/mgaido91/FBK-fairseq-ST/tree/eacl2021 . To be comparable with previous works. The version signature is:

BLEU+c.mixed+ . As we want to ensure that our results are notbiased by a poor baseline, we compare with(Di Gangi et al., 2020), which uses the sameframework and similar settings. As shown in Ta-ble 2, our strong baseline (

8L EN ) outperforms(Di Gangi et al., 2020) by 2 BLEU on en-it and 1.3BLEU on en-de.As in ASR, replacing the transcripts with phonesas target for the CTC loss (

8L PH ) further im-proves respectively by 0.5 and 1.2 BLEU. We ﬁrstexplore the introduction of the compression at dif-ferent layers. Adding it to the 8 th layer (

8L PHAVG ) enhances the translation quality by 0.6 (en-it) and 0.2 (en-de) BLEU, with the improvementon en-it being statistically signiﬁcant over the ver-sion without CTC compression. Moving it to pre-vious layers (

4L PH AVG ,

2L PH AVG ) causesperformance drops, suggesting that many layersare needed to extract useful phonetic information.Then, we compare the different compres-sion policies:

AVG outperforms (or matches)

WEIGHTED and

SOFTMAX on both languages. In-deed, the small weight these two methods assignto some vectors likely causes an information lossand prevents a proper gradient propagation for thecorresponding input elements.Finally, we experiment with different CTC tar-gets, but both the phones without the position suf-ﬁx (

8L PH W/O POS. AVG ) and the transcripts(

8L EN AVG ) lead to lower scores.The different results between ASR and ST canbe explained by the nature of the two tasks: extract-ing content knowledge is critical for ST but notfor ASR, in which a compression can hide detailsthat are not relevant to extrapolate meaning, butneeded to generate precise transcripts. The RAMsavings are higher than in ASR as there are 3 morelayers. On the 8 th layer, they range from 11% to23% for en-it, 16% to 22% for en-de. By movingthe compression to previous layers, we can tradeperformance for RAM requirements, saving up to50% of the memory.We also tested whether we can use the savedRAM to add more layers and improve the transla-tion quality. We added 3 encoder and 2 decoderlayers: this (

8L PH AVG (14+6L) ) results in We acknowledge that better results have been publishedin a contemporaneous paper by Inaguma et al. (2020). Besidesthe contemporaneity issue, our results are not comparable withtheirs, as they use: i) a different architecture built on ESPnet-ST (a newer framework that, alone, outperforms Fairseq), ii) higher dimensional input features (83 vs 40 dimensions), iii) data augmentation, and iv) pre-training techniques. n-it en-deBLEU ( ↑ ) SacreBLEU ( ↑ ) RAM (MB) BLEU ( ↑ ) SacreBLEU ( ↑ ) RAM (MB) (Di Gangi et al., 2020) 20.1 - - 19.1 - -Baseline - 8L EN 22.1 21.8 9624 (1.00) 20.4 20.5 9166 (1.00)8L PH 22.6 * * * * † † * * * * * * * * * * * * † † † † Table 2: Results using the CTC loss with transcripts and phones as target.

AVG , WEIGHTED and

SOFTMAX indicate the compression method. If none is speciﬁed, no compression is performed. The symbol “ * ” indicatesimprovements that are statistically signiﬁcant with respect to the baseline. “ † ” indicates statistically signiﬁcantgains with respect to

8L PH . Statistical signiﬁcance is computed according to (Koehn, 2004) with α = 0 . .Scores in italic indicate the best models among those with equal number of layers. small gains (0.2 on en-it and 0.1 on en-de), but theadditional memory required is also small (the RAMusage is still 10-16% lower than the baseline). Theimprovements are statistically signiﬁcant with re-spect to the models without compression (

8L PH )on both language pairs. When training on moredata, the beneﬁt of having deeper networks mightbe higher, though, and this solution allows to in-crease the number of layers without a prohibitivememory footprint. We leave this investigation forfuture works, as experiments on larger training cor-pora are out of the scope of this paper.

As researchers’ focus is shifting from cascade to di-rect solutions due to the advantages of the latter, weproposed a technique of dynamic sequence-lengthreduction for direct ST. We showed that averagingthe vectors corresponding to the same phone predic-tion according to the CTC improves the translationquality and reduces the memory footprint, allowingfor training deeper models. Our best model outper-forms a strong baseline, which uses transcripts ina multi-task training, by 1.3 (en-it) and 1.5 (en-de)BLEU, reducing memory usage by 10-16%.

Acknowledgments

This work is part of the “End-to-end SpokenLanguage Translation in Rich Data Conditions”project, which is ﬁnancially supported by an Ama-zon AWS ML Grant. The authors also wish to https://ict.fbk.eu/units-hlt-mt-e2eslt/ thank Mattia Antonino Di Gangi for the insightfuldiscussions on this work. References

Parnia Bahar, Tobias Bieschke, and Hermann Ney.2019. A Comparative Study on End-to-end Speechto Text Translation. In

Proceedings of InternationalWorkshop on Automatic Speech Recognition and Un-derstanding (ASRU) , pages 792–799, Sentosa, Sin-gapore.Yoshua Bengio, Paolo Frasconi, and Patrice Y. Simard.1993. The problem of learning long-term dependen-cies in recurrent networks. In

IEEE InternationalConference on Neural Networks , pages 1183–1188vol.3.Alexandre B´erard, Laurent Besacier, Ali Can Ko-cabiyikoglu, and Olivier Pietquin. 2018. End-to-End Automatic Speech Translation of Audiobooks.In

Proceedings of ICASSP 2018 - IEEE Interna-tional Conference on Acoustics, Speech and SignalProcessing , Calgary, Alberta, Canada.Alexandre B´erard, Olivier Pietquin, Christophe Servan,and Laurent Besacier. 2016. Listen and Translate:A Proof of Concept for End-to-End Speech-to-TextTranslation. In

NIPS Workshop on end-to-end learn-ing for speech and audio processing , Barcelona,Spain.Roldano Cattoni, Mattia A. Di Gangi, Luisa Bentivogli,Matteo Negri, and Marco Turchi. 2021. MuST-C:A multilingual corpus for end-to-end speech transla-tion.

Computer Speech & Language , 66:101155.Mattia A. Di Gangi, Marco Gaido, Matteo Negri, andMarco Turchi. 2020. On Target Segmentation for Di-rect Speech Translation. In

Proceedings of the 14thConference of the Association for Machine Transla-tion in the Americas (AMTA 2020) , pages 137–150,irtual. Association for Machine Translation in theAmericas.Mattia A. Di Gangi, Matteo Negri, Roldano Cattoni,Roberto Dessi, and Marco Turchi. 2019. EnhancingTransformer for End-to-end Speech-to-Text Transla-tion. In

Proceedings of Machine Translation Sum-mit XVII Volume 1: Research Track , pages 21–31,Dublin, Ireland. European Association for MachineTranslation.Jeffrey L. Elman. 1990. Finding structure in time.

COGNITIVE SCIENCE , 14(2):179–211.Marco Gaido, Mattia A. Di Gangi, Matteo Ne-gri, and Marco Turchi. 2020. End-to-EndSpeech-Translation with Knowledge Distillation:FBK@IWSLT2020. In

Proceedings of the 17th In-ternational Conference on Spoken Language Trans-lation , pages 80–88, Online. Association for Com-putational Linguistics.Alex Graves, Santiago Fern´andez, Faustino J. Gomez,and J¨urgen Schmidhuber. 2006. Connectionist Tem-poral Classiﬁcation: Labelling Unsegmented Se-quence Data with Recurrent Neural Networks. In

Proceedings of the 23rd international conferenceon Machine learning (ICML) , pages 369–376, Pitts-burgh, Pennsylvania.William N. Havard, Jean-Pierre Chevrot, and Lau-rent Besacier. 2020. Catplayinginthesnow: Im-pact of Prior Segmentation on a Model of VisuallyGrounded Speech.Hirofumi Inaguma, Shun Kiyono, Kevin Duh, ShigekiKarita, Nelson Yalta, Tomoki Hayashi, and ShinjiWatanabe. 2020. ESPnet-ST: All-in-One SpeechTranslation Toolkit. In

Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics: System Demonstrations , pages 302–311, Online. Association for Computational Linguis-tics.Ye Jia, Ron J. Weiss, Fadi Biadsy, Wolfgang Macherey,Melvin Johnson, Zhifeng Chen, and Yonghui Wu.2019. Direct Speech-to-Speech Translation with aSequence-to-Sequence Model. In

Proceedings of In-terspeech 2019 , pages 1123–1127, Graz, Austria.Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017.Joint CTC-attention based end-to-end speech recog-nition using multi-task learning. In

Proceedings ofthe 2017 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) , pages4835–4839, New Orleans, Louisiana.Diederik Kingma and Jimmy Ba. 2015. Adam: AMethod for Stochastic Optimization. In

Proceed-ings of 3rd International Conference on LearningRepresentations (ICLR) , San Diego, California.Philipp Koehn. 2004. Statistical Signiﬁcance Testsfor Machine Translation Evaluation. In

Proceed-ings of the 2004 Conference on Empirical Meth-ods in Natural Language Processing , pages 388– 395, Barcelona, Spain. Association for Computa-tional Linguistics.Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A Fast, ExtensibleToolkit for Sequence Modeling. In

Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics(Demonstrations) , pages 48–53, Minneapolis, Min-nesota. Association for Computational Linguistics.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for AutomaticEvaluation of Machine Translation. In

Proceed-ings of the 40th Annual Meeting of the Associa-tion for Computational Linguistics , pages 311–318,Philadelphia, Pennsylvania. Association for Compu-tational Linguistics.Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, andQuoc V. Le. 2019. SpecAugment: A Simple DataAugmentation Method for Automatic Speech Recog-nition. In

Proceedings of Interspeech 2019 , pages2613–2617, Graz, Austria.Matt Post. 2018. A Call for Clarity in Reporting BLEUScores. In

Proceedings of the Third Conference onMachine Translation: Research Papers , pages 186–191, Belgium, Brussels. Association for Computa-tional Linguistics.Tomasz Potapczyk and Pawel Przybysz. 2020. SR-POL’s system for the IWSLT 2020 end-to-endspeech translation task. In

Proceedings of the17th International Conference on Spoken LanguageTranslation , pages 89–94, Online. Association forComputational Linguistics.Has¸im Sak, Andrew Senior, Kanishka Rao, andFranc¸oise Beaufays. 2015. Fast and Accurate Re-current Neural Network Acoustic Models for SpeechRecognition. In

Proceedings of Interspeech 2015 ,Dresden, Germany.Elizabeth Salesky and Alan W. Black. 2020. PhoneFeatures Improve Speech Translation. In

Proceed-ings of the 58th Annual Meeting of the Associationfor Computational Linguistics , pages 2388–2397,Online. Association for Computational Linguistics.Elizabeth Salesky, Matthias Sperber, and Alan W.Black. 2019. Exploring Phoneme-Level SpeechRepresentations for End-to-End Speech Translation.In

Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics , pages1835–1841, Florence, Italy. Association for Compu-tational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural Machine Translation of Rare Wordswith Subword Units. In

Proceedings of the 54th An-nual Meeting of the Association for Computationalinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Frederick W. M. Stentiford and Martin G. Steer. 1988.Machine Translation of Speech.

British TelecomTechnology Journal , 6(2):116–122.Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,Jon Shlens, and Zbigniew Wojna. 2016. Rethink-ing the Inception Architecture for Computer Vision.In

Proceedings of 2016 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , pages2818–2826, Las Vegas, Nevada, United States.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is AllYou Need. In

Proceedings of Advances in NeuralInformation Processing Systems 30 (NIPS) , pages5998–6008, Long Beach, California.Alex Waibel, Ajay N. Jain, Arthur E. McNair, HiroakiSaito, Alexander G. Hauptmann, and Joe Tebelskis.1991. JANUS: A Speech-to-Speech Translation Sys-tem Using Connectionist and Symbolic ProcessingStrategies. In

Proceedings of the International Con-ference on Acoustics, Speech and Signal Processing,ICASSP 1991 , pages 793–796, Toronto, Canada.Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, YonghuiWu, and Zhifeng Chen. 2017. Sequence-to-Sequence Models Can Directly Translate ForeignSpeech. In