Adaptive Feature Selection for End-to-End Speech Translation
AAdaptive Feature Selection for End-to-End Speech Translation
Biao Zhang Ivan Titov , Barry Haddow Rico Sennrich , School of Informatics, University of Edinburgh ILLC, University of Amsterdam Department of Computational Linguistics, University of Zurich
[email protected], { ititov,bhaddow } @inf.ed.ac.uk, [email protected] Abstract
Information in speech signals is not evenlydistributed, making it an additional challengefor end-to-end (E2E) speech translation (ST)to learn to focus on informative features. Inthis paper, we propose adaptive feature se-lection (AFS) for encoder-decoder based E2EST. We first pre-train an ASR encoder andapply AFS to dynamically estimate the im-portance of each encoded speech feature toASR. A ST encoder, stacked on top of theASR encoder, then receives the filtered fea-tures from the (frozen) ASR encoder. We take L D ROP (Zhang et al., 2020) as the backbonefor AFS, and adapt it to sparsify speech fea-tures with respect to both temporal and fea-ture dimensions. Results on LibriSpeech En-Fr and MuST-C benchmarks show that AFS fa-cilitates learning of ST by pruning out ∼ ∼ ∼ × . In particular, AFS reducesthe performance gap compared to the cascadebaseline, and outperforms it on LibriSpeechEn-Fr with a BLEU score of 18.56 (withoutdata augmentation). End-to-end (E2E) speech translation (ST), aparadigm that directly maps audio to a foreign text,has been gaining popularity recently (Duong et al.,2016; B´erard et al., 2016; Bansal et al., 2018; DiGangi et al., 2019; Wang et al., 2019). Based onthe attentional encoder-decoder framework (Bah-danau et al., 2015), it optimizes model parametersunder direct translation supervision. This end-to-end paradigm avoids the problem of error propa-gation that is inherent in cascade models wherean automatic speech recognition (ASR) model and We release our source code at https://github.com/bzhangGo/zero . play is not just child ’s gamesP L EY1 IH1 Z NAA1 TJH IH0 S T CH AY1 L D S G EY1 M Z − . − . . . . A m p li t ud e o r H z Figure 1 : Example illustrating our motivation. We plotthe amplitude and frequency spectrum of an audio segment(top), paired with its time-aligned words and phonemes (bot-tom). Information inside an audio stream is not uniformlydistributed. We propose to dynamically capture speech fea-tures corresponding to informative signals (red rectangles) toimprove ST. a machine translation (MT) model are chained to-gether. Nonetheless, previous work still reports thatE2E ST delivers inferior performance compared tocascade methods (Niehues et al., 2019).We study one reason for the difficulty of train-ing E2E ST models, namely the uneven spread ofinformation in the speech signal, as visualized inFigure 1, and the consequent difficulty of extract-ing informative features. Features correspondingto uninformative signals, such as pauses or noise,increase the input length and bring in unmanage-able noise for ST. This increases the difficulty oflearning (Zhang et al., 2019b; Na et al., 2019) andreduces translation performance.In this paper, we propose adaptive feature selec-tion (AFS) for ST to explicitly eliminate uninfor-mative features. Figure 2 shows the overall archi-tecture. We employ a pretrained ASR encoder toinduce contextual speech features, followed by anST encoder bridging the gap between speech andtranslation modalities. AFS is inserted in-betweenthem to select a subset of features for ST encoding(see red rectangles in Figure 1). To ensure thatthe selected features are well-aligned to transcrip-tions, we pretrain AFS on ASR. AFS estimatesthe informativeness of each feature through a pa-rameterized gate, and encourages the dropping of a r X i v : . [ c s . C L ] O c t SREncoderSTEncoder STDecoderAFSSpeech Input TranslationOutput
Figure 2 : Overview of our E2E ST model. AFS is insertedbetween the ST encoder (blue) and a pretrained ASR encoder(gray) to filter speech features for translation. We pretrainAFS jointly with ASR and freeze it during ST training. features (pushing the gate to ) that contribute littleto ASR. An underlying assumption is that featuresirrelevant for ASR are also unimportant for ST.We base AFS on L D ROP (Zhang et al., 2020),a sparsity-inducing method for encoder-decodermodels, and extend it to sparsify speech features.The acoustic input of speech signals involves twodimensions: temporal and feature , where the lat-ter one describes the spectrum extracted from timeframes. Accordingly, we adapt L D ROP to spar-sify encoder states along temporal and feature di-mensions but using different gating networks. Incontrast to (Zhang et al., 2020), who focus on effi-ciency and report a trade-off between sparsity andquality for MT and summarization, we find thatsparsity also improves translation quality for ST.We conduct extensive experiments with Trans-former (Vaswani et al., 2017) on LibriSpeech En-Frand MuST-C speech translation tasks, covering 8different language pairs. Results show that AFSonly retains about 16% of temporal speech features,revealing heavy redundancy in speech encodingsand yielding a decoding speedup of ∼ × . AFSeases model convergence, and improves the trans-lation quality by ∼ L D ROP L D ROP provides a selective mechanism forencoder-decoder models which encourages remov-ing uninformative encoder outputs via a sparsity-inducing objective (Zhang et al., 2020). Given asource sequence X = { x , x , . . . , x n } , L D ROP assigns each encoded source state x i ∈ R d with ascalar gate g i ∈ [0 , as follows: L D ROP ( x i ) = g i x i , (1)with g i ∼ HardConcrete ( α i , β, (cid:15) ) , (2)where α i , β, (cid:15) are hyperparameters of the hard con-crete distribution (HardConcrete) (Louizos et al.,2018).Note that the hyperparameter α i is crucial toHardConcrete as it directly governs its shape. Weassociate α i with x i through a gating network: log α i = x Ti · w , (3)Thus, L D ROP can schedule HardConcrete via α i to put more probability mass at either (i.e g i → )or (i.e. g i → ). w ∈ R d is a trainable parameter.Intuitively, L D ROP controls the openness of gate g i via α i so as to determine whether to remove( g i = 0 ) or retain ( g i = 1 ) the state x i . L D ROP enforces sparsity by pushing the proba-bility mass of HardConcrete towards , accordingto the following penalty term: L ( X ) = n (cid:88) i =1 − p ( g i = 0 | α i , β, (cid:15) ) . (4)By sampling g i with reparameterization (Kingmaand Welling, 2013), L D ROP is fully differentiableand optimized with an upper bound on the objec-tive: L MLE + λ L ( X ) , where λ is a hyperparam-eter affecting the degree of sparsity – a larger λ enforces more gates near 0 – and L MLE denotesthe maximum likelihood loss. An estimation ofthe expected value of g i is used during inference.Zhang et al. (2020) applied L D ROP to prune en-coder outputs for MT and summarization tasks; weadapt it to E2E ST. Sparse stochastic gates and L relaxations were also by Bastings et al. (2019) toconstruct interpretable classifiers, i.e. models thatcan reveal which tokens they rely on when makinga prediction. One difficulty with applying encoder-decoder mod-els to E2E ST is deciding how to encode speechignals. In contrast to text where word boundariescan be easily identified, the spectrum features ofspeech are continuous, varying remarkably acrossdifferent speakers for the same transcript. In addi-tion, redundant information, like pauses in-betweenneighbouring words, can be of arbitrary duration atany position as shown in Figure 1, while contribut-ing little to translation. This increases the burdenand occupies the capacity of ST encoder, leading toinferior performance (Duong et al., 2016; B´erardet al., 2016). Rather than developing complex en-coder architectures, we resort to feature selectionto explicitly clear out those uninformative speechfeatures.Figure 2 gives an overview of our model. Weuse a pretrained and frozen ASR encoder to extractcontextual speech features, and collect the informa-tive ones from them via AFS before transmissionto the ST encoder. AFS drops pauses, noise andother uninformative features and retains featuresthat are relevant for ASR. We speculate that theseretained features are also the most relevant for ST,and that the sparser representation simplifies thelearning problem for ST, for example the learningof attention strength between encoder states andtarget language (sub)words. Given a training tuple(audio, source transcription, translation), denotedas ( X, Y, Z ) respectively, we outline the overallframework below, including three steps:E2E ST with AFS
1. Train ASR model with the following objectiveand model architecture until convergence: L ASR = η L MLE ( Y | X ) + γ L CTC ( Y | X ) , (5) M ASR = D ASR (cid:16)
Y, E
ASR ( X ) (cid:17) . (6)2. Finetune ASR model with AFS for m steps: L AFS = L MLE ( Y | X ) + λ L ( X ) , (7) M AFS = D ASR (cid:16)
Y, F (cid:16) E ASR ( X ) (cid:17)(cid:17) . (8)3. Train ST model with pretrained and frozenASR and AFS submodules until convergence: L ST = L MLE ( Z | X ) , (9) M ST = D ST (cid:16) Z, E ST (cid:16) F E
ASR ( X ) (cid:17)(cid:17) . (10) We handle both ASR and ST as sequence-to-sequence problem with encoder-decoder models.We use E ∗ ( · ) and D ∗ ( · , · ) to denote the correspond- Note that our model only requires pair-wise training cor-pora, ( X, Y ) for ASR, and ( X, Z ) for ST. ing encoder and decoder respectively. F ( · ) denotesthe AFS approach, and F E means freezing theASR encoder and the AFS module during train-ing. Note that our framework puts no constraint onthe architecture of the encoder and decoder in anytask, although we adopt the multi-head dot-productattention network (Vaswani et al., 2017) for ourexperiments.
ASR Pretraining
The ASR model M ASR (Eq.6) directly maps an audio input to its transcription.To improve speech encoding, we apply logarithmicpenalty on attention to enforce short-range depen-dency (Di Gangi et al., 2019) and use trainablepositional embedding with a maximum length of2048. Apart from L MLE , we augment the trainingobjective with the connectionist temporal classi-fication (Graves et al., 2006, CTC) loss L CTC asin Eq. 5. Note η = 1 − γ . The CTC loss is ap-plied to the encoder outputs, guiding them to alignwith their corresponding transcription (sub)wordsand improving the encoder’s robustness (Karitaet al., 2019). Following previous work (Karita et al.,2019; Wang et al., 2020), we set γ to . . AFS Finetuning
This stage aims at using AFSto dynamically pick out the subset of ASR encoderoutputs that are most relevant for ASR performance(see red rectangles in Figure 1). We follow Zhanget al. (2020) and place AFS in-between ASR en-coder and decoder during finetuning (see F ( · ) in M AFS , Eq. 8). We exclude the CTC loss in thetraining objective (Eq. 7) to relax the alignmentconstraint and increase the flexibility of featureadaptation. We use L D ROP for AFS in two ways.
AFS t The direct application of L D ROP on ASRencoder results in AFS t , sparsifying encodingsalong the temporal dimension { x i } ni =1 : F t ( x i ) = AFS t ( x i ) = g ti x i , with log α ti = x Ti · w t ,g ti ∼ HardConcrete ( α ti , β, (cid:15) ) , (11)where α ti is a positive scalar powered by a simplelinear gating layer, and w t ∈ R d is a trainableparameter of dimension d . g t is the temporal gate.The sparsity penalty of AFS t follows Eq. 4: L t ( X ) = n (cid:88) i =1 − p ( g ti = 0 | α ti , β, (cid:15) ) . (12) AFS t,f
In contrast to text processing, speech pro-cessing often extracts spectrum from overlappingime frames to form the acoustic input, similar tothe word embedding. As each encoded speech fea-ture contains temporal information, it is reasonableto extend AFS t to AFS t,f , including sparsificationalong the feature dimension { x i,j } dj =1 : F t,f ( x i ) = AFS t,f ( x i ) = g ti x i (cid:12) g f , with log α f = w f ,g fj ∼ HardConcrete ( α fj , β, (cid:15) ) , (13)where α f ∈ R d estimates the weights of each fea-ture, dominated by an input-independent gatingmodel with trainable parameter w f ∈ R d . g f is the feature gate. Note that α f is shared for alltime steps. (cid:12) denotes element-wise multiplication.AFS t,f reuses g ti -relevant submodules in Eq. 11,and extends the sparsity penalty L t in Eq. 12 asfollows: L t,f ( X ) = L t + d (cid:88) j =1 − p ( g fj = 0 | α fj , β, (cid:15) ) . (14)We perform the finetuning by replacing ( F, L )in Eq. (8-7) with either AFS t ( F t , L t ) or AFS t,f ( F t,f , L t,f ) for extra m steps. We compare thesetwo variants in our experiments. E2E ST Training
We treat the pretrained ASRand AFS model as a speech feature extractor, andfreeze them during ST training. We gather thespeech features emitted by the ASR encoder thatcorrespond to g ti > , and pass them similarly asdone with word embeddings to the ST encoder. Weemploy sinusoidal positional encoding to distin-guish features at different positions. Except for theinput to the ST encoder, our E2E ST follows thestandard encoder-decoder translation model ( M ST in Eq. 10) and is optimized with L MLE alone as inEq. 9. Intuitively, AFS bridges the gap betweenASR output and MT input by selecting transcript-aligned speech features.
Datasets and Preprocessing
We experimentwith two benchmarks: the Augmented LibriSpeechdataset (LibriSpeech En-Fr) (Kocabiyikoglu et al.,2018) and the multilingual MuST-C dataset (MuST-C) (Di Gangi et al., 2019). LibriSpeech En-Fr is Other candidate gating models, like linear mapping uponmean-pooled encoder outputs, delivered worse performancein our preliminary experiments. collected by aligning e-books in French with En-glish utterances of LibriSpeech, further augmentedwith French translations offered by Google Trans-late. We use the 100 hours clean training set fortraining, including 47K utterances to train ASRmodels and double the size for ST models afterconcatenation with the Google translations. Wereport results on the test set (2048 utterances) usingmodels selected on the dev set (1071 utterances).MuST-C is built from English TED talks, covering8 translation directions: English to German (De),Spanish (Es), French (Fr), Italian (It), Dutch (Nl),Portuguese (Pt), Romanian (Ro) and Russian (Ru).We train ASR and ST models on the given trainingset, containing ∼
452 hours with ∼ Model Settings and Baselines
We adopt theTransformer architecture (Vaswani et al., 2017)for all tasks, including M ASR (Eq. 6), M AFS (Eq. 8) and M ST (Eq. 10). The encoder and de-coder consist of 6 identical layers, each includinga self-attention sublayer, a cross-attention sublayer(decoder alone) and a feedforward sublayer. Weemploy the base setting for experiments: hiddensize d = 512 , attention head 8 and feedforwardsize 2048. We schedule learning rate via Adam( β = 0 . , β = 0 . ) (Kingma and Ba, 2015),paired with a warmup step of 4K. We apply dropoutto attention weights and residual connections witha rate of 0.1 and 0.2 respectively, and also add labelsmoothing of 0.1 to handle overfitting. We trainall models with a maximum step size of 30K and ainibatch size of around 25K target subwords. Weaverage the last 5 checkpoints for evaluation. Weuse beam search for decoding, and set the beamsize and length penalty to 4 and 0.6, respectively.We set (cid:15) = − . , and β = / for AFS follow-ing Louizos et al. (2018), and finetune AFS for anadditional m = 5 K steps. We evaluate translationquality with tokenized case-sensitive BLEU (Pap-ineni et al., 2002), and report WER for ASR per-formance without punctuation.We compare our models with four baselines:
ST:
A vanilla Transformer-based E2E ST modelof 6 encoder and decoder layers. Logarithmicattention penalty (Di Gangi et al., 2019) isused to improve the encoder.
ST + ASR-PT:
We perform the ASR pretraining(ASR-PT) for E2E ST. This is the same modelas ours (Figure 2) but without AFS finetuning.
Cascade:
We first transcribe the speech input us-ing an ASR model, and then passes the resultson to an MT model. We also use the logarith-mic attention penalty (Di Gangi et al., 2019)for the ASR encoder.
ST + Fixed Rate:
Instead of dynamically select-ing features, we replace AFS with subsam-pling at a fixed rate: we extract the speechencodings after every k positions.Besides, we offer another baseline, ST + CNN , forcomparison on MuST-C En-De: we replace thefixed-rate subsampling with a one-layer 1D depth-separable convolution, where the output dimensionis set to 512, the kernel size over temporal dimen-sion is set to 5 and the stride is set to 6. In this way,the ASR encoder features will be compressed toaround 1/6 features, a similar ratio to the fixed-ratesubsampling.
We perform a thorough study on MuST-C En-De.With AFS, the first question is its feasibility. Westart by analyzing the degree of sparsity in speechfeatures (i.e. sparsity rate) yielded by AFS, focus-ing on the temporal sparsity rate { g ti =0 } / n and thefeature sparsity rate { g fj =0 } / d . To obtain differentrates, we vary the hyperparameter λ in Eq. 7 in arange of [0 . , . with a step size 0.1.Results in Figure 3 show that large amounts ofencoded speech features ( > ) can be easilypruned out, revealing heavy inner-speech redun-dancy. Both AFS t and AFS t,f drop ∼
60% tempo-ral features with λ of 0.1, and this number increases . . . . λ . . . . M e a n ( ± s t d ) o f g f AFS t,f (a) Feature Gate Value . . . . λ . . . . . . T e m p o r a l Sp a r s i t y R a t e AFS t AFS t,f (b) Temporal Sparsity Rate
Figure 3 : Feature gate value and temporal sparsity rate as afunction of λ on MuST-C En-De dev set. Larger λ decreasesthe gate value of g f but without dropping any neurons, i.e.feature sparsity rate 0%. By contrast, speech features are ofhigh redundancy along temporal dimension, easily inducinghigh sparsity rate of ∼ .
60 0 .
65 0 .
70 0 .
75 0 .
80 0 . . . . . . . W E R ASR-PTAFS t AFS t,f (a) ASR .
60 0 .
65 0 .
70 0 .
75 0 .
80 0 . . . . . . B L E U ASR-PTAFS t AFS t,f (b) ST
Figure 4 : ASR (WER ↓ ) and ST (BLEU ↑ ) performance as afunction of temporal sparsity rate on MuST-C En-De dev set.Pruning out ∼
85% temporal speech features largely improvestranslation quality and retains ∼
95% ASR accuracy. to > when λ ≥ . (Figure 3b), remarkablysurpassing the sparsity rate reported by Zhang et al.(2020) on text summarization ( . ). In contrastto rich temporal sparsification, we get a featuresparsity rate of 0, regardless of λ ’s value, althoughincreasing λ decreases g f (Figure 3a). This sug-gests that selecting neurons from the feature dimen-sion is harder. Rather than filtering neurons, thefeature gate g f acts more like a weighting mech-anism on them. In the rest of the paper, we use sparsity rate for the temporal sparsity rate.We continue to explore the impact of varied spar-sity rates on the ASR and ST performance. Figure4 shows their correlation. We observe that AFSslightly degenerates ASR accuracy (Figure 4a), butstill retains ∼
95% accuracy on average; AFS t,f often performs better than AFS t with similar spar-sity rate. The fact that only speech featuressuccessfully support 95% ASR accuracy proves theinformativeness of these selected features. Thesefindings echo with (Zhang et al., 2020), where theyobserve a trade-off between sparsity and quality.However, when AFS is applied to ST, we findconsistent improvements to translation quality by > . BLEU, shown in Figure 4b. Translation qual-ity on the development set peaks at 22.17 BLEU odel BLEU ↑ Speedup ↑ MT 29.69 -Cascade 22.52 1.06 × ST 17.44 0.87 × ST + ASR-PT 20.67 1.00 × ST + CNN 20.64 1.31 × ST + Fixed Rate ( k = 6 ) 21.14 (83.3%) 1.42 × ST + Fixed Rate ( k = 7 ) 20.87 (85.7%) 1.43 × ST + AFS t × ST + AFS t,f × Table 1 : BLEU ↑ and speedup ↑ on MuST-C En-De test set. λ = 0 . . We evaluate the speedup on GeForce GTX 1080 Tiwith a decoding batch size of 16, and report average resultsover 3 runs. Numbers in parentheses are the sparsity rate. . . . . . . . . B L E U ST+Fixed RateST + AFS t ( λ = 0 . t,f ( λ = 0 . Figure 5 : Impact of k in fixed-rate subsampling on ST per-formance on MuST-C En-De test set. Sparsity rate: k − / k .This subsampling underperforms AFS, and degenerates theST performance at suboptimal rates. achieved by AFS t,f with a sparsity rate of 85.5%.We set λ = 0 . (corresponding to sparsity rate of ∼ t andAFS t,f reach their optimal result at this point.We summarize the test results in Table 1, wherewe set k = 6 or k = 7 for ST+Fixed Rate with asparsity rate of around 85% inspired by our aboveanalysis. Our vanilla ST model yields a BLEUscore of 17.44; pretraining on ASR further en-hances the performance to 20.67, significantly out-performing the results of Di Gangi et al. (2019) by3.37 BLEU. This also suggests the importance ofspeech encoder pretraining (Di Gangi et al., 2019;Stoian et al., 2020; Wang et al., 2020). We treat STwith ASR-PT as our real baseline. We observe im-proved translation quality with fixed-rate subsam-pling, +0.47 BLEU at k = 6 . Subsampling offers achance to bypass noisy speech signals and reducingthe number of source states makes learning trans-lation alignment easier, but deciding the optimalsampling rate is tough. Results in Figure 5 revealthat fixed-rate subsampling deteriorates ST perfor-mance with suboptimal rates. Replacing fixed-ratesubsampling with our one-layer CNN also fails toimprove over the baseline, although CNN offersmore flexibility in feature manipulation. By con- . . . . D e v B L E U STST + ASR-PTST + AFS t ST + AFS t,f
ST + Fixed Rate
Figure 6 : ST training curves (MuST-C En-De dev set). ASRpretraining significantly accelerates model convergence, andfeature selection further stabilizes and improves training. λ =0 . , k = 6 . trast to fixed-rate subsampling, the proposed AFSis data-driven, shifting the decision burden to thedata and model themselves. As a result, AFS t andAFS t,f surpass ASR-PT by 0.9 BLEU and 1.71BLEU, respectively, substantially narrowing theperformance gap compared to the cascade baseline(-0.14 BLEU).We also observe improved decoding speed: AFSruns ∼ × faster than ASR-PT. Compared tothe fixed-rate subsampling, AFS is slightly slowerwhich we ascribe to the overhead introduced by thegating module. Surprisingly, Table 1 shows thatthe vanilla ST runs slower than ASR-PT (0.87 × )while the cascade model is slightly faster (1.06 × ).By digging into the beam search algorithm, wediscover that ASR pretraining shortens the numberof steps in beam-decoding: ASR-PT vs. vanilla ST (on average). The speedup brought bycascading is due to the smaller English vocabularysize compared to the German vocabulary whenprocessing audio inputs.
Apart from the benefits in translation quality, we godeeper to study other potential impacts of (adaptive)feature selection. We begin with inspecting trainingcurves. Figure 6 shows that ASR pretraining im-proves model convergence; feature selection makestraining more stable. Compared to other models,the curve of ST with AFS is much smoother, sug-gesting its better regularization effect.We then investigate the effect of training datasize, and show the results in Figure 7. Overall, wedo not observe higher data efficiency by featureselection on low-resource settings. But instead, ourresults suggest that feature selection delivers largerperformance improvement when more training datais available. With respect to data efficiency, ASRpretraining seems to be more important (Figure 7,left) (Bansal et al., 2019; Stoian et al., 2020). Com- . . . . . . . . B L E U STST + ASR-PT 50000 100000 150000 200000 B L E U ST + ASR-PTST + Fixed RateST+AFS t ST + AFS t,f
Figure 7 : BLEU as a function of training data size onMuST-C En-De. We split the original training data intonon-overlapped five subsets, and train different models withaccumulated subsets. Results are reported on the test set.Note that we perform ASR pretraining on the original dataset. λ = 0 . , k = 6 . . . . . . F r e q u e n c y ST + ASR-PT 0 . . . . . F r e q u e n c y ST + AFS t ST + Fixed RateST + AFS t,f
Figure 8 : Histogram of the cross-attention weights receivedper ST encoder output on MuST-C En-De test set. For eachinstance, we collect attention weights averaged over differentheads and decoder layers following Zhang et al. (2020). Largerweight indicates stronger impact of the encoder output ontranslation. Feature selection biases the distribution towardslarger weights. λ = 0 . , k = 6 . pared to AFS, the fixed-rate subsampling suffersmore from small-scale training: it yields worse per-formance than ASR-PT when data size ≤ K,highlighting better generalization of AFS.In addition to model performance, we also lookinto the ST model itself, and focus on the cross-attention weights. Figure 8 visualize the attentionvalue distribution, where ST models with featureselection noticeably shift the distribution towardslarger weights. This suggests that each ST encoderoutput exerts greater influence on the translation.By removing redundant and noisy speech features,feature selection eases the learning of the ST en-coder, and also enhances its connection strengthwith the ST decoder. This helps bridge the modalitygap between speech and text translation. Althoughfixed-rate subsampling also delivers a distributionshift similar to AFS, its inferior ST performancecompared to AFS corroborates the better quality ofadaptively selected features.
AFS vs. Fixed Rate
We compare these two ap-proaches by analyzing the number of retained fea-tures with respect to word duration and temporalposition. Results in Figure 9a show that the under-lying pattern behind these two methods is similar: . . . . . S e l ec t e d F e a t u r e s AFS t AFS t,f
Fixed Rate (a) Duration Analysis . . . . . . . S e l ec t e d F e a t u r e s AFS t AFS t,f
Fixed Rate (b) Position Analysis
Figure 9 : The number of selected features vs. word duration(left) and position (right) on MuST-C En-De test set. Forword duration, we align the audio and its transcription byMontreal Forced Aligner (McAuliffe et al., 2017), and collecteach words’ duration and its corresponding retained featurenumber. For position, we uniformly split each input into 50pieces, and count the average number of retained features ineach piece. λ = 0 . , k = 6 . d )0 . . . . . G a t e v a l u e ( g f ) AFS t,f
Figure 10 : Illustration of feature gate g f with λ = 0 . . words with longer duration correspond to morespeech features. However, when it comes to tempo-ral position, Figure 9b illustrates their difference:fixed-rate subsampling is context-independent, pe-riodically picking up features; while AFS decidesfeature selection based on context information. Thecurve of AFS is more smooth, indicating that fea-tures kept by AFS are more uniformly distributedacross different positions, ensuring the features’informativeness. AFS t vs. AFS t,f Their only difference lies at thefeature gate g f . We visualize this gate in Figure10. Although this gate induces no sparsification, itoffers AFS t,f the capability of adjusting the weightof each neuron. In other words, AFS t,f has morefreedom in manipulating speech features. Table 2 and Table 3 list the results on MuST-C andLibriSpeech En-Fr, respectively. Over all tasks,AFS t /AFS t,f substantially outperforms ASR-PTby 1.34/1.60 average BLEU, pruning out 84.5%temporal speech features on average and yieldingan average decoding speedup of 1.45 × . Our modelnarrows the gap against the cascade model to -0.8average BLEU, where AFS surpasses Cascade onLibriSpeech En-Fr, without using KD (Liu et al., etric Model De Es Fr It Nl Pt Ro RuBLEU ↑ Di Gangi et al. (2019) 17.30 20.80 26.90 16.80 18.80 20.10 16.50 10.50Transformer + ASR-PT ∗ t t,f ↑ ST + AFS t t,f t t,f ↑ ST + AFS t × × × × × × × × ST + AFS t,f × × × × × × × × Table 2 : Performance over 8 languages on MuST-C dataset. ∗ : results reported by the ESPNet toolkit (Watanabe et al., 2018),where the hyperparameters of beam search are tuned for each dataset.Metric Model En-FrBLEU ↑ B´erard et al. (2018) 13.40Watanabe et al. (2018) 16.68Liu et al. (2019a) 17.02Wang et al. (2019) 17.05Wang et al. (2020) 17.66ST 14.32ST + ASR-PT 17.05Cascade 18.27ST + AFS t t,f ↑ ST + AFS t t,f t t,f ↑ ST + AFS t × ST + AFS t,f × Table 3 : Performance on LibriSpeech En-Fr. for our models. Speech Translation
Pioneering studies on STused a cascade of separately trained ASR and MTsystems (Ney, 1999). Despite its simplicity, thisapproach inevitably suffers from mistakes madeby ASR models, and is error prone. Research inthis direction often focuses on strategies capable ofmitigating the mismatch between ASR output and signature: BLEU+c.mixed+ MT input, such as representing ASR outputs withlattices (Saleem et al., 2004; Mathias and Byrne,2006; Zhang et al., 2019a; Beck et al., 2019), inject-ing synthetic ASR errors for robust MT (Tsvetkovet al., 2014; Cheng et al., 2018) and differentiablecascade modeling (Kano et al., 2017; Anastasopou-los and Chiang, 2018; Sperber et al., 2019).In contrast to cascading, another option is toperform direct speech-to-text translation. Duonget al. (2016) and B´erard et al. (2016) employ the at-tentional encoder-decoder model (Bahdanau et al.,2015) for E2E ST without accessing any inter-mediate transcriptions. E2E ST opens the wayto bridging the modality gap directly, but it isdata-hungry, sample-inefficient and often underper-forms cascade models especially in low-resourcesettings (Bansal et al., 2018). This led researchersto explore solutions ranging from efficient neuralarchitecture design (Karita et al., 2019; Di Gangiet al., 2019; Sung et al., 2019) to extra trainingsignal incorporation, including multi-task learn-ing (Weiss et al., 2017; Liu et al., 2019b), sub-module pretraining (Bansal et al., 2019; Stoianet al., 2020; Wang et al., 2020), knowledge dis-tillation (Liu et al., 2019a), meta-learning (Indurthiet al., 2019) and data augmentation (Kocabiyikogluet al., 2018; Jia et al., 2019; Pino et al., 2019). Ourwork focuses on E2E ST, but we investigate featureselection which has rarely been studied before.
Speech Feature Selection
Encoding speech sig-nals is challenging as acoustic input is lengthy,noisy and redundant. To ease model learning, previ-ous work often selected features via downsamplingtechniques, such as convolutional modeling (Diangi et al., 2019) and fixed-rate subsampling (Luet al., 2015). Recently, Zhang et al. (2019b) andNa et al. (2019) proposed dynamic subsamplingfor ASR which learns to skip uninformative fea-tures during recurrent encoding. Unfortunately,their methods are deeply embedded into recur-rent networks, hard to adapt to other architectureslike Transformer (Vaswani et al., 2017). Recently,Salesky et al. (2019) have explored phoneme-levelrepresentations for E2E ST, which reduces speechfeatures temporarily by ∼
80% and obtains signif-icant performance improvement, but this requiresnon-trivial phoneme recognition and alignment.Instead, we resort to sparsification techniqueswhich have achieved great success in NLP tasksrecently (Correia et al., 2019; Child et al., 2019;Zhang et al., 2020). In particular, we employ L D ROP (Zhang et al., 2020) for AFS to dynami-cally retain informative speech features, which isfully differentiable and independent of concrete en-coder/decoder architectures. We extend L D ROP by handling both temporal and feature dimensionswith different gating networks, and apply it to E2EST.
In this paper, we propose adaptive feature selectionfor E2E ST to handle redundant and noisy speechsignals. We insert AFS in-between the ST encoderand a pretrained, frozen ASR encoder to filter outuninformative features contributing little to ASR.We base AFS on L D ROP (Zhang et al., 2020), andextend it to modeling both temporal and featuredimensions. Results show that AFS improves trans-lation quality and accelerates decoding by ∼ × with an average temporal sparsity rate of ∼ Acknowledgments
We would like to thank Shucong Zhang for hisgreat support on building our ASR baselines. ITacknowledges support of the European ResearchCouncil (ERC Starting grant 678254) and the Dutch National Science Foundation (NWO VIDI639.022.518). This work has received fundingfrom the European Union’s Horizon 2020 Researchand Innovation Programme under Grant AgreementNo 825460 (ELITR). Rico Sennrich acknowledgessupport of the Swiss National Science Foundation(MUTAMUR; no. 176727).
References
Antonios Anastasopoulos and David Chiang. 2018.Tied multitask learning for neural speech translation.In
Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers) , pages 82–91, New Orleans,Louisiana. Association for Computational Linguis-tics.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In .Sameer Bansal, Herman Kamper, Karen Livescu,Adam Lopez, and Sharon Goldwater. 2018. Low-resource speech-to-text translation. In
Proc. Inter-speech 2018 , pages 1298–1302.Sameer Bansal, Herman Kamper, Karen Livescu,Adam Lopez, and Sharon Goldwater. 2019. Pre-training on high-resource speech recognition im-proves low-resource speech-to-text translation. In
Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers) , pages 58–68,Minneapolis, Minnesota. Association for Computa-tional Linguistics.Jasmijn Bastings, Wilker Aziz, and Ivan Titov. 2019.Interpretable neural predictions with differentiablebinary variables. In
Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics , pages 2963–2977, Florence, Italy. Associa-tion for Computational Linguistics.Daniel Beck, Trevor Cohn, and Gholamreza Haffari.2019. Neural speech translation using lattice trans-formations and graph networks. In
Proceedings ofthe Thirteenth Workshop on Graph-Based Methodsfor Natural Language Processing (TextGraphs-13) ,pages 26–31, Hong Kong. Association for Computa-tional Linguistics.Alexandre B´erard, Laurent Besacier, Ali Can Ko-cabiyikoglu, and Olivier Pietquin. 2018. End-to-end automatic speech translation of audiobooks.In , pages6224–6228. IEEE.lexandre B´erard, Olivier Pietquin, Christophe Servan,and Laurent Besacier. 2016. Listen and translate: Aproof of concept for end-to-end speech-to-text trans-lation. In
NIPS Workshop on End-to-end Learningfor Speech and Audio Processing , Barcelona, Spain.Yong Cheng, Zhaopeng Tu, Fandong Meng, JunjieZhai, and Yang Liu. 2018. Towards robust neuralmachine translation. In
Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1756–1766, Melbourne, Australia. Association for Compu-tational Linguistics.Rewon Child, Scott Gray, Alec Radford, andIlya Sutskever. 2019. Generating long se-quences with sparse transformers. arXiv preprintarXiv:1904.10509 .Gonc¸alo M. Correia, Vlad Niculae, and Andr´e F. T.Martins. 2019. Adaptively sparse transformers. In
Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 2174–2184, Hong Kong, China. Association for Computa-tional Linguistics.Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli,Matteo Negri, and Marco Turchi. 2019. MuST-C:a Multilingual Speech Translation Corpus. In
Pro-ceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers) , pages 2012–2017,Minneapolis, Minnesota. Association for Computa-tional Linguistics.Mattia A. Di Gangi, Matteo Negri, and Marco Turchi.2019. Adapting Transformer to End-to-End SpokenLanguage Translation. In
Proc. Interspeech 2019 ,pages 1133–1137.Long Duong, Antonios Anastasopoulos, David Chiang,Steven Bird, and Trevor Cohn. 2016. An attentionalmodel for speech translation without transcription.In
Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies ,pages 949–959, San Diego, California. Associationfor Computational Linguistics.Alex Graves, Santiago Fern´andez, and Faustino Gomez.2006. Connectionist temporal classification: La-belling unsegmented sequence data with recurrentneural networks. In
In Proceedings of the Inter-national Conference on Machine Learning, ICML2006 , pages 369–376.Sathish Indurthi, Houjeung Han, Nikhil Kumar Laku-marapu, Beomseok Lee, Insoo Chung, Sangha Kim,and Chanwoo Kim. 2019. Data efficient directspeech-to-text translation with modality agnosticmeta-learning. arXiv preprint arXiv:1911.04283 . Ye Jia, Melvin Johnson, Wolfgang Macherey, Ron JWeiss, Yuan Cao, Chung-Cheng Chiu, Naveen Ari,Stella Laurenzo, and Yonghui Wu. 2019. Lever-aging weakly supervised data to improve end-to-end speech-to-text translation. In
ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , pages7180–7184. IEEE.Takatomo Kano, Sakriani Sakti, and Satoshi Nakamura.2017. Structured-based curriculum learning for end-to-end english-japanese speech translation. In
Proc.Interspeech 2017 , pages 2630–2634.Shigeki Karita, Nanxin Chen, Tomoki Hayashi,Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang,Masao Someki, Nelson Enrique Yalta Soplin,Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe,Takenori Yoshimura, and Wangyou Zhang. 2019. Acomparative study on transformer vs rnn in speechapplications. In ,pages 449–456.Diederik P Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In
InternationalConference on Learning Representations .Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 .Ali Can Kocabiyikoglu, Laurent Besacier, and OlivierKraif. 2018. Augmenting librispeech with Frenchtranslations: A multimodal corpus for direct speechtranslation evaluation. In
Proceedings of theEleventh International Conference on Language Re-sources and Evaluation (LREC 2018) , Miyazaki,Japan. European Language Resources Association(ELRA).Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondˇrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: Opensource toolkit for statistical machine translation. In
Proceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions , pages 177–180, Prague, Czech Republic. As-sociation for Computational Linguistics.Yuchen Liu, Hao Xiong, Jiajun Zhang, ZhongjunHe, Hua Wu, Haifeng Wang, and Chengqing Zong.2019a. End-to-End Speech Translation with Knowl-edge Distillation. In
Proc. Interspeech 2019 , pages1128–1132.Yuchen Liu, Jiajun Zhang, Hao Xiong, Long Zhou,Zhongjun He, Hua Wu, Haifeng Wang, andChengqing Zong. 2019b. Synchronous speechrecognition and speech-to-text translation with inter-active decoding. arXiv preprint arXiv:1912.07240 .hristos Louizos, Max Welling, and Diederik P.Kingma. 2018. Learning sparse neural networksthrough l regularization. In International Confer-ence on Learning Representations .Liang Lu, Xingxing Zhang, Kyunghyun Cho, and SteveRenals. 2015. A study of the recurrent neural net-work encoder-decoder for large vocabulary speechrecognition. In
Sixteenth Annual Conference of theInternational Speech Communication Association .Lambert Mathias and William Byrne. 2006. Statisti-cal phrase-based speech translation. In , volume 1, pages I–I.IEEE.Michael McAuliffe, Michaela Socolof, Sarah Mi-huc, Michael Wagner, and Morgan Sonderegger.2017. Montreal forced aligner: Trainable text-speech alignment using kaldi. In
Proc. Interspeech2017 , pages 498–502.Rui Na, Junfeng Hou, Wu Guo, Yan Song, and LirongDai. 2019. Learning adaptive downsampling encod-ing for online end-to-end speech recognition. In , pages 850–854.Hermann Ney. 1999. Speech translation: Couplingof recognition and translation. In , volume 1, pages 517–520. IEEE.Jan Niehues, Roldano Cattoni, Sebastian St¨uker, Mat-teo Negri, Marco Turchi, Elizabeth Salesky, RamonSanabria, Lo¨ıc Barrault, Lucia Specia, and MarcelloFederico. 2019. The iwslt 2019 evaluation cam-paign. In
Proceedings of the 16th InternationalWorkshop on Spoken Language Translation (IWSLT2019) .Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In
Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics , pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.Juan Pino, Liezl Puzon, Jiatao Gu, Xutai Ma, Arya DMcCarthy, and Deepak Gopinath. 2019. Harness-ing indirect training data for end-to-end automaticspeech translation: Tricks of the trade. In
Proceed-ings of the 16th International Workshop on SpokenLanguage Translation (IWSLT) .Matt Post. 2018. A call for clarity in reporting BLEUscores. In
Proceedings of the Third Conference onMachine Translation: Research Papers , pages 186–191, Belgium, Brussels. Association for Computa-tional Linguistics. Shirin Saleem, Szu-Chen (Stan) Jou, Stephan Vogel,and Tanja Schultz. 2004. Using word lattice infor-mation for a tighter coupling in speech translationsystems. In
International Conference of Spoken Lan-guage Processing .Elizabeth Salesky, Matthias Sperber, and Alan WBlack. 2019. Exploring phoneme-level speech rep-resentations for end-to-end speech translation. In
Proceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 1835–1841, Florence, Italy. Association for ComputationalLinguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In
Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Matthias Sperber, Graham Neubig, Jan Niehues, andAlex Waibel. 2019. Attention-passing models for ro-bust and data-efficient end-to-end speech translation.
Transactions of the Association for ComputationalLinguistics , 7:313–325.Mihaela C Stoian, Sameer Bansal, and SharonGoldwater. 2020. Analyzing asr pretrainingfor low-resource speech-to-text translation. In
ICASSP 2020-2020 IEEE International Confer-ence on Acoustics, Speech and Signal Processing(ICASSP) , pages 7909–7913. IEEE.Tzu-Wei Sung, Jun-You Liu, Hung-yi Lee, and Lin-shan Lee. 2019. Towards end-to-end speech-to-text translation with two-pass decoding. In
ICASSP 2019-2019 IEEE International Confer-ence on Acoustics, Speech and Signal Processing(ICASSP) , pages 7175–7179. IEEE.Yulia Tsvetkov, Florian Metze, and Chris Dyer.2014. Augmenting translation models with simu-lated acoustic confusions for improved spoken lan-guage translation. In
Proceedings of the 14thConference of the European Chapter of the Asso-ciation for Computational Linguistics , pages 616–625, Gothenburg, Sweden. Association for Compu-tational Linguistics.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,
Advances in Neural Information Pro-cessing Systems 30 , pages 5998–6008. Curran Asso-ciates, Inc.Chengyi Wang, Yu Wu, Shujie Liu, Zhenglu Yang, andMing Zhou. 2019. Bridging the gap between pre-training and fine-tuning for end-to-end speech trans-lation. arXiv preprint arXiv:1909.07575 .hengyi Wang, Yu Wu, Shujie Liu, Ming Zhou,and Zhenglu Yang. 2020. Curriculum pre-trainingfor end-to-end speech translation. arXiv preprintarXiv:2004.10093 .Shinji Watanabe, Takaaki Hori, Shigeki Karita, TomokiHayashi, Jiro Nishitoba, Yuya Unno, Nelson En-rique Yalta Soplin, Jahn Heymann, Matthew Wies-ner, Nanxin Chen, Adithya Renduchintala, andTsubasa Ochiai. 2018. Espnet: End-to-end speechprocessing toolkit. In
Proc. Interspeech 2018 , pages2207–2211.Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, YonghuiWu, and Zhifeng Chen. 2017. Sequence-to-sequence models can directly translate foreignspeech. In
Proc. Interspeech 2017 , pages 2625–2629.Biao Zhang, Ivan Titov, and Rico Sennrich. 2020.On sparsifying encoder outputs in sequence-to-sequence models. arXiv preprint arXiv:2004.11854 .Pei Zhang, Niyu Ge, Boxing Chen, and Kai Fan.2019a. Lattice transformer for speech translation.In
Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics , pages6475–6484, Florence, Italy. Association for Compu-tational Linguistics.Shucong Zhang, Erfan Loweimi, Yumo Xu, Peter Bell,and Steve Renals. 2019b. Trainable Dynamic Sub-sampling for End-to-End Speech Recognition. In