Gradient-guided Loss Masking for Neural Machine Translation
aa r X i v : . [ c s . C L ] F e b Gradient-guided Loss Masking for Neural Machine Translation
Xinyi Wang , Ankur Bapna , Melvin Johnson , Orhan Firat Language Technology Institute, Carnegie Mellon University Google Research [email protected], { ankurbpn, melvinp, orhanf } @google.com Abstract
To mitigate the negative effect of low qualitytraining data on the performance of neural ma-chine translation models, most existing strate-gies focus on filtering out harmful data beforetraining starts. In this paper, we explore strate-gies that dynamically optimize data usage dur-ing the training process using the model’s gra-dients on a small set of clean data. At eachtraining step, our algorithm calculates the gra-dient alignment between the training data andthe clean data to mask out data with negativealignment. Our method has a natural intuition:good training data should update the model pa-rameters in a similar direction as the clean data.Experiments on three WMT language pairsshow that our method brings significant im-provement over strong baselines, and the im-provements are generalizable across test datafrom different domains.
The quality of training data has a large effecton the performance of neural machine trans-lation (NMT) models (Khayrallah and Koehn,2018). Parallel sentences crawled from theweb and aligned from multilingual docu-ments (Espl`a et al., 2019) provide a largesupply of training data to boost the performanceof NMT, but automatically extracted trainingexamples often have noise that could hurt themodel performance. Designing a good strategy toutilize the training data with varying quality levelsis essential to improving the performance of NMTmodels.Prior methods mainly focus on filtering databefore training starts (Moore and Lewis, 2010;van der Wees et al., 2017; Wang et al., 2018).Junczys-Dowmunt (2018) proposed selectingclean training data by using a heuristic thresholdon the perplexity difference between the forward and backward translation models. Wang et al.(2018) propose a curriculum-based data selectionstrategy using a small set of clean trusted data.In this paper, we examine the problem of op-timizing the training data usage from a differentangle. Instead of filtering data before the train-ing, we utilize the gradient information of the data during the training process to guide the data us-age. The intuition behind our method is natu-ral: given a very small set of clean or trusteddata (which is generally easy to obtain in practice),the preferred training data should have similar gra-dient direction as the gradient of the clean data, sothat updating the model parameters on this train-ing data would improve the model performanceon the clean data. Based on this intuition, wepropose gradient-guided loss masking (GLMask),which calculates the gradient alignment betweeneach training example and the clean data, and sim-ply masks out the loss of the training examples thathave negative gradient alignments. GLMask is in-spired by methods that optimize training data us-age using meta-learning. Ren et al. (2018) calcu-lates a weighting of noisy training data that mini-mizes the model loss on clean data for image clas-sification tasks. Wang et al. (2020b,a) optimizesthe training data distribution of multilingual datasuch that the model loss on the development set isminimized. Our method, on the other hand, uses asimple gradient alignment signal to determine theparallel sentence pairs or target words to mask outduring NMT training.We test GLMask on the standard WMT English-German, English-Chinese and English-Frenchtranslation tasks, using WMT test sets from prioryears as the small clean data. GLMask bringssignificant improvements over strong baselines onall three language pairs. We further evaluate thetrained models on the IWSLT test sets which aresampled from a different domain, and demonstratehat GLMask also delivers improvements on theseout of domain sets. This shows that improvementsfrom GLMask generalize to data from different do-mains, beyond the domain of the clean data.
To explain our method, we first provide a mathe-matical framework for the problem it aims to ad-dress.
To train a NMT model with parameters θ that trans-lates from a source language S to a target language T , we want to find the optimal model parameters θ ∗ , which minimizes the loss function ℓ ( x, y, θ ) over the true distribution of the parallel data from S - T , denoted by P ( S, T ) . However, in practice,one usually only has access to a limited number ofparallel training sentences sampled from the train-ing distribution P train ( x, y ) . The standard trainingapproach finds the θ ∗ that minimizes the loss func-tion ℓ ( · ; θ ) over this training distribution: J train ( θ ) = E x,y ∼ P train ( x,y ) [ ℓ ( x, y ; θ )] (1)Problems arise when the training distribu-tion P train ( x, y ) , from which we draw trainingsentences, has discrepancies with P ( S, T ) , thetrue distribution of the parallel data from S - T . For example, some samples from the train-ing data might be noisy and might be detri-mental to final model performance on the cleandata (Khayrallah and Koehn, 2018).To remedy this training data discrepancy, onestrategy is to collect a small set of high qualitydata drawn from distribution P clean ( x, y ) , whichis closer to the true data distribution P ( S, T ) , toguide training on the large noisy training data.With the help of this clean data, we want to trainthe model using data sampled from P train ( x, y ) ,while the loss over the clean data is minimized: J clean ( θ ) = E x,y ∼ P clean ( x,y ) [ ℓ ( x, y ; θ )] (2) We propose gradient-guided loss mask-ing (GLMask), which uses gradient information tomask out the training examples that can be harm-ful for minimizing the model’s loss on the cleandata. GLMask modifies the training objective J train ( θ ) such that optimizing the model parameterover this objective also optimizes the objective over the small clean data J clean ( θ ) . The intuitionbehind our approach is that a training example x, y ∼ P train ( x, y ) is more likely to minimizethe loss over the clean data, if the gradient ofthis training data is in the same direction as thegradient of the clean data. Formally, we use thedot product between these two gradients as theiralignment g ( x, y ) = ∇ θ ℓ ( x, y ; θ ) ⊤ · ∇ θ J clean ( θ ) (3)A negative g ( x, y ) indicates that the training exam-ple has a conflicting gradient with the clean data,which might be detrimental for the model’s perfor-mance on the clean data. To optimize the modelobjective on the clean data, we can mask out theexamples that have negative g ( x, y ) with the cleandata. That is, we modify the training objective inEq. 1 to J train ( θ ) = E x,y ∼ P train ( x,y ) [ m ( x, y ) · ℓ ( x, y )] (4)where m ( x, y ) = [ g ( x, y ) > . The maskingterm assigns a weight of 0 to examples that havenegative gradient alignment with the clean data. Implementation
The pseudo code forGLMask is in Alg. 1. Note that calculatingthe loss mask at line 6 requires calculating thegradient alignment of each training example andthe gradient of the clean data, which is potentiallyvery expensive. We use a technique that allowsus to efficiently compute this value with only twoadditional backward passes, which is supportedby modern deep learning libraries such as Ten-sorflow (Abadi et al., 2016) . In our preliminaryexperiments, using GLMask at the later stageof training works as well as or even better thanusing it from the beginning. Therefore, we useGLMask for the last of training steps, whichfurther decreases the overall training overhead ofour method. We use parallel data from the WMT eval-uation campaign. To verify the effectiveness ofour approach, we test it on three language pairswith varying amounts of resources: English to Ger-man (en-de), English to Chinese (en-zh), and En-glish to French (en-fr). We use the technique introduced here: https://j-towns.github.io/2017/06/12/A-new-trick.html .The function in Tensorflow is provided in § A.1 lgorithm 1:
Training with GLMask
Input :
Training corpus D train ; clean data D clean Output :
The converged model θ ∗ while not converged do ⊲ Sample a batch of training data ( x , y ) ... ( x B , y B ) ∼ D train ⊲ Sample a batch of clean data ( x ′ , y ′ ) ... ( x ′ B , y ′ B ) ∼ D clean ⊲ Calculate data mask g ′ ← ∇ θ t B P B ′ i =1 ℓ ( x ′ i , y ′ i ; θ t ) m ( x i , y i ) ← ( g ′⊤ ∇ θ ℓ ( x i , y i ; θ t ) > for i in 1...B ⊲ Calculate masked objective g train ← ∇ θ t B P Bi =1 m ( x i , y i ) ℓ ( x i , y i ; θ t ) θ t +1 ← Update ( θ t , g train ) end The en-de experiments are conducted using theWMT’14 training data with about 4 million par-allel sentences. We use newstest2013 as the vali-dation set, and newstest2014 as the test set. Theen-zh experiments use the WMT’17 training datawith around 22 million parallel sentences. Weuse newsdev2017 as the validation set and new-stest2017 as the test set. The en-fr experiments usethe WMT’14 training data with about 40 millionparallel sentences. We use newstest2013 as valida-tion set and newstest2014 as the test set. We usesacreBLEU (Post, 2018) to evaluate all our mod-els.
Model and preprocessing
We use the standardTransformer base model (Vaswani et al., 2017),with 6 layers and 8 attention heads. The dropoutrate is set to 0.1 and we use label smoothing of 0.1.For all datasets, we process the data using senten-cepice (Kudo and Richardson, 2018) with a vocab-ulary size of 40k. Parallel sentences with lengthlonger than 200 word pieces are filtered out dur-ing training.
Construction of the clean data
GLMask re-quires a small set of clean data to guide modeltraining. Here we simply use previous years’WMT evaluation sets as our clean data. For en-deand en-fr, we concatenate the news test sets from2010 to 2012 as the clean data, which contains ap-proximately 8k sentences. For en-zh, we re-useour validation set, newsdev2017, as the clean data,which has about 2k sentences. In practice, it isgenerally reasonable to obtain a small amount ofhigh quality annotated data. en-de en-zh en-frVanilla 27.29 32.99 39.30Finetune 27.26 33.99 40.33GLMask-sent 27.49 34.19 40.34GLMask-word
Table 1: Results on the WMT test set. en-de en-zh en-frVanilla 28.79 25.64 42.24Finetune 28.54 25.17 42.02GLMask-sent 28.74 25.60
GLMask-word
Table 2: Results on the IWSLT test set.
We compare with two baselines: 1) vanilla: stan-dard transformer model trained on all parallel data;2) finetune: we finetune the vanilla model on thesmall cleaned dev set used for calculating lossmasking in our method.For GLMask, we use two variations of lossmasking: 1) sentence level (GLMask-sent): wemask out the loss of the parallel sentences in abatch; 2) word level (GLMask-word): we maskout the loss of each individual subwords on the tar-get side.
We evaluate the baselines and our methods on theWMT test sets, and document the results in Tab. 1.First, we notice that finetuning is a strong base-line, especially when the training data is relativelydiverse. It does not lead to improvement for en-de, which has the least amount of training data,but it improves over the vanilla model by 1 BLEUfor the higher resourced en-zh and en-fr languagepairs. GLMask improves over the strong finetunebaseline for all three language pairs. Specifically,using GLMask on the word level consistently out-performs masking on the sentence level.
Out-of-domain generalization results
GLMask utilizes a small clean dataset to guide thetraining of NMT models. One potential criticismof this design choice is that the model mightoverfit to this small data. To examine how well themodel trained using GLMask generalizes to newdomains, we construct an additional test set using n-de en-zh en-fr0102030 K e p t W o r d s % All Copied
Figure 1: Ave. percentage of words that are not maskedout for all examples, and copied examples. Exam-ples where source and target sentences are identical aremasked out more. en-de en-zh en-fr050100 A l ph a W o r d s % Low High
Figure 2: Percentage of alphabetical words for the low-est and highest masked out words. Words that aremasked out more tend to have less semantically richalphabetical words. the IWSLT (Cettolo et al., 2012) data from TEDtalks. Since the small clean data we use is drawnfrom the news domain, the model performance onthe IWSLT test sets is a good indicator how wellGLMask generalizes to out-of-domain data. Weaggregate the test sets from IWSLT In this section, we analyze the training sentencesand words that tend to be masked out by GLMask.We use a converged vanilla model to calculate thegradient alignments of a subset of training data,with about 250k sentence pairs. https://wit3.fbk.eu/ Src Trgen-de
Neither have you , I hope . Sie doch hoffentlich auch nicht ? en-zh
What about you r 你 认 为 呢 ? r Table 3: Examples training data. Red words aremasked out.
First, we examine the sentence pairs using thepercentage of words that are not masked for eachsentence. In Fig. 1, we plot the average percent-age of unmasked words in all the sampled train-ing examples (All), and examples that have iden-tical source and target sentences (Copied). En-dehas higher percentage of unmasked, probably be-cause it has a relatively cleaner training set withthe least amount of data. For all three languagepairs, especially the more noisy en-zh and en-fr,copied examples have much less unmasked wordsthan average. This indicates that GLMask is ableto filter out copied sentences, which is known tobe one of the most harmful categories of noise forNMT (Khayrallah and Koehn, 2018).Next, we analyze the type of target words thattend to be masked out by GLMask. We sort thetarget words by the percentage of being masked,and compare the 1k words with the highest mask-ing rate and the 1k words with the lowest maskingrate. Alphabetical words, or words with only al-phabetical characters, usually have richer semanticmeaning than words that contain numbers or sym-bols. In Fig. 2, we compare the percentage of al-phabetical words for the two group of words withthe highest and lowest masking rates. For all threelanguages, the words that have a higher maskingrate tend to have less alphabetical words, whichindicates that the words having clearer and richermeanings are more favored by GLMask.We show some training examples and the targetwords that are masked out by GLMask in Tab. 3.Our method masks out the punctuation words inthe target that do not align with the source sen-tences.
In this paper, we evaluate a strategy to dynamicallymask out unhelpful data during NMT training .We propose GLMask, a simple method that usesthe gradient alignments between the training dataand a small clean dataset to improve data usage.Experiments show that our method not only bringssignificant improvements on three WMT datasets,ut also improves out-of-domain performance.
References
Mart´ın Abadi, Paul Barham, Jianmin Chen, ZhifengChen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard,et al. 2016. Tensorflow: A system for large-scalemachine learning. In { USENIX } .M. Cettolo, Girardi C., and Federico M. 2012. Wit3:Web inventory of transcribed and translated talks. In EAMT .Miquel Espl`a, Mikel Forcada, Gema Ram´ırez-S´anchez,and Hieu Hoang. 2019. ParaCrawl: Web-scale par-allel corpora for the languages of the EU. In
EACL:Machine Translation Summit .Marcin Junczys-Dowmunt. 2018. Dual conditionalcross-entropy filtering of noisy parallel corpora. In
WMT .Huda Khayrallah and Philipp Koehn. 2018. On theimpact of various types of noise on neural machinetranslation. In
WMT .Taku Kudo and John Richardson. 2018. Sentencepiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing. In
EMNLP .Robert C. Moore and William Lewis. 2010. Intelligentselection of language model training data. In
ACL .Matt Post. 2018. A call for clarity in reporting BLEUscores. In
WMT .Mengye Ren, Wenyuan Zeng, Bin Yang, and RaquelUrtasun. 2018. Learning to reweight examples forrobust deep learning. In
ICML .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
NIPS .Wei Wang, Taro Watanabe, Macduff Hughes, TetsujiNakagawa, and Ciprian Chelba. 2018. Denoisingneural machine translation training with trusted dataand online data selection. In
WMT .Xinyi Wang, Hieu Pham, Paul Mitchel, Antonis Anas-tasopoulos, Jaime Carbonell, and Graham Neubig.2020a. Optimizing data usage via differentiable re-wards. In
ICML .Xinyi Wang, Yulia Tsvetkov, and Graham Neubig.2020b. Balancing training for multilingual neuralmachine translation. In
ACL .Marlies van der Wees, Arianna Bisazza, and MonzChristof. 2017. Dynamic data selection for neuralmachine transaltion. In
EMNLP . Appendix
A.1 Source Code for Training with GLMask in TensorFlow
Get masked training gradient in TensorFlow import tensorflow as tf def get_train_gradient(train_loss, valid_loss, model): """Sample a batch of corrupted examples from sents. Args: train_loss: Tensor [batch_size, n_steps]. The loss for each training example in a batch. valid_loss: Tensor [1]. The aggregated loss for a batch of valid data. model: the parameters of the NMT model. Returns: train_grad: Tensor [batch_size, n_steps]. The training gradient after masking. """ z = tf.ones(train_loss.shape) train_grad = tf.gradient(train_loss, model, grad_ys=z) valid_grad = tf.gradient(valid_loss, model) dot_prod = tf.gradient(train_grad, z, grad_ys=valid_grad) gradient_mask = tf.greater(dot_prod, tf.zeros(dot_prod.shape)) train_grad = train_grad * gradient_mask returnreturn