On NMT Search Errors and Model Errors: Cat Got Your Tongue?
OOn NMT Search Errors and Model Errors: Cat Got Your Tongue?
Felix Stahlberg ∗ and Bill Byrne
University of CambridgeDepartment of EngineeringTrumpington St, Cambridge CB2 1PZ, UK { fs439,wjb31 } @cam.ac.uk Abstract
We report on search errors and model errors inneural machine translation (NMT). We presentan exact inference procedure for neural se-quence models based on a combination ofbeam search and depth-first search. We useour exact search to find the global best modelscores under a Transformer base model for theentire WMT15 English-German test set. Sur-prisingly, beam search fails to find these globalbest model scores in most cases, even with avery large beam size of 100. For more than50% of the sentences, the model in fact assignsits global best score to the empty translation,revealing a massive failure of neural models inproperly accounting for adequacy. We showby constraining search with a minimum trans-lation length that at the root of the problemof empty translations lies an inherent bias to-wards shorter translations. We conclude thatvanilla NMT in its current form requires justthe right amount of beam search errors, which,from a modelling perspective, is a highly un-satisfactory conclusion indeed, as the modeloften prefers an empty translation.
Neural machine translation (Kalchbrenner andBlunsom, 2013; Sutskever et al., 2014; Bahdanauet al., 2015, NMT) assigns the probability P ( y | x ) of a translation y = y J ∈ T J of length J overthe target language vocabulary T for a source sen-tence x ∈ S I of length I over the source languagevocabulary S via a left-to-right factorization usingthe chain rule: log P ( y | x ) = J (cid:88) j =1 log P ( y j | y j − , x ) . (1)The task of finding the most likely translation ˆ y ∈T ∗ for a given source sentence x is known as the * Now at Google. decoding or inference problem: ˆ y = arg max y ∈T ∗ P ( y | x ) . (2)The NMT search space is vast as it grows expo-nentially with the sequence length. For example,for a common vocabulary size of |T | = 32 , ,there are already more possible translations with20 words or less than atoms in the observableuniverse ( , (cid:29) ). Thus, completeenumeration of the search space is impossible.The size of the NMT search space is perhaps themain reason why – besides some preliminary stud-ies (Niehues et al., 2017; Stahlberg et al., 2018b;Ott et al., 2018) – analyzing search errors in NMThas received only limited attention. To the best ofour knowledge, none of the previous studies wereable to quantify the number of search errors in un-constrained NMT due to the lack of an exact infer-ence scheme that – although too slow for practi-cal MT – guarantees to find the global best modelscore for analysis purposes.In this work we propose such an exact decod-ing algorithm for NMT that exploits the mono-tonicity of NMT scores: Since the conditionallog-probabilities in Eq. 1 are always negative,partial hypotheses can be safely discarded oncetheir score drops below the log-probability of any complete hypothesis. Using our exact inferencescheme we show that beam search does not findthe global best model score for more than half ofthe sentences. However, these search errors, para-doxically, often prevent the decoder from suffer-ing from a frequent but very serious model error inNMT, namely that the empty hypothesis often getsthe global best model score. Our findings suggesta reassessment of the amount of model and searcherrors in NMT, and we hope that they will sparknew efforts in improving NMT modeling capabil-ities, especially in terms of adequacy. a r X i v : . [ c s . C L ] A ug lgorithm 1 BeamSearch ( x , n ∈ N + ) Input: x : Source sentence, n : Beam size H cur ← { ( (cid:15), . } { Initialize with empty translation prefix and zero score } repeat H next ← ∅ for all ( y , p ) ∈ H cur do if y | y | = < / s > then H next ← H next ∪ { ( y , p ) } { Hypotheses ending with < / s > are not expanded } else H next ← H next ∪ (cid:83) w ∈T ( y · w, p + log P ( w | x , y )) { Add all possible continuations } end if end for H cur ← { ( y , p ) ∈ H next : |{ ( y (cid:48) , p (cid:48) ) ∈ H next : p (cid:48) > p }| < n } { Select n -best } (˜ y , ˜ p ) ← arg max ( y ,p ) ∈H cur p until ˜ y | ˜ y | = < / s > return ˜ y Algorithm 2
DFS ( x , y , p ∈ R , γ ∈ R ) Input: x : Source sentence y : Translation prefix (default: (cid:15) ) p : log P ( y | x ) (default: . ) γ : Lower bound if y | y | = < / s > then return ( y , p ) { Trigger γ update } end if ˜ y ←⊥ { Initialize ˜ y with dummy value } for all w ∈ T do p (cid:48) ← p + log P ( w | x , y ) if p (cid:48) ≥ γ then ( y (cid:48) , γ (cid:48) ) ← DFS ( x , y · w, p (cid:48) , γ ) if γ (cid:48) > γ then (˜ y , γ ) ← ( y (cid:48) , γ (cid:48) ) end if end if end for return (˜ y , γ ) Decoding in NMT (Eq. 2) is usually tackled withbeam search, which is a time-synchronous approx-imate search algorithm that builds up hypothesesfrom left to right. A formal algorithm descriptionis given in Alg. 1. Beam search maintains a setof active hypotheses H cur . In each iteration, allhypotheses in H cur that do not end with the end-of-sentence symbol < / s > are expanded and col-lected in H next . The best n items in H next consti-tute the set of active hypotheses H cur in the nextiteration (line 11 in Alg. 1), where n is the beam size. The algorithm terminates when the best hy-pothesis in H cur ends with the end-of-sentencesymbol < / s > . Hypotheses are called complete if they end with < / s > and partial if they do not.Beam search is the ubiquitous decoding algo-rithm for NMT, but it is prone to search errorsas the number of active hypotheses is limited by n . In particular, beam search never comparespartial hypotheses of different lengths with eachother. As we will see in later sections, this isone of the main sources of search errors. How-ever, in many cases, the model score found bybeam search is a reasonable approximation to theglobal best model score. Let γ be the modelscore found by beam search ( ˜ p in line 12, Alg. 1),which is a lower bound on the global best modelscore: γ ≤ log P (ˆ y | x ) . Furthermore, since theconditionals log P ( y j | y j − , x ) in Eq. 1 are log-probabilities and thus non-positive, expanding apartial hypothesis is guaranteed to result in a lowermodel score, i.e.: ∀ j ∈ [2 , J ] : log P ( y j − | x ) > log P ( y j | x ) . (3)Consequently, when we are interested in the globalbest hypothesis ˆ y , we only need to consider partialhypotheses with scores greater than γ . In our ex-act decoding scheme we traverse the NMT searchspace in a depth-first order, but cut off branchesalong which the accumulated model score falls be-low γ . During depth-first search (DFS), we up-date γ when we find a better complete hypothesis. Equality in Eq. 3 is impossible since probabilities aremodeled by the neural model via a softmax function whichnever predicts a probability of exactly lg. 2 specifies the DFS algorithm formally. Animportant detail is that elements in T are orderedsuch that the loop in line 5 considers the < / s > token first. This often updates γ early on and leadsto better pruning in subsequent recursive calls. Exact inference under length constraints
Ouradmissible pruning criterion based on γ relies onthe fact that the model score of a (partial) hy-pothesis is always lower than the score of anyof its translation prefixes. While this monotonic-ity condition is true for vanilla NMT (Eq. 3), itdoes not hold for methods like length normaliza-tion (Jean et al., 2015; Boulanger-Lewandowskiet al., 2013; Wu et al., 2016) or word rewards (Heet al., 2016): Length normalization gives an ad-vantage to longer hypotheses by dividing the scoreby the sentence length, while a word reward di-rectly violates monotonicity as it rewards eachword with a positive value. In Sec. 4 we showhow our exact search can be extended to handle ar-bitrary length models (Murray and Chiang, 2018;Huang et al., 2017; Yang et al., 2018) by introduc-ing length dependent lower bounds γ k and reportinitial findings on exact search under length nor-malization. However, despite being of practicaluse, methods like length normalization and wordpenalties are rather heuristic as they do not haveany justification from a probabilistic perspective.They also do not generalize well as (without re-tuning) they often work only for a specific beamsize. It would be much more desirable to fix thelength bias in the NMT model itself. We conduct all our experiments in this sec-tion on the entire English-German WMT news-test2015 test set (2,169 sentences) witha Transformer base (Vaswani et al., 2017) modeltrained with Tensor2Tensor (Vaswani et al., 2018)on parallel WMT18 data excluding ParaCrawl.Our pre-processing is as described by Stahlberget al. (2018a) and includes joint subword seg-mentation using byte pair encoding (Sennrichet al., 2016) with 32K merges. We report casedBLEU scores. An open-source implementationof our exact inference scheme is available in the Note that the order in which the for-loop in line 5 ofAlg. 2 iterates over T may be important for efficiency butdoes not affect the correctness of the algorithm. Comparable with http://matrix.statmt.org/
Search BLEU Ratio
Greedy 29.3 1.02 73.6% 0.0%Beam-10 30.3 1.00 57.7% 0.0%Exact 2.1 0.06 0.0% 51.8%
Table 1: NMT with exact inference. In the absence ofsearch errors, NMT often prefers the empty translation,causing a dramatic drop in length ratio and BLEU. B L E U Leng t h r a t i o Figure 1: BLEU over the percentage of search er-rors. Large beam sizes yield fewer search errors butthe BLEU score suffers from a length ratio below 1. -12.2-12.0-11.8-11.6-11.4-11.2-11.0-10.8 10 20 30 40 50 60 70 80 90 100 55% 60% 65% 70%
Log - li k e li hood S ea r c h e rr o r s Beam sizeLog-likelihood
Figure 2: Even large beam sizes produce a large num-ber of search errors.
SGNMT decoder (Stahlberg et al., 2017, 2018b). Our main result is shown in Tab. 1. Greedyand beam search both achieve reasonable BLEUscores but rely on a high number of search er-rors to not be affected by a serious NMT modelerror: For 51.8% of the sentences, NMT assignsthe global best model score to the empty transla-tion, i.e. a single < / s > token. Fig. 1 visualizesthe relationship between BLEU and the number ofsearch errors. Large beam sizes reduce the num-ber of search errors, but the BLEU score drops be-cause translations are too short. Even a large beamsize of 100 produces 53.62% search errors. Fig. 2shows that beam search effectively reduces search http://ucam-smt.github.io/sgnmt/html/ , simpledfs decoding strategy. A sentence is classified as search error if the decoderdoes not find the global best model score. [ . , . ] ( . , . ] ( . , . ] ( . , . ] ( . , . ] ( . , . ] ( . , . ] ( . , . ] ( . , . ] ( . , . ] > . S en t en c e s Target/source ratioReferenceExactBeam-10
Figure 3: Histogram over target/source length ratios.
Model Beam-10 ExactBLEU
LSTM ∗ ∗ ∗ Table 2: ∗ : The recurrent LSTM, the convolutionalSliceNet (Kaiser et al., 2017), and the Transformer-Bigsystems are strong baselines from a WMT’18 sharedtask submission (Stahlberg et al., 2018a). errors with respect to greedy decoding to some de-gree, but is ineffective in reducing search errorseven further. For example, Beam-10 yields 15.9%fewer search errors (absolute) than greedy decod-ing (57.68% vs. 73.58%), but Beam-100 improvessearch only slightly (53.62% search errors) despitebeing 10 times slower than beam-10.The problem of empty translations is also vis-ible in the histogram over length ratios (Fig. 3).Beam search – although still slightly too short –roughly follows the reference distribution, but ex-act search has an isolated peak in [0 . , . fromthe empty translations.Tab. 2 demonstrates that the problems of search
0% 20% 40% 60% 80%100%[0,10] [10,20) [20,30) [30,40) [40,50) [50,60] >60 P e r c en t age Source sentence length
Figure 4: Number of search errors under Beam-10 andempty global bests over the source sentence length. [ . , . ] ( . , . ] ( . , . ] ( . , . ] ( . , . ] ( . , . ] ( . , . ] ( . , . ] ( . , . ] ( . , . ] > . S en t en c e s Target/source ratioReferenceExactBeam-10
Figure 5: Histogram over length ratios with minimumtranslation length constraint of 0.25 times the sourcesentence length. Experiment conducted on 73.0% ofthe test set. errors and empty translations are not specific to theTransformer base model and also occur with otherarchitectures. Even a highly optimized Trans-former Big model from our WMT18 shared tasksubmission (Stahlberg et al., 2018a) has 25.8%empty translations.Fig. 4 shows that long source sentences aremore affected by both beam search errors and theproblem of empty translations. The global besttranslation is empty for almost all sentences longerthan 40 tokens (green curve). Even without sen-tences where the model prefers the empty transla-tion, a large amount of search errors remain (bluecurve).
To find out more about the length deficiencywe constrained exact search to certain translationlengths. Constraining search that way increasesthe run time as the γ -bounds are lower. Therefore,all results in this section are conducted on only asubset of the test set to keep the runtime under con-trol. We first constrained search to translationslonger than 0.25 times the source sentence lengthand thus excluded the empty translation from thesearch space. Although this mitigates the prob-lem slightly (Fig. 5), it still results in a peak in the (0 . , . cluster. This suggests that the problemof empty translations is the consequence of an in-herent model bias towards shorter hypotheses andcannot be fixed with a length constraint. We stopped decoding if the decoder took longer than aday for a single sentence on a single CPU. Exact search with-out length constraints is much faster and does not need max-imum execution time limits. earch BLEU Ratio
Beam-10 37.0 1.00Exact for Beam-10 length 37.0 1.00Exact for reference length 37.9 1.01
Table 3: Exact search under length constraints. Exper-iment conducted on 48.3% of the test set.
Search W/o length norm. With length norm.BLEU Ratio BLEU Ratio
Beam-10 37.0 1.00 36.3 1.03Beam-30 36.7 0.98 36.3 1.04Exact 27.2 0.74 36.4 1.03
Table 4: Length normalization fixes translation lengths,but prevents exact search from matching the BLEUscore of Beam-10. Experiment conducted on 48.3%of the test set.
We then constrained exact search to either thelength of the best Beam-10 hypothesis or the ref-erence length. Tab. 3 shows that exact search con-strained to the Beam-10 hypothesis length doesnot improve over beam search, suggesting thatany search errors between beam search score andglobal best score for that length are insignificantenough so as not to affect the BLEU score. Theoracle experiment in which we constrained exactsearch to the correct reference length (last row inTab. 3) improved the BLEU score by 0.9 points.A popular method to counter the length bias inNMT is length normalization (Jean et al., 2015;Boulanger-Lewandowski et al., 2013) which sim-ply divides the sentence score by the sentencelength. We can find the global best translations un-der length normalization by generalizing our ex-act inference scheme to length dependent lowerbounds γ k . The generalized scheme finds the bestmodel scores for each translation length k in a cer-tain range (e.g. zero to 1.2 times the source sen-tence length). The initial lower bounds are derivedfrom the Beam-10 hypothesis y beam as follows: γ k = ( k + 1) log P ( y beam | x ) | y beam | + 1 . (4)Exact search under length normalization does notsuffer from the length deficiency anymore (lastrow in Tab. 4), but it is not able to match our bestBLEU score under Beam-10 search. This suggeststhat while length normalization biases search to-wards translations of roughly the correct length, itdoes not fix the fundamental modelling problem. Available in our SGNMT decoder (Stahlberg et al., 2017,2018b) as simplelendfs strategy. We add 1 to the lengths to avoid division by zero errors.
Other researchers have also noted that largebeam sizes yield shorter translations (Koehn andKnowles, 2017). Sountsov and Sarawagi (2016)argue that this model error is due to the locally nor-malized maximum likelihood training objective inNMT that underestimates the margin between thecorrect translation and shorter ones if trained withregularization and finite data. A similar argu-ment was made by Murray and Chiang (2018) whopointed out the difficulty for a locally normalizedmodel to estimate the “budget” for all remaining(longer) translations. Kumar and Sarawagi (2019)demonstrated that NMT models are often poorlycalibrated, and that that can cause the length defi-ciency. Ott et al. (2018) argued that uncertaintycaused by noisy training data may play a role.Chen et al. (2018) showed that the consistent beststring problem for RNNs is decidable. We pro-vide an alternative DFS algorithm that relies onthe monotonic nature of model scores rather thanconsistency, and that often converges in practice.To the best of our knowledge, this is the firstwork that reports the exact number of search errorsin NMT as prior work often relied on approxima-tions, e.g. via n -best lists (Niehues et al., 2017) orconstraints (Stahlberg et al., 2018b). We have presented an exact inference scheme forNMT. Exact search may not be practical, but it al-lowed us to discover deficiencies in widely usedNMT models. We linked deteriorating BLEUscores of large beams with the reduction of searcherrors and showed that the model often prefers theempty translation – an evidence of NMT’s failureto properly model adequacy. Our investigationsinto length constrained exact search suggested thatsimple heuristics like length normalization are un-likely to remedy the problem satisfactorily.
Acknowledgments
This work was supported by the U.K. Engineeringand Physical Sciences Research Council (EPSRC)grant EP/L027623/1 and has been performed us-ing resources provided by the Cambridge Tier-2system operated by the University of CambridgeResearch Computing Service funded by EPSRCTier-2 capital grant EP/P020259/1. eferences Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In
ICLR .Nicolas Boulanger-Lewandowski, Yoshua Bengio, andPascal Vincent. 2013. Audio chord recognition withrecurrent neural networks. In
ISMIR , pages 335–340. Citeseer.Yining Chen, Sorcha Gilroy, Andreas Maletti, JonathanMay, and Kevin Knight. 2018. Recurrent neu-ral networks as weighted language recognizers. In
Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers) , pages 2261–2271, NewOrleans, Louisiana. Association for ComputationalLinguistics.Wei He, Zhongjun He, Hua Wu, and Haifeng Wang.2016. Improved neural machine translation withSMT features. In
Proceedings of the Thirtieth AAAIConference on Artificial Intelligence , pages 151–157. AAAI Press.Liang Huang, Kai Zhao, and Mingbo Ma. 2017. Whento finish? Optimal beam search for neural textgeneration (modulo beam size). In
Proceedingsof the 2017 Conference on Empirical Methods inNatural Language Processing , pages 2134–2139,Copenhagen, Denmark. Association for Computa-tional Linguistics.S´ebastien Jean, Orhan Firat, Kyunghyun Cho, RolandMemisevic, and Yoshua Bengio. 2015. Montrealneural machine translation systems for WMT’15. In
Proceedings of the Tenth Workshop on StatisticalMachine Translation , pages 134–140, Lisbon, Por-tugal. Association for Computational Linguistics.Lukasz Kaiser, Aidan N Gomez, and Francois Chol-let. 2017. Depthwise separable convolutionsfor neural machine translation. arXiv preprintarXiv:1706.03059 .Nal Kalchbrenner and Phil Blunsom. 2013. Recurrentcontinuous translation models. In
Proceedings ofthe 2013 Conference on Empirical Methods in Nat-ural Language Processing , pages 1700–1709. Asso-ciation for Computational Linguistics.Philipp Koehn and Rebecca Knowles. 2017. Six chal-lenges for neural machine translation. In
Pro-ceedings of the First Workshop on Neural MachineTranslation , pages 28–39, Vancouver. Associationfor Computational Linguistics.Aviral Kumar and Sunita Sarawagi. 2019. Calibrationof encoder decoder models for neural machine trans-lation. arXiv preprint arXiv:1903.00802 .Kenton Murray and David Chiang. 2018. Correct-ing length bias in neural machine translation. In
Proceedings of the Third Conference on Machine Translation: Research Papers , pages 212–223, Bel-gium, Brussels. Association for Computational Lin-guistics.Jan Niehues, Eunah Cho, Thanh-Le Ha, and AlexWaibel. 2017. Analyzing neural MT search andmodel performance. In
Proceedings of the FirstWorkshop on Neural Machine Translation , pages11–17.Myle Ott, Michael Auli, David Grangier, et al. 2018.Analyzing uncertainty in neural machine translation.In
International Conference on Machine Learning ,pages 3953–3962.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In
Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Pavel Sountsov and Sunita Sarawagi. 2016. Lengthbias in encoder decoder models and a case for globalconditioning. In
Proceedings of the 2016 Confer-ence on Empirical Methods in Natural LanguageProcessing , pages 1516–1525, Austin, Texas. Asso-ciation for Computational Linguistics.Felix Stahlberg, Adri`a de Gispert, and Bill Byrne.2018a. The University of Cambridge’s machinetranslation systems for WMT18. In
Proceedingsof the Third Conference on Machine Translation:Shared Task Papers , pages 504–512, Belgium, Brus-sels. Association for Computational Linguistics.Felix Stahlberg, Eva Hasler, Danielle Saunders, andBill Byrne. 2017. SGNMT – A flexible NMT de-coding platform for quick prototyping of new mod-els and search strategies. In
Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing: System Demonstrations , pages25–30. Association for Computational Linguistics.Felix Stahlberg, Danielle Saunders, Gonzalo Iglesias,and Bill Byrne. 2018b. Why not be versatile? Ap-plications of the SGNMT decoder for machine trans-lation. In
Proceedings of the 13th Conference of theAssociation for Machine Translation in the Ameri-cas (Volume 1: Research Papers) , pages 208–216.Association for Machine Translation in the Ameri-cas.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In
Advances in Neural Information Process-ing Systems , pages 3104–3112.Ashish Vaswani, Samy Bengio, Eugene Brevdo, Fran-cois Chollet, Aidan N. Gomez, Stephan Gouws,Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, NikiParmar, Ryan Sepassi, Noam Shazeer, and JakobUszkoreit. 2018. Tensor2tensor for neural machinetranslation.
CoRR , abs/1803.07416.shish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in Neural Information Pro-cessing Systems , pages 5998–6008.Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation. arXiv preprintarXiv:1609.08144 .Yilin Yang, Liang Huang, and Mingbo Ma. 2018.Breaking the beam search curse: A study of (re-)scoring methods and stopping criteria for neural ma-chine translation. In