Computational Linguistics | 2019

Quality Estimation for Machine Translation

 

Abstract


Many natural language processing tasks aim at generating a human readable text (or human audible speech) in response to some input: Machine Translation (MT) generates a target translation of a source input; document summarization generates a shortened version of its input document(s); text generation converts a formal representation into an utterance or a document, etc. For such tasks, the automatic evaluation of the system’s performance is often performed by comparison to a reference output, deemed representative of human-level performance. Evaluation of Machine Translation (MT) is a typical illustration of this approach and relies on metrics such as BLEU (Papineni et al. 2002), Translation Edit Rate (TER) (Snover et al. 2006) or METEOR (Banerjee and Lavie 2005) implementing various string comparison routines between the system output and the corresponding reference(s). This strategy has the merit to make evaluation fully automatic and reproducible. Preparing human translation references is however a costly process, which requires highly trained experts; it is also prone to much variability and subjectivity. This implies that the failure to match the reference does not necessarily entail an error of the system. Reference-based evaluations are also considered too crude for many language pairs and tend to only evaluate the system’s ability to reproduce one specific human annotation. Organizers of shared task in MT have therefore abandoned reference-based metrics to compare systems and resort to human judgements (Callison-Burch et al. 2008). The book by Specia, Scarton and Paetzold surveys an alternative approach to automatic evaluation of Machine Translation, Quality Estimation (QE). In essence, QE aims to move away from human references and to evaluate a generated text based only on automatically computed features. QE was initially proposed in the context of Automatic Speech Recognition systems (ASR), an area where much of the foundational work has been performed (Jiang 2005). As explained in the introductory chapter, QE for MT also has many applications and has emerged in the last decade as a very active subfield of MT, with its own evaluation campaigns and metrics. In a nutshell, QE predicts the quality score or quality label of some target fragment, produced in response to some source text. Assuming that texts annotated with their quality level are available, QE is usually cast as a supervised machine learning task. QE for MT needs to simultaneously take two dimensions into account: (a) is the proposed output appropriate for the input data? (b) is the generated text grammatically correct? Dimension (a) is usually associated with the concept of adequacy with respect to the input signal, while dimension (b) is associated to the correctness or fluency of the output target fragment. Correctness can be defined at various levels of granularity, depending on the size of the output chunk: smaller chunks

Volume None
Pages 1-4
DOI 10.1162/COLI_r_00352
Language English
Journal Computational Linguistics

Full Text