Adrià de Gispert
University of Cambridge
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Adrià de Gispert.
Computational Linguistics | 2006
José B. Mariòo; Rafael E. Banchs; Josep Maria Crego; Adrià de Gispert; Patrik Lambert; José A. R. Fonollosa; Marta Ruiz Costa-Jussà
This article describes in detail an n-gram approach to statistical machine translation. This approach consists of a log-linear combination of a translation model based on n-grams of bilingual units, which are referred to as tuples, along with four specific feature functions. Translation performance, which happens to be in the state of the art, is demonstrated with Spanish-to-English and English-to-Spanish translations of the European Parliament Plenary Sessions (EPPS).
language resources and evaluation | 2005
Patrik Lambert; Adrià de Gispert; Rafael E. Banchs; José B. Mariño
The purpose of this paper is to provide guidelines for building a word alignment evaluation scheme. The notion of word alignment quality depends on the application: here we review standard scoring metrics for full text alignment and give explanations on how to use them better. We discuss strategies to build a reference corpus, and show that the ratio between ambiguous and unambiguous links in the reference has a great impact on scores measured with these metrics. In particular, automatically computed alignments with higher precision or higher recall can be favoured depending on the value of this ratio. Finally, we suggest a strategy to build a reference corpus particularly adapted to applications where recall plays a significant role, like in machine translation. The manually aligned corpus we built for the Spanish-English European Parliament corpus is also described. This corpus is freely available.
Computational Linguistics | 2010
Adrià de Gispert; Gonzalo Iglesias; Graeme W. Blackwood; Eduardo Rodríguez Banga; William Byrne
In this article we describe HiFST, a lattice-based decoder for hierarchical phrase-based translation and alignment. The decoder is implemented with standard Weighted Finite-State Transducer (WFST) operations as an alternative to the well-known cube pruning procedure. We find that the use of WFSTs rather than k-best lists requires less pruning in translation search, resulting in fewer search errors, better parameter optimization, and improved translation performance. The direct generation of translation lattices in the target language can improve subsequent rescoring procedures, yielding further gains when applying long-span language models and Minimum Bayes Risk decoding. We also provide insights as to how to control the size of the search space defined by hierarchical rules. We show that shallow-n grammars, low-level rule catenation, and other search constraints can help to match the power of the translation system to specific language pairs.
meeting of the association for computational linguistics | 2009
Gonzalo Iglesias; Adrià de Gispert; Eduardo Rodríguez Banga; William Byrne
We describe refinements to hierarchical translation search procedures intended to reduce both search errors and memory usage through modifications to hypothesis expansion in cube pruning and reductions in the size of the rule sets used in translation. Rules are put into syntactic classes based on the number of non-terminals and the pattern, and various filtering strategies are then applied to assess the impact on translation speed and quality. Results are reported on the 2008 NIST Arabic-to-English evaluation task.
workshop on statistical machine translation | 2006
Maja Popović; Adrià de Gispert; Deepa Gupta; Patrik Lambert; Hermann Ney; José B. Mariño; Marcello Federico; Rafael E. Banchs
Evaluation of machine translation output is an important but difficult task. Over the last years, a variety of automatic evaluation measures have been studied, some of them like Word Error Rate (WER), Position Independent Word Error Rate (PER) and BLEU and NIST scores have become widely used tools for comparing different systems as well as for evaluating improvements within one system. However, these measures do not give any details about the nature of translation errors. Therefore some analysis of the generated output is needed in order to identify the main problems and to focus the research efforts. On the other hand, human evaluation is a time consuming and expensive task. In this paper, we investigate methods for using of morpho-syntactic information for automatic evaluation: standard error measures WER and PER are calculated on distinct word classes and forms in order to get a better idea about the nature of translation errors and possibilities for improvements.
north american chapter of the association for computational linguistics | 2009
Adrià de Gispert; Sami Virpioja; Mikko Kurimo; William Byrne
We describe a simple strategy to achieve translation performance improvements by combining output from identical statistical machine translation systems trained on alternative morphological decompositions of the source language. Combination is done by means of Minimum Bayes Risk decoding over a shared N-best list. When translating into English from two highly inflected languages such as Arabic and Finnish we obtain significant improvements over simply selecting the best morphological decomposition.
north american chapter of the association for computational linguistics | 2009
Gonzalo Iglesias; Adrià de Gispert; Eduardo Rodríguez Banga; William Byrne
This paper describes a lattice-based decoder for hierarchical phrase-based translation. The decoder is implemented with standard WFST operations as an alternative to the well-known cube pruning procedure. We find that the use of WFSTs rather than k-best lists requires less pruning in translation search, resulting in fewer search errors, direct generation of translation lattices in the target language, better parameter optimization, and improved translation performance when rescoring with long-span language models and MBR decoding. We report translation experiments for the Arabic-to-English and Chinese-to-English NIST translation tasks and contrast the WFST-based hierarchical decoder with hierarchical translation under cube pruning.
north american chapter of the association for computational linguistics | 2009
Jamie Brunning; Adrià de Gispert; William Byrne
We introduce alignment models for Machine Translation that take into account the context of a source word when determining its translation. Since the use of these contexts alone causes data sparsity problems, we develop a decision tree algorithm for clustering the contexts based on optimisation of the EM auxiliary function. We show that our context-dependent models lead to an improvement in alignment quality, and an increase in translation quality when the alignments are used in Arabic-English and Chinese-English translation.
north american chapter of the association for computational linguistics | 2015
Adrià de Gispert; Gonzalo Iglesias; Bill Byrne
We propose the use of neural networks to model source-side preordering for faster and better statistical machine translation. The neural network trains a logistic regression model to predict whether two sibling nodes of the source-side parse tree should be swapped in order to obtain a more monotonic parallel corpus, based on samples extracted from the word-aligned parallel corpus. For multiple language pairs and domains, we show that this yields the best reordering performance against other state-of-the-art techniques, resulting in improved translation quality and very fast decoding.
international conference natural language processing | 2006
Adrià de Gispert; Deepa Gupta; Maja Popović; Patrik Lambert; José B. Mariño; Marcello Federico; Hermann Ney; Rafael E. Banchs
This paper presents a wide range of statistical word alignment experiments incorporating morphosyntactic information. By means of parallel corpus transformations according to information of POS-tagging, lemmatization or stemming, we explore which linguistic information helps improve alignment error rates. For this, evaluation against a human word alignment reference is performed, aiming at an improved machine translation training scheme which eventually leads to improved SMT performance. Experiments are carried out in a Spanish–English European Parliament Proceedings parallel corpus, both in a large and a small data track. As expected, improvements due to introducing morphosyntactic information are bigger in case of data scarcity, but significant improvement is also achieved in a large data task, meaning that certain linguistic knowledge is relevant even in situations of large data availability.