Josep Maria Crego
Polytechnic University of Catalonia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Josep Maria Crego.
Computational Linguistics | 2006
José B. Mariòo; Rafael E. Banchs; Josep Maria Crego; Adrià de Gispert; Patrik Lambert; José A. R. Fonollosa; Marta Ruiz Costa-Jussà
This article describes in detail an n-gram approach to statistical machine translation. This approach consists of a log-linear combination of a translation model based on n-grams of bilingual units, which are referred to as tuples, along with four specific feature functions. Translation performance, which happens to be in the state of the art, is demonstrated with Spanish-to-English and English-to-Spanish translations of the European Parliament Plenary Sessions (EPPS).
Machine Translation | 2006
Josep Maria Crego; José B. Mariño
In this paper we describe an elegant and efficient approach to coupling reordering and decoding in statistical machine translation, where the n-gram translation model is also employed as distortion model. The reordering search problem is tackled through a set of linguistically motivated rewrite rules, which are used to extend a monotonic search graph with reordering hypotheses. The extended graph is traversed in the global search when a fully informed decision can be taken. Further experiments show that the n-gram translation model can be successfully used as reordering model when estimated with reordered source words. Experiments are reported on the Europarl task (Spanish–English and English–Spanish). Results are presented regarding translation accuracy and computational efficiency, showing significant improvements in translation quality with respect to monotonic search for both translation directions at a very low computational cost.
The Prague Bulletin of Mathematical Linguistics | 2011
Josep Maria Crego; François Yvon; José B. Mariño
Ncode: an Open Source Bilingual N-gram SMT Toolkit This paper describes Ncode, an open source statistical machine translation (SMT) toolkit for translation models estimated as n-gram language models of bilingual units (tuples). This toolkit includes tools for extracting tuples, estimating models and performing translation. It can be easily coupled to several other open source toolkits to yield a complete SMT pipeline. In this article, we review the main features of the toolkit and explain how to build a translation engine with Ncode. We also report a short comparison with the widely known Moses system. Results show that Ncode outperforms Moses in terms of memory requirements and translation speed. Ncode also achieves slightly higher accuracy results.
workshop on statistical machine translation | 2008
Josep Maria Crego; Nizar Habash
We describe two methods to improve SMT accuracy using shallow syntax information. First, we use chunks to refine the set of word alignments typically used as a starting point in SMT systems. Second, we extend an N-gram-based SMT system with chunk tags to better account for long-distance reorderings. Experiments are reported on an Arabic-English task showing significant improvements. A human error analysis indicates that long-distance reorderings are captured effectively.
spoken language technology workshop | 2006
Josep Maria Crego; José B. Mariño
This paper addresses the problem of reordering in statistical machine translation (SMT). We describe an elegant and efficient approach to couple reordering (word order monotonization) and decoding, which does not need for any additional model. We use linguistically motivated reordering rules to extend a monotonic search graph (with reordering hypotheses). The extended graph is traversed in decoding when a fully- informed decision can be taken (no preprocessing decision about reordering is taken). We also show how the N-gram translation model can be successfully used as reordering model when estimated with reordered source words (to harmonize the source and target word order). Experiments are reported on the Euparl task (Spanish- to-English and English-to-Spanish). Results are presented regarding translation accuracy and computational efficiency, showing significant improvements in translation quality for both translation directions at a very low computational cost.
north american chapter of the association for computational linguistics | 2007
Patrik Lambert; Rafael E. Banchs; Josep Maria Crego
In present Statistical Machine Translation (SMT) systems, alignment is trained in a previous stage as the translation model. Consequently, alignment model parameters are not tuned in function of the translation task, but only indirectly. In this paper, we propose a novel framework for discriminative training of alignment models with automated translation metrics as maximization criterion. In this approach, alignments are optimized for the translation task. In addition, no link labels at the word level are needed. This framework is evaluated in terms of automatic translation evaluation metrics, and an improvement of translation quality is observed.
Machine Translation | 2010
Josep Maria Crego; François Yvon
In this work, we present an extension of n-gram-based translation models based on factored language models (FLMs). Translation units employed in the n-gram-based approach to statistical machine translation (SMT) are based on mappings of sequences of raw words, while translation model probabilities are estimated through standard language modeling of such bilingual units. Therefore, similar to other translation model approaches (phrase-based or hierarchical), the sparseness problem of the units being modeled leads to unreliable probability estimates, even under conditions where large bilingual corpora are available. In order to tackle this problem, we extend the n-gram-based approach to SMT by tightly integrating more general word representations, such as lemmas and morphological classes, and we use the flexible framework of FLMs to apply a number of different back-off techniques. In this work, we show that FLMs can also be successfully applied to translation modeling, yielding more robust probability estimates that integrate larger bilingual contexts during the translation process.
meeting of the association for computational linguistics | 2007
Josep Maria Crego; Jos'e B. Mari~no
In this paper we present several extensions of MARIE, a freely available N-gram-based statistical machine translation (SMT) decoder. The extensions mainly consist of the ability to accept and generate word graphs and the introduction of two new N-gram models in the loglinear combination of feature functions the decoder implements. Additionally, the decoder is enhanced with a caching strategy that reduces the number of N-gram calls improving the overall search efficiency. Experiments are carried out over the Eurpoean Parliament Spanish-English translation task.
workshop on statistical machine translation | 2007
Marta R. Costa-jussià; Josep Maria Crego; Patrik Lambert; Maxim Khalilov; José A. R. Fonollosa; José B. Mariño; Rafael E. Banchs
This paper describes the 2007 Ngram-based statistical machine translation system developed at the TALP Research Center of the UPC (Universitat Politecnica de Catalunya) in Barcelona. Emphasis is put on improvements and extensions of the previous years system, being highlyghted and empirically compared. Mainly, these include a novel word ordering strategy based on: (1) statistically monotonizing the training source corpus and (2) a novel reordering approach based on weighted reordering graphs. In addition, this system introduces a target language model based on statistical classes, a feature for out-of-domain units and an improved optimization procedure. The paper provides details of this system participation in the ACL 2007 SECOND WORKSHOP ON STATISTICAL MACHINE TRANSLATION. Results on three pairs of languages are reported, namely from Spanish, French and German into English (and the other way round) for both the in-domain and out-of-domain tasks.
north american chapter of the association for computational linguistics | 2007
Marta Ruiz Costa-Jussà; Josep Maria Crego; David Vilar; José A. R. Fonollosa; José B. Mariño; Hermann Ney
In the framework of the Tc-Star project, we analyze and propose a combination of two Statistical Machine Translation systems: a phrase-based and an N-gram-based one. The exhaustive analysis includes a comparison of the translation models in terms of efficiency (number of translation units used in the search and computational time) and an examination of the errors in each systems output. Additionally, we combine both systems, showing accuracy improvements.