Patrik Lambert
Polytechnic University of Catalonia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Patrik Lambert.
Computational Linguistics | 2006
José B. Mariòo; Rafael E. Banchs; Josep Maria Crego; Adrià de Gispert; Patrik Lambert; José A. R. Fonollosa; Marta Ruiz Costa-Jussà
This article describes in detail an n-gram approach to statistical machine translation. This approach consists of a log-linear combination of a translation model based on n-grams of bilingual units, which are referred to as tuples, along with four specific feature functions. Translation performance, which happens to be in the state of the art, is demonstrated with Spanish-to-English and English-to-Spanish translations of the European Parliament Plenary Sessions (EPPS).
language resources and evaluation | 2005
Patrik Lambert; Adrià de Gispert; Rafael E. Banchs; José B. Mariño
The purpose of this paper is to provide guidelines for building a word alignment evaluation scheme. The notion of word alignment quality depends on the application: here we review standard scoring metrics for full text alignment and give explanations on how to use them better. We discuss strategies to build a reference corpus, and show that the ratio between ambiguous and unambiguous links in the reference has a great impact on scores measured with these metrics. In particular, automatically computed alignments with higher precision or higher recall can be favoured depending on the value of this ratio. Finally, we suggest a strategy to build a reference corpus particularly adapted to applications where recall plays a significant role, like in machine translation. The manually aligned corpus we built for the Spanish-English European Parliament corpus is also described. This corpus is freely available.
workshop on statistical machine translation | 2006
Maja Popović; Adrià de Gispert; Deepa Gupta; Patrik Lambert; Hermann Ney; José B. Mariño; Marcello Federico; Rafael E. Banchs
Evaluation of machine translation output is an important but difficult task. Over the last years, a variety of automatic evaluation measures have been studied, some of them like Word Error Rate (WER), Position Independent Word Error Rate (PER) and BLEU and NIST scores have become widely used tools for comparing different systems as well as for evaluating improvements within one system. However, these measures do not give any details about the nature of translation errors. Therefore some analysis of the generated output is needed in order to identify the main problems and to focus the research efforts. On the other hand, human evaluation is a time consuming and expensive task. In this paper, we investigate methods for using of morpho-syntactic information for automatic evaluation: standard error measures WER and PER are calculated on distinct word classes and forms in order to get a better idea about the nature of translation errors and possibilities for improvements.
north american chapter of the association for computational linguistics | 2007
Patrik Lambert; Rafael E. Banchs; Josep Maria Crego
In present Statistical Machine Translation (SMT) systems, alignment is trained in a previous stage as the translation model. Consequently, alignment model parameters are not tuned in function of the translation task, but only indirectly. In this paper, we propose a novel framework for discriminative training of alignment models with automated translation metrics as maximization criterion. In this approach, alignments are optimized for the translation task. In addition, no link labels at the word level are needed. This framework is evaluated in terms of automatic translation evaluation metrics, and an improvement of translation quality is observed.
international conference natural language processing | 2006
Adrià de Gispert; Deepa Gupta; Maja Popović; Patrik Lambert; José B. Mariño; Marcello Federico; Hermann Ney; Rafael E. Banchs
This paper presents a wide range of statistical word alignment experiments incorporating morphosyntactic information. By means of parallel corpus transformations according to information of POS-tagging, lemmatization or stemming, we explore which linguistic information helps improve alignment error rates. For this, evaluation against a human word alignment reference is performed, aiming at an improved machine translation training scheme which eventually leads to improved SMT performance. Experiments are carried out in a Spanish–English European Parliament Proceedings parallel corpus, both in a large and a small data track. As expected, improvements due to introducing morphosyntactic information are bigger in case of data scarcity, but significant improvement is also achieved in a large data task, meaning that certain linguistic knowledge is relevant even in situations of large data availability.
workshop on statistical machine translation | 2007
Marta R. Costa-jussià; Josep Maria Crego; Patrik Lambert; Maxim Khalilov; José A. R. Fonollosa; José B. Mariño; Rafael E. Banchs
This paper describes the 2007 Ngram-based statistical machine translation system developed at the TALP Research Center of the UPC (Universitat Politecnica de Catalunya) in Barcelona. Emphasis is put on improvements and extensions of the previous years system, being highlyghted and empirically compared. Mainly, these include a novel word ordering strategy based on: (1) statistically monotonizing the training source corpus and (2) a novel reordering approach based on weighted reordering graphs. In addition, this system introduces a target language model based on statistical classes, a feature for out-of-domain units and an improved optimization procedure. The paper provides details of this system participation in the ACL 2007 SECOND WORKSHOP ON STATISTICAL MACHINE TRANSLATION. Results on three pairs of languages are reported, namely from Spanish, French and German into English (and the other way round) for both the in-domain and out-of-domain tasks.
spoken language technology workshop | 2006
Patrik Lambert; Jesús Giménez; Marta Ruiz Costa-Jussà; Enrique Amigó; Rafael E. Banchs; Lluís Màrquez; José A. R. Fonollosa
We present a novel approach for parameter adjustment in empirical machine translation systems. Instead of relying on a single evaluation metric, or in an ad-hoc linear combination of metrics, our method works over metric combinations with maximum descriptive power, aiming to maximise the Human Likeness of the automatic translations. We apply it to the problem of optimising decoding stage parameters of a state- of-the-art Statistical machine translation system. By means of a rigorous manual evaluation, we show how our methodology provides more reliable and robust system configurations than a tuning strategy based on the BLEU metric alone.
international joint conference on natural language processing | 2015
Patrik Lambert
Most cross-lingual sentiment classification (CLSC) research so far has been performed at sentence or document level. Aspect-level CLSC, which is more appropriate for many applications, presents the additional difficulty that we consider subsentential opinionated units which have to be mapped across languages. In this paper, we extend the possible cross-lingual sentiment analysis settings to aspect-level specific use cases. We propose a method, based on constrained SMT, to transfer opinionated units across languages by preserving their boundaries. We show that cross-language sentiment classifiers built with this method achieve comparable results to monolingual ones, and we compare different cross-lingual settings.
workshop on statistical machine translation | 2008
Maxim Khalilov; Adolfo Hernández H.; Marta Ruiz Costa-Jussà; Josep Maria Crego; Carlos A. Henríquez Q.; Patrik Lambert; José A. R. Fonollosa; José B. Mariño; Rafael E. Banchs
This paper reports on the participation of the TALP Research Center of the UPC (Universitat Politecnica de Catalunya) to the ACL WMT 2008 evaluation campaign. This years system is the evolution of the one we employed for the 2007 campaign. Main updates and extensions involve linguistically motivated word reordering based on the reordering patterns technique. In addition, this system introduces a target language model, based on linguistic classes (Part-of-Speech), morphology reduction for an inflectional language (Spanish) and an improved optimization procedure. Results obtained over the development and test sets on Spanish to English (and the other way round) translations for both the traditional Europarl and a challenging News stories tasks are analyzed and commented.
Natural Language Engineering | 2016
Maite Melero; Marta Ruiz Costa-Jussà; Patrik Lambert; Martí Quixal
We present research aiming to build tools for the normalization of User-Generated Content (UGC). We argue that processing this type of text requires the revisiting of the initial steps of Natural Language Processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user-generated texts) presents a number of nonstandard communicative and linguistic characteristics – often closer to oral and colloquial language than to edited text. We present a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews, and blogs, and describe its main characteristics. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging. Our aim with this paper is to seize the power of already existing spell and grammar correction engines and endow them with automatic normalization capabilities in order to pave the way for the application of standard Natural Language Processing tools to typical UGC text. Particularly, we propose a strategy for automatically normalizing UGC by adding a module on top of a pre-existing spell-checker that selects the most plausible correction from an unranked list of candidates provided by the spell-checker. To build this selector module we train four language models, each one containing a different type of linguistic information in a trade-off with its generalization capabilities. Our experiments show that the models trained on truecase and lowercase word forms are more discriminative than the others at selecting the best candidate. We have also experimented with a parametrized combination of the models by both optimizing directly on the selection task and doing a linear interpolation of the models. The resulting parametrized combinations obtain results close to the best performing model but do not improve on those results, as measured on the test set. The precision of the selector module in ranking number one the expected correction proposal on the test corpora reaches 82.5% for Twitter text (baseline 57%) and 88% for non-Twitter text (baseline 64%).