Nicola Bertoldi
fondazione bruno kessler
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Nicola Bertoldi.
international acm sigir conference on research and development in information retrieval | 2002
Marcello Federico; Nicola Bertoldi
This paper presents a novel statistical model for cross-language information retrieval. Given a written query in the source language, documents in the target language are ranked by integrating probabilities computed by two statistical models: a query-translation model, which generates most probable term-by-term translations of the query, and a query-document model, which evaluates the likelihood of each document and translation. Integration of the two scores is performed over the set of N most probable translations of the query. Experimental results with values N=1, 5, 10 are presented on the Italian-English bilingual track data used in the CLEF 2000 and 2001 evaluation campaigns.
international conference on acoustics, speech, and signal processing | 2007
Nicola Bertoldi; Richard Zens; Marcello Federico
This paper describes advances in the use of confusion networks as interface between automatic speech recognition and machine translation. In particular, it presents an implementation of a confusion network decoder which significantly improves both in efficiency and performance previous work along this direction. The confusion network decoder results as an extension of a state-of-the-art phrase-based text translation system. Experimental results in terms of decoding speed and translation accuracy are reported on a real-data task, namely the translation of plenary speeches at the European Parliament from Spanish to English.
IEEE Transactions on Audio, Speech, and Language Processing | 2008
Gregor Leusch; Rafael E. Banchs; Nicola Bertoldi; Daniel Déchelotte; Marcello Federico; Muntsin Kolss; Young-Suk Lee; José B. Mariño; Matthias Paulik; Salim Roukos; Holger Schwenk; Hermann Ney
This paper describes an approach for computing a consensus translation from the outputs of multiple machine translation (MT) systems. The consensus translation is computed by weighted majority voting on a confusion network, similarly to the well-established ROVER approach of Fiscus for combining speech recognition hypotheses. To create the confusion network, pairwise word alignments of the original MT hypotheses are learned using an enhanced statistical alignment algorithm that explicitly models word reordering. The context of a whole corpus of automatic translations rather than a single sentence is taken into account in order to achieve high alignment quality. The confusion network is rescored with a special language model, and the consensus translation is extracted as the best path. The proposed system combination approach was evaluated in the framework of the TC-STAR speech translation project. Up to six state-of-the-art statistical phrase-based translation systems from different project partners were combined in the experiments. Significant improvements in translation quality from Spanish to English and from English to Spanish in comparison with the best of the individual MT systems were achieved under official evaluation conditions.
ieee automatic speech recognition and understanding workshop | 2005
Nicola Bertoldi; Marcello Federico
A novel approach to spoken language translation is proposed, which more tightly integrates automatic speech recognition (ASR) and statistical machine translation (SMT). SMT is directly applied on an approximation of the word graph produced by the ASR system, namely a confusion network. The decoding algorithm extends a conventional phrase-based decoder in that it can process at once a large number of source sentence hypotheses contained in the confusion network. Experimental results are presented on a Spanish-English large vocabulary task, namely the translation of the European Parliament plenary sessions. With respect to a conventional SMT decoder processing N-best lists, a slight improvement in the BLEU score is reported as well as a significantly lower decoding time
Computer Speech & Language | 2004
Marcello Federico; Nicola Bertoldi
This paper investigates the problem of updating over time the statistical language model (LM) of an Italian broadcast news transcription system. Statistical adaptation methods are proposed which try to cope with the complex dynamics of news by exploiting newswire texts daily available on the Internet. In particular, contemporary news reports are used to extend the lexicon of the LM, to minimize the out-of-vocabulary (OOV) word rate, and to adapt the n-gram probabilities. Experiments performed on 19 news shows, spanning a period of one month, showed relative reductions of 58% in OOV word rate, 16% in perplexity, and 4% in word error rate (WER).
workshop on statistical machine translation | 2006
Marcello Federico; Nicola Bertoldi
State of the art in statistical machine translation is currently represented by phrase-based models, which typically incorporate a large number of probabilities of phrase-pairs and word n-grams. In this work, we investigate data compression methods for efficiently encoding n-gram and phrase-pair probabilities, that are usually encoded in 32-bit floating point numbers. We measured the impact of compression on translation quality through a phrase-based decoder trained on two distinct tasks: the translation of European Parliament speeches from Spanish to English, and the translation of news agencies from Chinese to English. We show that with a very simple quantization scheme all probabilities can be encoded in just 4 bits with a relative loss in BLEU score on the two tasks by 1.0% and 1.6%, respectively.
international conference on acoustics, speech, and signal processing | 2001
Nicola Bertoldi; E. Brugnara; Mauro Cettolo; Marcello Federico; Diego Giuliani
Reports on experiments of porting the ITC-irst Italian broadcast news recognition system to two spontaneous dialogue domains. The trade-off between performance and the required amount of task specific data was investigated. Porting was experimented by applying supervised adaptation methods to acoustic and language models. By using two hours of manually transcribed speech, word error rates of 26.0% and 28.4% were achieved by the adapted systems. Two reference systems, developed on a larger training corpus, achieved word error rates of 22.6% and 21.2%, respectively.
IEEE Transactions on Audio, Speech, and Language Processing | 2008
Nicola Bertoldi; Richard Zens; Marcello Federico; Wade Shen
This paper describes advances in the use of confusion networks as interface between automatic speech recognition and machine translation. In particular, it presents a decoding algorithm for confusion networks which results as an extension of a state-of-the-art phrase-based text translation decoder. The confusion network decoder significantly improves both in efficiency and performance over previous work along this direction, and outperforms the background text translation system. Experimental results in terms of translation accuracy and decoding efficiency are reported for the task of translating plenary speeches of the European Parliament from Spanish to english and from english to Spanish.
ACM Transactions on Speech and Language Processing | 2005
Marcello Federico; Nicola Bertoldi
This article addresses the development of statistical models for phrase-based machine translation (MT) which extend a popular word-alignment model proposed by IBM in the early 90s. A novel decoding algorithm is directly derived from the optimization criterion which defines the statistical MT approach. Efficiency in decoding is achieved by applying dynamic programming, pruning strategies, and word reordering constraints. It is known that translation performance can be boosted by exploiting phrase (or multiword) translation pairs automatically extracted from a parallel corpus. New phrase-based models are obtained by introducing extra multiwords in the target language vocabulary and by estimating the corresponding parameters from either: (i) a word-based model, (ii) phrase-based statistics computed on the parallel corpus, or (iii) the interpolation of the two previous estimates. Word-based and phrase-based MT models are evaluated on a traveling domain task in two translation directions: Chinese-English (12k-word vocabulary) and Italian-English (16k-word vocabulary). Phrase-based models show Bleu score improvements over the word-based model by 19% and 13% relative, respectively.
cross language evaluation forum | 2004
Marcello Federico; Nicola Bertoldi; Gina-Anne Levow; Gareth J. F. Jones
This paper summarizes the Cross-Language Spoken Document Retrieval (CL-SDR) track held at CLEF 2004. The CL-SDR task at CLEF 2004 was again based on the TREC-8 and TREC-9 SDR tasks. This year the CL-SDR task was extended to explore the unknown story boundaries condition introduced at TREC. The paper reports results from the participants showing that as expected cross-language results are reduced relative to a monolingual baseline, although the amount to which they are degraded varies for different topic languages.