Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Ondřej Bojar is active.

Publication


Featured researches published by Ondřej Bojar.


meeting of the association for computational linguistics | 2007

Moses: Open Source Toolkit for Statistical Machine Translation

Philipp Koehn; Hieu Hoang; Alexandra Birch; Chris Callison-Burch; Marcello Federico; Nicola Bertoldi; Brooke Cowan; Wade Shen; Christine Moran; Richard Zens; Chris Dyer; Ondřej Bojar; Alexandra Constantin; Evan Herbst

We describe an open-source toolkit for statistical machine translation whose novel contributions are (a) support for linguistically motivated factors, (b) confusion network decoding, and (c) efficient data formats for translation models and language models. In addition to the SMT decoder, the toolkit also includes a wide variety of tools for training, tuning and applying the system to many translation tasks.


workshop on statistical machine translation | 2013

Results of the WMT13 Metrics Shared Task

Milos Stanojevic; Amir Kamran; Philipp Koehn; Ondřej Bojar

This paper presents the results of the WMT15 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT15 Shared Translation Task. We collected scores of 46 metrics from 11 research groups. In addition to that, we computed scores of 7 standard metrics (BLEU, SentBLEU, NIST, WER, PER, TER and CDER) as baselines. The collected scores were evaluated in terms of system level correlation (how well each metric’s scores correlate with WMT15 official manual ranking of systems) and in terms of segment level correlation (how often a metric agrees with humans in comparing two translations of a particular sentence).


workshop on statistical machine translation | 2007

English-to-Czech Factored Machine Translation

Ondřej Bojar

This paper describes experiments with English-to-Czech phrase-based machine translation. Additional annotation of input and output tokens (multiple factors) is used to explicitly model morphology. We vary the translation scenario (the setup of multiple factors) and the amount of information in the morphological tags. Experimental results demonstrate significant improvement of translation quality in terms of BLEU.


workshop on statistical machine translation | 2008

Phrase-Based and Deep Syntactic English-to-Czech Statistical Machine Translation

Ondřej Bojar; Jan Hajiċ

This paper describes our two contributions to WMT08 shared task: factored phrase-based model using Moses and a probabilistic tree-transfer model at a deep syntactic layer.


The Prague Bulletin of Mathematical Linguistics | 2011

Addicter: What is wrong with my translations?

Daniel Zeman; Mark Fishel; Jan Berka; Ondřej Bojar

Addicter: What Is Wrong with My Translations? We introduce Addicter, a tool for Automatic Detection and DIsplay of Common Translation ERrors. The tool allows to automatically identify and label translation errors and browse the test and training corpus and word alignments; usage of additional linguistic tools is also supported. The error classification is inspired by that of Vilar et al. (2006), although some of their higher-level categories are beyond the reach of the current version of our system. In addition to the tool itself we present a comparison of the proposed method to manually classified translation errors and a thorough evaluation of the generated alignments.


The Prague Bulletin of Mathematical Linguistics | 2009

Evaluation of Machine Translation Metrics for Czech as the Target Language

Kamil Kos; Ondřej Bojar

Evaluation of Machine Translation Metrics for Czech as the Target Language In the present work we study semi-automatic evaluation techniques of machine translation (MT) systems. These techniques are based on a comparison of the MT systems output to human translations of the same text. Various metrics were proposed in the recent years, ranging from metrics using only a unigram comparison to metrics that try to take advantage of additional syntactic or semantic information. The main goal of this article is to compare these metrics with respect to their correlation with human judgments for Czech as the target language and to propose the best ones that can be used for an evaluation of MT systems translating into Czech language.


workshop on statistical machine translation | 2009

English-Czech MT in 2008

Ondřej Bojar; David Mareċek; Václav Novák; Martin Popel; Jan Pt'aċek; Jan Rouš; Zdenėk Żabokrtsk'y

We describe two systems for English-to-Czech machine translation that took part in the WMT09 translation task. One of the systems is a tuned phrase-based system and the other one is based on a linguistically motivated analysis-transfer-synthesis approach.


text speech and dialogue | 2011

Automatic translation error analysis

Mark Fishel; Ondřej Bojar; Daniel Zeman; Jan Berka

We propose a method of automatic identification of various error types in machine translation output. The approach is mostly based on monolingual word alignment of the hypothesis and the reference translation. In addition to common lexical errors misplaced words are also detected. A comparison to manually classified MT errors is presented. Our error classification is inspired by that of Vilar (2006; [17]), although distinguishing some of their categories is beyond the reach of the current version of our system.


The Prague Bulletin of Mathematical Linguistics | 2009

CzEng 0.9: Large Parallel Treebank with Rich Annotation

Ondřej Bojar; Zdeněk Žabokrtský

CzEng 0.9: Large Parallel Treebank with Rich Annotation We describe our ongoing efforts in collecting a Czech-English parallel corpus CzEng. The paper provides full details on the current version 0.9 and focuses on its new features: (1) data from new sources were added, most importantly a few hundred electronically available books, technical documentation and also some parallel web pages, (2) the full corpus has been automatically annotated up to the tectogrammatical layer (surface and deep syntactic analysis), (3) sentence segmentation has been refined, and (4) several heuristic filters to improve corpus quality were implemented. In total, we provide a sentence-aligned automatic parallel treebank of about 8.0 million sentences, 93 million English and 82 million Czech words. CzEng 0.9 is freely available for non-commercial research purposes.


text speech and dialogue | 2013

Scratching the Surface of Possible Translations

Ondřej Bojar; Matouš Macháček; Aleš Tamchyna; Daniel Zeman

One of the key obstacles in automatic evaluation of machine translation systems is the reliance on a few (typically just one) human-made reference translations to which the system output is compared. We propose a method of capturing millions of possible translations and implement a tool for translators to specify them using a compact representation. We evaluate this new type of reference set by edit distance and correlation to human judgements of translation quality.

Collaboration


Dive into the Ondřej Bojar's collaboration.

Top Co-Authors

Avatar

Aleš Tamchyna

Charles University in Prague

View shared research outputs
Top Co-Authors

Avatar

Daniel Zeman

Charles University in Prague

View shared research outputs
Top Co-Authors

Avatar

Martin Popel

Charles University in Prague

View shared research outputs
Top Co-Authors

Avatar

Tom Kocmi

Charles University in Prague

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Zdeněk Žabokrtský

Charles University in Prague

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Lucia Specia

University of Sheffield

View shared research outputs
Top Co-Authors

Avatar

Jan Berka

Charles University in Prague

View shared research outputs
Researchain Logo
Decentralizing Knowledge