Ondřej Dušek | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ondřej Dušek is active.

Explore More

Publication

Featured researches published by Ondřej Dušek.

language resources and evaluation | 2014

HamleDT: Harmonized multi-language dependency treebank

Daniel Zeman; Ondřej Dušek; David Mareček; Martin Popel; Loganathan Ramasamy; Jan Ŝtĕpánek; Zdenĕk Žabokrtský; Jan Hajic

AbstractWe present HamleDT—a HArmonized Multi-LanguagE Dependency Treebank. HamleDT is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. In the present article, we provide a thorough investigation and discussion of a number of phenomena that are comparable across languages, though their annotation in treebanks often differs. We claim that transformation procedures can be designed to automatically identify most such phenomena and convert them to a unified annotation style. This unification is beneficial both to comparative corpus linguistics and to machine learning of syntactic parsing.

Artificial Intelligence in Medicine | 2014

Adaptation of machine translation for multilingual information retrieval in the medical domain

Pavel Pecina; Ondřej Dušek; Lorraine Goeuriot; Jan Hajic; Jaroslava Hlaváčová; Gareth J. F. Jones; Liadh Kelly; Johannes Leveling; David Mareček; Michal Novák; Martin Popel; Rudolf Rosa; Aleš Tamchyna; Zdeňka Urešová

OBJECTIVE We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve effectiveness of cross-lingual IR. METHODS AND DATA Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech-English, German-English, and French-English. MT quality is evaluated on data sets created within the Khresmoi project and IR effectiveness is tested on the CLEF eHealth 2013 data sets. RESULTS The search query translation results achieved in our experiments are outstanding - our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech-English, from 23.03 to 40.82 for German-English, and from 32.67 to 40.82 for French-English. This is a 55% improvement on average. In terms of the IR performance on this particular test collection, a significant improvement over the baseline is achieved only for French-English. For Czech-English and German-English, the increased MT quality does not lead to better IR results. CONCLUSIONS Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance - better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions.

meeting of the association for computational linguistics | 2016

Sequence-to-Sequence Generation for Spoken Dialogue via Deep Syntax Trees and Strings.

Ondřej Dušek; Filip Jurčíček

We present a natural language generator based on the sequence-to-sequence approach that can be trained to produce natural language strings as well as deep syntax dependency trees from input dialogue acts, and we use it to directly compare two-step generation with separate sentence planning and surface realization stages to a joint, one-step approach. We were able to train both setups successfully using very little training data. The joint setup offers better performance, surpassing state-of-the-art with regards to n-gram-based scores while providing more relevant outputs.

international joint conference on natural language processing | 2015

Training a Natural Language Generator From Unaligned Data

Ondřej Dušek; Filip Jurčíček

We present a novel syntax-based natural language generation system that is trainable from unaligned pairs of input meaning representations and output sentences. It is divided into sentence planning, which incrementally builds deep-syntactic dependency trees, and surface realization. Sentence planner is based on A* search with a perceptron ranker that uses novel differing subtree updates and a simple future promise estimation; surface realization uses a rule-based pipeline from the Treex NLP toolkit. Our first results show that training from unaligned data is feasible, the outputs of our generator are mostly fluent and relevant.

workshop on statistical machine translation | 2014

Machine Translation of Medical Texts in the Khresmoi Project

Ondřej Dušek; Jan Hajiċ; Jaroslava Hlaváċová; Michal Novák; Pavel Pecina; Rudolf Rosa; Aleš Tamchyna; Zdeňka Urešová; Daniel Zeman

This paper presents the participation of the Charles University team in the WMT 2014 Medical Translation Task. Our systems are developed within the Khresmoi project, a large integrated project aiming to deliver a multi-lingual multi-modal search and access system for biomedical information and documents. Being involved in the organization of the Medical Translation Task, our primary goal is to set up a baseline for both its subtasks (summary translation and query translation) and for all translation directions. Our systems are based on the phrasebased Moses system and standard methods for domain adaptation. The constrained/unconstrained systems differ in the training data only.

workshop on statistical machine translation | 2015

New Language Pairs in TectoMT

Ondřej Dušek; Luís Gomes; Michal Novák; Martin Popel; Rudolf Rosa

The TectoMT tree-to-tree machine translation system has been updated this year to support easier retraining for more translation directions. We use multilingual standards for morphology and syntax annotation and language-independent base rules. We include a simple, non-parametric way of combining TectoMT’s transfer model outputs. We submitted translations by the Englishto-Czech and Czech-to-English TectoMT pipelines to the WMT shared task. While the former offers a stable performance, the latter is completely new and will require more tuning and debugging.

linguistic annotation workshop | 2015

Bilingual English-Czech Valency Lexicon Linked to a Parallel Corpus

Zdeňka Urešová; Ondřej Dušek; Eva Fučíková; Jan Hajic; Jana Šindlerová

This paper presents a resource and the associated annotation process used in a project of interlinking Czech and English verbal translational equivalents based on a parallel, richly annotated dependency treebank containing also valency and semantic roles, namely the Prague Czech-English Dependency Treebank. One of the main aims of this project is to create a high-quality and relatively large empirical base which could be used both for linguistic comparative research as well as for natural language processing applications, such as machine translation or cross-language sense disambiguation. This paper describes the resulting lexicon, CzEngVallex, and the process of building it, as well some interesting observations and statistics already obtained.

workshop on events definition detection coreference and representation | 2014

Verbal Valency Frame Detection and Selection in Czech and English

Ondřej Dušek; Jan Hajic; Zdenka Uresová

We present a supervised learning method for verbal valency frame detection and selection, i.e., a specific kind of word sense disambiguation for verbs based on subcategorization information, which amounts to detecting mentions of events in text. We use the rich dependency annotation present in the Prague Dependency Treebanks for Czech and English, taking advantage of several analysis tools (taggers, parsers) developed on these datasets previously. The frame selection is based on manually created lexicons accompanying these treebanks, namely on PDT-Vallex for Czech and EngVallex for English. The results show that verbal predicate detection is easier for Czech, but in the subsequent frame selection task, better results have been achieved for English.

text speech and dialogue | 2014

Alex: A Statistical Dialogue Systems Framework

Filip Jurčíček; Ondřej Dušek; Ondřej Plátek; Lukáš Žilka

This paper describes the Alex Dialogue Systems Framework (ADSF). The ADSF currently includes mature components for public telephone network connectivity, voice activity detection, automatic speech recognition, statistical spoken language understanding, and probabilistic belief tracking. The ADSF is used in a real-world deployment within the Public Transport Information (PTI) domain. In PTI, users can interact with a dialogue system on the phone to find intra- and inter-city public transport connections and ask for weather forecast in a desired city. Based on user responses, vast majority of the system users are satisfied with the system performance.

text speech and dialogue | 2016

CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered

Ondřej Bojar; Ondřej Dušek; Tom Kocmi; Jindřich Libovický; Michal Novák; Martin Popel; Roman Sudarikov; Dusan Varis

We present a new release of the Czech-English parallel corpus CzEng. CzEng 1.6 consists of about 0.5 billion words (“gigaword”) in each language. The corpus is equipped with automatic annotation at a deep syntactic level of representation and alternatively in Universal Dependencies. Additionally, we release the complete annotation pipeline as a virtual machine in the Docker virtualization toolkit.

Explore More