Rudolf Rosa
Charles University in Prague
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Rudolf Rosa.
Artificial Intelligence in Medicine | 2014
Pavel Pecina; Ondřej Dušek; Lorraine Goeuriot; Jan Hajic; Jaroslava Hlaváčová; Gareth J. F. Jones; Liadh Kelly; Johannes Leveling; David Mareček; Michal Novák; Martin Popel; Rudolf Rosa; Aleš Tamchyna; Zdeňka Urešová
OBJECTIVE We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve effectiveness of cross-lingual IR. METHODS AND DATA Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech-English, German-English, and French-English. MT quality is evaluated on data sets created within the Khresmoi project and IR effectiveness is tested on the CLEF eHealth 2013 data sets. RESULTS The search query translation results achieved in our experiments are outstanding - our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech-English, from 23.03 to 40.82 for German-English, and from 32.67 to 40.82 for French-English. This is a 55% improvement on average. In terms of the IR performance on this particular test collection, a significant improvement over the baseline is achieved only for French-English. For Czech-English and German-English, the increased MT quality does not lead to better IR results. CONCLUSIONS Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance - better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions.
international joint conference on natural language processing | 2015
Rudolf Rosa; Zdenek Zabokrtsky
We present KLcpos3 , a language similarity measure based on Kullback-Leibler divergence of coarse part-of-speech tag trigram distributions in tagged corpora. It has been designed for multilingual delexicalized parsing, both for source treebank selection in single-source parser transfer, and for source treebank weighting in multi-source transfer. In the selection task, KLcpos3 identifies the best source treebank in 8 out of 18 cases. In the weighting task, it brings +4.5% UAS absolute, compared to unweighted parse tree combination.
workshop on statistical machine translation | 2014
Ondřej Dušek; Jan Hajiċ; Jaroslava Hlaváċová; Michal Novák; Pavel Pecina; Rudolf Rosa; Aleš Tamchyna; Zdeňka Urešová; Daniel Zeman
This paper presents the participation of the Charles University team in the WMT 2014 Medical Translation Task. Our systems are developed within the Khresmoi project, a large integrated project aiming to deliver a multi-lingual multi-modal search and access system for biomedical information and documents. Being involved in the organization of the Medical Translation Task, our primary goal is to set up a baseline for both its subtasks (summary translation and query translation) and for all translation directions. Our systems are based on the phrasebased Moses system and standard methods for domain adaptation. The constrained/unconstrained systems differ in the training data only.
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers | 2016
Rudolf Rosa; Roman Sudarikov; Michal Novák; Martin Popel; Ondrej Bojar
We describe our submission to the ITdomain translation task of WMT 2016. We perform domain adaptation with dictionary data on already trained MT systems with no further retraining. We apply our approach to two conceptually different systems developed within the QTLeap project: TectoMT and Moses, as well as Chimera, their combination. In all settings, our method improves the translation quality. Moreover, the basic variant of our approach is applicable to any MT system, including a black-box one.
international workshop/conference on parsing technologies | 2015
Rudolf Rosa; ZdenÄ›k Żabokrtský
We introduce interpolation of trained MSTParser models as a resource combination method for multi-source delexicalized parser transfer. We present both an unweighted method, as well as a variant in which each source model is weighted by the similarity of the source language to the target language. Evaluation on the HamleDT treebank collection shows that the weighted model interpolation performs comparably to weighted parse tree combination method, while being computationally much less demanding.
The Prague Bulletin of Mathematical Linguistics | 2013
Aleš Tamchyna; Ondřej Dušek; Rudolf Rosa; Pavel Pecina
Abstract We present a web service which handles and distributes JSON-encoded HTTP requests for machine translation (MT) among multiple machines running an MT system, including text pre- and post-processing. It is currently used to provide MT between several languages for cross-lingual information retrieval in the EU FP7 Khresmoi project. The software consists of an application server and remote workers which handle text processing and communicate translation requests to MT systems. The communication between the application server and the workers is based on the XML-RPC protocol. We present the overall design of the software and test results which document speed and scalability of our solution. Our software is licensed under the Apache 2.0 licence and is available for download from the Lindat-Clarin repository and Github.
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) | 2017
Rudolf Rosa; Daniel Zeman; David Mareček
We once had a corp, or should we say, it once hadus They showed us its tags, isn’t it great,unifiedtags They asked us to parse and they told us to useeverything So we looked around and we noticed there was nearnothing We took other langs, bitext aligned: words one-to-one We played for two weeks, and then they said, here is the test The parser kept training till morning,
text speech and dialogue | 2012
Rudolf Rosa; David Mareček
We present a MIRA-based labeller designed to assign dependency relation labels to edges in a dependency parse tree, tuned for Czech language. The labeller was created to be used as a second stage to unlabelled dependency parsers but can also improve output from labelled dependency parsers. We evaluate two existing techniques which can be used for labelling and experiment with combining them together. We describe the feature set used. Our final setup significantly outperforms the best results from the CoNLL 2009 shared task.
workshop on statistical machine translation | 2012
Rudolf Rosa; David Mareċek; OndÅ™ej Dušek
Archive | 2015
Joakim Nivre; Željko Agić; Maria Jesus Aranzabe; Masayuki Asahara; Aitziber Atutxa; Miguel Ballesteros; John Bauer; Kepa Bengoetxea; Riyaz Ahmad Bhat; Cristina Bosco; Sam Bowman; Giuseppe G. A. Celano; Miriam Connor; Marie-Catherine de Marneffe; Arantza Diaz de Ilarraza; Kaja Dobrovoljc; Timothy Dozat; Tomaž Erjavec; Richárd Farkas; Jennifer Foster; Daniel Galbraith; Filip Ginter; Iakes Goenaga; Koldo Gojenola; Yoav Goldberg; Berta Gonzales; Bruno Guillaume; Jan Hajic; Dag Haug; Radu Ion