Daniel Zeman
Charles University in Prague
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Daniel Zeman.
international conference on computational linguistics | 2000
Anoop Sarkar; Daniel Zeman
We present some novel machine learning techniques for the identification of subcategorization information for verbs in Czech. We compare three different statistical techniques applied to this problem. We show how the learning algorithm can be used to discover previously unknown subcategorization frames from the Czech Prague Dependency Treebank. The algorithm can then be used to label dependents of a verb in the Czech treebank as either arguments or adjuncts. Using our techniques, we are able to achieve 88% precision on unseen parsed text.
international workshop/conference on parsing technologies | 2005
Daniel Zeman; Zdenėk Żabokrtsk'y
This paper explores the possibilities of improving parsing results by combining outputs of several parsers. To some extent, we are porting the ideas of Henderson and Brill (1999) to the world of dependency structures. We differ from them in exploring context features more deeply. All our experiments were conducted on Czech but the method is language-independent. We were able to significantly improve over the best parsing result for the given setting, known so far. Moreover, our experiments show that even parsers far below the state of the art can contribute to the total improvement.
north american chapter of the association for computational linguistics | 2014
Stephan Oepen; Marco Kuhlmann; Yusuke Miyao; Daniel Zeman; Silvie Cinková; Dan Flickinger; Jan Hajic; Zdenka Uresová
Task 18 at SemEval 2015 defines Broad-Coverage Semantic Dependency Parsing (SDP) as the problem of recovering sentence-internal predicate–argument relationships for all content words, i.e. the sema ...
language resources and evaluation | 2014
Daniel Zeman; Ondřej Dušek; David Mareček; Martin Popel; Loganathan Ramasamy; Jan Ŝtĕpánek; Zdenĕk Žabokrtský; Jan Hajic
AbstractWe present HamleDT—a HArmonized Multi-LanguagE Dependency Treebank. HamleDT is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. In the present article, we provide a thorough investigation and discussion of a number of phenomena that are comparable across languages, though their annotation in treebanks often differs. We claim that transformation procedures can be designed to automatically identify most such phenomena and convert them to a unified annotation style. This unification is beneficial both to comparative corpus linguistics and to machine learning of syntactic parsing.
systems and frameworks for computational morphology | 2011
Septina Dian Larasati; Vladislav Kuboň; Daniel Zeman
This paper describes a robust finite state morphology tool for Indonesian (MorphInd), which handles both morphological analysis and lemmatization for a given surface word form so that it is suitable for further language processing. MorphInd has wider coverage on handling Indonesian derivational and inflectional morphology compared to an existing Indonesian morphological analyzer [1], along with a more detailed tagset. MorphInd outputs the analysis in the form of segmented morphemes along with the morphological tags. The implementation was done using finite state technology by adopting the two-level morphology approach implemented in Foma. It achieved 84.6% of coverage on a preliminary stage Indonesian corpus where it mostly fails to capture the proper nouns and foreign words as expected initially.
The Prague Bulletin of Mathematical Linguistics | 2011
Daniel Zeman; Mark Fishel; Jan Berka; Ondřej Bojar
Addicter: What Is Wrong with My Translations? We introduce Addicter, a tool for Automatic Detection and DIsplay of Common Translation ERrors. The tool allows to automatically identify and label translation errors and browse the test and training corpus and word alignments; usage of additional linguistic tools is also supported. The error classification is inspired by that of Vilar et al. (2006), although some of their higher-level categories are beyond the reach of the current version of our system. In addition to the tool itself we present a comparison of the proposed method to manually classified translation errors and a thorough evaluation of the generated alignments.
international conference on computational linguistics | 2002
Daniel Zeman
Today there is a relatively large body of work on automatic acquisition of lexico-syntactical preferences (subcategorization) from corpora. Various techniques have been developed that not only produce machine-readable subcategorization dictionaries but also they are capable of weighing the various subcategorization frames probabilistically. Clearly there should be a potential to use such weighted lexical information to improve statistical parsing, though published experiments proving (or disproving) such hypothesis are comparatively rare. One experiment is described in (Carroll et al., 1998) --- they use subcategorization probabilities for ranking trees generated by unification-based phrasal grammar. The present paper, on the other hand, involves a statistical dependency parser. Although dependency and constituency parsing are of quite a different nature, we show that a subcategorization model is of much use here as well.
cross language evaluation forum | 2008
Daniel Zeman
This paper describes a rather simplistic method of unsupervised morphological analysis of words in an unknown language. All what is needed is a raw text corpus in the given language. The algorithm looks at words, identifies repeatedly occurring stems and suffixes, and constructs probable morphological paradigms. The paper also describes how this method has been applied to solve the Morpho Challenge 2007 task, and gives the Morpho Challenge results. Although quite simple, this approach outperformed, to our surprise, several others in most morpheme segmentation subcompetitions. We believe that there is enough room for improvements that can put the results even higher. Errors are discussed in the paper; together with suggested adjustments in future research.
cross language evaluation forum | 2008
Daniel Zeman
We describe a simple method of unsupervised morpheme segmentation of words in an unknown language. All that is needed is a raw text corpus (or a list of words) in the given language. The algorithm identifies word parts occurring in many words and interprets them as morpheme candidates (prefixes, stems and suffixes). New treatment of prefixes is the main innovation in comparison to [1]. After filtering out spurious hypotheses, the list of morphemes is applied to segment input words. Official Morpho Challenge 2008 evaluation is given together with some additional experiments. Processing of prefixes improved the F-score by 5 to 11 points for German, Finnish and Turkish, while it failed to improve English and Arabic. We also analyze and discuss errors with respect to the evaluation method.
workshop on statistical machine translation | 2014
Ondřej Dušek; Jan Hajiċ; Jaroslava Hlaváċová; Michal Novák; Pavel Pecina; Rudolf Rosa; Aleš Tamchyna; Zdeňka Urešová; Daniel Zeman
This paper presents the participation of the Charles University team in the WMT 2014 Medical Translation Task. Our systems are developed within the Khresmoi project, a large integrated project aiming to deliver a multi-lingual multi-modal search and access system for biomedical information and documents. Being involved in the organization of the Medical Translation Task, our primary goal is to set up a baseline for both its subtasks (summary translation and query translation) and for all translation directions. Our systems are based on the phrasebased Moses system and standard methods for domain adaptation. The constrained/unconstrained systems differ in the training data only.