Jirka Hana
Charles University in Prague
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jirka Hana.
Archive | 2010
Anna Feldman; Jirka Hana
While supervised corpus-based methods are highly accurate for different NLP tasks, including morphological tagging, they are difficult to port to other languages because they require resources that are expensive to create. As a result, many languages have no realistic prospect for morpho-syntactic annotation in the foreseeable future. The method presented in this book aims to overcome this problem by significantly limiting the necessary data and instead extrapolating the relevant information from another, related language. The approach has been tested on Catalan, Portuguese, and Russian. Although these languages are only relatively resource-poor, the same method can be in principle applied to any inflected language, as long as there is an annotated corpus of a related language available. Time needed for adjusting the system to a new language constitutes a fraction of the time needed for systems with extensive, manually created resources: days instead of years. This book touches upon a number of topics: typology, morphology, corpus linguistics, contrastive linguistics, linguistic annotation, computational linguistics and Natural Language Processing (NLP). Researchers and students who are interested in these scientific areas as well as in cross-lingual studies and applications will greatly benefit from this work. Scholars and practitioners in computer science and linguistics are the prospective readers of this book.Morpho-syntactic tagging is the process of assigning part o f speech (POS), case, number, gender, and other morphological information to each word in a corpus. Morpho-syntactic tagging is an important step in natural language processing . Corpora that have been morphologically tagged are very useful both for linguistic res earch, e.g. finding instances or frequencies of particular constructions in large corpora, and for further computational processing, such as syntactic parsing, speech recognition, st emming, and word-sense disambiguation, among others. Despite the importance of morphol ogical tagging, there are many languages that lack annotated resources. This is almost ine vi able because these resources are costly to create. But, as described in this thesis, it is p os ible to avoid this expense. This thesis describes a method for transferring annotation from a morphologically annotated corpus of a source language to a corpus of a related target language. Unlike unsupervised approaches that do not require annotated data at ll and, as a consequence, lack precision, the approach proposed in this dissertation relies on linguistic knowledge, but avoids large-scale grammar engineering. The approach need s neither a parallel corpus nor a bilingual lexicon, and requires much less linguistic labo r than the standard technology. This dissertation describes experiments with Russian, Cze ch, Polish, Spanish, Portuguese, and Catalan. However, the general method proposed can be applied to any fusional language.
CrossLangInduction '06 Proceedings of the International Workshop on Cross-Language Knowledge Induction | 2006
Jirka Hana; Anna Feldman; Chris Brew; Luiz Amaral
We describe a knowledge and resource light system for an automatic morphological analysis and tagging of Brazilian Portuguese. We avoid the use of labor intensive resources; particularly, large annotated corpora and lexicons. Instead, we use (i) an annotated corpus of Peninsular Spanish, a language related to Portuguese, (ii) an unannotated corpus of Portuguese, (iii) a description of Portuguese morphology on the level of a basic grammar book. We extend the similar work that we have done (Hana et al., 2004; Feldman et al., 2006) by proposing an alternative algorithm for cognate transfer that effectively projects the Spanish emission probabilities into Portuguese. Our experiments use minimal new human effort and show 21% error reduction over even emissions on a fine-grained tagset.
language resources and evaluation | 2014
Alexandr Rosen; Jirka Hana; Barbora Štindlová; Anna Feldman
The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked tiers, designed to handle a wide range of error types present in the input. Each tier corrects different types of errors; links between the tiers allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a data set including approx. 175,000 words with fair inter-annotator agreement results. We also explore the possibility of applying automated linguistic annotation tools (taggers, spell checkers and grammar checkers) to the learner text to support or even substitute manual annotation.
language resources and evaluation | 2012
Jirka Hana; Alexandr Rosen; Barbora Štindlová; Petr J"ager
The need for data about the acquisition of Czech by non-native learners prompted the compilation of the first learner corpus of Czech. After introducing its basic design and parameters, including a multi-tier manual annotation scheme and error taxonomy, we focus on the more technical aspects: the transcription of hand-written source texts, process of annotation, and options for exploiting the result, together with tools used for these tasks and decisions behind the choices. To support or even substitute manual annotation we assign some error tags automatically and use automatic annotation tools (tagger, spell checker).
text speech and dialogue | 2012
Tomáš Jelínek; Barbora Štindlová; Alexandr Rosen; Jirka Hana
We present an approach to building a learner corpus of Czech, manually corrected and annotated with error tags using a complex grammar-based taxonomy of errors in spelling, morphology, morphosyntax, lexicon and style. This grammar-based annotation is supplemented by a formal classification of errors based on surface alternations. To supply additional information about non-standard or ill-formed expressions, we aim at a synergy of manual and automatic annotation, deriving information from the original input and from the manual annotation.
international conference on computational linguistics | 2006
Anna Feldman; Jirka Hana; Chris Brew
Annotated corpora are valuable resources for NLP which are often costly to create. We introduce a method for transferring annotation from a morphologically annotated corpus of a source language to a target language. Our approach assumes only that an unannotated text corpus exists for the target language and a simple textbook which describes the basic morphological properties of that language is available. Our paper describes experiments with Polish, Czech, and Russian. However, the method is not tied in any way to these languages. In all the experiments we use the TnT tagger ([3]), a second-order Markov model. Our approach assumes that the information acquired about one language can be used for processing a related language. We have found out that even breathtakingly naive things (such as approximating the Russian transitions by Czech and/or Polish and approximating the Russian emissions by (manually/automatically derived) Czech cognates) can lead to a significant improvement of the tagger’s performance.
The Prague Bulletin of Mathematical Linguistics | 2008
Barbora Hladká; Jan Hajic; Jirka Hana; Jaroslava Hlaváčová; Jiří Mírovský; Jan Raab
The Czech Academic Corpus 2.0 Guide The Czech Academic Corpus version 2.0 is a morphologically and syntactically annotated corpus of 650,000 words. The Czech Academic Corpus (CAC) was created by a team from the Institute of the Czech Language of the Academy of Sciences of the Czech Republic from 1971 to 1985. When the CAC project began there were only two computerized annotated corpora available since the 1960s - the Brown Corpus of American English and the LOB Corpus of British English. Both corpora became well known to corpus linguists, whereas the CAC remained hidden mainly because of the 1980s political regime in the Czech Republic. The idea of transferring the internal format and annotation scheme of the CAC into the Prague Dependency Treebank (PDT) concept emerged during the work on the PDTs second version. The main goal was to make the CAC and the PDT fully compatible and thus enable the integration of the CAC into the PDT. The currently released second version of the CAC presents the complete conversion of the internal format and morphological and syntactical annotation schemes. The Czech Academic Corpus v. 2.0 is being published by the Linguistic Data Consortium.
linguistic annotation workshop | 2014
Jirka Hana; Barbora Hladká; Ivana Lukšová
The purpose of our work is to explore the possibility of using sentence diagrams produced by schoolchildren as training data for automatic syntactic analysis. We have implemented a sentence diagram editor that schoolchildren can use to practice morphology and syntax. We collect their diagrams, combine them into a single diagram for each sentence and transform them into a form suitable for training a particular syntactic parser. In this study, the object language is Czech, where sentence diagrams are part of elementary school curriculum, and the target format is the annotation scheme of the Prague Dependency Treebank. We mainly focus on the evaluation of individual diagrams and on their combination into a merged better version.
Archive | 2010
Anna Feldman; Jirka Hana
While supervised corpus-based methods are highly accurate for different NLP tasks, including morphological tagging, they are difficult to port to other languages because they require resources that are expensive to create. As a result, many languages have no realistic prospect for morpho-syntactic annotation in the foreseeable future. The method presented in this book aims to overcome this problem by significantly limiting the necessary data and instead extrapolating the relevant information from another, related language. The approach has been tested on Catalan, Portuguese, and Russian. Although these languages are only relatively resource-poor, the same method can be in principle applied to any inflected language, as long as there is an annotated corpus of a related language available. Time needed for adjusting the system to a new language constitutes a fraction of the time needed for systems with extensive, manually created resources: days instead of years. This book touches upon a number of topics: typology, morphology, corpus linguistics, contrastive linguistics, linguistic annotation, computational linguistics and Natural Language Processing (NLP). Researchers and students who are interested in these scientific areas as well as in cross-lingual studies and applications will greatly benefit from this work. Scholars and practitioners in computer science and linguistics are the prospective readers of this book.
Archive | 2010
Anna Feldman; Jirka Hana
While supervised corpus-based methods are highly accurate for different NLP tasks, including morphological tagging, they are difficult to port to other languages because they require resources that are expensive to create. As a result, many languages have no realistic prospect for morpho-syntactic annotation in the foreseeable future. The method presented in this book aims to overcome this problem by significantly limiting the necessary data and instead extrapolating the relevant information from another, related language. The approach has been tested on Catalan, Portuguese, and Russian. Although these languages are only relatively resource-poor, the same method can be in principle applied to any inflected language, as long as there is an annotated corpus of a related language available. Time needed for adjusting the system to a new language constitutes a fraction of the time needed for systems with extensive, manually created resources: days instead of years. This book touches upon a number of topics: typology, morphology, corpus linguistics, contrastive linguistics, linguistic annotation, computational linguistics and Natural Language Processing (NLP). Researchers and students who are interested in these scientific areas as well as in cross-lingual studies and applications will greatly benefit from this work. Scholars and practitioners in computer science and linguistics are the prospective readers of this book.