Pavel Rychlý
Masaryk University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Pavel Rychlý.
conference on current trends in theory and practice of informatics | 1997
Karel Pala; Pavel Rychlý; Pavel Smrz
This paper deals with Czech disambiguated corpus DESAM. It is a tagged corpus which has been manually disambiguated and can be used in various applications. We discuss the structure of the corpus, tools used for its managing, linguistic applications, and also possible use of machine learning techniques relying on the disambiguated data. Possible ways of developing the procedures for complete automatic disambiguation are considered.
conference of the european chapter of the association for computational linguistics | 2014
Miloš Jakub'iċek; Adam Kilgarriff; VojtÄ›ch Kovář; Pavel Rychlý; Vít Suchomel
Term candidates for a domain, in a language, can be found by • taking a corpus for the domain, and a refer- ence corpus for the language • identifying the grammatical shape of a term in the language • tokenising, lemmatising and POS-tagging both corpora • identifying (and counting) the items in each corpus which match the grammatical shape • for each item in the domain corpus, compar- ing its frequency with its frequency in the refence corpus. Then, the items with the highest frequency in the domain corpus in comparison to the reference cor- pus will be the top term candidates. None of the steps above are unusual or innova- tive for NLP (see, e. g., (Aker et al., 2013), (Go- jun et al., 2012)). However it is far from trivial to implement them all, for numerous languages, in an environment that makes it easy for non- programmers to find the terms in a domain. This is what we have done in the Sketch Engine (Kilgarriff et al., 2004), and will demonstrate. In this abstract we describe how we addressed each of the stages above.
text speech and dialogue | 2001
Pavel Smrz; Pavel Rychlý
The paper deals with the linguistic problem of fully automatic grouping of semantically related words. We discuss the measures of semantic relatedness of basic word forms and describe the treatment of collocations. Next we present the procedure of hierarchical clustering of a very large number of semantically related words and give examples of the resulting partitioning of data in the form of dendrogram. Finally we show a form of the output presentation that facilitates the inspection of the resulting word clusters.
language resources and evaluation | 2010
Karel Pala; Pavel Rychlý; Pavel Šmerk
Law texts including constitution, acts, public notices and court judgements form a huge database of texts. As many texts from small domains, the used sublanguage is partially restricted and also different from general language (Czech). As a starting collection of data, the legal database Lexis containing approx. 50,000 Czech law documents has been chosen. Our attention is concentrated mostly on noun groups, which are the main candidates for law terms. We were able to recognize 3992 such different noun groups in the selected text samples. The paper also presents results of the morphological analysis, lemmatization, tagging, disambiguation, and the basic syntactic analysis of Czech law texts as these tasks are crucial for any further sophisticated natural language processing. The verbs in legal texts have been explored preliminarily as well. In this respect, we are trying to explore how the linguistic analysis can help in identification of the semantic nature of law terms.
text speech and dialogue | 2003
Karel Pala; Pavel Rychlý; Pavel Smrž
This paper presents a description of a Czech text corpus (Chyby) containing various kinds of errors such as spelling, typographical, grammatical, style, lexical. We explain how Chyby has been built, how the errors in it have been discovered, marked and annotated. The classification of the errors is presented and the statistics concerning the types of errors is given. The tools for annotating the errors are also described. To the best of our knowledge, this is first text corpus of this sort prepared for Czech.
text speech and dialogue | 1999
Jaroslava Hlaváčová; Pavel Rychlý
This paper proposes new measures for dealing with word dispersion in a language corpus - reduced frequency and rarity. Their calculation is described and some results from the Czech National Corpus (CNC) presented. Some previous approaches are briefly mentioned.
text speech and dialogue | 2016
Pavel Rychlý; Vít Suchomel
Amharic is one of under-resourced languages. The paper presents two text corpora. The first one is a substantially cleaned version of existing morphologically annotated WIC Corpus (210,000 words). The second one is the largest Amharic text corpus (17 million words). It was created from Web pages automatically crawled in 2013, 2015 and 2016. It is part-of-speech annotated by a tagger trained and evaluated on the WIC Corpus.
conference on intelligent text processing and computational linguistics | 2016
Roger Evans; Alexander Gelbukh; Gregory Grefenstette; Patrick Hanks; Miloš Jakubíček; Diana McCarthy; Martha Palmer; Ted Pedersen; Michael Rundell; Pavel Rychlý; Serge Sharoff; David Tugwell
The 2016 CICLing conference was dedicated to the memory of Adam Kilgarriff who died the year before. Adam leaves behind a tremendous scientific legacy and those working in computational linguistics, other fields of linguistics and lexicography are indebted to him. This paper is a summary review of some of Adam’s main scientific contributions. It is not and cannot be exhaustive. It is written by only a small selection of his large network of collaborators. Nevertheless we hope this will provide a useful summary for readers wanting to know more about the origins of work, events and software that are so widely relied upon by scientists today, and undoubtedly will continue to be so in the foreseeable future.
text speech and dialogue | 2010
Silvie Cinková; Martin Holub; Pavel Rychlý; Lenka Smejkalová; Jana ýindlerová
Corpus Pattern Analysis (CPA) [1], coined and implemented by Hanks as the Pattern Dictionary of English Verbs (PDEV) [2], appears to be the only deliberate and consistent implementation of Sinclairs concept of Lexical Item [3]. In his theoretical inquiries [4] Hanks hypothesizes that the pattern repository produced by CPA can also support the word sense disambiguation task. Although more than 670 verb entries have already been compiled in PDEV, no systematic evaluation of this ambitious project has been reported yet. Assuming that the Sinclairian concept of the Lexical Item is correct, we started to closely examine PDEV with its possible NLP application in mind. Our experiments presented in this paper have been performed on a pilot sample of English verbs to provide a first reliable view on whether humans can agree in assigning PDEV patterns to verbs in a corpus. As a conclusion we suggest procedures for future development of PDEV.
Proceedings of the Corpus Linguistics Conference 2009 (CL2009),, 2009, pág. 177 | 2009
Adam Kilgarriff; Pavel Rychlý; Pavel Smrž; David Tugwell