Heiki-Jaan Kaalep
University of Tartu
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Heiki-Jaan Kaalep.
meeting of the association for computational linguistics | 1998
Ludmila Dimitrova; Tomaz Erjavec; Nancy Ide; Heiki-Jaan Kaalep; Vladimír Petkevič; Dan Tufis
The EU Copernicus project Multext-East has created a multi-lingual corpus of text and speech data, covering the six languages of the project: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene. In addition, wordform lexicons for each of the languages were developed. The corpus includes a parallel component consisting of Orwells Nineteen Eighty-Four, with versions in all six languages tagged for part-of-speech and aligned to English (also tagged for POS). We describe the encoding format and data architecture designed especially for this corpus, which is generally usable for encoding linguistic corpora. We also describe the methodology for the development of a harmonized set of morphosyntactic descriptions (MSDs), which builds upon the scheme for western European languages developed within the EAGLES project. We discuss the special concerns for handling the six project languages, which cover three distinct language families.
Computers and The Humanities | 1997
Heiki-Jaan Kaalep
The paper describes a morphological analyser forEstonian and how using a text corpus influenced theprocess of creating it and the resulting programitself. The influence is not limited to the lexicononly, but is also noticeable in the resulting algorithm andimplementation too. When work on the analyser began,there were no computational treatment of Estonianderivatives and compounds. After some cycles ofdevelopment and testing on the corpus, we came up withan acceptable algorithm for their treatment. Both themorphological analyser and the speller based on ithave been successfully marketed.
language resources and evaluation | 2010
Kadri Muischnek; Heiki-Jaan Kaalep
This article focuses on the variability of one of the subtypes of multi-word expressions, namely those consisting of a verb and a particle or a verb and its complement(s). We build on evidence from Estonian, an agglutinative language with free word order, analysing the behaviour of verbal multi-word expressions (opaque and transparent idioms, support verb constructions and particle verbs). Using this data we analyse such phenomena as the order of the components of a multi-word expression, lexical substitution and morphosyntactic flexibility.
The Prague Bulletin of Mathematical Linguistics | 2010
Mark Fishel; Heiki-Jaan Kaalep
CorporAl: a Method and Tool for Handling Overlapping Parallel Corpora This work introduces a method and tool for handling overlapping parallel corpora — i.e. corpora that are based on the same source material. The method is insensitive to minor changes in the text, different segmentation levels of the corpora and omitted material from either corpora. The aim is to detect matching sentence pairs and either produce combinations of the overlapping corpora or compare them and assess their quality in comparison to each other. The introduced tool enables the user to define the desired behavior when combining corpora pairs, resulting in pure comparison, maximum-size or maximum-quality versions of the combinations. We test the tool on two cases of overlapping parallel corpora and five language pairs. We also evaluate the impact of using the method on two translation systems — a phrase-based and a parsing-based one.
Archive | 2002
Heiki-Jaan Kaalep
International Journal of Lexicography | 2008
Heiki-Jaan Kaalep; Jaan Mikk
Proceedings of the 2010 conference on Human Language Technologies -- The Baltic Perspective: Proceedings of the Fourth International Conference Baltic HLT 2010 | 2010
Heiki-Jaan Kaalep; Kadri Muischnek; Kristel Uiboaed; Kaarel Veskis
NODALIDA | 2007
Mark Fishel; Heiki-Jaan Kaalep; Kadri Muischnek
language resources and evaluation | 2002
Heiki-Jaan Kaalep; Kadri Muischnek
language resources and evaluation | 2016
Siim Orasmaa; Timo Petmanson; Alexander Tkachenko; Sven Laur; Heiki-Jaan Kaalep