Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Heiki-Jaan Kaalep is active.

Publication


Featured researches published by Heiki-Jaan Kaalep.


meeting of the association for computational linguistics | 1998

Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages

Ludmila Dimitrova; Tomaz Erjavec; Nancy Ide; Heiki-Jaan Kaalep; Vladimír Petkevič; Dan Tufis

The EU Copernicus project Multext-East has created a multi-lingual corpus of text and speech data, covering the six languages of the project: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene. In addition, wordform lexicons for each of the languages were developed. The corpus includes a parallel component consisting of Orwells Nineteen Eighty-Four, with versions in all six languages tagged for part-of-speech and aligned to English (also tagged for POS). We describe the encoding format and data architecture designed especially for this corpus, which is generally usable for encoding linguistic corpora. We also describe the methodology for the development of a harmonized set of morphosyntactic descriptions (MSDs), which builds upon the scheme for western European languages developed within the EAGLES project. We discuss the special concerns for handling the six project languages, which cover three distinct language families.


Computers and The Humanities | 1997

An Estonian Morphological Analyser and the Impact of a Corpus on Its Development

Heiki-Jaan Kaalep

The paper describes a morphological analyser forEstonian and how using a text corpus influenced theprocess of creating it and the resulting programitself. The influence is not limited to the lexicononly, but is also noticeable in the resulting algorithm andimplementation too. When work on the analyser began,there were no computational treatment of Estonianderivatives and compounds. After some cycles ofdevelopment and testing on the corpus, we came up withan acceptable algorithm for their treatment. Both themorphological analyser and the speller based on ithave been successfully marketed.


language resources and evaluation | 2010

The variability of multi-word verbal expressions in Estonian

Kadri Muischnek; Heiki-Jaan Kaalep

This article focuses on the variability of one of the subtypes of multi-word expressions, namely those consisting of a verb and a particle or a verb and its complement(s). We build on evidence from Estonian, an agglutinative language with free word order, analysing the behaviour of verbal multi-word expressions (opaque and transparent idioms, support verb constructions and particle verbs). Using this data we analyse such phenomena as the order of the components of a multi-word expression, lexical substitution and morphosyntactic flexibility.


The Prague Bulletin of Mathematical Linguistics | 2010

CorporAl: a Method and Tool for Handling Overlapping Parallel Corpora

Mark Fishel; Heiki-Jaan Kaalep

CorporAl: a Method and Tool for Handling Overlapping Parallel Corpora This work introduces a method and tool for handling overlapping parallel corpora — i.e. corpora that are based on the same source material. The method is insensitive to minor changes in the text, different segmentation levels of the corpora and omitted material from either corpora. The aim is to detect matching sentence pairs and either produce combinations of the overlapping corpora or compare them and assess their quality in comparison to each other. The introduced tool enables the user to define the desired behavior when combining corpora pairs, resulting in pure comparison, maximum-size or maximum-quality versions of the combinations. We test the tool on two cases of overlapping parallel corpora and five language pairs. We also evaluate the impact of using the method on two translation systems — a phrase-based and a parsing-based one.


Archive | 2002

Eesti kirjakeele sagedussõnastik

Heiki-Jaan Kaalep


International Journal of Lexicography | 2008

Creating Specialised Dictionaries for Foreign Language Learners: A Case Study

Heiki-Jaan Kaalep; Jaan Mikk


Proceedings of the 2010 conference on Human Language Technologies -- The Baltic Perspective: Proceedings of the Fourth International Conference Baltic HLT 2010 | 2010

The Estonian Reference Corpus: Its Composition and Morphology-aware User Interface

Heiki-Jaan Kaalep; Kadri Muischnek; Kristel Uiboaed; Kaarel Veskis


NODALIDA | 2007

Estonian-English Statistical Machine Translation: the First Results.

Mark Fishel; Heiki-Jaan Kaalep; Kadri Muischnek


language resources and evaluation | 2002

Using the Text Corpus to Create a Comprehensive List of Phrasal Verbs.

Heiki-Jaan Kaalep; Kadri Muischnek


language resources and evaluation | 2016

EstNLTK - NLP Toolkit for Estonian.

Siim Orasmaa; Timo Petmanson; Alexander Tkachenko; Sven Laur; Heiki-Jaan Kaalep

Collaboration


Dive into the Heiki-Jaan Kaalep's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Vladimír Petkevič

Charles University in Prague

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Kiril Simov

Bulgarian Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Ludmila Dimitrova

Bulgarian Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Csaba Oravecz

Hungarian Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge