Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Mathias Creutz is active.

Publication


Featured researches published by Mathias Creutz.


ACM Transactions on Speech and Language Processing | 2007

Unsupervised models for morpheme segmentation and morphology learning

Mathias Creutz; Krista Lagus

We present a model family called Morfessor for the unsupervised induction of a simple morphology from raw text data. The model is formulated in a probabilistic maximum a posteriori framework. Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes. A lexicon of word segments, called morphs, is induced from the data. The lexicon stores information about both the usage and form of the morphs. Several instances of the model are evaluated quantitatively in a morpheme segmentation task on different sized sets of Finnish as well as English data. Morfessor is shown to perform very well compared to a widely known benchmark algorithm, in particular on Finnish data.


meeting of the association for computational linguistics | 2002

Unsupervised Discovery of Morphemes

Mathias Creutz; Krista Lagus

We present two methods for unsupervised segmentation of words into morpheme-like units. The model utilized is especially suited for languages with a rich morphology, such as Finnish. The first method is based on the Minimum Description Length (MDL) principle and works online. In the second method, Maximum Likelihood (ML) optimization is used. The quality of the segmentations is measured using an evaluation method that compares the segmentations produced to an existing morphological analysis. Experiments on both Finnish and English corpora show that the presented methods perform well compared to a current state-of-the-art system.


Computer Speech & Language | 2006

Unlimited Vocabulary Speech Recognition with Morph Language Models Applied to Finnish

Teemu Hirsimäki; Mathias Creutz; Vesa Siivola; Mikko Kurimo; Sami Virpioja; Janne Pylkkönen

Abstract In the speech recognition of highly inflecting or compounding languages, the traditional word-based language modeling is problematic. As the number of distinct word forms can grow very large, it becomes difficult to train language models that are both effective and cover the words of the language well. In the literature, several methods have been proposed for basing the language modeling on sub-word units instead of whole words. However, to our knowledge, considerable improvements in speech recognition performance have not been reported. In this article, we present a language-independent algorithm for discovering word fragments in an unsupervised manner from text. The algorithm uses the Minimum Description Length principle to find an inventory of word fragments that is compact but models the training text effectively. Language modeling and speech recognition experiments show that n -gram models built over these fragments perform better than n -gram models based on words. In two Finnish recognition tasks, relative error rate reductions between 12% and 31% are obtained. In addition, our experiments suggest that word fragments obtained using grammatical rules do not outperform the fragments discovered from text. We also present our recognition system and discuss how utilizing fragments instead of words affects the decoding process.


ACM Transactions on Speech and Language Processing | 2007

Morph-based speech recognition and modeling of out-of-vocabulary words across languages

Mathias Creutz; Teemu Hirsimäki; Mikko Kurimo; Antti Puurula; Janne Pylkkönen; Vesa Siivola; Matti Varjokallio; Ebru Arisoy; Murat Saraclar; Andreas Stolcke

We explore the use of morph-based language models in large-vocabulary continuous-speech recognition systems across four so-called morphologically rich languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. The morphs are subword units discovered in an unsupervised, data-driven way using the Morfessor algorithm. By estimating n-gram language models over sequences of morphs instead of words, the quality of the language model is improved through better vocabulary coverage and reduced data sparsity. Standard word models suffer from high out-of-vocabulary (OOV) rates, whereas the morph models can recognize previously unseen word forms by concatenating morphs. It is shown that the morph models do perform fairly well on OOVs without compromising the recognition accuracy on in-vocabulary words. The Arabic experiment constitutes the only exception since here the standard word model outperforms the morph model. Differences in the datasets and the amount of data are discussed as a plausible explanation.


meeting of the association for computational linguistics | 2003

Unsupervised Segmentation of Words Using Prior Distributions of Morph Length and Frequency

Mathias Creutz

We present a language-independent and unsupervised algorithm for the segmentation of words into morphs. The algorithm is based on a new generative probabilistic model, which makes use of relevant prior information on the length and frequency distributions of morphs in a language. Our algorithm is shown to outperform two competing algorithms, when evaluated on data from a language with agglutinative morphology (Finnish), and to perform well also on English data.


meeting of the association for computational linguistics | 2004

Induction of a simple morphology for highly-inflecting languages

Mathias Creutz; Krista Lagus

This paper presents an algorithm for the unsupervised learning of a simple morphology of a natural language from raw text. A generative probabilistic model is applied to segment word forms into morphs. The morphs are assumed to be generated by one of three categories, namely prefix, suffix, or stem, and we make use of some observed asymmetries between these categories. The model learns a word structure, where words are allowed to consist of lengthy sequences of alternating stems and affixes, which makes the model suitable for highly-inflecting languages. The ability of the algorithm to find real morpheme boundaries is evaluated against a gold standard for both Finnish and English. In comparison with a state-of-the-art algorithm the new algorithm performs best on the Finnish data, and on roughly equal level on the English data.


cross language evaluation forum | 2008

Morpho Challenge Evaluation Using a Linguistic Gold Standard

Mikko Kurimo; Mathias Creutz; Matti Varjokallio

In Morpho Challenge 2007, the objective was to design statistical machine learning algorithms that discover which morphemes (smallest individually meaningful units of language) words consist of. Ideally, these are basic vocabulary units suitable for different tasks, such as text understanding, machine translation, information retrieval, and statistical language modeling. Because in unsupervised morpheme analysis the morphemes can have arbitrary names, the analyses are here evaluated by a comparison to a linguistic gold standard by matching the morpheme-sharing word pairs. The data sets were provided for four languages: Finnish, German, English, and Turkish and the participants were encouraged to apply their algorithm to all of them. The results show significant variance between the methods and languages, but the best methods seem to be useful in all tested languages and match quite well with the linguistic analysis.


meeting of the association for computational linguistics | 2009

Web Augmentation of Language Models for Continuous Speech Recognition of SMS Text Messages

Mathias Creutz; Sami Virpioja; Anna Kovaleva

In this paper, we present an efficient query selection algorithm for the retrieval of web text data to augment a statistical language model (LM). The number of retrieved relevant documents is optimized with respect to the number of queries submitted. The querying scheme is applied in the domain of SMS text messages. Continuous speech recognition experiments are conducted on three languages: English, Spanish, and French. The web data is utilized for augmenting in-domain LMs in general and for adapting the LMs to a user-specific vocabulary. Word error rate reductions of up to 6.6% (in LM augmentation) and 26.0% (in LM adaptation) are obtained in setups, where the size of the web mixture LM is limited to the size of the baseline in-domain LM.


finite-state methods and natural language processing | 2005

Unsupervised Morphology Induction Using Morfessor

Mathias Creutz; Krista Lagus; Sami Virpioja

We present Morfessor, an unsupervised algorithm and software that induces a simple morphology of a natural language from a large corpus. Morfessor simultaneously builds a morph lexicon and represents the corpus with the induced lexicon using a probabilistic maximum a posteriori model.


Proceedings of the Ninth Neural Computation and Psychology Workshop | 2005

LATENT LINGUISTIC CODES FOR MORPHEMES USING INDEPENDENT COMPONENT ANALYSIS

Krista Lagus; Mathias Creutz; Sami Virpioja

We study properties of morphemes by analyzing their use in a large Finnish text corpus using Independent Component Analysis (ICA). As a result, we obtain emergent linguistic representations for the morphemes. On a coarse level, main syntactic categories are observed. On a more detailed level, the components depict potential thematic roles of the morphemes. An interesting question is whether these discovered lower-dimensional representations could be directly utilized in language processing applications.

Collaboration


Dive into the Mathias Creutz's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Teemu Hirsimäki

Helsinki University of Technology

View shared research outputs
Top Co-Authors

Avatar

Janne Pylkkönen

Helsinki University of Technology

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge