Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Matti Varjokallio is active.

Publication


Featured researches published by Matti Varjokallio.


ACM Transactions on Speech and Language Processing | 2007

Morph-based speech recognition and modeling of out-of-vocabulary words across languages

Mathias Creutz; Teemu Hirsimäki; Mikko Kurimo; Antti Puurula; Janne Pylkkönen; Vesa Siivola; Matti Varjokallio; Ebru Arisoy; Murat Saraclar; Andreas Stolcke

We explore the use of morph-based language models in large-vocabulary continuous-speech recognition systems across four so-called morphologically rich languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. The morphs are subword units discovered in an unsupervised, data-driven way using the Morfessor algorithm. By estimating n-gram language models over sequences of morphs instead of words, the quality of the language model is improved through better vocabulary coverage and reduced data sparsity. Standard word models suffer from high out-of-vocabulary (OOV) rates, whereas the morph models can recognize previously unseen word forms by concatenating morphs. It is shown that the morph models do perform fairly well on OOVs without compromising the recognition accuracy on in-vocabulary words. The Arabic experiment constitutes the only exception since here the standard word model outperforms the morph model. Differences in the datasets and the amount of data are discussed as a plausible explanation.


cross language evaluation forum | 2008

Morpho Challenge Evaluation Using a Linguistic Gold Standard

Mikko Kurimo; Mathias Creutz; Matti Varjokallio

In Morpho Challenge 2007, the objective was to design statistical machine learning algorithms that discover which morphemes (smallest individually meaningful units of language) words consist of. Ideally, these are basic vocabulary units suitable for different tasks, such as text understanding, machine translation, information retrieval, and statistical language modeling. Because in unsupervised morpheme analysis the morphemes can have arbitrary names, the analyses are here evaluated by a comparison to a linguistic gold standard by matching the morpheme-sharing word pairs. The data sets were provided for four languages: Finnish, German, English, and Turkish and the participants were encouraged to apply their algorithm to all of them. The results show significant variance between the methods and languages, but the best methods seem to be useful in all tested languages and match quite well with the linguistic analysis.


cross language evaluation forum | 2008

Overview of Morpho challenge 2008

Mikko Kurimo; Ville T. Turunen; Matti Varjokallio

This paper gives an overview of Morpho Challenge 2008 competition and results. The goal of the challenge was to evaluate unsupervised algorithms that provide morpheme analyses for words in different languages. For morphologically complex languages, such as Finnish, Turkish and Arabic, morpheme analysis is particularly important for lexical modeling of words in speech recognition, information retrieval and machine translation. The evaluation in Morpho Challenge competitions consisted of both a linguistic and an application oriented performance analysis. In addition to the Finnish, Turkish, German and English evaluations performed in Morpho Challenge 2007, the competition this year had an additional evaluation for Arabic. The results in linguistic evaluation in 2008 show that although the level of precision and recall varies substantially between the tasks in different languages, the best methods seem to deal quite well with all languages involved. The results in information retrieval evaluation indicate that the morpheme analysis has a significant effect in all the tested languages (Finnish, English and German). The best unsupervised and language-independent morpheme analysis methods can also rival the best language-dependent word normalization methods. The Morpho Challenge was part of the EU Network of Excellence PASCAL Challenge Program and organized in collaboration with CLEF.


language resources and evaluation | 2017

Modeling under-resourced languages for speech recognition

Mikko Kurimo; Seppo Enarvi; Ottokar Tilk; Matti Varjokallio; André Mansikkaniemi; Tanel Alumäe

One particular problem in large vocabulary continuous speech recognition for low-resourced languages is finding relevant training data for the statistical language models. Large amount of data is required, because models should estimate the probability for all possible word sequences. For Finnish, Estonian and the other fenno-ugric languages a special problem with the data is the huge amount of different word forms that are common in normal speech. The same problem exists also in other language technology applications such as machine translation, information retrieval, and in some extent also in other morphologically rich languages. In this paper we present methods and evaluations in four recent language modeling topics: selecting conversational data from the Internet, adapting models for foreign words, multi-domain and adapted neural network language modeling, and decoding with subword units. Our evaluations show that the same methods work in more than one language and that they scale down to smaller data resources.


ieee automatic speech recognition and understanding workshop | 2013

Learning a subword vocabulary based on unigram likelihood

Matti Varjokallio; Mikko Kurimo; Sami Virpioja

Using words as vocabulary units for tasks like speech recognition is infeasible for many morphologically rich languages, including Finnish. Thus, subword units are commonly used for language modeling. This work presents a novel algorithm for creating a subword vocabulary, based on the unigram likelihood of a text corpus. The method is evaluated with entropy measure and a Finnish LVCSR task. Unigram entropy of the text corpus is shown to be a good indicator for the quality of higher order n-gram models, also resulting in high speech recognition accuracy.


spoken language technology workshop | 2014

A word-level token-passing decoder for subword n-gram LVCSR

Matti Varjokallio; Mikko Kurimo

The decoder is a key component of any modern speech recognizer. Morphologically rich languages pose special challenges for the decoder design, as a very large recognition vocabulary is required to avoid high out-of-vocabulary (OOV) rates. To alleviate these issues, the n-gram models are often trained over subwords instead of words. A subword n-gram model is able to assign probabilities to unseen word forms. We review token-passing decoding and suggest a novel way of creating the decoding graph for subword n-grams on word-level. This approach has the advantage of a better control over the recognition vocabulary, including removal of nonsense words and the possibility to include important OOV-words to the graph. The different decoders are evaluated in a Finnish large vocabulary continuous speech recognition (LVCSR) task.


International Conference on Statistical Language and Speech Processing | 2016

Class n-Gram Models for Very Large Vocabulary Speech Recognition of Finnish and Estonian

Matti Varjokallio; Mikko Kurimo; Sami Virpioja

We study class n-gram models for very large vocabulary speech recognition of Finnish and Estonian. The models are trained with vocabulary sizes of several millions of words using automatically derived classes. To evaluate the models on Finnish and an Estonian broadcast news speech recognition task, we modify Aalto University’s LVCSR decoder to operate with the class n-grams and very large vocabularies. Linear interpolation of a standard n-gram model and a class n-gram model provides relative perplexity improvements of 21.3 % for Finnish and 12.8 % for Estonian over the n-gram model. The relative improvements in word error rates are 5.5 % for Finnish and 7.4 % for Estonian. We also compare our word-based models to a state-of-the-art unlimited vocabulary recognizer utilizing subword n-gram models, and show that the very large vocabulary word-based models can perform equally well or better.


Archive | 2006

Unsupervised segmentation of words into morphemes - Challenge 2005, An Introduction and Evaluation Report

Mikko Kurimo; Mathias Creutz; Matti Varjokallio; Ebru Arisoy; Murat Saraclar


north american chapter of the association for computational linguistics | 2007

Analysis of Morph-Based Speech Recognition and the Modeling of Out-of-Vocabulary Words Across Languages

Mathias Creutz; Teemu Hirsimäki; Mikko Kurimo; Antti Puurula; Janne Pylkkönen; Vesa Siivola; Matti Varjokallio; Ebru Arısoy; Murat Saraclar; Andreas Stolcke


conference of the international speech communication association | 2006

Unsupervised segmentation of words into morphemes - Morpho Challenge 2005: Application to Automatic Speech Recognition

Mikko Kurimo; Mathias Creutz; Matti Varjokallio; Ebru Arisoy; Murat Saraclar

Collaboration


Dive into the Matti Varjokallio's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Mathias Creutz

Helsinki University of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Janne Pylkkönen

Helsinki University of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Antti Puurula

Helsinki University of Technology

View shared research outputs
Top Co-Authors

Avatar

Teemu Hirsimäki

Helsinki University of Technology

View shared research outputs
Top Co-Authors

Avatar

Vesa Siivola

Helsinki University of Technology

View shared research outputs
Top Co-Authors

Avatar

Ville T. Turunen

Helsinki University of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge