Janne Pylkkönen
Helsinki University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Janne Pylkkönen.
Computer Speech & Language | 2006
Teemu Hirsimäki; Mathias Creutz; Vesa Siivola; Mikko Kurimo; Sami Virpioja; Janne Pylkkönen
Abstract In the speech recognition of highly inflecting or compounding languages, the traditional word-based language modeling is problematic. As the number of distinct word forms can grow very large, it becomes difficult to train language models that are both effective and cover the words of the language well. In the literature, several methods have been proposed for basing the language modeling on sub-word units instead of whole words. However, to our knowledge, considerable improvements in speech recognition performance have not been reported. In this article, we present a language-independent algorithm for discovering word fragments in an unsupervised manner from text. The algorithm uses the Minimum Description Length principle to find an inventory of word fragments that is compact but models the training text effectively. Language modeling and speech recognition experiments show that n -gram models built over these fragments perform better than n -gram models based on words. In two Finnish recognition tasks, relative error rate reductions between 12% and 31% are obtained. In addition, our experiments suggest that word fragments obtained using grammatical rules do not outperform the fragments discovered from text. We also present our recognition system and discuss how utilizing fragments instead of words affects the decoding process.
ACM Transactions on Speech and Language Processing | 2007
Mathias Creutz; Teemu Hirsimäki; Mikko Kurimo; Antti Puurula; Janne Pylkkönen; Vesa Siivola; Matti Varjokallio; Ebru Arisoy; Murat Saraclar; Andreas Stolcke
We explore the use of morph-based language models in large-vocabulary continuous-speech recognition systems across four so-called morphologically rich languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. The morphs are subword units discovered in an unsupervised, data-driven way using the Morfessor algorithm. By estimating n-gram language models over sequences of morphs instead of words, the quality of the language model is improved through better vocabulary coverage and reduced data sparsity. Standard word models suffer from high out-of-vocabulary (OOV) rates, whereas the morph models can recognize previously unseen word forms by concatenating morphs. It is shown that the morph models do perform fairly well on OOVs without compromising the recognition accuracy on in-vocabulary words. The Arabic experiment constitutes the only exception since here the standard word model outperforms the morph model. Differences in the datasets and the amount of data are discussed as a plausible explanation.
IEEE Transactions on Audio, Speech, and Language Processing | 2009
Teemu Hirsimäki; Janne Pylkkönen; Mikko Kurimo
Speech recognition systems trained for morphologically rich languages face the problem of vocabulary growth caused by prefixes, suffixes, inflections, and compound words. Solutions proposed in the literature include increasing the size of the vocabulary and segmenting words into morphs. However, in many cases, the methods have only been experimented with low-order n-gram models or compared to word-based models that do not have very large vocabularies. In this paper, we study the importance of using high-order variable-length n-gram models when the language models are trained over morphs instead of whole words. Language models trained on a very large vocabulary are compared with models based on different morph segmentations. Speech recognition experiments are carried out on two highly inflecting and agglutinative languages, Finnish and Estonian. The results suggest that high-order models can be essential in morph-based speech recognition, even when lattices are generated for two-pass recognition. The analysis of recognition errors reveal that the high-order morph language models improve especially the recognition of previously unseen words.
language and technology conference | 2006
Mikko Kurimo; Antti Puurula; Ebru Arisoy; Vesa Siivola; Teemu Hirsimäki; Janne Pylkkönen; Tanel Alumäe; Murat Saraclar
It is practically impossible to build a word-based lexicon for speech recognition in agglutinative languages that would cover all the relevant words. The problem is that words are generally built by concatenating several prefixes and suffixes to the word roots. Together with compounding and inflections this leads to millions of different, but still frequent word forms. Due to inflections, ambiguity and other phenomena, it is also not trivial to automatically split the words into meaningful parts. Rule-based morphological analyzers can perform this splitting, but due to the handcrafted rules, they also suffer from an out-of-vocabulary problem. In this paper we apply a recently proposed fully automatic and rather language and vocabulary independent way to build sub-word lexica for three different agglutinative languages. We demonstrate the language portability as well by building a successful large vocabulary speech recognizer for each language and show superior recognition performance compared to the corresponding word-based reference systems.
IEEE Transactions on Audio, Speech, and Language Processing | 2012
Janne Pylkkönen; Mikko Kurimo
Discriminative training is an essential part in building a state-of-the-art speech recognition system. The Extended Baum–Welch (EBW) algorithm is the most popular method to carry out this demanding large-scale optimization task. This paper presents a novel analysis of the EBW algorithm which shows that EBW is performing a specific kind of constrained optimization. The constraints show an interesting connection between the improvement of the discriminative criterion and the Kullback–Leibler divergence (KLD). Based on the analysis, a novel method for controlling the EBW algorithm is proposed. The presented analysis uses decomposed formulae for Gaussian mixture KLDs which correspond to the ones used in the Constrained Line Search (CLS) optimization algorithm. The CLS algorithm for discriminative training is therefore also briefly presented and its connections to EBW studied. Large vocabulary speech recognition experiments are used to evaluate the proposed controlling of EBW, which is shown to outperform the common heuristics in model robustness. Comparison of EBW to CLS also shows differences in robustness in favor to EBW. The constraints for Gaussian parameter optimization as well as the special mixture weight estimation method used with EBW are shown to be the key factors for good performance.
conference of the international speech communication association | 2016
Thomas Drugman; Janne Pylkkönen; Reinhard Kneser
The goal of this paper is to simulate the benefits of jointly applying active learning (AL) and semi-supervised training (SST) in a new speech recognition application. Our data selection approach relies on confidence filtering, and its impact on both the acoustic and language models (AM and LM) is studied. While AL is known to be beneficial to AM training, we show that it also carries out substantial improvements to the LM when combined with SST. Sophisticated confidence models, on the other hand, did not prove to yield any data selection gain. Our results indicate that, while SST is crucial at the beginning of the labeling process, its gains degrade rapidly as AL is set in place. The final simulation reports that AL allows a transcription cost reduction of about 70% over random selection. Alternatively, for a fixed transcription budget, the proposed approach improves the word error rate by about 12.5% relative.
international conference on acoustics, speech, and signal processing | 2007
Ulpu Remes; Janne Pylkkönen; Mikko Kurimo
Speaker adaptation is commonly used to compensate speaker variation in large vocabulary continuous speech recognition. In a multi-speaker environment where speakers change frequently speaker segregation is needed to divide the input audio stream to speaker turns. Speaker turns define the current speaker at each time and speaker adaptation can thus be done based on speaker turns. The novelty of this paper is that the speaker-specific transformations are estimated incrementally and in tandem with speaker segregation. Therefore we need a transformation that can be reliably estimated based on one speaker turn alone. We propose the constrained maximum likelihood linear regression (CMLLR) for this. In testing with Finnish TV news audio, speaker adaptation reduced the average letter error rate 25% relative to baseline.
conference of the international speech communication association | 2016
Janne Pylkkönen; Thomas Drugman; Max Bisani
Producing large enough quantities of high-quality transcriptions for accurate and reliable evaluation of an automatic speech recognition (ASR) system can be costly. It is therefore desirable to minimize the manual transcription work for producing metrics with an agreed precision. In this paper we demonstrate how to improve ASR evaluation precision using stratified sampling. We show that by altering the sampling, the deviations observed in the error metrics can be reduced by up to 30% compared to random sampling, or alternatively, the same precision can be obtained on about 30% smaller datasets. We compare different variants for conducting stratified sampling, including a novel sample allocation scheme tailored for word error rate. Experimental evidence is provided to assess the effect of different sampling schemes to evaluation precision.
conference of the international speech communication association | 2004
Janne Pylkkönen; Mikko Kurimo
conference of the international speech communication association | 2005
Janne Pylkkönen