Seppo Enarvi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Seppo Enarvi is active.

Explore More

Publication

Featured researches published by Seppo Enarvi.

language resources and evaluation | 2017

Modeling under-resourced languages for speech recognition

Mikko Kurimo; Seppo Enarvi; Ottokar Tilk; Matti Varjokallio; André Mansikkaniemi; Tanel Alumäe

One particular problem in large vocabulary continuous speech recognition for low-resourced languages is finding relevant training data for the statistical language models. Large amount of data is required, because models should estimate the probability for all possible word sequences. For Finnish, Estonian and the other fenno-ugric languages a special problem with the data is the huge amount of different word forms that are common in normal speech. The same problem exists also in other language technology applications such as machine translation, information retrieval, and in some extent also in other morphologically rich languages. In this paper we present methods and evaluations in four recent language modeling topics: selecting conversational data from the Internet, adapting models for foreign words, multi-domain and adapted neural network language modeling, and decoding with subword units. Our evaluations show that the same methods work in more than one language and that they scale down to smaller data resources.

conference of the international speech communication association | 2016

TheanoLM-An extensible toolkit for neural network language modeling

Seppo Enarvi; Mikko Kurimo

We present a new tool for training neural network language models (NNLMs), scoring sentences, and generating text. The tool has been written using Python library Theano, which allows researcher to easily extend it and tune any aspect of the training process. Regardless of the flexibility, Theano is able to generate extremely fast native code that can utilize a GPU or multiple CPU cores in order to parallelize the heavy numerical computations. The tool has been evaluated in difficult Finnish and English conversational speech recognition tasks, and significant improvement was obtained over our best back-off n-gram models. The results that we obtained in the Finnish task were compared to those from existing RNNLM and RWTHLM toolkits, and found to be as good or better, while training times were an order of magnitude shorter.

2013 7th Conference on Speech Technology and Human - Computer Dialogue (SpeD) | 2013

A novel discriminative method for pruning pronunciation dictionary entries

Seppo Enarvi; Mikko Kurimo

In this paper we describe a novel discriminative method for pruning pronunciation dictionary. The algorithm removes those entries from the dictionary that affect negatively on speech recognition word error rate. The implementation is simple and requires no tunable parameters. We have carried out preliminary speech recognition experiments, pruning multiword pronunciations created by a phonetician. With the task in hand, we achieved only minimal improvements in recognition results. We are optimistic that the algorithm will prove to be useful in pruning larger dictionaries containing automatically generated pronunciations.

IEEE Transactions on Audio, Speech, and Language Processing | 2017

Automatic Speech Recognition With Very Large Conversational Finnish and Estonian Vocabularies

Seppo Enarvi; Peter Smit; Sami Virpioja; Mikko Kurimo

Today, the vocabulary size for language models in large vocabulary speech recognition is typically several hundreds of thousands of words. While this is already sufficient in some applications, the out-of-vocabulary words are still limiting the usability in others. In agglutinative languages the vocabulary for conversational speech should include millions of word forms to cover the spelling variations due to colloquial pronunciations, in addition to the word compounding and inflections. Very large vocabularies are also needed, for example, when the recognition of rare proper names is important. Previously, very large vocabularies have been efficiently modeled in conventional n-gram language models either by splitting words into subword units or by clustering words into classes. While vocabulary size is not as critical anymore in modern speech recognition systems, training time and memory consumption become an issue when state-of-the-art neural network language models are used. In this paper, we investigate techniques that address the vocabulary size issue by reducing the effective vocabulary size and by processing large vocabularies more efficiently. The experimental results in conversational Finnish and Estonian speech recognition indicate that properly defined word classes improve recognition accuracy. Subword n-gram models are not better on evaluation data than word n-gram models constructed from a vocabulary that includes all the words in the training corpus. However, when recurrent neural network (RNN) language models are used, their ability to utilize long contexts gives a larger gain to subword-based modeling. Our best results are from RNN language models that are based on statistical morphs. We show that the suitable size for a subword vocabulary depends on the language. Using time delay neural network acoustic models, we were able to achieve new state of the art in Finnish and Estonian conversational speech recognition, 27.1% word error rate in the Finnish task and 21.9% in the Estonian task.

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) | 2017

Aalto system for the 2017 Arabic multi-genre broadcast challenge

Peter Smit; Siva Reddy Gangireddy; Seppo Enarvi; Sami Virpioja; Mikko Kurimo

We describe the speech recognition systems we have created for MGB-3, the 3rd Multi Genre Broadcast challenge, which this year consisted of a task of building a system for transcribing Egyptian Dialect Arabic speech, using a big audio corpus of primarily Modern Standard Arabic speech and only a small amount (5 hours) of Egyptian adaptation data. Our system, which was a combination of different acoustic models, language models and lexical units, achieved a Multi-Reference Word Error Rate of 29.25%, which was the lowest in the competition. Also on the old MGB-2 task, which was run again to indicate progress, we achieved the lowest error rate: 13.2%. The result is a combination of the application of state-of-the-art speech recognition methods such as simple dialect adaptation for a Time-Delay Neural Network (TDNN) acoustic model (−27% errors compared to the baseline), Recurrent Neural Network Language Model (RNNLM) rescoring (an additional −5%), and system combination with Minimum Bayes Risk (MBR) decoding (yet another −10%). We also explored the use of morph and character language models, which was particularly beneficial in providing a rich pool of systems for the MBR decoding.

Archive | 2013

Studies on Training Text Selection for Conversational Finnish Language Modeling

Seppo Enarvi; Mikko Kurimo

Archive | 2018

Modeling Conversational Finnish for Automatic Speech Recognition

Seppo Enarvi

conference of the international speech communication association | 2017

SIAK — A Game for Foreign Language Pronunciation Learning

Reima Karhila; Sari Ylinen; Seppo Enarvi; Kalle J. Palomäki; Aleksander Nikulin; Olli Rantula; Vertti Viitanen; Krupakar Dhinakaran; Anna-Riikka Smolander; Heini Kallio; Katja Junttila; Maria Uther; Perttu Hämäläinen; Mikko Kurimo

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) | 2017