Peter Smit
Aalto University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Peter Smit.
conference of the european chapter of the association for computational linguistics | 2014
Peter Smit; Sami Virpioja; Stig-Arne Grönroos; Mikko Kurimo
Morfessor is a family of probabilistic machine learning methods for finding the morphological segmentation from raw text data. Recent developments include the development of semi-supervised methods for utilizing annotated data. Morfessor 2.0 is a rewrite of the original, widely-used Morfessor 1.0 software, with well documented command-line tools and library interface. It includes new features such as semi-supervised learning, online training, and integrated evaluation code.
International Workshop on Spoken Dialogue Systems | 2017
Graham Wilcock; Niklas Laxström; Juho Leinonen; Peter Smit; Mikko Kurimo; Kristiina Jokinen
We describe our work towards developing SamiTalk , a robot application for the North Sami language. With SamiTalk, users will hold spoken dialogues with a humanoid robot that speaks and recognizes North Sami. The robot will access information from the Sami Wikipedia, talk about requested topics using the Wikipedia texts, and make smooth topic shifts to related topics using the Wikipedia links. SamiTalk will be based on the existing WikiTalk system for Wikipedia-based spoken dialogues, with newly developed speech components for North Sami.
international conference on acoustics, speech, and signal processing | 2011
Peter Smit; Mikko Kurimo
A common problem in speech recognition for foreign accented speech is that there is not enough training data for an accent-specific or a speaker-specific recognizer. Speaker adaptation can be used to improve the accuracy of a speaker-independent recognizer, but a lot of adaptation data is needed for speakers with a strong foreign accent. In this paper we propose a rather simple and successful technique of stacked transformations where the baseline models trained for native speakers are first adapted by using accent-specific data and then by another transformation using speaker-specific data. Because the accent-specific data can be collected offline, the first transformation can be more detailed and comprehensive, and the second one less detailed and fast. Experimental results are provided for speaker adaptation in English spoken by Finnish speakers. The evaluation results confirm that the stacked transformations are very helpful for fast speaker adaptation.
international conference on acoustics, speech, and signal processing | 2012
Reima Karhila; Doddipatla Rama Sanand; Mikko Kurimo; Peter Smit
This paper describes experiments in creating personalised childrens voices for HMM-based synthesis by adapting either an adult or child average voice. The adult average voice is trained from a large adult speech database, whereas the child average voice is trained using a small database of childrens speech. Here we present the idea to use stacked transformations for creating synthetic child voices, where the child average voice is first created from the adult average voice through speaker adaptation using all the pooled speech data from multiple children and then adding child specific speaker adaptation on top of it. VTLN is applied to speech synthesis to see whether it helps the speaker adaptation when only a small amount of adaptation data is available. The listening test results show that the stacked transformations significantly improve speaker adaptation for small amounts of data, but the additional benefit provided by VTLN is not yet clear.
IEEE Transactions on Audio, Speech, and Language Processing | 2017
Seppo Enarvi; Peter Smit; Sami Virpioja; Mikko Kurimo
Today, the vocabulary size for language models in large vocabulary speech recognition is typically several hundreds of thousands of words. While this is already sufficient in some applications, the out-of-vocabulary words are still limiting the usability in others. In agglutinative languages the vocabulary for conversational speech should include millions of word forms to cover the spelling variations due to colloquial pronunciations, in addition to the word compounding and inflections. Very large vocabularies are also needed, for example, when the recognition of rare proper names is important. Previously, very large vocabularies have been efficiently modeled in conventional n-gram language models either by splitting words into subword units or by clustering words into classes. While vocabulary size is not as critical anymore in modern speech recognition systems, training time and memory consumption become an issue when state-of-the-art neural network language models are used. In this paper, we investigate techniques that address the vocabulary size issue by reducing the effective vocabulary size and by processing large vocabularies more efficiently. The experimental results in conversational Finnish and Estonian speech recognition indicate that properly defined word classes improve recognition accuracy. Subword n-gram models are not better on evaluation data than word n-gram models constructed from a vocabulary that includes all the words in the training corpus. However, when recurrent neural network (RNN) language models are used, their ability to utilize long contexts gives a larger gain to subword-based modeling. Our best results are from RNN language models that are based on statistical morphs. We show that the suitable size for a subword vocabulary depends on the language. Using time delay neural network acoustic models, we were able to achieve new state of the art in Finnish and Estonian conversational speech recognition, 27.1% word error rate in the Finnish task and 21.9% in the Estonian task.
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) | 2017
Peter Smit; Siva Reddy Gangireddy; Seppo Enarvi; Sami Virpioja; Mikko Kurimo
We describe the speech recognition systems we have created for MGB-3, the 3rd Multi Genre Broadcast challenge, which this year consisted of a task of building a system for transcribing Egyptian Dialect Arabic speech, using a big audio corpus of primarily Modern Standard Arabic speech and only a small amount (5 hours) of Egyptian adaptation data. Our system, which was a combination of different acoustic models, language models and lexical units, achieved a Multi-Reference Word Error Rate of 29.25%, which was the lowest in the competition. Also on the old MGB-2 task, which was run again to indicate progress, we achieved the lowest error rate: 13.2%. The result is a combination of the application of state-of-the-art speech recognition methods such as simple dialect adaptation for a Time-Delay Neural Network (TDNN) acoustic model (−27% errors compared to the baseline), Recurrent Neural Network Language Model (RNNLM) rescoring (an additional −5%), and system combination with Minimum Bayes Risk (MBR) decoding (yet another −10%). We also explored the use of morph and character language models, which was particularly beneficial in providing a rich pool of systems for the MBR decoding.
Archive | 2013
Sami Virpioja; Peter Smit; Stig-Arne Grönroos; Mikko Kurimo
international conference on computational linguistics | 2014
Stig-Arne Grönroos; Sami Virpioja; Peter Smit; Mikko Kurimo
conference of the international speech communication association | 2017
Peter Smit; Sami Virpioja; Mikko Kurimo
International Workshop on Computational Linguistics for the Uralic Languages | 2016
Peter Smit; Juho Leinonen; Kristiina Jokinen; Mikko Kurimo