Péter Mihajlik
Budapest University of Technology and Economics
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Péter Mihajlik.
International Journal of Speech Technology | 2000
Málté Szarvas; Tibor Fegyó; Péter Mihajlik; Péter Tatai
This article describes the problems encountered during the design and implementation of automatic speech recognition systems for the Hungarian language, proposes practical solutions for treating them, and evaluates their practicality using publicly available databases. The article introduces a rule-based system for modeling the phonological rules inside of words as well as at word boundaries and the notion of stochastic morphological analysis for the treatment of the vocabulary size problem. Finally, the implementation of the proposed methods by the FlexiVoice speech engine is described, and the results of the experimental evaluation on isolated and connected digit recognition, on a 2000-word recognition of Hungarian city names, and on inflected word recognition tasks are summarized.
IEEE Transactions on Audio, Speech, and Language Processing | 2010
Péter Mihajlik; Zoltán Tüske; Balázs Tarján; Bottyán Németh; Tibor Fegyó
Various morphological and acoustic modeling techniques are evaluated on a less resourced, spontaneous Hungarian large-vocabulary continuous speech recognition (LVCSR) task. Among morphologically rich languages, Hungarian is known for its agglutinative, inflective nature that increases the data sparseness caused by a relatively small training database. Although Hungarian spelling is considered as simple phonological, a large part of the corpus is covered by words pronounced in multiple, phonemically different ways. Data-driven and language specific knowledge supported vocabulary decomposition methods are investigated in combination with phoneme- and grapheme-based acoustic modeling techniques on the given task. Word baseline and morph-based advanced baseline results are significantly outperformed by using both statistical and grammatical vocabulary decomposition methods. Although the discussed morph-based techniques recognize a significant amount of out of vocabulary words, the improvements are due not to this fact but to the reduction of insertion errors. Applying grapheme-based acoustic models instead of phoneme-based models causes no severe recognition performance deteriorations. Moreover, a fully data-driven acoustic modeling technique along with a statistical morphological modeling approach provides the best performance on the most difficult test set. The overall best speech recognition performance is obtained by using a novel word to morph decomposition technique that combines grammatical and unsupervised statistical segmentation algorithms. The improvement achieved by the proposed technique is stable across acoustic modeling approaches and larger with speaker adaptation.
spoken language technology workshop | 2012
Éva Székely; Tamás Gábor Csapó; Bálint Tóth; Péter Mihajlik; Julie Carson-Berndsen
Freely available audiobooks are a rich resource of expressive speech recordings that can be used for the purposes of speech synthesis. Natural sounding, expressive synthetic voices have previously been built from audiobooks that contained large amounts of highly expressive speech recorded from a professionally trained speaker. The majority of freely available audiobooks, however, are read by amateur speakers, are shorter and contain less expressive (less emphatic, less emotional, etc.) speech both in terms of quality and quantity. Synthesizing expressive speech from a typical online audiobook therefore poses many challenges. In this work we address these challenges by applying a method consisting of minimally supervised techniques to align the text with the recorded speech, select groups of expressive speech segments and build expressive voices for hidden Markov-model based synthesis using speaker adaptation. Subjective listening tests have shown that the expressive synthetic speech generated with this method is often able to produce utterances suited to an emotional message. We used a restricted amount of speech data in our experiment, in order to show that the method is generally applicable to most typical audiobooks widely available online.
2011 6th Conference on Speech Technology and Human-Computer Dialogue (SpeD) | 2011
Gellért Sárosi; Mihály Mozsáry; Péter Mihajlik; Tibor Fegyó
A crucial part of a speech recognizer is the acoustic feature extraction, especially when the application is intended to be used in noisy environment. In this paper we investigate several novel front-end techniques and compare them to multiple baselines. Recognition tests were performed on studio quality wide band recordings on Hungarian as well as on narrow band telephone speech including real-life noises collected in six languages: English, German, French, Italian, Spanish and Hungarian. The following baseline feature types were used with several settings: Mel Frequency Cepstral Coefficients (MFCC), Perceptual Linear Prediction (PLP) features implemented in HTK, SPHINX, or by ourselves. Novel methods include Perceptual Minimum Variance Distortionless Response (PMVDR) and multiple variations of the Power-Normalized Cepstral Coefficients (PNCC). Also, adaptive techniques are applied to reduce convolutive distortions. We have experienced a significant difference between the MFCC implementations, and there were major differences in the PNCC variations useful in the different bandwidths and noise conditions.
text speech and dialogue | 2007
Péter Mihajlik; Tibor Fegyó; Bottyán Németh; Zoltán Tüske; Viktor Trón
The paper describes automatic speech recognition experiments and results on the spontaneous Hungarian MALACH speech corpus. A novel morph-based lexical modeling approach is compared to the traditional word-based one and to another, previously best performing morph-based one in terms of word and letter error rates. The applied language and acoustic modeling techniques are also detailed. Using unsupervised speaker adaptations along with morph based lexical models 14.4%-8.1% absolute word error rate reductions have been achieved on a 2 speakers, 2 hours test set as compared to the speaker independent baseline results.
international conference on computers helping people with special needs | 2008
Géza Németh; Gábor Olaszy; Mátyás Bartalis; Géza Kiss; Csaba Zainkó; Péter Mihajlik; Csaba Haraszti
Aged and visually impaired persons belong to those groups of people, who can get information about drugs not so easily, as others. Although in Hungary lately Braille prints (containing the name of the medicament) are placed on the boxes of the drugs, but getting detailed information about the drug, i.e. to access the content of the written Patient Information Leaflets (PIL), is complicated. The Medicine Line (MLN) service may help in solving this problem. This automatic telephone information system was developed and put into operation in Hungary in December 2006. The computer system speaks and understands Hungarian, so the aged and visually impaired can get the information about the drug by voice. Adaptation to other languages is also possible. As we know, no such system is available in the European Union.
international conference on speech and computer | 2015
Ádám Varga; Balázs Tarján; Zoltán Tobler; György Szaszák; Tibor Fegyó; Csaba Bordás; Péter Mihajlik
In this paper, the application of LVCSR (Large Vocabulary Continuous Speech Recognition) technology is investigated for real-time, resource-limited broadcast close captioning. The work focuses on transcribing live broadcast conversation speech to make such programs accessible to deaf viewers. Due to computational limitations, real time factor (RTF) and memory requirements are kept low during decoding with various models tailored for Hungarian broadcast speech recognition. Two decoders are compared on the direct transcription task of broadcast conversation recordings, and setups employing re-speakers are also tested. Moreover, the models are evaluated on a broadcast news transcription task as well, and different language models (LMs) are tested in order to demonstrate the performance of our systems in settings when low memory consumption is a less crucial factor.
2013 7th Conference on Speech Technology and Human - Computer Dialogue (SpeD) | 2013
Balázs Tarján; Gellért Sárosi; Tibor Fegyó; Péter Mihajlik
This paper summarizes our recent efforts made to automatically transcribe call center conversations in real-time. Data sparseness issue is addressed due to the small amount of transcribed training data. Accordingly, first the potentials in the inclusion of additional non-conventional training texts are investigated, and then morphological language models are introduced to handle data insufficiency. The baseline system is also extended with explicit models for non-verbal speech events such as hesitation or consent. In addition, all the above techniques are efficiently combined in the final system. The benefit of each approach is evaluated on real-life call center recordings. Results show that by utilizing morphological language models, significant error rate reduction can be achieved over the word baseline system, which is preserved across experimental setups. The results can be further improved if nonverbal events are also modeled.
2013 7th Conference on Speech Technology and Human - Computer Dialogue (SpeD) | 2013
Péter Mihajlik; András Balog
In a variety of speech recognition tasks a large amount of approximate transcription is available for the audio material, but is not directly applicable for acoustic model training. Whereas roughly time-synchronous closed-captions or proper audiobook texts are already used in lightly supervised techniques, the utilization of more imperfect and at the same time completely unaligned transcriptions is not self-evident. In this paper we describe our experiments aiming at automated transcription of Hungarian parliamentary speeches. Essentially, a lightly supervised across-domain acoustic model adaptation/retraining is performed. A low-resource broadcast news model is used to bootstrap the process. Relying on automatic recognition of parliamentary training speech and on dynamic text alignment based data selection, a new, task-specific acoustic model is built. For the adaptation to the parliamentary domain, only edited official transcriptions and unaligned speech data are used, without any additional human annotation effort. The adapted acoustic model is applied on unseen target speech in real-time recognition. The word accuracy difference between the automatic and the human powered, official transcription is only 5% (as compared to the exact reference text).
COST'10 Proceedings of the 2010 international conference on Analysis of Verbal and Nonverbal Communication and Enactment | 2010
Gellért Sárosi; Tamás Mozsolics; Balázs Tarján; András Balog; Péter Mihajlik; Tibor Fegyó
This paper introduces our work and results related to a multiple language continuous speech recognition task. The aim was to design a system that introduces tolerable amount of recognition errors for point of interest words in voice navigational queries even in the presence of real-life traffic noise. Additional challenges were that no task-specific training databases were available for language and acoustic modeling. Instead, general purpose acoustic database were obtained and (probabilistic) context free grammars were constructed for the acoustic and language models, respectively. Public pronunciation lexicon was used for the English language, whereas rule- and exception dictionary based pronunciation modeling was applied for French, German, Italian, Spanish and Hungarian. For the last four languages the classical phoneme-based pronunciation modeling approach was compared to grapheme-based pronunciation modeling technique, as well. Noise robustness was addressed by applying various feature extraction methods. The results show that achieving high word recognition accuracy is feasible if cooperative speakers can be assumed.