Is this you? Create Your Porfile

Petr Mizera

Czech Technical University in Prague

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Petr Mizera is active.

Explore More

Publication

Featured researches published by Petr Mizera.

text speech and dialogue | 2014

Impact of irregular pronunciation on phonetic segmentation of Nijmegen corpus of Casual Czech

Petr Mizera; Petr Pollák; Alice Kolman; Mirjam Ernestus

This paper describes the pilot study of phonetic segmentation applied to Nijmegen Corpus of Casual Czech (NCCCz). This corpus contains informal speech of strong spontaneous nature which influences the character of produced speech at various levels. This work is the part of wider research related to the analysis of pronunciation reduction in such informal speech. We present the analysis of the accuracy of phonetic segmentation when canonical or reduced pronunciation is used. The achieved accuracy of realized phonetic segmentation provides information about general accuracy of proper acoustic modelling which is supposed to be applied in spontaneous speech recognition. As a byproduct of presented spontaneous speech segmentation, this paper also describes the created lexicon with canonical pronunciations of words in NCCCz, a tool supporting pronunciation check of lexicon items, and finally also a minidatabase of selected utterances from NCCCz manually labelled on phonetic level suitable for evaluation purposes.

international conference on speech and computer | 2013

Noise and Channel Normalized Cepstral Features forFar-speech Recognition

Michal Borsky; Petr Mizera; Petr Pollák

The paper analyses suitable features for distorted speech recognition. The aim is to explore the application of command ASR system when the speech is recorded with far-distance microphones with a possible strong additive and convolutory noise. The paper analyses feasible contribution of basic spectral subtraction coupled with cepstral mean normalization in minimizing of the influence of present distortion in such far-talk channel. The results are compared with reference close-talk speech recognition system. The results show the improvement in WER for channels with low or medium SNR. Using the combination of these basic techniques WERR of 55.6% was obtained for medium distance channel and WERR of 22.5% for far distance channel.

text speech and dialogue | 2015

Phonetic Segmentation Using KALDI and Reduced Pronunciation Detection in Causal Czech Speech

Zdenek Patc; Petr Mizera; Petr Pollák

The paper describes the implementation of phonetic segmentation using the tools from KALDI toolkit. Its usage is motivated by the big development and support of topical techniques of ASR which are available in KALDI. The presented work is related to the research on pronunciation variability in casual Czech speech. For this purpose we use the automatic phonetic segmentation to analyze the particular phone boundaries, deletions, etc. We also present the tool for pronunciation detection. Both tools can be used for processing large databases as well as for an interactive work within the environment of Praat. Also the illustrative analysis of the segmentation accuracy and the design of new environment for phonetic segmentation in Praat are presented.

Speech Communication | 2017

Dithering techniques in automatic recognition of speech corrupted by MP3 compression

Michal Borsky; Petr Mizera; Petr Pollák; Jan Nouza

A large portion of the audio files distributed over the Internet or those stored in personal and corporate media archives are in a compressed form. There exist several compression techniques and algorithms but it is the MPEG Layer-3 (known as MP3) that has achieved a really wide popularity in general audio coding, and in speech, too. However, the algorithm is lossy in nature and introduces distortion into spectral and temporal characteristics of a signal. In this paper we study its impact on automatic speech recognition (ASR). We show that with decreasing MP3 bitrates the major source of ASR performance degradation is deep spectral valleys (i.e. bins with almost zero energy) caused by the masking effect of the MP3 algorithm. We demonstrate that these unnatural gaps in spectrum can be effectively compensated by adding a certain amount of noise to the distorted signal. We provide theoretical background for this approach where we show that the added noise affects mainly the spectral valleys. They are filled by the noise while the spectral bins with speech remain almost unchanged. This helps to restore a more natural shape of log spectrum and cepstrum, and consequently has a positive impact on ASR performance. In our previous work, we have proposed two types of the signal dithering (noise addition) technique, one applied globally, the other in a more selective way. In this paper, we offer a more detailed insight into their performance. We provide results from many experiments where we test them in various scenarios, using a large vocabulary continuous speech recognition (LVCSR) system, acoustic models based on gaussian-mixture model (GMM) as well as on deep-neural network (DNN), and multiple speech databases in three languages (Czech, English and German). Our results prove that both the proposed techniques, and the selective dithering method, in particular, yield consistent compensation of the negative impact of the MP3 compressed speech on ASR performance.

text speech and dialogue | 2015

Improved Estimation of Articulatory Features Based on Acoustic Features with Temporal Context

Petr Mizera; Petr Pollák

The paper deals with neural network-based estimation of articulatory features for Czech which are intended to be applied within automatic phonetic segmentation or automatic speech recognition. In our current approach we use the multi-layer perceptron networks to extract the articulatory features on the basis of non-linear mapping from standard acoustic features extracted from speech signal. The suitability of various acoustic features and the optimum length of temporal context at the input of used network were analysed. The temporal context is represented by a context window created from the stacked feature vectors. The optimum length of the temporal contextual information was analysed and identified for the context window in the range from 9 to 21 frames. We obtained 90.5% frame level accuracy on average across all the articulatory feature classes for mel-log filter-bank features. The highest classification rate of 95.3% was achieved for the voicing class.

Neural Network World | 2014

R OBUST NEURAL NETWORK-BASED ESTIMATION OF ARTICULATORY FEATURES FOR CZECH

Petr Mizera; Petr Pollák

The article describes a neural network-based articulatory feature (AF) estimation for the Czech speech. First, the relationship between AFs and a Czech phone inventory is defined, and then the estimation based on the MLP neural networks is done. The usage of several speech representations on the input of the MLP classifiers is proposed with the purpose to obtain a robust AF estimation. The realized experiments have proved that an ANN- based AF estimation works very reliably especially in a low noise environment. Moreover, in case the number of neurons in a hidden layer is increased and if the temporal context DCT-TRAP features are used on the input of the MLP network, the AF classification works accurately also for the signals collected in the environments with a high background noise.

international conference on speech and computer | 2018

Automatic Phonetic Segmentation and Pronunciation Detection with Various Approaches of Acoustic Modeling

Petr Mizera; Petr Pollák

The paper describes HMM-based phonetic segmentation realized by KALDI toolkit with the focus on study of accuracy of various acoustic modeling such as GMM-HMM vs. DNN-HMM, monophone vs. triphone, speaker independent vs. speaker dependent. The analysis was performed using TIMIT database and it proved the contribution of advanced acoustic modeling for the choice of a proper pronunciation variant. For this purpose, the lexicon covering the pronunciation variability among TIMIT speakers was created on the basis of phonetic transcriptions available in TIMIT corpus. When the proper sequence of phones is recognized by DNN-HMM system, more precise boundary placement can be then obtained using basic monophone acoustic models.

international conference on speech and computer | 2017

Improving of LVCSR for Causal Czech Using Publicly Available Language Resources

Petr Mizera; Petr Pollák

The paper presents the design of Czech casual speech recognition which is a part of the wider research focused on understanding very informal speaking styles. The study was carried out using the NCCCz corpus and the contributions of optimized acoustic and language models as well as pronunciation lexicon optimization were analyzed. Special attention was paid to the impact of publicly available corpora suitable for language model (LM) creation. Our final DNN-HMM system achieved in the task of casual speech recognition WER of 30–60% depending on LM used. The results of recognition for other speaking styles are presented as well for the comparison purposes. The system was built using KALDI toolkit and created recipes are available for the research community.

text speech and dialogue | 2016

KALDI Recipes for the Czech Speech Recognition Under Various Conditions

Petr Mizera; Jiří Fiala; Aleš Brich; Petr Pollák

The paper presents the implementation of Czech ASR system under various conditions using KALDI speech recognition toolkit in two standard state-of-the-art architectures (GMM-HMM and DNN-HMM). We present the recipes for the building of LVCSR using SpeechDat, SPEECON, CZKCC, and NCCCz corpora with the new update of feature extraction tool CtuCopy which supports currently KALDI format. All presented recipes same as CtuCopy tool are publicly available under the Apache license v2.0. Finally, an extension of KALDI toolkit which supports the running of described LVCSR recipes on MetaCentrum computing facilities (Czech National Grid Infrastructure operated by CESNET) is described. In the experimental part the baseline performance of both GMM-HMM and DNN-HMM LVCSR systems applied on given Czech corpora is presented. These results also demonstrate the behaviour of designed LVCSR under various acoustic conditions same as various speaking styles.

Eurasip Journal on Audio, Speech, and Music Processing | 2015