Petr Pollák
Czech Technical University in Prague
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Petr Pollák.
language resources and evaluation | 2014
Mirjam Ernestus; Lucie Koċkov'a-Amortov'a; Petr Pollák
This article describes the preparation, recording and orthographic transcription of a new speech corpus, the Nijmegen Corpus of Casual French (NCCFr). The corpus contains a total of over 36 h of recordings of 46 French speakers engaged in conversations with friends. Casual speech was elicited during three different parts, which together provided around 90 min of speech from every pair of speakers. While Parts 1 and 2 did not require participants to perform any specific task, in Part 3 participants negotiated a common answer to general questions about society. Comparisons with the ESTER corpus of journalistic speech show that the two corpora contain speech of considerably different registers. A number of indicators of casualness, including swear words, casual words, verlan, disfluencies and word repetitions, are more frequent in the NCCFr than in the ESTER corpus, while the use of double negation, an indicator of formal speech, is less frequent. In general, these estimates of casualness are constant through the three parts of the recording sessions and across speakers. Based on these facts, we conclude that our corpus is a rich resource of highly casual speech, and that it can be effectively exploited by researchers in language science and technology.
international conference radioelektronika | 2007
Josef Rajnoha; Petr Pollák
The speech recognisers use a parametric form of the signal to get the most important features in speech for the recognition task. Mel-frequency cepstral coefficients (MFCC) and Perceptual linear prediction coefficients (PLP) belong to the most commonly used methods. There is no rule to decide which one is better to use and it depends mainly on the particular conditions. The tests on taking advantage of different parts of each parametrization process to get the best results in given conditions are presented in this paper. Robust Hidden Markov model-based (HMM) Czech digit recogniser in slightly noisy environment is used for this purpose. The experiments show, that using Bark-frequency scaling, equal loudness pre-emphasis and intensity-loudness power law in the original MFCC method can bring improvement in white noise robustness for particular conditions. The results also uncovered that the LP-based methods tend to generate insertion errors in given environment.
text speech and dialogue | 2014
Petr Mizera; Petr Pollák; Alice Kolman; Mirjam Ernestus
This paper describes the pilot study of phonetic segmentation applied to Nijmegen Corpus of Casual Czech (NCCCz). This corpus contains informal speech of strong spontaneous nature which influences the character of produced speech at various levels. This work is the part of wider research related to the analysis of pronunciation reduction in such informal speech. We present the analysis of the accuracy of phonetic segmentation when canonical or reduced pronunciation is used. The achieved accuracy of realized phonetic segmentation provides information about general accuracy of proper acoustic modelling which is supposed to be applied in spontaneous speech recognition. As a byproduct of presented spontaneous speech segmentation, this paper also describes the created lexicon with canonical pronunciations of words in NCCCz, a tool supporting pronunciation check of lexicon items, and finally also a minidatabase of selected utterances from NCCCz manually labelled on phonetic level suitable for evaluation purposes.
international conference on speech and computer | 2013
Michal Borsky; Petr Mizera; Petr Pollák
The paper analyses suitable features for distorted speech recognition. The aim is to explore the application of command ASR system when the speech is recorded with far-distance microphones with a possible strong additive and convolutory noise. The paper analyses feasible contribution of basic spectral subtraction coupled with cepstral mean normalization in minimizing of the influence of present distortion in such far-talk channel. The results are compared with reference close-talk speech recognition system. The results show the improvement in WER for channels with low or medium SNR. Using the combination of these basic techniques WERR of 55.6% was obtained for medium distance channel and WERR of 22.5% for far distance channel.
Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions | 2009
Josef Rajnoha; Petr Pollák
Applying speech recognition into real working systems, spontaneous speech recognition has increasing importance. For the development purposes of such applications, the need of spontaneous speech database is evident both for general design or training and testing of such systems. This paper describes the collection of Czech spontaneous data recorded within technical lectures. It is supposed to be used as a material for the analysis of particular phenomena which appear within spontaneous speech but also as an extension material for training of spontaneous speech recognizers. Mainly the presence of spontaneous speech phenomena such as higher rate of non-speech events, changes in pronunciation, or sentence irregularities, should be the most important contribution of the collected database for the training purposes in comparison to the usage of available read speech databases only. Speech signals are captured in two different channels with slightly different quality and about 14 hours of speech from 15 different speakers are currently collected and annotated. The first analyses of spontaneous speech related effects in the collected data have been performed and the comparison with read speech databases is presented.
text speech and dialogue | 2015
Zdenek Patc; Petr Mizera; Petr Pollák
The paper describes the implementation of phonetic segmentation using the tools from KALDI toolkit. Its usage is motivated by the big development and support of topical techniques of ASR which are available in KALDI. The presented work is related to the research on pronunciation variability in casual Czech speech. For this purpose we use the automatic phonetic segmentation to analyze the particular phone boundaries, deletions, etc. We also present the tool for pronunciation detection. Both tools can be used for processing large databases as well as for an interactive work within the environment of Praat. Also the illustrative analysis of the segmentation accuracy and the design of new environment for phonetic segmentation in Praat are presented.
Speech Communication | 2017
Michal Borsky; Petr Mizera; Petr Pollák; Jan Nouza
A large portion of the audio files distributed over the Internet or those stored in personal and corporate media archives are in a compressed form. There exist several compression techniques and algorithms but it is the MPEG Layer-3 (known as MP3) that has achieved a really wide popularity in general audio coding, and in speech, too. However, the algorithm is lossy in nature and introduces distortion into spectral and temporal characteristics of a signal. In this paper we study its impact on automatic speech recognition (ASR). We show that with decreasing MP3 bitrates the major source of ASR performance degradation is deep spectral valleys (i.e. bins with almost zero energy) caused by the masking effect of the MP3 algorithm. We demonstrate that these unnatural gaps in spectrum can be effectively compensated by adding a certain amount of noise to the distorted signal. We provide theoretical background for this approach where we show that the added noise affects mainly the spectral valleys. They are filled by the noise while the spectral bins with speech remain almost unchanged. This helps to restore a more natural shape of log spectrum and cepstrum, and consequently has a positive impact on ASR performance. In our previous work, we have proposed two types of the signal dithering (noise addition) technique, one applied globally, the other in a more selective way. In this paper, we offer a more detailed insight into their performance. We provide results from many experiments where we test them in various scenarios, using a large vocabulary continuous speech recognition (LVCSR) system, acoustic models based on gaussian-mixture model (GMM) as well as on deep-neural network (DNN), and multiple speech databases in three languages (Czech, English and German). Our results prove that both the proposed techniques, and the selective dithering method, in particular, yield consistent compensation of the negative impact of the MP3 compressed speech on ASR performance.
text speech and dialogue | 2015
Petr Mizera; Petr Pollák
The paper deals with neural network-based estimation of articulatory features for Czech which are intended to be applied within automatic phonetic segmentation or automatic speech recognition. In our current approach we use the multi-layer perceptron networks to extract the articulatory features on the basis of non-linear mapping from standard acoustic features extracted from speech signal. The suitability of various acoustic features and the optimum length of temporal context at the input of used network were analysed. The temporal context is represented by a context window created from the stacked feature vectors. The optimum length of the temporal contextual information was analysed and identified for the context window in the range from 9 to 21 frames. We obtained 90.5% frame level accuracy on average across all the articulatory feature classes for mel-log filter-bank features. The highest classification rate of 95.3% was achieved for the voicing class.
Neural Network World | 2014
Petr Mizera; Petr Pollák
The article describes a neural network-based articulatory feature (AF) estimation for the Czech speech. First, the relationship between AFs and a Czech phone inventory is defined, and then the estimation based on the MLP neural networks is done. The usage of several speech representations on the input of the MLP classifiers is proposed with the purpose to obtain a robust AF estimation. The realized experiments have proved that an ANN- based AF estimation works very reliably especially in a low noise environment. Moreover, in case the number of neurons in a hidden layer is increased and if the temporal context DCT-TRAP features are used on the input of the MLP network, the AF classification works accurately also for the signals collected in the environments with a high background noise.
international conference on e business | 2011
Petr Pollák; Michal Borsky
This paper presents the study of speech recognition accuracy both for small and large vocabulary task with respect to different levels of MP3 compression of processed data. The motivation behind the work was to evaluate the usage of ASR system for off-line automatic transcription of recordings collected from standard present MP3 devices under different levels of background noise and channel distortion. Although MP3 may not be an optimal compression algorithm, the performed experiments have prooved that it does not distort speech signal significantly for higher compression rates. Realized experiments showed also that the accuracy of speech recognition (both small- and large-vocabulary) decreased very slowly for the bit-rate of 24 kbps and higher. However, slightly different setup of speech feature computation is necessary for MP3 speech data, mainly PLP features give significantly better results in comparison to MFCC.