Jan Cernocký
Brno University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jan Cernocký.
text speech and dialogue | 2004
Petr Schwarz; Pavel Matějka; Jan Cernocký
We investigate techniques for acoustic modeling in automatic recognition of context-independent phoneme strings from the TIMIT database. The baseline phoneme recognizer is based on TempoRAl Patterns (TRAP). This recognizer is simplified to shorten processing times and reduce computational requirements. More states per phoneme and bi-gram language models are incorporated into the system and evaluated. The question of insufficient amount of training data is discussed and the system is improved. All modifications lead to a faster system with about 23.6% relative improvement over the baseline in phoneme error rate.
text speech and dialogue | 2005
Igor Szöke; Petr Schwarz; Pavel Matějka; Lukas Burget; Martin Karafiát; Jan Cernocký
This paper describes several ways of acoustic keywords spotting (KWS), based on Gaussian mixture model (GMM) hidden Markov models (HMM) and phoneme posterior probabilities from FeatureNet. Context-independent and dependent phoneme models are used in the GMM/HMM system. The systems were trained and evaluated on informal continuous speech. We used different complexities of KWS recognition network and different types of phoneme models. We study the impact of these parameters on the accuracy and computational complexity, an conclude that phoneme posteriors outperform conventional GMM/HMM system.
text, speech and dialogue | 2006
Lukas Burget; Jan Cernocký; Michal Fapso; Martin Karafiát; Pavel Matějka; Petr Schwarz; Pavel Smrž; Igor Szöke
This paper presents two approaches to spoken document retrieval—search in LVCSR recognition lattices and in phoneme lattices For the former one, an efficient method of indexing and search of multi-word queries is discussed In phonetic search, the indexation of tri-phoneme sequences is investigated The results in terms of response time to single and multi-word queries are evaluated on ICSI meeting database.
ACM Transactions on Information Systems | 2012
Javier Tejedor; Michal Fapso; Igor Szöke; Jan Cernocký; Frantisek Grezl
This article investigates query-by-example (QbE) spoken term detection (STD), in which the query is not entered as text, but selected in speech data or spoken. Two feature extractors based on neural networks (NN) are introduced: the first producing phone-state posteriors and the second making use of a compressive NN layer. They are combined with three different QbE detectors: while the Gaussian mixture model/hidden Markov model (GMM/HMM) and dynamic time warping (DTW) both work on continuous feature vectors, the third one, based on weighted finite-state transducers (WFST), processes phone lattices. QbE STD is compared to two standard STD systems with text queries: acoustic keyword spotting and WFST-based search of phone strings in phone lattices. The results are reported on four languages (Czech, English, Hungarian, and Levantine Arabic) using standard metrics: equal error rate (EER) and two versions of popular figure-of-merit (FOM). Language-dependent and language-independent cases are investigated; the latter being particularly interesting for scenarios lacking standard resources to train speech recognition systems. While the DTW and GMM/HMM approaches produce the best results for a language-dependent setup depending on the target language, the GMM/HMM approach performs the best dealing with a language-independent setup. As far as WFSTs are concerned, they are promising as they allow for indexing and fast search.
international conference on machine learning | 2007
Igor Szöke; Michal Fapso; Martin Karafiát; Lukas Burget; Frantisek Grezl; Petr Schwarz; Ondřej Glembek; Pavel Matějka; Jiří Kopecký; Jan Cernocký
The paper presents the Brno University of Technology (BUT) system for indexing and search of speech, combining LVCSR and phonetic approach. It brings a complete description of individual building blocks of the system from signal processing, through the recognizers, indexing and search until the normalization of detection scores. It also describes the data used in the first edition of NIST Spoken term detection (STD) evaluation. The results are presented on three US-English conditions - meetings, broadcast news and conversational telephone speech, in terms of detection error trade-off (DET) curves and term-weighted values (TWV) metrics defined by NIST.
NATO advanced study institute on computational models of speech pattern processing | 1999
Gérard Chollet; Jan Cernocký; Andrei Constantinescu; Sabine Deligne; Frédéric Bimbot
The models used in current automatic speech recognition (or synthesis) systems are generally relying on a representation based on phonetic symbols. The phonetic transcription of a word can be seen as an intermediate representation between the acoustic and the linguistic levels, but the a priori choice of phonemes (or phone-like units) can be questioned, as probably non-optimal. Moreover, the phonetic representation has the drawback of being strongly language-dependent, which partly prevents reusability of acoustic resources across languages. In this article, we expose and develop the concept of ALISP (Automatic Language Independent Speech Processing), namely a general methodology which consists in inferring the intermediate representation between the acoustic and the linguistic levels, from speech and linguistic data rather than from a priori knowledge, with as little supervision as possible. We expose the benefits that can be expected from developing the ALISP approach, together with the key issues to be solved. We also present preliminary experiments that can be viewed as first steps towards the ALISP goal.
Procedia Computer Science | 2016
Lucas Ondel; Lukas Burget; Jan Cernocký
Abstract Recently, several nonparametric Bayesian models have been proposed to automatically discover acoustic units in unlabeled data. Most of them are trained using various versions of the Gibbs Sampling (GS) method. In this work, we consider Variational Bayes (VB) as alternative inference process. Even though VB yields an approximate solution of the posterior distribution it can be easily parallelized which makes it more suitable for large database. Results show that, notwithstanding VB inference is an order of magnitude faster, it outperforms GS in terms of accuracy.
Digital Signal Processing | 2000
Dijana Petrovska-Delacrétaz; Jan Cernocký; Jean Hennebert; Gérard Chollet
Petrovska-Delacretaz, Dijana, ?ernocký, Jan, Hennebert, Jean, and Chollet, Gerard, Segmental Approaches for Automatic Speaker Verification, Digital Signal Processing10(2000), 198?212.Speech is composed of different sounds (acoustic segments). Speakers differ in their pronunciation of these sounds. The segmental approaches described in this paper are meant to exploit these differences for speaker verification purposes. For such approaches, the speech is divided into different classes, and the speaker modeling is done for each class. The speech segmentation applied is based on automatic language independent speech processing tools that provide a segmentation of the speech requiring neither phonetic nor orthographic transcriptions of the speech data. Two different speaker modeling approaches, based on multilayer perceptrons (MLPs) and on Gaussian mixture models (GMMs), are studied. The MLP-based segmental systems have performance comparable to that of the global MLP-based systems, and in the mismatched train-test conditions slightly better results are obtained with the segmental MLP system. The segmental GMM systems gave poorer results than the equivalent global GMM systems.
text speech and dialogue | 2002
Geneviève Baudoin; François Capman; Jan Cernocký; Fadi El Chami; Maurice Charbit; Gérard Chollet; Dijana Petrovska-Delacrétaz
ALISP (Automatic Language Independent Speech Processing) units are an alternative concept to using phoneme-derived units in speech processing. This article describes advances in very low bit rate coding using ALISP units. Results of speaker-independent experiments are reported and speaker clustering using vector quantization is proposed. The improvements of speech re-synthesis using Harmonic Noise Model and dynamic selection of units are discussed.
conference of the international speech communication association | 2016
Hossein Zeinali; Hossein Sameti; Lukas Burget; Jan Cernocký; Nooshin Maghsoodi; Pavel Matejka
Recently, a new data collection was initiated within the RedDots project in order to evaluate text-dependent and text-prompted speaker recognition technology on data from a wider speaker population and with more realistic noise, channel and phonetic variability. This paper analyses our systems built for RedDots challenge – the effort to collect and compare the initial results on this new evaluation data set obtained at different sites. We use our recently introduced HMM based i-vector approach, where, instead of the traditional GMM, a set of phone specific HMMs is used to collect the sufficient statistics for i-vector extraction. Our systems are trained in a completely phraseindependent way on the data from RSR2015 and Libri speech databases. We compare systems making use of standard cepstral features and their combination with neural network based bottle-neck features. The best results are obtained with a scorelevel fusion of such systems.