Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Karel Palecek is active.

Publication


Featured researches published by Karel Palecek.


international conference on telecommunications | 2013

Audio-visual speech recognition in noisy audio environments

Karel Palecek; Josef Chaloupka

It is a well-known fact that the visual part of speech can improve the resulting recognition rate mainly in noisy conditions. Main goal of this work is to find a set of visual features which would be possible to use in our audio-visual speech recognition systems. Discrete Cosine Transform (DCT) and Active Appearance Model (AAM) based visual features are extracted from visual speech signals, enhanced by a simplified variant of Hierarchical Linear Discriminant Analysis (HiLDA) and normalized across speakers. The visual features are then combined with standard MFCC audio features by the middle fusion method. The results from audio-visual speech recognition are compared with the results from experiments where the log-spectra minimum mean square error and multiband spectral subtraction methods for reducing additive noise in the audio signal are used.


european signal processing conference | 2016

Lipreading using spatiotemporal histogram of oriented gradients

Karel Palecek

We propose a visual speech parametrization based on histogram of oriented gradients (HOG) for the task of lipreading from frontal face videos. Inspired by the success of spatiotemporal local binary patterns, the features are designed to capture dynamic information contained in the input video sequence by combining HOG descriptors extracted from three orthogonal planes that span x, y and t axes. We integrate our features into a system based on hidden Markov model (HMM) and show that by utilizing robust and properly tuned parametrization this traditional scheme can outperform recent sophisticated embedding approaches to lipreading. We perform experiments on three different datasets, two of which are publicly available. In order to conduct an unbiased feature comparison, the process of model learning including hyperparameter tuning is as automatized as possible. To this end, we rely heavily on cross validation.


international conference on speech and computer | 2014

Extraction of Features for Lip-reading Using Autoencoders

Karel Palecek

We study the incorporation of facial depth data in the task of isolated word visual speech recognition. We propose novel features based on unsupervised training of a single layer autoencoder. The features are extracted from both video and depth channels obtained by Microsoft Kinect device. We perform all experiments on our database of 54 speakers, each uttering 50 words. We compare our autoencoder features to traditional methods such as DCT or PCA. The features are further processed by simplified variant of hierarchical linear discriminant analysis in order to capture the speech dynamics. The classification is performed using a multi-stream Hidden Markov Model for various combinations of audio, video, and depth channels. We also evaluate visual features in the join audio-video isolated word recognition in noisy environments. English


multimedia signal processing | 2012

Browsing, indexing and automatic transcription of lectures for distance learning

Petr Cerva; Jan Silovsky; Jindrich Zdansky; Ondrej Smola; Karel Blavka; Karel Palecek; Jan Nouza; Jiri Malek

This paper presents a complex system developed to improve the quality of distance learning by allowing people to browse the content of various (academic) lectures. The system consists of several main modules. The first automatic speech recognition (ASR) module is designed to cope with inflective Czech language and provides time-aligned transcriptions of input audio-visual recordings of lectures. These transcriptions are generated off-line in two recognition passes using speaker adaptation methods and language models mixed from various text sources including transcriptions of broadcast programs, spontaneous telephone talks, web discussions, thesis, etc. Lecture recordings and their transcriptions are then indexed and stored in the database. The next module, client-server web lecture browser, allows to browse or play the indexed content and search in it.


international conference on speech and computer | 2017

Utilizing Lipreading in Large Vocabulary Continuous Speech Recognition

Karel Palecek

Vast majority of current research in the area of audiovisual speech recognition via lipreading from frontal face videos focuses on simple cases such as isolated phrase recognition or structured speech, where the vocabulary is limited to several tens of units. In this paper, we diverge from these traditional applications and investigate the effect of incorporating the visual information in the task of continuous speech recognition with vocabulary size ranging from several hundred to half a million words. To this end, we evaluate various visual speech parametrizations, both existing and novel, that are designed to capture different kind of information in the video signal. The experiments are conducted on a moderate sized dataset of 54 speakers, each uttering 100 sentences in Czech language. We show that even for large vocabularies the visual signal contains enough information to improve the word accuracy up to 15% relatively to the acoustic-only recognition.


Journal on Multimodal User Interfaces | 2018

Experimenting with lipreading for large vocabulary continuous speech recognition

Karel Palecek

Vast majority of current research in the area of audiovisual speech recognition via lipreading from frontal face videos focuses on simple cases such as isolated phrase recognition or structured speech, where the vocabulary is limited to several tens of units. In this paper, we diverge from these traditional applications and investigate the effect of incorporating the visual and also depth information in the task of continuous speech recognition with vocabulary size ranging from several hundred to half a million words. To this end, we evaluate various visual speech parametrizations, both existing and novel, that are designed to capture different kind of information in the video and depth signals. The experiments are conducted on a moderate sized dataset of 54 speakers, each uttering 100 sentences in Czech language. Both the video and depth data was captured by the Microsoft Kinect device. We show that even for large vocabularies the visual signal contains enough information to improve the word accuracy up to 22% relatively to the acoustic-only recognition. Somewhat surprisingly, a relative improvement of up to 16% has also been reached using the interpolated depth data.


text speech and dialogue | 2017

Spatiotemporal Convolutional Features for Lipreading

Karel Palecek

We propose a visual parametrization method for the task of lipreading and audiovisual speech recognition from frontal face videos. The presented features utilize learned spatiotemporal convolutions in a deep neural network that is trained to predict phonemes on a frame level. The network is trained on a manually transcribed moderate size dataset of Czech television broadcast, but we show that the resulting features generalize well to other languages as well. On a publicly available OuluVS dataset, a result of 91% word accuracy was achieved using vanilla convolutional features, and 97.2% after fine tuning – substantial state of the art improvements in this popular benchmark. Contrary to most of the work on lipreading, we also demonstrate usefulness of the proposed parametrization in the task of continuous audiovisual speech recognition.


european signal processing conference | 2017

Audio/video supervised independent vector analysis through multimodal pilot dependent components

Francesco Nesta; Saeed Mosayyebpour; Zbynek Koldovsky; Karel Palecek

Independent Vector Analysis is a powerful tool for estimating the broadband acoustic transfer function between multiple sources and the microphones in the frequency domain. In this work, we consider an extended IVA model which adopts the concept of pilot dependent signals. Without imposing any constraint on the de-mixing system, pilot signals depending on the target source are injected into the model enforcing the permutation of outputs to be consistent over time. A neural network trained on acoustic data and a lip motion detection are jointly used to produce a multimodal pilot signal dependent on the target source. It is shown through experimental results that this structure allows the enhancement of a predefined target source in very difficult and ambiguous scenarios.


international conference on telecommunications | 2015

Comparison of depth-based features for lipreading

Karel Palecek

We examine the effect of depth information captured by Microsoft Kinect on the task of visual speech recognition. We propose depth-based active appearance model (AAM) features and show improved results over discrete cosine transform (DCT). The visual and depth features are evaluated on a database of 54 speakers each uttering 50 isolated words. In order to exploit the speech dynamics, the features are enhanced by a simplified one-stage variant of hierarchical linear discriminant analysis (Hi-LDA). In the experiments, we consider feature fusion via combined video-depth active appearance model as a form of early integration, and compare it to traditional multi-stream hidden Markov Model as a form of decision fusion. We also perform experiments on audio-visual recognition in noisy environments and show improved results of incorporating depth information over both traditional audio-video fusion and utilization of speech enhancement algorithms.


international conference on telecommunications | 2016

Depth-based features in audio-visual speech recognition

Karel Palecek; Josef Chaloupka

We study the impact of depth-based visual features in systems for visual and audio-visual speech recognition. Instead of reconstruction from multiple views, the depth maps are obtained by the Kinect sensor, which is better suited for real world applications. We extract several types of visual features from video and depth channels and evaluate their performance both individually and in cross-channel combination. In order to show the information complementarity between the video-based and the depth-based features, we examine the relative importance of each channel when combined via weighted multi-stream Hidden Markov Models. We also introduce novel parametrizations based on discrete cosine transform and histogram of oriented gradients. The contribution of all presented visual speech features is demonstrated in the task of audio-visual speech recognition under noisy conditions.

Collaboration


Dive into the Karel Palecek's collaboration.

Top Co-Authors

Avatar

Jan Nouza

Technical University of Liberec

View shared research outputs
Top Co-Authors

Avatar

Petr Cerva

Technical University of Liberec

View shared research outputs
Top Co-Authors

Avatar

Jan Silovsky

Technical University of Liberec

View shared research outputs
Top Co-Authors

Avatar

Josef Chaloupka

Technical University of Liberec

View shared research outputs
Top Co-Authors

Avatar

Jan Silovský

Technical University of Liberec

View shared research outputs
Top Co-Authors

Avatar

Jindrich Zdansky

Technical University of Liberec

View shared research outputs
Top Co-Authors

Avatar

Jiri Malek

Technical University of Liberec

View shared research outputs
Top Co-Authors

Avatar

Karel Blavka

Technical University of Liberec

View shared research outputs
Top Co-Authors

Avatar

Ondrej Smola

Technical University of Liberec

View shared research outputs
Top Co-Authors

Avatar

Zbynek Koldovsky

Technical University of Liberec

View shared research outputs
Researchain Logo
Decentralizing Knowledge