Jan Vaněk
University of West Bohemia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jan Vaněk.
Eurasip Journal on Audio, Speech, and Music Processing | 2011
Josef Psutka; Jan Švec; Jan Vaněk; Aleš Pražák; Luboš Šmídl; Pavel Ircing
The main objective of the work presented in this paper was to develop a complete system that would accomplish the original visions of the MALACH project. Those goals were to employ automatic speech recognition and information retrieval techniques to provide improved access to the large video archive containing recorded testimonies of the Holocaust survivors. The system has been so far developed for the Czech part of the archive only. It takes advantage of the state-of-the-art speech recognition system tailored to the challenging properties of the recordings in the archive (elderly speakers, spontaneous speech and emotionally loaded content) and its close coupling with the actual search engine. The design of the algorithm adopting the spoken term detection approach is focused on the speed of the retrieval. The resulting system is able to search through the 1,000 h of video constituting the Czech portion of the archive and find query word occurrences in the matter of seconds. The phonetic search implemented alongside the search based on the lexicon words allows to find even the words outside the ASR system lexicon such as names, geographic locations or Jewish slang.
text speech and dialogue | 2010
Jan Vaněk; Josef Psutka
Gender-dependent (male/female) acoustic models are more acoustically homogeneous and therefore give better recognition performance than single gender-independent model. This paper deals with a problem how to use these gender-based acoustic models in a real-time LVCSR (Large Vocabulary Continuous Speech Recognition) system that is for more than one year used by the Czech TV for automatic subtitling of Parliament meetings that are broadcasted on the channel CT24. Frequent changes of speakers and the direct connection of the LVCSR system to the TV audio stream require switching/fusion of models automatically and as soon as possible. The paper presents various techniques based on using the output probabilities for quick selection of a better model or their combinations. The best proposed method achieved over 11% relative WER reduction in comparision with the GI model.
text speech and dialogue | 2014
Pavel Campr; Marie Kunešová; Jan Vaněk; Jan Cech; Josef Psutka
Our goal is to create speaker models in audio domain and face models in video domain from a set of videos in an unsupervised manner. Such models can be used later for speaker identification in audio domain (answering the question ”Who was speaking and when”) and/or for face recognition (”Who was seen and when”) for given videos that contain speaking persons. The proposed system is based on an audio-video diarization system that tries to resolve the disadvantages of the individual modalities. Experiments on broadcasts of Czech parliament meetings show that the proposed combination of individual audio and video diarization systems yields an improvement of the diarization error rate (DER).
text, speech and dialogue | 2010
Josef Psutka; Jan ývec; Jan Vaněk; Aleý Pražák; Luboý ýmídl
In this paper we describe the system for a fast phonetic/lexical searching in the large archives of the Czech holocaust testimonies. The developed system is the first step to a fulfillment of the MALACH project visions [1, 2], at least as for an easier and faster access to the Czech part of the archives. More than one thousand hours of spontaneous, accented and highly emotional speech of Czech holocaust survivors stored at the USC Shoah Foundation Institute as videointerviews were automatically transcribed and phonetically/lexically indexed. Special attention was paid to processing of colloquial words that appear very frequently in the Czech spontaneous speech. The final access to the archives is very fast allowing to detect segments of interviews containing pronounced words, clusters of words presented in pre-defined time intervals, and also words that were not included in the working vocabulary (OOV words).
text speech and dialogue | 2005
Aleš Padrta; Jan Vaněk
In this paper, the improvements of the speaker verification system, which is used at Department of Cybernetics at University of West Bohemia, are introduced. The paper summarizes our actual pieces of knowledge in the acoustic modeling domain, in the domain of the model creation and in the domain of score normalization based on the universal background models. The constituent components of the state-of-art verification system were modified or replaced by virtue of the actual pieces of knowledge. A set of experiments was performed to evaluate and compare the performance of the improved verification system and the baseline verification system based on HTK-toolkit. The results prove that the improved verification system outperforms the baseline system in both of the reviewed criterions – the equal error rate and the time consumption.
International Conference on Statistical Language and Speech Processing | 2017
Jan Vaněk; Jan Zelinka; Daniel Soutner; Josef Psutka
Neural Networks (NNs) are prone to overfitting. Especially, the Deep Neural Networks in the cases where the training data are not abundant. There are several techniques which allow us to prevent the overfitting, e.g., L1/L2 regularization, unsupervised pre-training, early training stopping, dropout, bootstrapping or cross-validation models aggregation. In this paper, we proposed a regularization post-layer that may be combined with prior techniques, and it brings additional robustness to the NN. We trained the regularization post-layer in the cross-validation (CV) aggregation scenario: we used the CV held-out folds to train an additional neural network post-layer that boosts the network robustness. We have tested various post-layer topologies and compared results with other regularization techniques. As a benchmark task, we have selected the TIMIT phone recognition which is a well-known and still favorite task where the training data are limited, and the used regularization techniques play a key role. However, the regularization post-layer is a general method, and it may be employed in any classification task.
international conference on speech and computer | 2014
Zbyněk Zajíc; Jan Zelinka; Jan Vaněk; Luděk Müller
The aim of this work is to propose a refinement of the shift-MLLR (shift Maximum Likelihood Linear Regression) adaptation of an acoustics model in the case of limited amount of adaptation data, which can lead to ill-conditioned transformations matrices. We try to suppress the influence of badly estimated transformation parameters utilizing the Artificial Neural Network (ANN), especially Convolutional Neural Network (CNN) with bottleneck layer on the end. The badly estimated shift-MLLR transformation is propagated through an ANN (suitably trained beforehand), and the output of the net is used as the new refined transformation. To train the ANN the well and the badly conditioned shift-MLLR transformations are used as outputs and inputs of ANN, respectively.
text speech and dialogue | 2011
Josef Psutka; Jan Vaněk
This paper describes the effort with building speaker-clustered acoustic models as a part of the real-time LVCSR system that is used more than one year by the Czech TV for automatic subtitling of parliament meetings broadcasted on the channel CT24. Speaker-clustered acoustic models are more acoustically homogeneous and therefore give better recognition performance than single gender-independent model or even gender-dependent models. Frequent changes of speakers and a direct connection of the LVCSR system to the audio channel require an automatic switching/fusion of models as quickly as possible. An important part of the solution is real time likelihood evaluations of all clustered acoustic models, taking advantage of a fast GPU(Graphic Processing Unit). The proposed method achieved a WER reduction to the baseline gender-independent model over 2.34% relatively with more than 2M Gaussian mixtures evaluated in real-time.
International Conference on Statistical Language and Speech Processing | 2018
Jan Vaněk; Josef Michalek; Jan Zelinka; Josef Psutka
Recently, recurrent neural networks have become state-of-the-art in acoustic modeling for automatic speech recognition. The long short-term memory (LSTM) units are the most popular ones. However, alternative units like gated recurrent unit (GRU) and its modifications outperformed LSTM in some publications. In this paper, we compared five neural network (NN) architectures with various adaptation and feature normalization techniques. We have evaluated feature-space maximum likelihood linear regression, five variants of i-vector adaptation and two variants of cepstral mean normalization. The most adaptation and normalization techniques were developed for feed-forward NNs and, according to results in this paper, not all of them worked also with RNNs. For experiments, we have chosen a well known and available TIMIT phone recognition task. The phone recognition is much more sensitive to the quality of AM than large vocabulary task with a complex language model. Also, we published the open-source scripts to easily replicate the results and to help continue the development.
international conference on bioinformatics and biomedical engineering | 2017
Pavla Urbanová; Jan Vaněk; Pavel Souček; Dalibor Štys; Petr Císař; M. Železný
Bioimaging, image segmentation, thresholding, and multivariate processing are helpful tools in analysis of series of images from many time lapse experiments. The different methods of mathematic, algorithmization and artificial intelligence could by modified, parametrized or adopted for single purpose case of completely different biological background (namely microorganisms, tissue cultures, aquaculture). However, most of the task is based on initial image segmentation, before features axtraction and comparison tasks are evaluated. In this article, we compare several of classical approaches in bioinformatical and biophysical cases with the neural network approach. The concept of neural network was adopted from the biological neural networks. Th networks need to be trained, however after the learning phase, they should be able to find one solution for various objects. The comparison of the methods is evaluated via error in segmentation according to the human supervisor.