Jan Vaněk | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jan Vaněk is active.

Explore More

Publication

Featured researches published by Jan Vaněk.

Eurasip Journal on Audio, Speech, and Music Processing | 2011

System for fast lexical and phonetic spoken term detection in a Czech cultural heritage archive

Josef Psutka; Jan Švec; Jan Vaněk; Aleš Pražák; Luboš Šmídl; Pavel Ircing

The main objective of the work presented in this paper was to develop a complete system that would accomplish the original visions of the MALACH project. Those goals were to employ automatic speech recognition and information retrieval techniques to provide improved access to the large video archive containing recorded testimonies of the Holocaust survivors. The system has been so far developed for the Czech part of the archive only. It takes advantage of the state-of-the-art speech recognition system tailored to the challenging properties of the recordings in the archive (elderly speakers, spontaneous speech and emotionally loaded content) and its close coupling with the actual search engine. The design of the algorithm adopting the spoken term detection approach is focused on the speed of the retrieval. The resulting system is able to search through the 1,000 h of video constituting the Czech portion of the archive and find query word occurrences in the matter of seconds. The phonetic search implemented alongside the search based on the lexicon words allows to find even the words outside the ASR system lexicon such as names, geographic locations or Jewish slang.

text speech and dialogue | 2010

Gender-dependent acoustic models fusion developed for automatic subtitling of parliament meetings broadcasted by the Czech TV

Jan Vaněk; Josef Psutka

Gender-dependent (male/female) acoustic models are more acoustically homogeneous and therefore give better recognition performance than single gender-independent model. This paper deals with a problem how to use these gender-based acoustic models in a real-time LVCSR (Large Vocabulary Continuous Speech Recognition) system that is for more than one year used by the Czech TV for automatic subtitling of Parliament meetings that are broadcasted on the channel CT24. Frequent changes of speakers and the direct connection of the LVCSR system to the TV audio stream require switching/fusion of models automatically and as soon as possible. The paper presents various techniques based on using the output probabilities for quick selection of a better model or their combinations. The best proposed method achieved over 11% relative WER reduction in comparision with the GI model.

text speech and dialogue | 2014

Audio-Video Speaker Diarization for Unsupervised Speaker and Face Model Creation

Pavel Campr; Marie Kunešová; Jan Vaněk; Jan Cech; Josef Psutka

Our goal is to create speaker models in audio domain and face models in video domain from a set of videos in an unsupervised manner. Such models can be used later for speaker identification in audio domain (answering the question ”Who was speaking and when”) and/or for face recognition (”Who was seen and when”) for given videos that contain speaking persons. The proposed system is based on an audio-video diarization system that tries to resolve the disadvantages of the individual modalities. Experiments on broadcasts of Czech parliament meetings show that the proposed combination of individual audio and video diarization systems yields an improvement of the diarization error rate (DER).

text, speech and dialogue | 2010

Fast phonetic/lexical searching in the archives of the Czech holocaust testimonies: advancing towards the MALACH project visions

Josef Psutka; Jan ývec; Jan Vaněk; Aleý Pražák; Luboý ýmídl

In this paper we describe the system for a fast phonetic/lexical searching in the large archives of the Czech holocaust testimonies. The developed system is the first step to a fulfillment of the MALACH project visions [1, 2], at least as for an easier and faster access to the Czech part of the archives. More than one thousand hours of spontaneous, accented and highly emotional speech of Czech holocaust survivors stored at the USC Shoah Foundation Institute as videointerviews were automatically transcribed and phonetically/lexically indexed. Special attention was paid to processing of colloquial words that appear very frequently in the Czech spontaneous speech. The final access to the archives is very fast allowing to detect segments of interviews containing pronounced words, clusters of words presented in pre-defined time intervals, and also words that were not included in the working vocabulary (OOV words).

text speech and dialogue | 2005

Introduction of improved UWB speaker verification system

Aleš Padrta; Jan Vaněk

In this paper, the improvements of the speaker verification system, which is used at Department of Cybernetics at University of West Bohemia, are introduced. The paper summarizes our actual pieces of knowledge in the acoustic modeling domain, in the domain of the model creation and in the domain of score normalization based on the universal background models. The constituent components of the state-of-art verification system were modified or replaced by virtue of the actual pieces of knowledge. A set of experiments was performed to evaluate and compare the performance of the improved verification system and the baseline verification system based on HTK-toolkit. The results prove that the improved verification system outperforms the baseline system in both of the reviewed criterions – the equal error rate and the time consumption.

International Conference on Statistical Language and Speech Processing | 2017

A Regularization Post Layer: An Additional Way How to Make Deep Neural Networks Robust

Jan Vaněk; Jan Zelinka; Daniel Soutner; Josef Psutka

Neural Networks (NNs) are prone to overfitting. Especially, the Deep Neural Networks in the cases where the training data are not abundant. There are several techniques which allow us to prevent the overfitting, e.g., L1/L2 regularization, unsupervised pre-training, early training stopping, dropout, bootstrapping or cross-validation models aggregation. In this paper, we proposed a regularization post-layer that may be combined with prior techniques, and it brings additional robustness to the NN. We trained the regularization post-layer in the cross-validation (CV) aggregation scenario: we used the CV held-out folds to train an additional neural network post-layer that boosts the network robustness. We have tested various post-layer topologies and compared results with other regularization techniques. As a benchmark task, we have selected the TIMIT phone recognition which is a well-known and still favorite task where the training data are limited, and the used regularization techniques play a key role. However, the regularization post-layer is a general method, and it may be employed in any classification task.

international conference on speech and computer | 2014

Convolutional Neural Network for Refinement of Speaker Adaptation Transformation

Zbyněk Zajíc; Jan Zelinka; Jan Vaněk; Luděk Müller

The aim of this work is to propose a refinement of the shift-MLLR (shift Maximum Likelihood Linear Regression) adaptation of an acoustics model in the case of limited amount of adaptation data, which can lead to ill-conditioned transformations matrices. We try to suppress the influence of badly estimated transformation parameters utilizing the Artificial Neural Network (ANN), especially Convolutional Neural Network (CNN) with bottleneck layer on the end. The badly estimated shift-MLLR transformation is propagated through an ANN (suitably trained beforehand), and the output of the net is used as the new refined transformation. To train the ANN the well and the badly conditioned shift-MLLR transformations are used as outputs and inputs of ANN, respectively.

text speech and dialogue | 2011

Speaker-clustered acoustic models evaluated on GPU for on-line subtitling of parliament meetings

Josef Psutka; Jan Vaněk

This paper describes the effort with building speaker-clustered acoustic models as a part of the real-time LVCSR system that is used more than one year by the Czech TV for automatic subtitling of parliament meetings broadcasted on the channel CT24. Speaker-clustered acoustic models are more acoustically homogeneous and therefore give better recognition performance than single gender-independent model or even gender-dependent models. Frequent changes of speakers and a direct connection of the LVCSR system to the audio channel require an automatic switching/fusion of models as quickly as possible. An important part of the solution is real time likelihood evaluations of all clustered acoustic models, taking advantage of a fast GPU(Graphic Processing Unit). The proposed method achieved a WER reduction to the baseline gender-independent model over 2.34% relatively with more than 2M Gaussian mixtures evaluated in real-time.

International Conference on Statistical Language and Speech Processing | 2018

A Comparison of Adaptation Techniques and Recurrent Neural Network Architectures

Jan Vaněk; Josef Michalek; Jan Zelinka; Josef Psutka

Recently, recurrent neural networks have become state-of-the-art in acoustic modeling for automatic speech recognition. The long short-term memory (LSTM) units are the most popular ones. However, alternative units like gated recurrent unit (GRU) and its modifications outperformed LSTM in some publications. In this paper, we compared five neural network (NN) architectures with various adaptation and feature normalization techniques. We have evaluated feature-space maximum likelihood linear regression, five variants of i-vector adaptation and two variants of cepstral mean normalization. The most adaptation and normalization techniques were developed for feed-forward NNs and, according to results in this paper, not all of them worked also with RNNs. For experiments, we have chosen a well known and available TIMIT phone recognition task. The phone recognition is much more sensitive to the quality of AM than large vocabulary task with a complex language model. Also, we published the open-source scripts to easily replicate the results and to help continue the development.

international conference on bioinformatics and biomedical engineering | 2017

Bioimaging - Autothresholding and Segmentation via Neural Networks

Pavla Urbanová; Jan Vaněk; Pavel Souček; Dalibor Štys; Petr Císař; M. Železný

Bioimaging, image segmentation, thresholding, and multivariate processing are helpful tools in analysis of series of images from many time lapse experiments. The different methods of mathematic, algorithmization and artificial intelligence could by modified, parametrized or adopted for single purpose case of completely different biological background (namely microorganisms, tissue cultures, aquaculture). However, most of the task is based on initial image segmentation, before features axtraction and comparison tasks are evaluated. In this article, we compare several of classical approaches in bioinformatical and biophysical cases with the neural network approach. The concept of neural network was adopted from the biological neural networks. Th networks need to be trained, however after the learning phase, they should be able to find one solution for various objects. The comparison of the methods is evaluated via error in segmentation according to the human supervisor.

Explore More