Josef Psutka | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Josef Psutka is active.

Explore More

Publication

Featured researches published by Josef Psutka.

Eurasip Journal on Audio, Speech, and Music Processing | 2011

System for fast lexical and phonetic spoken term detection in a Czech cultural heritage archive

Josef Psutka; Jan Švec; Jan Vaněk; Aleš Pražák; Luboš Šmídl; Pavel Ircing

The main objective of the work presented in this paper was to develop a complete system that would accomplish the original visions of the MALACH project. Those goals were to employ automatic speech recognition and information retrieval techniques to provide improved access to the large video archive containing recorded testimonies of the Holocaust survivors. The system has been so far developed for the Czech part of the archive only. It takes advantage of the state-of-the-art speech recognition system tailored to the challenging properties of the recordings in the archive (elderly speakers, spontaneous speech and emotionally loaded content) and its close coupling with the actual search engine. The design of the algorithm adopting the spoken term detection approach is focused on the speed of the retrieval. The resulting system is able to search through the 1,000 h of video constituting the Czech portion of the archive and find query word occurrences in the matter of seconds. The phonetic search implemented alongside the search based on the lexicon words allows to find even the words outside the ASR system lexicon such as names, geographic locations or Jewish slang.

IEEE Transactions on Audio, Speech, and Language Processing | 2012

Optimized Acoustic Likelihoods Computation for NVIDIA and ATI/AMD Graphics Processors

Jan Vanek; Jan Trmal; Josef Psutka

In this paper, we describe an optimized version of a Gaussian-mixture-based acoustic model likelihood evaluation algorithm for graphical processing units (GPUs). The evaluation of these likelihoods is one of the most computationally intensive parts of automatic speech recognizers, but it can be parallelized and offloaded to GPU devices. Our approach offers a significant speed-up over the recently published approaches, because it utilizes the GPU architecture in a more effective manner. All the recent implementations have been intended only for NVIDIA graphics processors, programmed either in CUDA or OpenCL GPU programming frameworks. We present results for both CUDA and OpenCL. Further, we have developed an OpenCL implementation optimized for ATI/AMD GPUs. Results suggest that even very large acoustic models can be used in real-time speech recognition engines on computers equipped with a low-end GPU or laptops. In addition, the completely asynchronous GPU management provides additional CPU resources for the decoder part of the LVCSR. The optimized implementation enables us to apply fusion techniques together with evaluating many (10 or even more) speaker-specific acoustic models. We apply this technique to a real-time parliamentary speech recognition system where the speaker changes frequently.

IEEE Transactions on Audio, Speech, and Language Processing | 2009

Using Morphological Information for Robust Language Modeling in Czech ASR System

Pavel Ircing; Josef Psutka

Automatic speech recognition, or more precisely language modeling, of the Czech language has to face challenges that are not present in the language modeling of English. Those include mainly the rapid vocabulary growth and closely connected unreliable estimates of the language model parameters. These phenomena are caused mostly by the highly inflectional nature of the Czech language. On the other hand, the rich morphology together with the well-developed automatic systems for morphological tagging can be exploited to reinforce the language model probability estimates. This paper shows that using rich morphological tags within the concept of class-based n-gram language model with many-to-many word-to-class mapping and combination of this model with the standard word-based n-gram can improve the recognition accuracy over the word-based baseline on the task of automatic transcription of unconstrained spontaneous Czech interviews.

text speech and dialogue | 2002

Automatic Transcription of Czech Language Oral History in the MALACH Project: Resources and Initial Experiments

Josef Psutka; Pavel Ircing; Vlasta Radová; William Byrne; Jan Hajic; Samuel Gustman; Bhuvana Ramabhadran

In this paper we describe the initial stages of the ASR component of the MALACH (Multilingual Access to Large Spoken Archives) project. This project will attempt to provide improved access to the large multilingual spoken archives collected by the Survivors of the Shoah Visual History Foundation (VHF) by advancing the state of the art in automated speech recognition. In order to train the ASR system, it is neccesary to manually transcribe a large amount of speech data, identify the appropriate vocabulary, and obtain relevant text for language modeling. We give a detailed description of the speech annotation process; show the specific properties of the spontaneous speech contained in the archives; and present a baseline speech recognition results.

international symposium on signal processing and information technology | 2012

Full covariance Gaussian mixture models evaluation on GPU

Jan Vanek; Jan Trmal; Josef Psutka

Gaussian mixture models (GMMs) are often used in various data processing and classification tasks to model a continuous probability density in a multi-dimensional space. In cases, where the dimension of the feature space is relatively high (e.g. in the automatic speech recognition (ASR)), GMM with a higher number of Gaussians with diagonal covariances (DC) instead of full covariances (FC) is used from the two reasons. The first reason is a problem how to estimate robust FC matrices with a limited training data set. The second reason is a much higher computational cost during the GMM evaluation. The first reason was addressed in many recent publications. In contrast, this paper describes an efficient implementation on Graphic Processing Unit (GPU) of the FC-GMM evaluation, which addresses the second reason. The performance was tested on acoustic models for ASR, and it is shown that even a low-end laptop GPU is capable to evaluate a large acoustic model in a fraction of the real speech time. Three variants of the algorithm were implemented and compared on various GPUs: NVIDIA CUDA, NVIDIA OpenCL, and ATI/AMD OpenCL.

text, speech and dialogue | 2010

Fast phonetic/lexical searching in the archives of the Czech holocaust testimonies: advancing towards the MALACH project visions

Josef Psutka; Jan ývec; Jan Vaněk; Aleý Pražák; Luboý ýmídl

In this paper we describe the system for a fast phonetic/lexical searching in the large archives of the Czech holocaust testimonies. The developed system is the first step to a fulfillment of the MALACH project visions [1, 2], at least as for an easier and faster access to the Czech part of the archives. More than one thousand hours of spontaneous, accented and highly emotional speech of Czech holocaust survivors stored at the USC Shoah Foundation Institute as videointerviews were automatically transcribed and phonetically/lexically indexed. Special attention was paid to processing of colloquial words that appear very frequently in the Czech spontaneous speech. The final access to the archives is very fast allowing to detect segments of interviews containing pronounced words, clusters of words presented in pre-defined time intervals, and also words that were not included in the working vocabulary (OOV words).

international conference on speech and computer | 2013

Covariance Matrix Enhancement Approach to Train Robust Gaussian Mixture Models of Speech Data

Jan Vanĕk; Lukáš Machlica; Josef Psutka

An estimation of parameters of a multivariate Gaussian Mixture Model is usually based on a criterion e.g. Maximum Likelihood that is focused mostly on training data. Therefore, testing data, which were not seen during the training procedure, may cause problems. Moreover, numerical instabilities can occur e.g. for low-occupied Gaussians especially when working with full-covariance matrices in high-dimensional spaces. Another question concerns the number of Gaussians to be trained for a specific data set. The approach proposed in this paper can handle all these issues. It is based on an assumption that the training and testing data were generated from the same source distribution. The key part of the approach is to use a criterion based on the source distribution rather than using the training data itself. It is shown how to modify an estimation procedure in order to fit the source distribution better despite the fact that it is unknown, and subsequently new estimation algorithm for diagonal- as well as full-covariance matrices is derived and tested.

text speech and dialogue | 2012

Captioning of Live TV Programs through Speech Recognition and Re-speaking

Aleš Pražák; Zdeněk Loose; Jan Trmal; Josef Psutka

In this paper we introduce our complete solution for captioning of live TV programs used by the Czech Television, the public service broadcaster in the Czech Republic. Live captioning using speech recognition and re-speaking is on the increase and widely used for example in BBC; however, many specific issues have to be solved each time a new captioning system is being put in operation. Our concept of re-speaking assumes a complex integration of re-speaker’s skills, not only verbatim repetition with fully automatic processing. This paper describes the recognition system design with advanced re-speaker interaction, distributed captioning system architecture and neglected re-speaker training. Some evaluation of our skilled re-speakers is presented too.

text speech and dialogue | 2003

Building LVCSR System for Transcription of Spontaneously Pronounced Russian Testimonies in the MALACH Project: Initial Steps and First Results

Josef Psutka; Ilja Iljuchin; Pavel Ircing; Václav Trejbal; William Byrne; Jan Hajic; Samuel Gustman

The MALACH project [1] uses the world’s largest digital archives of video oral histories collected by the Survivors of the Shoah Visual History Foundation (VHF) and attempts to access such archives by advancing the state-of-the-art in Automated Speech Recognition (ASR) and Information Retrieval (IR). This paper discusses the initial steps and the first results in building large vocabulary continuous speech recognition (LVCSR) system for transcription of Russian witnesses. Russian as the third language processed in the MALACH project (after English [2] and Czech [3]) brought new problems especially in the phonetic area. Although the most of the Russian testimonies were provided by native Russian survivors we have encountered many different accents in their speech caused by a territory where the survivors are living.

computer analysis of images and patterns | 2015

Sample Size for Maximum Likelihood Estimates of Gaussian Model

Josef Psutka

Significant properties of maximum likelihood ML estimate are consistency, normality and efficiency. However, it has been proven that these properties are valid when the sample size approaches infinity. Many researches warn that a behavior of ML estimator working with the small sample size is largely unknown. But, in real tasks we usually do not have enough data to completely fulfill the conditions of optimal ML estimate. The question, which we discuss in the article is, how much data we need to be able to estimate the Gaussian model that provides sufficiently accurate likelihood estimates. This issue is addressed with respect to the dimension of space and it is taken into account possible property of ill conditioned data.

Explore More