Jan Trmal | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jan Trmal is active.

Explore More

Publication

Featured researches published by Jan Trmal.

international conference on acoustics, speech, and signal processing | 2014

Improving deep neural network acoustic models using generalized maxout networks

Xiaohui Zhang; Jan Trmal; Daniel Povey; Sanjeev Khudanpur

Recently, maxout networks have brought significant improvements to various speech recognition and computer vision tasks. In this paper we introduce two new types of generalized maxout units, which we call p-norm and soft-maxout. We investigate their performance in Large Vocabulary Continuous Speech Recognition (LVCSR) tasks in various languages with 10 hours and 60 hours of data, and find that the p-norm generalization of maxout consistently performs well. Because, in our training setup, we sometimes see instability during training when training unbounded-output nonlinearities such as these, we also present a method to control that instability. This is the “normalization layer”, which is a nonlinearity that scales down all dimensions of its input in order to stop the average squared output from exceeding one. The performance of our proposed nonlinearities are compared with maxout, rectified linear units (ReLU), tanh units, and also with a discriminatively trained SGMM/HMM system, and our p-norm units with p equal to 2 are found to perform best.

international conference on acoustics, speech, and signal processing | 2014

A pitch extraction algorithm tuned for automatic speech recognition

Pegah Ghahremani; Bagher BabaAli; Daniel Povey; Korbinian Riedhammer; Jan Trmal; Sanjeev Khudanpur

In this paper we present an algorithm that produces pitch and probability-of-voicing estimates for use as features in automatic speech recognition systems. These features give large performance improvements on tonal languages for ASR systems, and even substantial improvements for non-tonal languages. Our method, which we are calling the Kaldi pitch tracker (because we are adding it to the Kaldi ASR toolkit), is a highly modified version of the getf0 (RAPT) algorithm. Unlike the original getf0 we do not make a hard decision whether any given frame is voiced or unvoiced; instead, we assign a pitch even to unvoiced frames while constraining the pitch trajectory to be continuous. Our algorithm also produces a quantity that can be used as a probability of voicing measure; it is based on the normalized autocorrelation measure that our pitch extractor uses. We present results on data from various languages in the BABEL project, and show a large improvement over systems without tonal features and systems where pitch and POV information was obtained from SAcC or getf0.

ieee automatic speech recognition and understanding workshop | 2013

Using proxies for OOV keywords in the keyword search task

Guoguo Chen; Oguz Yilmaz; Jan Trmal; Daniel Povey; Sanjeev Khudanpur

We propose a simple but effective weighted finite state transducer (WFST) based framework for handling out-of-vocabulary (OOV) keywords in a speech search task. State-of-the-art large vocabulary continuous speech recognition (LVCSR) and keyword search (KWS) systems are developed for conversational telephone speech in Tagalog. Word-based and phone-based indexes are created from word lattices, the latter by using the LVCSR systems pronunciation lexicon. Pronunciations of OOV keywords are hypothesized via a standard grapheme-to-phoneme method. In-vocabulary proxies (word or phone sequences) are generated for each OOV keyword using WFST techniques that permit incorporation of a phone confusion matrix. Empirical results when searching for the Babel/NIST evaluation keywords in the Babel 10 hour development-test speech collection show that (i) searching for word proxies in the word index significantly outperforms searching for phonetic representations of OOV words in a phone index, and (ii) while phone confusion information yields minor improvement when searching a phone index, it yields up to 40% improvement in actual term weighted value when searching a word index with word proxies.

text speech and dialogue | 2010

Adaptation of a feedforward artificial neural network using a linear transform

Jan Trmal; Jan Zelinka; Luděk Müller

In this paper we present a novel method for adaptation of a multi-layer perceptron neural network (MLP ANN). Nowadays, the adaptation of the ANN is usually done as an incremental retraining either of a subset or the complete set of the ANN parameters. However, since sometimes the amount of the adaptation data is quite small, there is a fundamental drawback of such approach - during retraining, the network parameters can be easily overfitted to the new data. There certainly are techniques that can help overcome this problem (early-stopping, cross-validation), however application of such techniques leads to more complex and possibly more data hungry training procedure. The proposed method approaches the problem from a different perspective. We use the fact that in many cases we have an additional knowledge about the problem. Such additional knowledge can be used to limit the dimensionality of the adaptation problem. We applied the proposed method on speaker adaptation of a phoneme recognizer based on TRAPS (Temporal Patterns) parameters. We exploited the fact that the employed TRAPS parameters are constructed using log-outputs of mel-filter bank and by virtue of reformulating the first layer weight matrix adaptation problem as a mel-filter bank output adaptation problem, we were able to significantly limit the number of free variables. Adaptation using the proposed method resulted in a substantial improvement of phoneme recognizer accuracy.

international conference on acoustics, speech, and signal processing | 2013

Quantifying the value of pronunciation lexicons for keyword search in lowresource languages

Guoguo Chen; Sanjeev Khudanpur; Daniel Povey; Jan Trmal; David Yarowsky; Oguz Yilmaz

This paper quantifies the value of pronunciation lexicons in large vocabulary continuous speech recognition (LVCSR) systems that support keyword search (KWS) in low resource languages. State-of-the-art LVCSR and KWS systems are developed for conversational telephone speech in Tagalog, and the baseline lexicon is augmented via three different grapheme-to-phoneme models that yield increasing coverage of a large Tagalog word-list. It is demonstrated that while the increased lexical coverage - or reduced out-of-vocabulary (OOV) rate - leads to only modest (ca 1%-4%) improvements in word error rate, the concomitant improvements in actual term weighted value are as much as 60%. It is also shown that incorporating the augmented lexicons into the LVCSR system before indexing speech is superior to using them post facto, e.g., for approximate phonetic matching of OOV keywords in pre-indexed lattices. These results underscore the disproportionate importance of automatic lexicon augmentation for KWS in morphologically rich languages, and advocate for using them early in the LVCSR stage.

spoken language technology workshop | 2014

A keyword search system using open source software

Jan Trmal; Guoguo Chen; Daniel Povey; Sanjeev Khudanpur; Pegah Ghahremani; Xiaohui Zhang; Vimal Manohar; Chunxi Liu; Aren Jansen; Dietrich Klakow; David Yarowsky; Florian Metze

Provides an overview of a speech-to-text (STT) and keyword search (KWS) system architecture build primarily on the top of the Kaldi toolkit and expands on a few highlights. The system was developed as a part of the research efforts of the Radical team while participating in the IARPA Babel program. Our aim was to develop a general system pipeline which could be easily and rapidly deployed in any language, independently on the language script and phonological and linguistic features of the language.

IEEE Transactions on Audio, Speech, and Language Processing | 2012

Optimized Acoustic Likelihoods Computation for NVIDIA and ATI/AMD Graphics Processors

Jan Vanek; Jan Trmal; Josef Psutka

In this paper, we describe an optimized version of a Gaussian-mixture-based acoustic model likelihood evaluation algorithm for graphical processing units (GPUs). The evaluation of these likelihoods is one of the most computationally intensive parts of automatic speech recognizers, but it can be parallelized and offloaded to GPU devices. Our approach offers a significant speed-up over the recently published approaches, because it utilizes the GPU architecture in a more effective manner. All the recent implementations have been intended only for NVIDIA graphics processors, programmed either in CUDA or OpenCL GPU programming frameworks. We present results for both CUDA and OpenCL. Further, we have developed an OpenCL implementation optimized for ATI/AMD GPUs. Results suggest that even very large acoustic models can be used in real-time speech recognition engines on computers equipped with a low-end GPU or laptops. In addition, the completely asynchronous GPU management provides additional CPU resources for the decoder part of the LVCSR. The optimized implementation enables us to apply fusion techniques together with evaluating many (10 or even more) speaker-specific acoustic models. We apply this technique to a real-time parliamentary speech recognition system where the speaker changes frequently.

Methods of Information in Medicine | 2009

Voice-supported Electronic Health Record for Temporomandibular Joint Disorders

Radek Hippmann; Tatjana Dostalova; Jana Zvárová; Miroslav Nagy; Michaela Seydlova; Petr Hanzlícek; Pavel Kriz; Luboš Šmídl; Jan Trmal

OBJECTIVES To identify support of structured data entry for an electronic health record application in temporomandibular joint disorders. METHODS The methods of structuring information in dentistry are described and the interactive DentCross component is introduced. A system of structured voice-supported data entry in electronic health record on several real cases in the field of dentistry is performed. The connection of this component to the MUDRLite electronic health record is described. RESULTS The use of DentVoice, an application which consists of the electronic health record MUDRLite and the voice-controlled interactive component DentCross, to collect dental information required by temporomandibular joint disorders is shown. CONCLUSIONS The DentVoice application with the DentCross component showed the practical ability of the temporomandibular joint disorder treatment support.

international symposium on signal processing and information technology | 2012

Full covariance Gaussian mixture models evaluation on GPU

Jan Vanek; Jan Trmal; Josef Psutka

Gaussian mixture models (GMMs) are often used in various data processing and classification tasks to model a continuous probability density in a multi-dimensional space. In cases, where the dimension of the feature space is relatively high (e.g. in the automatic speech recognition (ASR)), GMM with a higher number of Gaussians with diagonal covariances (DC) instead of full covariances (FC) is used from the two reasons. The first reason is a problem how to estimate robust FC matrices with a limited training data set. The second reason is a much higher computational cost during the GMM evaluation. The first reason was addressed in many recent publications. In contrast, this paper describes an efficient implementation on Graphic Processing Unit (GPU) of the FC-GMM evaluation, which addresses the second reason. The performance was tested on acoustic models for ASR, and it is shown that even a low-end laptop GPU is capable to evaluate a large acoustic model in a fraction of the real speech time. Three variants of the algorithm were implemented and compared on various GPUs: NVIDIA CUDA, NVIDIA OpenCL, and ATI/AMD OpenCL.

text speech and dialogue | 2012

Captioning of Live TV Programs through Speech Recognition and Re-speaking

Aleš Pražák; Zdeněk Loose; Jan Trmal; Josef Psutka

In this paper we introduce our complete solution for captioning of live TV programs used by the Czech Television, the public service broadcaster in the Czech Republic. Live captioning using speech recognition and re-speaking is on the increase and widely used for example in BBC; however, many specific issues have to be solved each time a new captioning system is being put in operation. Our concept of re-speaking assumes a complex integration of re-speaker’s skills, not only verbatim repetition with fully automatic processing. This paper describes the recognition system design with advanced re-speaker interaction, distributed captioning system architecture and neglected re-speaker training. Some evaluation of our skilled re-speakers is presented too.

Explore More