Pavel Golik | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pavel Golik is active.

Explore More

Publication

Featured researches published by Pavel Golik.

ieee automatic speech recognition and understanding workshop | 2015

Multilingual representations for low resource speech recognition and keyword search

Jia Cui; Brian Kingsbury; Bhuvana Ramabhadran; Abhinav Sethy; Kartik Audhkhasi; Xiaodong Cui; Ellen Kislal; Lidia Mangu; Markus Nussbaum-Thom; Michael Picheny; Zoltán Tüske; Pavel Golik; Ralf Schlüter; Hermann Ney; Mark J. F. Gales; Kate Knill; Anton Ragni; Haipeng Wang; Phil Woodland

This paper examines the impact of multilingual (ML) acoustic representations on Automatic Speech Recognition (ASR) and keyword search (KWS) for low resource languages in the context of the OpenKWS15 evaluation of the IARPA Babel program. The task is to develop Swahili ASR and KWS systems within two weeks using as little as 3 hours of transcribed data. Multilingual acoustic representations proved to be crucial for building these systems under strict time constraints. The paper discusses several key insights on how these representations are derived and used. First, we present a data sampling strategy that can speed up the training of multilingual representations without appreciable loss in ASR performance. Second, we show that fusion of diverse multilingual representations developed at different LORELEI sites yields substantial ASR and KWS gains. Speaker adaptation and data augmentation of these representations improves both ASR and KWS performance (up to 8.7% relative). Third, incorporating un-transcribed data through semi-supervised learning, improves WER and KWS performance. Finally, we show that these multilingual representations significantly improve ASR and KWS performance (relative 9% for WER and 5% for MTWV) even when forty hours of transcribed audio in the target language is available. Multilingual representations significantly contributed to the LORELEI KWS systems winning the OpenKWS15 evaluation.

international conference on acoustics, speech, and signal processing | 2014

RASR/NN: The RWTH neural network toolkit for speech recognition

Simon Wiesler; Alexander Richard; Pavel Golik; Ralf Schlüter; Hermann Ney

This paper describes the new release of RASR - the open source version of the well-proven speech recognition toolkit developed and used at RWTH Aachen University. The focus is put on the implementation of the NN module for training neural network acoustic models. We describe code design, configuration, and features of the NN module. The key feature is a high flexibility regarding the network topology, choice of activation functions, training criteria, and optimization algorithm, as well as a built-in support for efficient GPU computing. The evaluation of run-time performance and recognition accuracy is performed exemplary with a deep neural network as acoustic model in a hybrid NN/HMM system. The results show that RASR achieves a state-of-the-art performance on a real-world large vocabulary task, while offering a complete pipeline for building and applying large scale speech recognition systems.

international conference on acoustics, speech, and signal processing | 2011

Non-stationary feature extraction for automatic speech recognition

Zoltán Tüske; Pavel Golik; Ralf Schlüter; Friedhelm R. Drepper

In current speech recognition systems mainly Short-Time Fourier Transform based features like MFCC are applied. Dropping the short-time stationarity assumption of the voiced speech, this paper introduces the non-stationary signal analysis into the ASR framework. We present new acoustic features extracted by a pitch-adaptive Gammatone filter bank. The noise robustness was proved on AURORA 2 and 4 tasks, where the proposed features outperform the standard MFCC. Furthermore, successful combination experiments via ROVER indicate the differences between the new features and MFCC.

ieee automatic speech recognition and understanding workshop | 2015

Speaker adaptive joint training of Gaussian mixture models and bottleneck features

Zoltán Tüske; Pavel Golik; Ralf Schlüter; Hermann Ney

In the tandem approach, the output of a neural network (NN) serves as input features to a Gaussian mixture model (GMM) aiming to improve the emission probability estimates. As has been shown in our previous work, GMM with pooled covariance matrix can be integrated into a neural network framework as a softmax layer with hidden variables, which allows for joint estimation of both neural network and Gaussian mixture parameters. Here, this approach is extended to include speaker adaptive training (SAT) by introducing a speaker dependent neural network layer. Error backpropagation beyond this speaker dependent layer realizes the adaptive training of the Gaussian parameters as well as the optimization of the bottleneck (BN) tandem features of the underlying acoustic model, simultaneously. In this study, after the initialization by constrained maximum likelihood linear regression (CMLLR) the speaker dependent layer itself is kept constant during the joint training. Experiments show that the deeper backpropagation through the speaker dependent layer is necessary for improved recognition performance. The speaker adaptively and jointly trained BN-GMM results in 5% relative improvement over very strong speaker-independent hybrid baseline on the Quaero English broadcast news and conversations task, and on the 300-hour Switchboard task.

international conference on acoustics, speech, and signal processing | 2015

Investigations on sequence training of neural networks

Simon Wiesler; Pavel Golik; Ralf Schlüter; Hermann Ney

In this paper we present an investigation of sequence-discriminative training of deep neural networks for automatic speech recognition. We evaluate different sequence-discriminative training criteria (MMI and MPE) and optimization algorithms (including SGD and Rprop) using the RASR toolkit. Further, we compare the training of the whole network with that of the output layer only. Technical details necessary for a robust training are studied, since there is no consensus yet on the ultimate training recipe. The investigation extends our previous work on training linear bottleneck networks from scratch showing the consistently positive effect of sequence training.

international conference on acoustics, speech, and signal processing | 2012

Mobile music modeling, analysis and recognition

Pavel Golik; Boulos Harb; Ananya Misra; Michael Riley; Alex Rudnick; Eugene Weinstein

We present an analysis of music modeling and recognition techniques in the context of mobile music matching, substantially improving on the techniques presented in [1]. We accomplish this by adapting the features specifically to this task, and by introducing new modeling techniques that enable using a corpus of noisy and channel-distorted data to improve mobile music recognition quality. We report the results of an extensive empirical investigation of the systems robustness under realistic channel effects and distortions. We show an improvement of recognition accuracy by explicit duration modeling of music phonemes and by integrating the expected noise environment into the training process. Finally, we propose the use of frame-to-phoneme alignment for high-level structure analysis of polyphonic music.

international conference on acoustics, speech, and signal processing | 2015

Unsupervised adaptation of a denoising autoencoder by Bayesian Feature Enhancement for reverberant asr under mismatch conditions

Jahn Heymann; Pavel Golik; Ralf Schlüter

The parametric Bayesian Feature Enhancement (BFE) and a datadriven Denoising Autoencoder (DA) both bring performance gains in severe single-channel speech recognition conditions. The first can be adjusted to different conditions by an appropriate parameter setting, while the latter needs to be trained on conditions similar to the ones expected at decoding time, making it vulnerable to a mismatch between training and test conditions. We use a DNN backend and study reverberant ASR under three types of mismatch conditions: different room reverberation times, different speaker to microphone distances and the difference between artificially reverberated data and the recordings in a reverberant environment. We show that for these mismatch conditions BFE can provide the targets for a DA. This unsupervised adaptation provides a performance gain over the direct use of BFE and even enables to compensate for the mismatch of real and simulated reverberant data.

international conference on acoustics, speech, and signal processing | 2017

Investigations on byte-level convolutional neural networks for language modeling in low resource speech recognition

Kazuki Irie; Pavel Golik; Ralf Schlüter; Hermann Ney

In this paper, we present an investigation on technical details of the byte-level convolutional layer which replaces the conventional linear word projection layer in the neural language model. In particular, we discuss and compare the effective filter configurations, pooling types and the use of bytes instead of characters. We carry out experiments on language packs released by the IARPA Babel project and measure the performance in terms of perplexity and word error rate. Introducing a convolutional layer consistently improves the results on all languages. Also, there is no degradation from using raw bytes instead of proper Unicode characters, even on syllabic alphabets like Amharic. In addition, we report improvements in word error rate from rescoring lattices and evaluate keyword search performance on several languages.

international conference on speech and computer | 2017

The 2016 RWTH Keyword Search System for Low-Resource Languages

Pavel Golik; Zoltán Tüske; Kazuki Irie; Eugen Beck; Ralf Schlüter; Hermann Ney

In this paper we describe the RWTH Aachen keyword search (KWS) system developed in the course of the IARPA Babel program. We put focus on acoustic modeling with neural networks and evaluate the full pipeline with respect to the KWS performance. At the core of this study lie multilingual bottleneck features extracted from a deep neural network trained on all 28 languages available to the project articipants. We show that in a low-resource scenario, the multilingual features are crucial for achieving state-of-the-art performance.

international conference on speech and computer | 2016

Automatic Speech Recognition Based on Neural Networks

Ralf Schlüter; Patrick Doetsch; Pavel Golik; Markus Kitza; Tobias Menne; Kazuki Irie; Zoltán Tüske; Albert Zeyer

In automatic speech recognition, as in many areas of machine learning, stochastic modeling relies on neural networks more and more. Both in acoustic and language modeling, neural networks today mark the state of the art for large vocabulary continuous speech recognition, providing huge improvements over former approaches that were solely based on Gaussian mixture hidden markov models and count-based language models. We give an overview of current activities in neural network based modeling for automatic speech recognition. This includes discussions of network topologies and cell types, training and optimization, choice of input features, adaptation and normalization, multitask training, as well as neural network based language modeling. Despite the clear progress obtained with neural network modeling in speech recognition, a lot is to be done, yet to obtain a consistent and self-contained neural network based modeling approach that ties in with the former state of the art. We will conclude by a discussion of open problems as well as potential future directions w.r.t. to neural network integration into automatic speech recognition systems.

Explore More