Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Zoltán Tüske is active.

Publication


Featured researches published by Zoltán Tüske.


international conference on acoustics, speech, and signal processing | 2013

Investigation on cross- and multilingual MLP features under matched and mismatched acoustical conditions

Zoltán Tüske; Joel Pinto; Daniel Willett; Ralf Schlüter

In this paper, Multi Layer Perceptron (MLP) based multilingual bottleneck features are investigated for acoustic modeling in three languages - German, French, and US English. We use a modified training algorithm to handle the multilingual training scenario without having to explicitly map the phonemes to a common phoneme set. Furthermore, the cross-lingual portability of bottleneck features between the three languages are also investigated. Single pass recognition experiments on large vocabulary SMS dictation task indicate that (1) multilingual bottleneck features yield significantly lower word error rates compared to standard MFCC features (2) multilingual bottleneck features are superior to monolingual bottleneck features trained for the target language with limited training data, and (3) multilingual bottleneck features are beneficial in training acoustic models in a low resource language where only mismatched training data is available-by exploiting the more matched training data from other languages.


international conference on acoustics, speech, and signal processing | 2014

Multilingual MRASTA features for low-resource keyword search and speech recognition systems

Zoltán Tüske; David Nolden; Ralf Schlüter; Hermann Ney

This paper investigates the application of hierarchical MRASTA bottleneck (BN) features for under-resourced languages within the IARPA Babel project. Through multilingual training of Multilayer Perceptron (MLP) BN features on five languages (Cantonese, Pashto, Tagalog, Turkish, and Vietnamese), we could end up in a single feature stream which is more beneficial to all languages than the unilingual features. In the case of balanced corpus sizes, the multilingual BN features improve the automatic speech recognition (ASR) performance by 3-5% and the keyword search (KWS) by 3-10% relative for both limited (LLP) and full language packs (FLP). Borrowing orders of magnitude more data from non-target FLPs, the recognition error rate is reduced by 8-10%, and the spoken term detection is improved by over 40% relative on Vietnamese and Pashto LLP. Aiming at the fast development of acoustic models, cross-lingual transfer of multilingually ”pretrained” BN features for a new language is also investigated. Without the need of any MLP training on the new language, the ported BN features performed similarly to the unilingual features on FLP and significantly better on LLP. Results also show that a simple fine-tuning step on the new language is enough to achieve comparable KWS and ASR performance to that system where the target language is also involved in the time-consuming multilingual training.


international conference on acoustics, speech, and signal processing | 2013

Deep hierarchical bottleneck MRASTA features for LVCSR

Zoltán Tüske; Ralf Schlüter; Hermann Ney

Hierarchical Multi Layer Perceptron (MLP) based long-term feature extraction is optimized for TANDEM connectionist large vocabulary continuous speech recognition (LVCSR) system within the QUAERO project. Training the bottleneck MLP on multi-resolutional RASTA filtered critical band energies, more than 20% relative word error rate (WER) reduction over standard MFCC system is observed after optimizing the number of target labels. Furthermore, introducing a deeper structure in the hierarchical bottleneck processing the relative gain increases to 25%. The final system based on deep bottleneck TANDEM features clearly outperforms the hybrid approach, even if the long-term features are also presented to the deep MLP acoustic model. The results are also verified on evaluation data of the year 2012, and about 20% relative WER improvement over classical cepstral system is measured even after speaker adaptive training.


ieee automatic speech recognition and understanding workshop | 2015

Multilingual representations for low resource speech recognition and keyword search

Jia Cui; Brian Kingsbury; Bhuvana Ramabhadran; Abhinav Sethy; Kartik Audhkhasi; Xiaodong Cui; Ellen Kislal; Lidia Mangu; Markus Nussbaum-Thom; Michael Picheny; Zoltán Tüske; Pavel Golik; Ralf Schlüter; Hermann Ney; Mark J. F. Gales; Kate Knill; Anton Ragni; Haipeng Wang; Phil Woodland

This paper examines the impact of multilingual (ML) acoustic representations on Automatic Speech Recognition (ASR) and keyword search (KWS) for low resource languages in the context of the OpenKWS15 evaluation of the IARPA Babel program. The task is to develop Swahili ASR and KWS systems within two weeks using as little as 3 hours of transcribed data. Multilingual acoustic representations proved to be crucial for building these systems under strict time constraints. The paper discusses several key insights on how these representations are derived and used. First, we present a data sampling strategy that can speed up the training of multilingual representations without appreciable loss in ASR performance. Second, we show that fusion of diverse multilingual representations developed at different LORELEI sites yields substantial ASR and KWS gains. Speaker adaptation and data augmentation of these representations improves both ASR and KWS performance (up to 8.7% relative). Third, incorporating un-transcribed data through semi-supervised learning, improves WER and KWS performance. Finally, we show that these multilingual representations significantly improve ASR and KWS performance (relative 9% for WER and 5% for MTWV) even when forty hours of transcribed audio in the target language is available. Multilingual representations significantly contributed to the LORELEI KWS systems winning the OpenKWS15 evaluation.


international conference on acoustics, speech, and signal processing | 2011

Non-stationary feature extraction for automatic speech recognition

Zoltán Tüske; Pavel Golik; Ralf Schlüter; Friedhelm R. Drepper

In current speech recognition systems mainly Short-Time Fourier Transform based features like MFCC are applied. Dropping the short-time stationarity assumption of the voiced speech, this paper introduces the non-stationary signal analysis into the ASR framework. We present new acoustic features extracted by a pitch-adaptive Gammatone filter bank. The noise robustness was proved on AURORA 2 and 4 tasks, where the proposed features outperform the standard MFCC. Furthermore, successful combination experiments via ROVER indicate the differences between the new features and MFCC.


international conference on acoustics, speech, and signal processing | 2015

Integrating Gaussian mixtures into deep neural networks: Softmax layer with hidden variables

Zoltán Tüske; Muhammad Ali Tahir; Ralf Schlüter; Hermann Ney

In the hybrid approach, neural network output directly serves as hidden Markov model (HMM) state posterior probability estimates. In contrast to this, in the tandem approach neural network output is used as input features to improve classic Gaussian mixture model (GMM) based emission probability estimates. This paper shows that GMM can be easily integrated into the deep neural network framework. By exploiting its equivalence with the log-linear mixture model (LMM), GMM can be transformed to a large softmax layer followed by a summation pooling layer. Theoretical and experimental results indicate that the jointly trained and optimally chosen GMM and bottleneck tandem features cannot perform worse than a hybrid model. Thus, the question “hybrid vs. tandem” simplifies to optimizing the output layer of a neural network. Speech recognition experiments are carried out on a broadcast news and conversations task using up to 12 feed-forward hidden layers with sigmoid and rectified linear unit activation functions. The evaluation of the LMM layer shows recognition gains over the classic softmax output.


ieee automatic speech recognition and understanding workshop | 2015

Speaker adaptive joint training of Gaussian mixture models and bottleneck features

Zoltán Tüske; Pavel Golik; Ralf Schlüter; Hermann Ney

In the tandem approach, the output of a neural network (NN) serves as input features to a Gaussian mixture model (GMM) aiming to improve the emission probability estimates. As has been shown in our previous work, GMM with pooled covariance matrix can be integrated into a neural network framework as a softmax layer with hidden variables, which allows for joint estimation of both neural network and Gaussian mixture parameters. Here, this approach is extended to include speaker adaptive training (SAT) by introducing a speaker dependent neural network layer. Error backpropagation beyond this speaker dependent layer realizes the adaptive training of the Gaussian parameters as well as the optimization of the bottleneck (BN) tandem features of the underlying acoustic model, simultaneously. In this study, after the initialization by constrained maximum likelihood linear regression (CMLLR) the speaker dependent layer itself is kept constant during the joint training. Experiments show that the deeper backpropagation through the speaker dependent layer is necessary for improved recognition performance. The speaker adaptively and jointly trained BN-GMM results in 5% relative improvement over very strong speaker-independent hybrid baseline on the Quaero English broadcast news and conversations task, and on the 300-hour Switchboard task.


conference of the international speech communication association | 2016

LSTM, GRU, Highway and a bit of Attention: An Empirical Overview for Language Modeling in Speech Recognition

Kazuki Irie; Zoltán Tüske; Tamer Alkhouli; Ralf Schlüter; Hermann Ney

Popularized by the long short-term memory (LSTM), multiplicative gates have become a standard means to design artificial neural networks with intentionally organized information flow. Notable examples of such architectures include gated recurrent units (GRU) and highway networks. In this work, we first focus on the evaluation of each of the classical gated architectures for language modeling for large vocabulary speech recognition. Namely, we evaluate the highway network, lateral network, LSTM and GRU. Furthermore, the motivation underlying the highway network also applies to LSTM and GRU. An extension specific to the LSTM has been recently proposed with an additional highway connection between the memory cells of adjacent LSTM layers. In contrast, we investigate an approach which can be used with both LSTM and GRU: a highway network in which the LSTM or GRU is used as the transformation function. We found that the highway connections enable both standalone feedforward and recurrent neural language models to benefit better from the deep structure and provide a slight improvement of recognition accuracy after interpolation with count models. To complete the overview, we include our initial investigations on the use of the attention mechanism for learning word triggers.


international conference on acoustics, speech, and signal processing | 2016

Investigation on log-linear interpolation of multi-domain neural network language model

Zoltán Tüske; Kazuki Irie; Ralf Schlüter; Hermann Ney

Inspired by the success of multi-task training in acoustic modeling, this paper investigates a new architecture for a multi-domain neural network based language model (NNLM). The proposed model has several shared hidden layers and domain-specific output layers. As will be shown, the log-linear interpolation of the multi-domain outputs and the optimization of interpolation weights fit naturally in the framework of NNLM. The resulting model can be expressed as a single NNLM. As an initial study of such an architecture, this paper focuses on deep feed-forward neural networks (DNNs). We also re-investigate the potential of long context up to 30-grams, and depth up to 5 hidden layers in DNN-LM. Our final feed-forward multidomain NNLM is trained on 3.1B running words across 11 domains for English broadcast news and conversations large vocabulary continuous speech recognition task. After log-linear interpolation and fine-tuning, we measured improvements in terms of perplexity and word error rate over the models trained on 50M running words of in-domain news resources. The final multi-domain feed-forward LM outperformed our previous best LSTM-RNN LM trained on the 50M in-domain corpus, even after linear interpolation with large count models.


international conference on acoustics, speech, and signal processing | 2014

The RWTH English lecture recognition system

Simon Wiesler; Kazuki Irie; Zoltán Tüske; Ralf Schlüter; Hermann Ney

In this paper, we describe the RWTH speech recognition system for English lectures developed within the Translectures project. A difficulty in the development of an English lectures recognition system, is the high ratio of non-native speakers. We address this problem by using very effective deep bottleneck features trained on multilingual data. The acoustic model is trained on large amounts of data from different domains and with different dialects. Large improvements are obtained from unsupervised acoustic adaptation. Another challenge is the frequent use of technical terms and the wide range of topics. In our recognition system, slides, which are attached to most lectures, are used for improving lexical coverage and language model adaptation.

Collaboration


Dive into the Zoltán Tüske's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Hermann Ney

RWTH Aachen University

View shared research outputs
Top Co-Authors

Avatar

Pavel Golik

RWTH Aachen University

View shared research outputs
Top Co-Authors

Avatar

Kazuki Irie

RWTH Aachen University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge