Mirko Hannemann | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mirko Hannemann is active.

Explore More

Publication

Featured researches published by Mirko Hannemann.

ieee automatic speech recognition and understanding workshop | 2013

Score normalization and system combination for improved keyword spotting

Damianos Karakos; Richard M. Schwartz; Stavros Tsakalidis; Le Zhang; Shivesh Ranjan; Tim Ng; Roger Hsiao; Guruprasad Saikumar; Ivan Bulyko; Long Nguyen; John Makhoul; Frantisek Grezl; Mirko Hannemann; Martin Karafiát; Igor Szöke; Karel Vesely; Lori Lamel; Viet-Bac Le

We present two techniques that are shown to yield improved Keyword Spotting (KWS) performance when using the ATWV/MTWV performance measures: (i) score normalization, where the scores of different keywords become commensurate with each other and they more closely correspond to the probability of being correct than raw posteriors; and (ii) system combination, where the detections of multiple systems are merged together, and their scores are interpolated with weights which are optimized using MTWV as the maximization criterion. Both score normalization and system combination approaches show that significant gains in ATWV/MTWV can be obtained, sometimes on the order of 8-10 points (absolute), in five different languages. A variant of these methods resulted in the highest performance for the official surprise language evaluation for the IARPA-funded Babel project in April 2013.

international conference on acoustics, speech, and signal processing | 2008

Combination of strongly and weakly constrained recognizers for reliable detection of OOVS

Lukas Burget; Petr Schwarz; Pavel Matejka; Mirko Hannemann; Ariya Rastrow; Christopher M. White; Sanjeev Khudanpur; Hynek Hermansky; Jan Cernocky

This paper addresses the detection of OOV segments in the output of a large vocabulary continuous speech recognition (LVCSR) system. First, standard confidence measures from frame-based word- and phone-posteriors are investigated. Substantial improvement is obtained when posteriors from two systems - strongly constrained (LVCSR) and weakly constrained (phone posterior estimator) are combined. We show that this approach is also suitable for detection of general recognition errors. All results are presented on WSJ task with reduced recognition vocabulary.

international conference on acoustics, speech, and signal processing | 2012

Generating exact lattices in the WFST framework

Daniel Povey; Mirko Hannemann; Gilles Boulianne; Lukas Burget; Arnab Ghoshal; Milos Janda; Martin Karafiát; Stefan Kombrink; Petr Motlicek; Yanmin Qian; Korbinian Riedhammer; Karel Vesely; Ngoc Thang Vu

We describe a lattice generation method that is exact, i.e. it satisfies all the natural properties we would want from a lattice of alternative transcriptions of an utterance. This method does not introduce substantial overhead above one-best decoding. Our method is most directly applicable when using WFST decoders where the WFST is “fully expanded”, i.e. where the arcs correspond to HMM transitions. It outputs lattices that include HMM-state-level alignments as well as word labels. The general idea is to create a state-level lattice during decoding, and to do a special form of determinization that retains only the best-scoring path for each word sequence. This special determinization algorithm is a solution to the following problem: Given a WFST A, compute a WFST B that, for each input-symbol-sequence of A, contains just the lowest-cost path through A.

ieee automatic speech recognition and understanding workshop | 2013

Semi-supervised training of Deep Neural Networks

Karel Vesely; Mirko Hannemann; Lukas Burget

In this paper we search for an optimal strategy for semi-supervised Deep Neural Network (DNN) training. We assume that a small part of the data is transcribed, while the majority of the data is untranscribed. We explore self-training strategies with data selection based on both the utterance-level and frame-level confidences. Further on, we study the interactions between semi-supervised frame-discriminative training and sequence-discriminative sMBR training. We found it beneficial to reduce the disproportion in amounts of transcribed and untranscribed data by including the transcribed data several times, as well as to do a frame-selection based on per-frame confidences derived from confusion in a lattice. For the experiments, we used the Limited language pack condition for the Surprise language task (Vietnamese) from the IARPA Babel program. The absolute Word Error Rate (WER) improvement for frame cross-entropy training is 2.2%, this corresponds to WER recovery of 36% when compared to the identical system, where the DNN is built on the fully transcribed data.

international conference on acoustics, speech, and signal processing | 2014

But neural network features for spontaneous Vietnamese in BABEL

Martin Karafiát; Frantisek Grezl; Mirko Hannemann; Jan Cernocky

This paper presents our work on speech recognition of Vietnamese spontaneous telephone conversations. It focuses on feature extraction by Stacked Bottle-Neck neural networks: several improvements such as semi-supervised training on untranscribed data, increasing of precision of state targets, and CMLLR adaptations were investigated. We have also tested speaker adaptive training of this architecture and significant gain was found. The results are reported on BABEL Vietnamese data.

text speech and dialogue | 2010

Recovery of rare words in lecture speech

Stefan Kombrink; Mirko Hannemann; Lukas Burget; Hynek Heřmanský

The vocabulary used in speech usually consists of two types of words: a limited set of common words, shared across multiple documents, and a virtually unlimited set of rare words, each of which might appear a few times only in particular documents. In most documents, however, these rare words are not seen at all. The first type of words is typically included in the language model of an automatic speech recognizer (ASR) and is thus widely referred to as invocabulary (IV). Words of the second type are missing in the language model and thus are called out-of-vocabulary (OOV). However, these words usually carry important information. We use a hybrid word/sub-word recognizer to detect OOV words occurring in English talks and describe them as sequences of sub-words. We detected about one third of all OOV words, and were able to recover the correct spelling for 26.2% of all detections by using a phoneme-to-grapheme (P2G) conversion trained on the recognition dictionary. By omitting detections corresponding to recovered IV words, we were able to increase the precision of the OOV detection substantially.

spoken language technology workshop | 2014

But ASR system for BABEL Surprise evaluation 2014

Martin Karafiát; Karel Vesely; Igor Szöke; Lukas Burget; Frantisek Grezl; Mirko Hannemann; Jan Cernocky

The paper describes Brno University of Technology (BUT) ASR system for 2014 BABEL Surprise language evaluation (Tamil). While being largely based on our previous work, two original contributions were brought: (1) speaker-adapted bottle-neck neural network (BN) features were investigated as an input to DNN recognizer and semi-supervised training was found effective. (2) Adding of noise to training data outperformed a classical de-noising technique while dealing with noisy test data was found beneficial, and the performance of this approach was verified on a relatively clean training/test data setup from a different language. All results are reported on BABEL 2014 Tamil data.

international conference on acoustics, speech, and signal processing | 2013

Combining forward and backward search in decoding

Mirko Hannemann; Daniel Povey; Geoffrey Zweig

We introduce a speed-up for weighted finite state transducer (WFST) based decoders, which is based on the idea that one decoding pass using a wider beam can be replaced by two decoding passes with smaller beams, decoding forward and backward in time. We apply this in a decoder that works with a variable beam width, which is widened in areas where the two decoding passes disagree. Experimental results are shown on the Wall Street Journal corpus (WSJ) using the Kaldi toolkit, and show a substantial speedup (a factor or 2 or 3) at the “more accurate” operating points. As part of this work we also introduce a new fast algorithm for weight pushing in WFSTs, and summarize an algorithm for the time reversal of backoff language models.

Detection and Identification of Rare Audiovisual Cues | 2012

Out-of-Vocabulary Word Detection and Beyond

Stefan Kombrink; Mirko Hannemann; Lukas Burget

In this work, we summarize our experiences in detection of unexpected words in automatic speech recognition (ASR). Two approaches based upon a paradigm of incongruence detection between generic and specific recognition systems are introduced. By arguing, that detection of incongruence is a necessity, but does not suffice when having in mind possible follow-up actions, we motivate the preference of one approach over the other. Nevertheless, we show, that a fusion outperforms both single systems. Finally, we propose possible actions after the detection of unexpected words, and conclude with general remarks about what we found to be important when dealing with unexpected words.

international conference on acoustics, speech, and signal processing | 2017

Bayesian joint-sequence models for grapheme-to-phoneme conversion

Mirko Hannemann; Jan Trmal; Lucas Ondel; Santosh Kesiraju; Lukas Burget

We describe a fully Bayesian approach to grapheme-to-phoneme conversion based on the joint-sequence model (JSM). Usually, standard smoothed n-gram language models (LM, e.g. Kneser-Ney) are used with JSMs to model graphone sequences (joint grapheme-phoneme pairs). However, we take a Bayesian approach using a hierarchical Pitman-Yor-Process LM. This provides an elegant alternative to using smoothing techniques to avoid over-training. No held-out sets and complex parameter tuning is necessary, and several convergence problems encountered in the discounted Expectation-Maximization (as used in the smoothed JSMs) are avoided. Every step is modeled by weighted finite state transducers and implemented with standard operations from the OpenFST toolkit. We evaluate our model on a standard data set (CMUdict), where it gives comparable results to the previously reported smoothed JSMs in terms of phoneme-error rate while requiring a much smaller training/testing time. Most importantly, our model can be used in a Bayesian framework and for (partly) un-supervised training.

Explore More