Georg Heigold | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Georg Heigold is active.

Explore More

Publication

Featured researches published by Georg Heigold.

international conference on acoustics, speech, and signal processing | 2013

Multilingual acoustic models using distributed deep neural networks

Georg Heigold; Vincent Vanhoucke; Andrew W. Senior; Patrick Nguyen; Marc'Aurelio Ranzato; Matthieu Devin; Jeffrey Dean

Todays speech recognition technology is mature enough to be useful for many practical applications. In this context, it is of paramount importance to train accurate acoustic models for many languages within given resource constraints such as data, processing power, and time. Multilingual training has the potential to solve the data issue and close the performance gap between resource-rich and resource-scarce languages. Neural networks lend themselves naturally to parameter sharing across languages, and distributed implementations have made it feasible to train large networks. In this paper, we present experimental results for cross- and multi-lingual network training of eleven Romance languages on 10k hours of data in total. The average relative gains over the monolingual baselines are 4%/2% (data-scarce/data-rich languages) for cross- and 7%/2% for multi-lingual training. However, the additional gain from jointly training the languages on all data comes at an increased training time of roughly four weeks, compared to two weeks (monolingual) and one week (crosslingual).

international conference on acoustics, speech, and signal processing | 2014

Small-footprint keyword spotting using deep neural networks

Guoguo Chen; Carolina Parada; Georg Heigold

Our application requires a keyword spotting system with a small memory footprint, low computational cost, and high precision. To meet these requirements, we propose a simple approach based on deep neural networks. A deep neural network is trained to directly predict the keyword(s) or subword units of the keyword(s) followed by a posterior handling method producing a final confidence score. Keyword recognition results achieve 45% relative improvement with respect to a competitive Hidden Markov Model-based system, while performance in the presence of babble noise shows 39% relative improvement.

international conference on machine learning | 2008

Modified MMI/MPE: a direct evaluation of the margin in speech recognition

Georg Heigold; Thomas Deselaers; Ralf Schlüter; Hermann Ney

In this paper we show how common speech recognition training criteria such as the Minimum Phone Error criterion or the Maximum Mutual Information criterion can be extended to incorporate a margin term. Different margin-based training algorithms have been proposed to refine existing training algorithms for general machine learning problems. However, for speech recognition, some special problems have to be addressed and all approaches proposed either lack practical applicability or the inclusion of a margin term enforces significant changes to the underlying model, e.g. the optimization algorithm, the loss function, or the parameterization of the model. In our approach, the conventional training criteria are modified to incorporate a margin term. This allows us to do large-margin training in speech recognition using the same efficient algorithms for accumulation and optimization and to use the same software as for conventional discriminative training. We show that the proposed criteria are equivalent to Support Vector Machines with suitable smooth loss functions, approximating the non-smooth hinge loss function or the hard error (e.g. phone error). Experimental results are given for two different tasks: the rather simple digit string recognition task Sietill which severely suffers from overfitting and the large vocabulary European Parliament Plenary Sessions English task which is supposed to be dominated by the risk and the generalization does not seem to be such an issue.

international conference on acoustics, speech, and signal processing | 2014

Asynchronous stochastic optimization for sequence training of deep neural networks

Georg Heigold; Erik McDermott; Vincent Vanhoucke; Andrew W. Senior; Michiel Bacchiani

This paper explores asynchronous stochastic optimization for sequence training of deep neural networks. Sequence training requires more computation than frame-level training using pre-computed frame data. This leads to several complications for stochastic optimization, arising from significant asynchrony in model updates under massive parallelization, and limited data shuffling due to utterance-chunked processing. We analyze the impact of these two issues on the efficiency and performance of sequence training. In particular, we suggest a framework to formalize the reasoning about the asynchrony and present experimental results on both small and large scale Voice Search tasks to validate the effectiveness and efficiency of asynchronous stochastic optimization.

international conference on acoustics, speech, and signal processing | 2013

Multiframe deep neural networks for acoustic modeling

Vincent Vanhoucke; Matthieu Devin; Georg Heigold

Deep neural networks have been shown to perform very well as acoustic models for automatic speech recognition. Compared to Gaussian mixtures however, they tend to be very expensive computationally, making them challenging to use in real-time applications. One key advantage of such neural networks is their ability to learn from very long observation windows going up to 400 ms. Given this very long temporal context, it is tempting to wonder whether one can run neural networks at a lower frame rate than the typical 10 ms, and whether there might be computational benefits to doing so. This paper describes a method of tying the neural network parameters over time which achieves comparable performance to the typical frame-synchronous model, while achieving up to a 4X reduction in the computational cost of the neural network activations.

IEEE Signal Processing Magazine | 2012

Discriminative Training for Automatic Speech Recognition: Modeling, Criteria, Optimization, Implementation, and Performance

Georg Heigold; Hermann Ney; R. Schluter; Simon Wiesler

Discriminative training techniques have been shown to consistently outperform the maximum likelihood (ML) paradigm for acoustic model training in automatic speech recognition (ASR). Consequently, todays discriminative training methods are fundamental components of state-of-the-art systems and are a major line of research in speech recognition. This article gives a comprehensive overview of discriminative training methods for acoustic model training in the context of ASR. The article covers all related aspects of discriminative training for speech recognition, i.e., specific training criteria and their relation, statistical modeling, different parameter optimization approaches, efficient implementation of discriminative training, and a performance overview.

Pattern Recognition | 2010

Object classification by fusing SVMs and Gaussian mixtures

Thomas Deselaers; Georg Heigold; Hermann Ney

We present a new technique that employs support vector machines (SVMs) and Gaussian mixture densities (GMDs) to create a generative/discriminative object classification technique using local image features. In the past, several approaches to fuse the advantages of generative and discriminative approaches were presented, often leading to improved robustness and recognition accuracy. Support vector machines are a well known discriminative classification framework but, similar to other discriminative approaches, suffer from a lack of robustness with respect to noise and overfitting. Gaussian mixtures, on the contrary, are a widely used generative technique. We present a method to directly fuse both approaches, effectively allowing to fully exploit the advantages of both. The fusion of SVMs and GMDs is done by representing SVMs in the framework of GMDs without changing the training and without changing the decision boundary. The new classifier is evaluated on the PASCAL VOC 2006 data. Additionally, we perform experiments on the USPS dataset and on four tasks from the UCI machine learning repository to obtain additional insights into the properties of the proposed approach. It is shown that for the relatively rare cases where SVMs have problems, the combined method outperforms both individual ones.

international conference on acoustics, speech, and signal processing | 2009

A flat direct model for speech recognition

Georg Heigold; Geoffrey Zweig; Xiao Li; Patrick Nguyen

We introduce a direct model for speech recognition that assumes an unstructured, i.e., flat text output. The flat model allows us to model arbitrary attributes and dependences of the output. This is different from the HMMs typically used for speech recognition. This conventional modeling approach is based on sequential data and makes rigid assumptions on the dependences. HMMs have proven to be convenient and appropriate for large vocabulary continuous speech recognition. Our task under consideration, however, is the Windows Live Search for Mobile (WLS4M) task [1]. This is a cellphone application that allows users to interact with web-based information portals. In particular, the set of valid outputs can be considered discrete and finite (although probably large, i.e., unseen events are an issue). Hence, a flat direct model lends itself to this task, making the adding of different knowledge sources and dependences straightforward and cheap. Using e.g. HMM posterior, m-gram, and spotter features, significant improvements over the conventional HMM system were observed.

IEEE Transactions on Audio, Speech, and Language Processing | 2011

Equivalence of Generative and Log-Linear Models

Georg Heigold; Hermann Ney; Patrick Lehnen; Tobias Gass; Ralf Schlüter

Conventional speech recognition systems are based on hidden Markov models (HMMs) with Gaussian mixture models (GHMMs). Discriminative log-linear models are an alternative modeling approach and have been investigated recently in speech recognition. GHMMs are directed models with constraints, e.g., positivity of variances and normalization of conditional probabilities, while log-linear models do not use such constraints. This paper compares the posterior form of typical generative models related to speech recognition with their log-linear model counterparts. The key result will be the derivation of the equivalence of these two different approaches under weak assumptions. In particular, we study Gaussian mixture models, part-of-speech bigram tagging models, and eventually, the GHMMs. This result unifies two important but fundamentally different modeling paradigms in speech recognition on the functional level. Furthermore, this paper will present comparative experimental results for various speech tasks of different complexity, including a digit string and large-vocabulary continuous speech recognition tasks.

international conference on document analysis and recognition | 2009

Confidence-Based Discriminative Training for Model Adaptation in Offline Arabic Handwriting Recognition

Philippe Dreuw; Georg Heigold; Hermann Ney

We present a novel confidence-based discriminative training for model adaptation approach for an HMM based Arabic handwriting recognition system to handle different handwriting styles and their variations.Most current approaches are maximum-likelihood trained HMM systems and try to adapt their models to different writing styles using writer adaptive training, unsupervised clustering, or additional writer specific data.Discriminative training based on the Maximum Mutual Information criterion is used to train writer independent handwriting models. For model adaptation during decoding, an unsupervised confidence-based discriminative training on a word and frame level within a two-pass decoding process is proposed. Additionally, the training criterion is extended to incorporate a margin term.The proposed methods are evaluated on the IFN/ENIT Arabic handwriting database, where the proposed novel adaptation approach can decrease the word-error-rate by 33% relative.

Explore More