Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Lidia Mangu is active.

Publication


Featured researches published by Lidia Mangu.


Computer Speech & Language | 2000

Finding consensus in speech recognition : word error minimization and other applications of confusion networks

Lidia Mangu; Eric D. Brill; Andreas Stolcke

We describe a new framework for distilling information from word lattices to improve the accuracy of the speech recognition output and obtain a more perspicuous representation of a set of alternative hypotheses. In the standard MAP decoding approach the recognizer outputs the string of words corresponding to the path with the highest posterior probability given the acoustics and a language model. However, even given optimal models, the MAP decoder does not necessarily minimize the commonly used performance metric, word error rate (WER). We describe a method for explicitly minimizing WER by extracting word hypotheses with the highest posterior probabilities from word lattices. We change the standard problem formulation by replacing global search over a large set of sentence hypotheses with local search over a small set of word candidates. In addition to improving the accuracy of the recognizer, our method produces a new representation of a set of candidate hypotheses that specifies the sequence of word-level confusions in a compact lattice format. We study the properties of confusion networks and examine their use for other tasks, such as lattice compression, word spotting, confidence annotation, and reevaluation of recognition hypotheses using higher-level knowledge sources.


international conference on acoustics, speech, and signal processing | 2005

fMPE: discriminatively trained features for speech recognition

Daniel Povey; Brian Kingsbury; Lidia Mangu; George Saon; Hagen Soltau; Geoffrey Zweig

MPE (minimum phone error) is a previously introduced technique for discriminative training of HMM parameters. fMPE applies the same objective function to the features, transforming the data with a kernel-like method and training millions of parameters, comparable to the size of the acoustic model. Despite the large number of parameters, fMPE is robust to over-training. The method is to train a matrix projecting from posteriors of Gaussians to a normal size feature space, and then to add the projected features to normal features such as PLP. The matrix is trained from a zero start using a linear method. Sparsity of posteriors ensures speed in both training and test time. The technique gives similar improvements to MPE (around 10% relative). MPE on top of fMPE results in error rates up to 6.5% relative better than MPE alone, or more if multiple layers of transform are trained.


IEEE Transactions on Audio, Speech, and Language Processing | 2006

Advances in speech transcription at IBM under the DARPA EARS program

Stanley F. Chen; Brian Kingsbury; Lidia Mangu; Daniel Povey; George Saon; Hagen Soltau; Geoffrey Zweig

This paper describes the technical and system building advances made in IBMs speech recognition technology over the course of the Defense Advanced Research Projects Agency (DARPA) Effective Affordable Reusable Speech-to-Text (EARS) program. At a technical level, these advances include the development of a new form of feature-based minimum phone error training (fMPE), the use of large-scale discriminatively trained full-covariance Gaussian models, the use of septaphone acoustic context in static decoding graphs, and improvements in basic decoding algorithms. At a system building level, the advances include a system architecture based on cross-adaptation and the incorporation of 2100 h of training data in every system component. We present results on English conversational telephony test data from the 2003 and 2004 NIST evaluations. The combination of technical advances and an order of magnitude more training data in 2004 reduced the error rate on the 2003 test set by approximately 21% relative-from 20.4% to 16.1%-over the most accurate system in the 2003 evaluation and produced the most accurate results on the 2004 test sets in every speed category


international conference on acoustics, speech, and signal processing | 2005

The IBM 2004 conversational telephony system for rich transcription

Hagen Soltau; Brian Kingsbury; Lidia Mangu; Daniel Povey; George Saon; Geoffrey Zweig

This paper describes the technical advances in IBMs conversational telephony submission to the DARPA-sponsored 2004 rich transcription evaluation (RT-04). These advances include a system architecture based on cross-adaptation; a new form of feature-based MPE training; the use of a full-scale discriminatively trained full covariance Gaussian system; the use of septaphone cross-word acoustic context in static decoding graphs; and the incorporation of 2100 hours of training data in every system component. These advances reduced the error rate by approximately 21% relative, on the 2003 test set, over the best-performing system in last years evaluation, and produced the best results on the RT-04 current and progress CTS data.


Computer Speech & Language | 2011

Minimum Bayes Risk decoding and system combination based on a recursion for edit distance

Haihua Xu; Daniel Povey; Lidia Mangu; Jie Zhu

Abstract: In this paper we describe a method that can be used for Minimum Bayes Risk (MBR) decoding for speech recognition. Our algorithm can take as input either a single lattice, or multiple lattices for system combination. It has similar functionality to the widely used Consensus method, but has a clearer theoretical basis and appears to give better results both for MBR decoding and system combination. Many different approximations have been described to solve the MBR decoding problem, which is very difficult from an optimization point of view. Our proposed method solves the problem through a novel forward-backward recursion on the lattice, not requiring time markings. We prove that our algorithm iteratively improves a bound on the Bayes risk.


international conference on acoustics, speech, and signal processing | 2007

The IBM 2006 Gale Arabic ASR System

Hagen Soltau; George Saon; Brian Kingsbury; Jeff Hc Kuo; Lidia Mangu; Daniel Povey; Geoffrey Zweig

This paper describes the advances made in IBMs Arabic broadcast news transcription system which was fielded in the 2006 GALE ASR and machine translation evaluation. These advances were instrumental in lowering the word error rate by 42% relative over the course of one year and include: training on additional LDC data, large-scale discriminative training on 1800 hours of unsupervised data, automatic vowelization using a flat-start approach, use of a large vocabulary with 617K words and 2 million pronunciations and lastly, a system architecture based on cross-adaptation between unvowelized and vowelized acoustic models.


international conference on acoustics, speech, and signal processing | 2010

The IBM 2008 GALE Arabic speech transcription system

Brian Kingsbury; Hagen Soltau; George Saon; Stephen M. Chu; Hong-Kwang Kuo; Lidia Mangu; Suman V. Ravuri; Nelson Morgan; Adam Janin

This paper describes the Arabic broadcast transcription system fielded by IBM in the GALE Phase 3.5 machine translation evaluation. Key advances compared to our Phase 2.5 system include improved discriminative training, the use of Subspace Gaussian Mixture Models (SGMM), neural network acoustic features, variable frame rate decoding, training data partitioning experiments, unpruned n-gram language models and neural network language models. These advances were instrumental in achieving a word error rate of 8.9% on the evaluation test set.


international conference on acoustics, speech, and signal processing | 2013

System combination and score normalization for spoken term detection

Jonathan Mamou; Jia Cui; Xiaodong Cui; Mark J. F. Gales; Brian Kingsbury; Kate Knill; Lidia Mangu; David Nolden; Michael Picheny; Bhuvana Ramabhadran; Ralf Schlüter; Abhinav Sethy; Philip C. Woodland

Spoken content in languages of emerging importance needs to be searchable to provide access to the underlying information. In this paper, we investigate the problem of extending data fusion methodologies from Information Retrieval for Spoken Term Detection on low-resource languages in the framework of the IARPA Babel program. We describe a number of alternative methods improving keyword search performance. We apply these methods to Cantonese, a language that presents some new issues in terms of reduced resources and shorter query lengths. First, we show score normalization methodology that improves in average by 20% keyword search performance. Second, we show that properly combining the outputs of diverse ASR systems performs 14% better than the best normalized ASR system.


international conference on acoustics, speech, and signal processing | 2013

A high-performance Cantonese keyword search system

Brian Kingsbury; Jia Cui; Xiaodong Cui; Mark J. F. Gales; Kate Knill; Jonathan Mamou; Lidia Mangu; David Nolden; Michael Picheny; Bhuvana Ramabhadran; Ralf Schlüter; Abhinav Sethy; Philip C. Woodland

We present a system for keyword search on Cantonese conversational telephony audio, collected for the IARPA Babel program, that achieves good performance by combining postings lists produced by diverse speech recognition systems from three different research groups. We describe the keyword search task, the data on which the work was done, four different speech recognition systems, and our approach to system combination for keyword search. We show that the combination of four systems outperforms the best single system by 7%, achieving an actual term-weighted value of 0.517.


international conference on acoustics, speech, and signal processing | 2001

Error corrective mechanisms for speech recognition

Lidia Mangu; Mukund Padmanabhan

In the standard MAP approach to speech recognition, the goal is to find the word sequence with the highest posterior probability given the acoustic observation. A number of alternate approaches have been proposed for directly optimizing the word error rate, the most commonly used evaluation criterion. One of them, the consensus decoding approach, converts a word lattice into a confusion network which specifies the word-level confusions at different time intervals, and outputs the word with the highest posterior probability from each word confusion set. The paper presents a method for discriminating between the correct and alternate hypotheses in a confusion set using additional knowledge sources extracted from the confusion networks. We use transformation-based learning for inducing a set of rules to guide a better decision between the top two candidates with the highest posterior probabilities in each confusion set. The choice of this learning method is motivated by the perspicuous representation of the rules induced, which can provide insight into the cause of the errors of a speech recognizer. In experiments on the Switchboard corpus, we show significant improvements over the consensus decoding approach.

Collaboration


Dive into the Lidia Mangu's collaboration.

Researchain Logo
Decentralizing Knowledge