Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Patrick Nguyen is active.

Publication


Featured researches published by Patrick Nguyen.


ieee automatic speech recognition and understanding workshop | 2009

A segmental CRF approach to large vocabulary continuous speech recognition

Geoffrey Zweig; Patrick Nguyen

This paper proposes a segmental conditional random field framework for large vocabulary continuous speech recognition. Fundamental to this approach is the use of acoustic detectors as the basic input, and the automatic construction of a versatile set of segment-level features. The detector streams operate at multiple time scales (frame, phone, multi-phone, syllable or word) and are combined at the word level in the CRF training and decoding processes. A key aspect of our approach is that features are defined at the word level, and are naturally geared to explain long span phenomena such as formant trajectories, duration, and syllable stress patterns. Generalization to unseen words is possible through the use of decomposable consistency features [1], [2], and our framework allows for the joint or separate discriminative training of the acoustic and language models. An initial evaluation of this framework with voice search data from the Bing Mobile (BM) application results in a 2% absolute improvement over an HMM baseline.


international conference on acoustics, speech, and signal processing | 2008

Live search for mobile:Web services by voice on the cellphone

Alex Acero; Neal Bernstein; Robert L. Chambers; Yun-Cheng Ju; Xinggang Li; Julian J. Odell; Patrick Nguyen; Oliver Scholz; Geoffrey Zweig

Live search for mobile is a cellphone application that allows users to interact with Web-based information portals. Currently the implementation is focused on information related to local businesses: their phone numbers and addresses, directions, reviews, maps of the surrounding area, and traffic. This paper describes a speech-recognition interface which was recently developed for the application, which allows the users to interact by voice. The paper presents the overall architecture, the user interface, the design and implementation of the speech recognition grammars, and initial performance results indicating that for sentence level utterance recognition we achieve 60 to 65% of human capability.


empirical methods in natural language processing | 2008

Indirect-HMM-based Hypothesis Alignment for Combining Outputs from Machine Translation Systems

Xiaodong He; Mei Yang; Jianfeng Gao; Patrick Nguyen; Robert C. Moore

This paper presents a new hypothesis alignment method for combining outputs of multiple machine translation (MT) systems. An indirect hidden Markov model (IHMM) is proposed to address the synonym matching and word ordering issues in hypothesis alignment. Unlike traditional HMMs whose parameters are trained via maximum likelihood estimation (MLE), the parameters of the IHMM are estimated indirectly from a variety of sources including word semantic similarity, word surface similarity, and a distance-based distortion penalty. The IHMM-based method significantly outperforms the state-of-the-art TER-based alignment model in our experiments on NIST benchmark datasets. Our combined SMT system using the proposed method achieved the best Chinese-to-English translation result in the constrained training track of the 2008 NIST Open MT Evaluation.


international conference on acoustics, speech, and signal processing | 2011

Speech recognitionwith segmental conditional random fields: A summary of the JHU CLSP 2010 Summer Workshop

Geoffrey Zweig; Patrick Nguyen; D. Van Compernolle; Kris Demuynck; L. Atlas; Pascal Clark; Gregory Sell; M. Wang; Fei Sha; Hynek Hermansky; Damianos Karakos; Aren Jansen; Samuel Thomas; S. Bowman; Justine T. Kao

This paper summarizes the 2010 CLSP Summer Workshop on speech recognition at Johns Hopkins University. The key theme of the workshop was to improve on state-of-the-art speech recognition systems by using Segmental Conditional Random Fields (SCRFs) to integrate multiple types of information. This approach uses a state-of-the-art baseline as a springboard from which to add a suite of novel features including ones derived from acoustic templates, deep neural net phoneme detections, duration models, modulation features, and whole word point-process models. The SCRF framework is able to appropriately weight these different information sources to produce significant gains on both the Broadcast News and Wall Street Journal tasks.


international conference on acoustics, speech, and signal processing | 2009

A flat direct model for speech recognition

Georg Heigold; Geoffrey Zweig; Xiao Li; Patrick Nguyen

We introduce a direct model for speech recognition that assumes an unstructured, i.e., flat text output. The flat model allows us to model arbitrary attributes and dependences of the output. This is different from the HMMs typically used for speech recognition. This conventional modeling approach is based on sequential data and makes rigid assumptions on the dependences. HMMs have proven to be convenient and appropriate for large vocabulary continuous speech recognition. Our task under consideration, however, is the Windows Live Search for Mobile (WLS4M) task [1]. This is a cellphone application that allows users to interact with web-based information portals. In particular, the set of valid outputs can be considered discrete and finite (although probably large, i.e., unseen events are an issue). Hence, a flat direct model lends itself to this task, making the adding of different knowledge sources and dependences straightforward and cheap. Using e.g. HMM posterior, m-gram, and spotter features, significant improvements over the conventional HMM system were observed.


international conference on acoustics, speech, and signal processing | 2008

An empirical study of automatic accent classification

Ghinwa F. Choueiter; Geoffrey Zweig; Patrick Nguyen

This paper extends language identification (LID) techniques to a large scale accent classification task: 23-way classification of foreign-accented English. We find that a purely acoustic approach based on a combination of heteroscedastic linear discriminant analysis (HLDA) and maximum mutual information (MMI) training is very effective. In contrast to LID tasks, methods based on parallel language models prove much less effective. We focus on the Oregon Graduate Institute Foreign-Accented English dataset, and obtain a detection rate of 32%, which to our knowledge is the best reported result for 23-way accent classification.


international conference on acoustics, speech, and signal processing | 2007

Finding Speaker Identities with a Conditional Maximum Entropy Model

Chengyuan Ma; Patrick Nguyen; Milind Mahajan

In this paper, we address the task of identifying the speakers by name in audio content. Identification of speakers by name helps to improve the readability of the transcript and also provides additional meta-data which can help in finding the audio content of interest. We present a conditional maximum entropy (maxent) framework for this problem which yields superior performance and lends itself well to incorporating different types of information. We take advantage of this property of maxent to explore new features for this task. We show that supplementing standard lexical triggers with information such as speaker gender and position of speaker name mentions afford us large gains in performance. At 95% precision, we increase the recall to 67% from the trigger baseline of 38%.


international conference on acoustics, speech, and signal processing | 2011

Integrating meta-information into exemplar-based speech recognition with segmental conditional random fields

Kris Demuynck; Dino Seppi; Dirk Van Compernolle; Patrick Nguyen; Geoffrey Zweig

Exemplar based recognition systems are characterized by the fact that, instead of abstracting large amounts of data into compact models, they store the observed data enriched with some annotations and infer on-the-fly from the data by finding those exemplars that resemble the input speech best. One advantage of exemplar based systems is that next to deriving what the current phone or word is, one can easily derive a wealth of meta-information concerning the chunk of audio under investigation. In this work we harvest meta-information from the set of best matching exemplars, that is thought to be relevant for the recognition such as word boundary predictions and speaker entropy. Integrating this meta-information into the recognition framework using segmental conditional random fields, reduced the WER of the exemplar based system on the WSJ Nov92 20k task from 8.2% to 7.6%. Adding the HMM-score and multiple HMM phone detectors as features further reduced the error rate to 6.6%.


IEEE Journal of Selected Topics in Signal Processing | 2010

Speech Recognition With Flat Direct Models

Patrick Nguyen; Georg Heigold; Geoffrey Zweig

This paper describes a novel direct modeling approach for speech recognition. We propose a log-linear modeling framework based on using numerous features which each measure some form of consistency between the underlying speech and an entire sequence of hypothesized words. Since the model relates the entire audio signal to a complete hypothesis without necessarily positing any inherent structure, we term this a flat direct model (FDM). In contrast to a conventional hidden Markov model approach, no Markov assumptions are used, and the model is not necessarily sequential. We demonstrate the use of features based on both template-matching distances, and the acoustic detection of multi-phone units which are selected so as to have maximal mutual information with respect to word labels. Further, we solve the key problem of how to define features which can generalize to unseen word sequences. In the proposed model, template-based features improve sentence error rate by 3% absolute over the baseline, while multi-phone-based features improve by 2% absolute.


international conference on acoustics, speech, and signal processing | 2009

Leveraging multiple query logs to improve language models for spoken query recognition

Xiao Li; Patrick Nguyen; Geoffrey Zweig; Dan Bohus

A voice search system requires a speech interface that can correctly recognize spoken queries uttered by users. The recognition performance strongly relies on a robust language model. In this work, we present the use of multiple data sources, with the focus on query logs, in improving ASR language models for a voice search application. Our contributions are three folds: (1) the use of text queries from web search and mobile search in language modeling; (2) the use of web click data to predict query forms from business listing forms; and (3) the use of voice query logs in creating a positive feedback loop. Experiments show that by leveraging these resources, we can achieve recognition performance comparable to, or even better than, that of a previously deploy system where a large amount of spoken query transcripts are used in language modeling.

Collaboration


Dive into the Patrick Nguyen's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge