Geoffrey Zweig | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Geoffrey Zweig is active.

Explore More

Publication

Featured researches published by Geoffrey Zweig.

computer vision and pattern recognition | 2015

From captions to visual concepts and back

Hao Fang; Saurabh Gupta; Forrest N. Iandola; Rupesh Kumar Srivastava; Li Deng; Piotr Dollár; Jianfeng Gao; Xiaodong He; Margaret Mitchell; John Platt; C. Lawrence Zitnick; Geoffrey Zweig

This paper presents a novel approach for automatically generating image descriptions: visual detectors, language models, and multimodal similarity models learnt directly from a dataset of image captions. We use multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives. The word detector outputs serve as conditional inputs to a maximum-entropy language model. The language model learns from a set of over 400,000 image descriptions to capture the statistics of word usage. We capture global semantics by re-ranking caption candidates using sentence-level features and a deep multimodal similarity model. Our system is state-of-the-art on the official Microsoft COCO benchmark, producing a BLEU-4 score of 29.1%. When human judges compare the system captions to ones written by other people on our held-out test set, the system captions have equal or better quality 34% of the time.

international conference on acoustics, speech, and signal processing | 2005

fMPE: discriminatively trained features for speech recognition

Daniel Povey; Brian Kingsbury; Lidia Mangu; George Saon; Hagen Soltau; Geoffrey Zweig

MPE (minimum phone error) is a previously introduced technique for discriminative training of HMM parameters. fMPE applies the same objective function to the features, transforming the data with a kernel-like method and training millions of parameters, comparable to the size of the acoustic model. Despite the large number of parameters, fMPE is robust to over-training. The method is to train a matrix projecting from posteriors of Gaussians to a normal size feature space, and then to add the projected features to normal features such as PLP. The matrix is trained from a zero start using a linear method. Sparsity of posteriors ensures speed in both training and test time. The technique gives similar improvements to MPE (around 10% relative). MPE on top of fMPE results in error rates up to 6.5% relative better than MPE alone, or more if multiple layers of transform are trained.

international conference on acoustics, speech, and signal processing | 2013

Recent advances in deep learning for speech research at Microsoft

Li Deng; Jinyu Li; Jui-Ting Huang; Kaisheng Yao; Dong Yu; Frank Seide; Michael L. Seltzer; Geoffrey Zweig; Xiaodong He; Jason D. Williams; Yifan Gong; Alex Acero

Deep learning is becoming a mainstream technology for speech recognition at industrial scale. In this paper, we provide an overview of the work by Microsoft speech researchers since 2009 in this area, focusing on more recent advances which shed light to the basic capabilities and limitations of the current deep learning technology. We organize this overview along the feature-domain and model-domain dimensions according to the conventional approach to analyzing speech systems. Selected experimental results, including speech recognition and related applications such as spoken dialogue and language modeling, are presented to demonstrate and analyze the strengths and weaknesses of the techniques described in the paper. Potential improvement of these techniques and future research directions are discussed.

spoken language technology workshop | 2012

Context dependent recurrent neural network language model

Tomas Mikolov; Geoffrey Zweig

Recurrent neural network language models (RNNLMs) have recently demonstrated state-of-the-art performance across a variety of tasks. In this paper, we improve their performance by providing a contextual real-valued input vector in association with each word. This vector is used to convey contextual information about the sentence being modeled. By performing Latent Dirichlet Allocation using a block of preceding text, we achieve a topic-conditioned RNNLM. This approach has the key advantage of avoiding the data fragmentation associated with building multiple topic models on different data subsets. We report perplexity results on the Penn Treebank data, where we achieve a new state-of-the-art. We further apply the model to the Wall Street Journal speech recognition task, where we observe improvements in word-error-rate.

international conference on acoustics, speech, and signal processing | 2002

The graphical models toolkit: An open source software system for speech and time-series processing

Jeff A. Bilmes; Geoffrey Zweig

This paper describes the Graphical Models Toolkit (GMTK), an open source, publically available toolkit for developing graphical-model based speech recognition and general time series systems. Graphical models are a flexible, concise, and expressive probabilistic modeling framework with which one may rapidly specify a vast collection of statistical models. This paper begins with a brief description of the representational and computational aspects of the framework. Following that is a detailed description of GMTKs features, including a language for specifying structures and probability distributions, logarithmic space exact training and decoding procedures, the concept of switching parents, and a generalized EM training method which allows arbitrary sub-Gaussian parameter tying. Taken together, these features endow GMTK with a degree of expressiveness and functionality that significantly complements other publically available packages. GMTK was recently used in the 2001 Johns Hopkins Summer Workshop, and experimental results are described in detail both herein and in a companion paper.

IEEE Transactions on Audio, Speech, and Language Processing | 2006

Advances in speech transcription at IBM under the DARPA EARS program

Stanley F. Chen; Brian Kingsbury; Lidia Mangu; Daniel Povey; George Saon; Hagen Soltau; Geoffrey Zweig

This paper describes the technical and system building advances made in IBMs speech recognition technology over the course of the Defense Advanced Research Projects Agency (DARPA) Effective Affordable Reusable Speech-to-Text (EARS) program. At a technical level, these advances include the development of a new form of feature-based minimum phone error training (fMPE), the use of large-scale discriminatively trained full-covariance Gaussian models, the use of septaphone acoustic context in static decoding graphs, and improvements in basic decoding algorithms. At a system building level, the advances include a system architecture based on cross-adaptation and the incorporation of 2100 h of training data in every system component. We present results on English conversational telephony test data from the 2003 and 2004 NIST evaluations. The combination of technical advances and an order of magnitude more training data in 2004 reduced the error rate on the 2003 test set by approximately 21% relative-from 20.4% to 16.1%-over the most accurate system in the 2003 evaluation and produced the most accurate results on the 2004 test sets in every speed category

ieee automatic speech recognition and understanding workshop | 2009

A segmental CRF approach to large vocabulary continuous speech recognition

Geoffrey Zweig; Patrick Nguyen

This paper proposes a segmental conditional random field framework for large vocabulary continuous speech recognition. Fundamental to this approach is the use of acoustic detectors as the basic input, and the automatic construction of a versatile set of segment-level features. The detector streams operate at multiple time scales (frame, phone, multi-phone, syllable or word) and are combined at the word level in the CRF training and decoding processes. A key aspect of our approach is that features are defined at the word level, and are naturally geared to explain long span phenomena such as formant trajectories, duration, and syllable stress patterns. Generalization to unseen words is possible through the use of decomposable consistency features [1], [2], and our framework allows for the joint or separate discriminative training of the acoustic and language models. An initial evaluation of this framework with voice search data from the Bing Mobile (BM) application results in a 2% absolute improvement over an HMM baseline.

international conference on acoustics, speech, and signal processing | 2005

The IBM 2004 conversational telephony system for rich transcription

Hagen Soltau; Brian Kingsbury; Lidia Mangu; Daniel Povey; George Saon; Geoffrey Zweig

This paper describes the technical advances in IBMs conversational telephony submission to the DARPA-sponsored 2004 rich transcription evaluation (RT-04). These advances include a system architecture based on cross-adaptation; a new form of feature-based MPE training; the use of a full-scale discriminatively trained full covariance Gaussian system; the use of septaphone cross-word acoustic context in static decoding graphs; and the incorporation of 2100 hours of training data in every system component. These advances reduced the error rate by approximately 21% relative, on the 2003 test set, over the best-performing system in last years evaluation, and produced the best results on the RT-04 current and progress CTS data.

international joint conference on natural language processing | 2015

Language Models for Image Captioning: The Quirks and What Works

Jacob Devlin; Hao Cheng; Hao Fang; Saurabh Gupta; Li Deng; Xiaodong He; Geoffrey Zweig; Margaret Mitchell

Two recent approaches have achieved state-of-the-art results in image captioning. The first uses a pipelined process where a set of candidate words is generated by a convolutional neural network (CNN) trained on images, and then a maximum entropy (ME) language model is used to arrange these words into a coherent sentence. The second uses the penultimate activation layer of the CNN as input to a recurrent neural network (RNN) that then generates the caption sequence. In this paper, we compare the merits of these different language modeling approaches for the first time by using the same state-ofthe-art CNN as input. We examine issues in the different approaches, including linguistic irregularities, caption repetition, and data set overlap. By combining key aspects of the ME and RNN methods, we achieve a new record performance over previously published results on the benchmark COCO dataset. However, the gains we see in BLEU do not translate to human judgments.

spoken language technology workshop | 2014

Spoken language understanding using long short-term memory neural networks

Kaisheng Yao; Baolin Peng; Yu Zhang; Dong Yu; Geoffrey Zweig; Yangyang Shi

Neural network based approaches have recently produced record-setting performances in natural language understanding tasks such as word labeling. In the word labeling task, a tagger is used to assign a label to each word in an input sequence. Specifically, simple recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have shown to significantly outperform the previous state-of-the-art - conditional random fields (CRFs). This paper investigates using long short-term memory (LSTM) neural networks, which contain input, output and forgetting gates and are more advanced than simple RNN, for the word labeling task. To explicitly model output-label dependence, we propose a regression model on top of the LSTM un-normalized scores. We also propose to apply deep LSTM to the task. We investigated the relative importance of each gate in the LSTM by setting other gates to a constant and only learning particular gates. Experiments on the ATIS dataset validated the effectiveness of the proposed models.

Explore More