Guoguo Chen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Guoguo Chen is active.

Explore More

Publication

Featured researches published by Guoguo Chen.

international conference on acoustics, speech, and signal processing | 2015

Librispeech: An ASR corpus based on public domain audio books

Vassil Panayotov; Guoguo Chen; Daniel Povey; Sanjeev Khudanpur

This paper introduces a new corpus of read English speech, suitable for training and evaluating speech recognition systems. The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz. We have made the corpus freely available for download, along with separately prepared language-model training data and pre-built language models. We show that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models trained on WSJ itself. We are also releasing Kaldi scripts that make it easy to build these systems.

international conference on acoustics, speech, and signal processing | 2014

Small-footprint keyword spotting using deep neural networks

Guoguo Chen; Carolina Parada; Georg Heigold

Our application requires a keyword spotting system with a small memory footprint, low computational cost, and high precision. To meet these requirements, we propose a simple approach based on deep neural networks. A deep neural network is trained to directly predict the keyword(s) or subword units of the keyword(s) followed by a posterior handling method producing a final confidence score. Keyword recognition results achieve 45% relative improvement with respect to a competitive Hidden Markov Model-based system, while performance in the presence of babble noise shows 39% relative improvement.

international conference on acoustics, speech, and signal processing | 2016

Highway long short-term memory RNNS for distant speech recognition

Yu Zhang; Guoguo Chen; Dong Yu; Kaisheng Yaco; Sanjeev Khudanpur; James R. Glass

In this paper, we extend the deep long short-term memory (DL-STM) recurrent neural networks by introducing gated direct connections between memory cells in adjacent layers. These direct links, called highway connections, enable unimpeded information flow across different layers and thus alleviate the gradient vanishing problem when building deeper LSTMs. We further introduce the latency-controlled bidirectional LSTMs (BLSTMs) which can exploit the whole history while keeping the latency under control. Efficient algorithms are proposed to train these novel networks using both frame and sequence discriminative criteria. Experiments on the AMI distant speech recognition (DSR) task indicate that we can train deeper LSTMs and achieve better improvement from sequence training with highway LSTMs (HLSTMs). Our novel model obtains 43.9/47.7% WER on AMI (SDM) dev and eval sets, outperforming all previous works. It beats the strong DNN and DLSTM baselines with 15.7% and 5.3% relative improvement respectively.

ieee automatic speech recognition and understanding workshop | 2013

Using proxies for OOV keywords in the keyword search task

Guoguo Chen; Oguz Yilmaz; Jan Trmal; Daniel Povey; Sanjeev Khudanpur

We propose a simple but effective weighted finite state transducer (WFST) based framework for handling out-of-vocabulary (OOV) keywords in a speech search task. State-of-the-art large vocabulary continuous speech recognition (LVCSR) and keyword search (KWS) systems are developed for conversational telephone speech in Tagalog. Word-based and phone-based indexes are created from word lattices, the latter by using the LVCSR systems pronunciation lexicon. Pronunciations of OOV keywords are hypothesized via a standard grapheme-to-phoneme method. In-vocabulary proxies (word or phone sequences) are generated for each OOV keyword using WFST techniques that permit incorporation of a phone confusion matrix. Empirical results when searching for the Babel/NIST evaluation keywords in the Babel 10 hour development-test speech collection show that (i) searching for word proxies in the word index significantly outperforms searching for phonetic representations of OOV words in a phone index, and (ii) while phone confusion information yields minor improvement when searching a phone index, it yields up to 40% improvement in actual term weighted value when searching a word index with word proxies.

international conference on acoustics, speech, and signal processing | 2016

Deep beamforming networks for multi-channel speech recognition

Xiong Xiao; Shinji Watanabe; Hakan Erdogan; Liang Lu; John R. Hershey; Michael L. Seltzer; Guoguo Chen; Yu Zhang; Michael I. Mandel; Dong Yu

Despite the significant progress in speech recognition enabled by deep neural networks, poor performance persists in some scenarios. In this work, we focus on far-field speech recognition which remains challenging due to high levels of noise and reverberation in the captured speech signals. We propose to represent the stages of acoustic processing including beamforming, feature extraction, and acoustic modeling, as three components of a single unified computational network. The parameters of a frequency-domain beam-former are first estimated by a network based on features derived from the microphone channels. These filter coefficients are then applied to the array signals to form an enhanced signal. Conventional features are then extracted from this signal and passed to a second network that performs acoustic modeling for classification. The parameters of both the beamforming and acoustic modeling networks are trained jointly using back-propagation with a common cross-entropy objective function. In experiments on the AMI meeting corpus, we observed improvements by pre-training each sub-network with a network-specific objective function before joint training of both networks. The proposed method obtained a 3.2% absolute word error rate reduction compared to a conventional pipeline of independent processing stages.

international conference on acoustics, speech, and signal processing | 2015

Query-by-example keyword spotting using long short-term memory networks

Guoguo Chen; Carolina Parada; Tara N. Sainath

We present a novel approach to query-by-example keyword spotting (KWS) using a long short-term memory (LSTM) recurrent neural network-based feature extractor. In our approach, we represent each keyword using a fixed-length feature vector obtained by running the keyword audio through a word-based LSTM acoustic model. We use the activations prior to the softmax layer of the LSTM as our keyword-vector. At runtime, we detect the keyword by extracting the same feature vector from a sliding window and computing a simple similarity score between this test vector and the keyword vector. With clean speech, we achieve 86% relative false rejection rate reduction at 0.5% false alarm rate when compared to a competitive phoneme posteriorgram with dynamic time warping KWS system, while the reduction in the presence of babble noise is 67%. Our system has a small memory footprint, low computational cost, and high precision, making it suitable for on-device applications.

international conference on acoustics, speech, and signal processing | 2013

Quantifying the value of pronunciation lexicons for keyword search in lowresource languages

Guoguo Chen; Sanjeev Khudanpur; Daniel Povey; Jan Trmal; David Yarowsky; Oguz Yilmaz

This paper quantifies the value of pronunciation lexicons in large vocabulary continuous speech recognition (LVCSR) systems that support keyword search (KWS) in low resource languages. State-of-the-art LVCSR and KWS systems are developed for conversational telephone speech in Tagalog, and the baseline lexicon is augmented via three different grapheme-to-phoneme models that yield increasing coverage of a large Tagalog word-list. It is demonstrated that while the increased lexical coverage - or reduced out-of-vocabulary (OOV) rate - leads to only modest (ca 1%-4%) improvements in word error rate, the concomitant improvements in actual term weighted value are as much as 60%. It is also shown that incorporating the augmented lexicons into the LVCSR system before indexing speech is superior to using them post facto, e.g., for approximate phonetic matching of OOV keywords in pre-indexed lattices. These results underscore the disproportionate importance of automatic lexicon augmentation for KWS in morphologically rich languages, and advocate for using them early in the LVCSR stage.

spoken language technology workshop | 2014

A keyword search system using open source software

Jan Trmal; Guoguo Chen; Daniel Povey; Sanjeev Khudanpur; Pegah Ghahremani; Xiaohui Zhang; Vimal Manohar; Chunxi Liu; Aren Jansen; Dietrich Klakow; David Yarowsky; Florian Metze

Provides an overview of a speech-to-text (STT) and keyword search (KWS) system architecture build primarily on the top of the Kaldi toolkit and expands on a few highlights. The system was developed as a part of the research efforts of the Radical team while participating in the IARPA Babel program. Our aim was to develop a general system pipeline which could be easily and rapidly deployed in any language, independently on the language script and phonological and linguistic features of the language.

ieee automatic speech recognition and understanding workshop | 2015

JHU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS

Vijayaditya Peddinti; Guoguo Chen; Vimal Manohar; Tom Ko; Daniel Povey; Sanjeev Khudanpur

Multi-style training, using data which emulates a variety of possible test scenarios, is a popular approach towards robust acoustic modeling. However acoustic models capable of exploiting large amounts of training data in a comparatively short amount of training time are essential. In this paper we tackle the problem of reverberant speech recognition using 5500 hours of simulated reverberant data. We use time-delay neural network (TDNN) architecture, which is capable of tackling long-term interactions between speech and corrupting sources in reverberant environments. By sub-sampling the outputs at TDNN layers across time steps, training time is substantially reduced. Combining this with distributed-optimization we show that the TDNN can be trained in 3 days using up to 32 GPUs. Further, iVectors are used as an input to the neural network to perform instantaneous speaker and environment adaptation. Finally, recurrent neural network language models are applied to the lattices to further improve the performance. Our system is shown to provide state-of-the-art results in the IARPA ASpIRE challenge, with 26.5% WER on the dev Jest set.

international conference on acoustics, speech, and signal processing | 2016

Acoustic data-driven pronunciation lexicon generation for logographic languages

Guoguo Chen; Daniel Povey; Sanjeev Khudanpur

Handcrafted pronunciation lexicons are widely used in modern speech recognition systems. Designing a pronunciation lexicon, however, requires tremendous amount of expert knowledge and effort, which is not practical when applying speech recognition techniques to low resource languages. In this paper, we are interested in developing speech recognition systems for logographic languages with only a small expert pronunciation lexicon. An iterative framework is proposed to generate and refine the phonetic transcripts of the training data, which will then be aligned to their word-level transcripts for grapheme-to-phoneme (G2P) model training. The G2P model trained this way covers graphemes that appear in the training transcripts (most of which are usually unseen in a small expert lexicon for logographic languages), therefore is able to generate pronunciations for all the words in the transcripts. The proposed lexicon generation procedure is evaluated on Cantonese speech recognition and keyword search tasks. Experiments show that starting from an expert lexicon of only 1K words, we are able to generate a lexicon that works reasonably well when compared with an expert-crafted lexicon of 5K words.

Explore More