Is this you? Create Your Porfile

Yaodong Zhang

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yaodong Zhang is active.

Explore More

Publication

Featured researches published by Yaodong Zhang.

ieee automatic speech recognition and understanding workshop | 2009

Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams

Yaodong Zhang; James R. Glass

In this paper, we present an unsupervised learning framework to address the problem of detecting spoken keywords. Without any transcription information, a Gaussian Mixture Model is trained to label speech frames with a Gaussian posteriorgram. Given one or more spoken examples of a keyword, we use segmental dynamic time warping to compare the Gaussian posteriorgrams between keyword samples and test utterances. The keyword detection result is then obtained by ranking the distortion scores of all the test utterances. We examine the TIMIT corpus as a development set to tune the parameters in our system, and the MIT Lecture corpus for more substantial evaluation. The results demonstrate the viability and effectiveness of our unsupervised learning framework on the keyword spotting task.

international conference on acoustics, speech, and signal processing | 2014

Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition

Xue Feng; Yaodong Zhang; James R. Glass

Denoising autoencoders (DAs) have shown success in generating robust features for images, but there has been limited work in applying DAs for speech. In this paper we present a deep denoising autoencoder (DDA) framework that can produce robust speech features for noisy reverberant speech recognition. The DDA is first pre-trained as restricted Boltzmann machines (RBMs) in an unsupervised fashion. Then it is unrolled to autoencoders, and fine-tuned by corresponding clean speech features to learn a nonlinear mapping from noisy to clean features. Acoustic models are re-trained using the reconstructed features from the DDA, and speech recognition is performed. The proposed approach is evaluated on the CHiME-WSJ0 corpus, and shows a 16-25% absolute improvement on the recognition accuracy under various SNRs.

international conference on acoustics, speech, and signal processing | 2010

Towards multi-speaker unsupervised speech pattern discovery

Yaodong Zhang; James R. Glass

In this paper, we explore the use of a Gaussian posteriorgram based representation for unsupervised discovery of speech patterns. Compared with our previous work, the new approach provides significant improvement towards speaker independence. The framework consists of three main procedures: a Gaussian posteriorgram generation procedure which learns an unsupervised Gaussian mixture model and labels each speech frame with a Gaussian posteriorgram representation; a segmental dynamic time warping procedure which locates pairs of similar sequences of Gaussian posteriorgram vectors; and a graph clustering procedure which groups similar sequences into clusters. We demonstrate the viability of using the posteriorgram approach to handle many talkers by finding clusters of words in the TIMIT corpus.

international conference on acoustics, speech, and signal processing | 2012

Resource configurable spoken query detection using Deep Boltzmann Machines

Yaodong Zhang; Ruslan Salakhutdinov; Hung-An Chang; James R. Glass

In this paper we present a spoken query detection method based on posteriorgrams generated from Deep Boltzmann Machines (DBMs). The proposed method can be deployed in both semi-supervised and unsupervised training scenarios. The DBM-based posteriorgrams were evaluated on a series of keyword spotting tasks using the TIMIT speech corpus. In unsupervised training conditions, the DBM-approach improved upon our previous best unsupervised keyword detection performance using Gaussian mixture model-based posteriorgrams by over 10%. When limited amounts of labeled data were incorporated into training, the DBM-approach required less than one third of the annotated data in order to achieve a comparable performance of a system that used all of the annotated data for training.

international conference on acoustics, speech, and signal processing | 2012

Fast spoken query detection using lower-bound Dynamic Time Warping on Graphical Processing Units

Yaodong Zhang; Kiarash Adl; James R. Glass

In this paper we present a fast unsupervised spoken term detection system based on lower-bound Dynamic Time Warping (DTW) search on Graphical Processing Units (GPUs). The lower-bound estimate and the K nearest neighbor DTW search are carefully designed to fit the GPU parallel computing architecture. In a spoken term detection task on the TIMIT corpus, a 55x speed-up is achieved compared to our previous implementation on a CPU without affecting detection performance. On large, artificially created corpora, measurements show that the total computation time of the entire spoken term detection system grows linearly with corpus size. On average, searching a keyword on a single desktop computer with modern GPUs requires 2.4 seconds/corpus hour.

international conference on acoustics, speech, and signal processing | 2011

An inner-product lower-bound estimate for dynamic time warping

Yaodong Zhang; James R. Glass

In this paper, we present a lower-bound estimate for dynamic time warping (DTW) on time series consisting of multi-dimensional posterior probability vectors known as posteriorgrams. We develop a lower-bound estimate based on the inner-product distance that has been found to be an effective metric for computing similarities between posteriorgrams. In addition to deriving the lower-bound estimate, we show how it can be efficiently used in an admissible K nearest neighbor (KNN) search for spotting matching sequences. We quantify the amount of computational savings achieved by performing a set of unsupervised spoken keyword spotting experiments using Gaussian mixture model posteriorgrams. In these experiments the proposed lower-bound estimate eliminates 89% of the DTW previously required calculations without affecting overall keyword detection performance.

international conference on acoustics, speech, and signal processing | 2009

Speech rhythm guided syllable nuclei detection

Yaodong Zhang; James R. Glass

In this paper, we present a novel speech-rhythm-guided syllable-nuclei location detection algorithm. As a departure from conventional methods, we introduce an instantaneous speech rhythm estimator to predict possible regions where syllable nuclei can appear. Within a possible region, a simple slope based peak counting algorithm is used to get the exact location of each syllable nucleus. We verify the correctness of our method by investigating the syllable nuclei interval distribution in TIMIT dataset, and evaluate the performance by comparing with a state-of-the-art syllable nuclei based speech rate detection approach.

international conference on acoustics, speech, and signal processing | 2011

A novel decision function and the associated decision-feedback learning for speech translation

Yaodong Zhang; Li Deng; Xiaodong He; Alex Acero

In this paper we report our recent development of an end-to-end integrative design methodology for speech translation. Specifically, a novel decision function is proposed based on the Bayesian analysis, and the associated discriminative learning technique is presented based on the decision-feedback principle. The decision function in our end-to-end design methodology integrates acoustic scores, language model scores and translation scores to refine the translation hypotheses and to determine the best translation candidate. This Bayesian-guided decision function is then embedded into the training process that jointly learns the parameters in speech recognition and machine translation sub-systems in the overall speech translation system. The resulting decision-feedback learning takes a functional form similar to the minimum classification error training. Experimental results obtained on the IWSLT DIALOG 2010 database showed that the proposed system outperformed the baseline system in terms of BLEU score by 2.3 points.

international conference on acoustics, speech, and signal processing | 2013

Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams

Ann Lee; Yaodong Zhang; James R. Glass

In this paper, we explore the use of deep belief network (DBN) posteriorgrams as input to our previously proposed comparison-based system for detecting word-level mispronunciation. The system works by aligning a nonnative utterance with at least one native utterance and extracting features that describe the degree of mis-alignment from the aligned path and the distance matrix. We report system performance under different DBN training scenarios: pre-training and fine-tuning with either native data only or both native and nonnative data. Experimental results have shown that by substituting the system input from MFCC or Gaussian posteriorgrams obtained in a fully unsupervised manner to DBN posteriorgrams, the system performance can be improved by at least 10.4% relatively. Moreover, the system performance remains steady when only 30% of the annotations being used.

international conference on document analysis and recognition | 2007

Minimum Error Discriminative Training for Radical-Based Online Chinese Handwriting Recognition

Yaodong Zhang; Peng Liu; Frank K. Soong

Free style Chinese handwriting recognition continues to pose a challenge to researchers due to the variety of writing styles. To recognize handwritten characters in an online mode, Hidden Markov Model (HMM) has been naturally adopted to model the pen trajectory of a character and a decent recognition performance is achieved. In this study, we start from a maximum likelihood trained HMM model and focus on minimizing recognition errors at the radical (sub- character) level to optimize the recognition performance. A novel Minimum Radical Error discriminative training criterion is proposed, and compared with the discrimination at the character level, our new approach further reduces the character errors by 15.6% relatively (29.0% overall reduction from the maximum likelihood baseline model) on a Chinese database.

Explore More