Stephen M. Chu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Stephen M. Chu is active.

Explore More

Publication

Featured researches published by Stephen M. Chu.

international conference on acoustics, speech, and signal processing | 2010

The IBM 2008 GALE Arabic speech transcription system

Brian Kingsbury; Hagen Soltau; George Saon; Stephen M. Chu; Hong-Kwang Kuo; Lidia Mangu; Suman V. Ravuri; Nelson Morgan; Adam Janin

This paper describes the Arabic broadcast transcription system fielded by IBM in the GALE Phase 3.5 machine translation evaluation. Key advances compared to our Phase 2.5 system include improved discriminative training, the use of Subspace Gaussian Mixture Models (SGMM), neural network acoustic features, variable frame rate decoding, training data partitioning experiments, unpruned n-gram language models and neural network language models. These advances were instrumental in achieving a word error rate of 8.9% on the evaluation test set.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2016

Multi-Graph Matching via Affinity Optimization with Graduated Consistency Regularization

Junchi Yan; Minsu Cho; Hongyuan Zha; Xiaokang Yang; Stephen M. Chu

This paper addresses the problem of matching common node correspondences among multiple graphs referring to an identical or related structure. This multi-graph matching problem involves two correlated components: i) the local pairwise matching affinity across pairs of graphs; ii) the global matching consistency that measures the uniqueness of the pairwise matchings by different composition orders. Previous studies typically either enforce the matching consistency constraints in the beginning of an iterative optimization, which may propagate matching error both over iterations and across graph pairs; or separate affinity optimization and consistency enforcement into two steps. This paper is motivated by the observation that matching consistency can serve as a regularizer in the affinity objective function especially when the function is biased due to noises or inappropriate modeling. We propose composition-based multi-graph matching methods to incorporate the two aspects by optimizing the affinity score, meanwhile gradually infusing the consistency. We also propose two mechanisms to elicit the common inliers against outliers. Compelling results on synthetic and real images show the competency of our algorithms.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2012

Partially Supervised Speaker Clustering

Hao Tang; Stephen M. Chu; Mark Hasegawa-Johnson; Thomas S. Huang

Content-based multimedia indexing, retrieval, and processing as well as multimedia databases demand the structuring of the media content (image, audio, video, text, etc.), one significant goal being to associate the identity of the content to the individual segments of the signals. In this paper, we specifically address the problem of speaker clustering, the task of assigning every speech utterance in an audio stream to its speaker. We offer a complete treatment to the idea of partially supervised speaker clustering, which refers to the use of our prior knowledge of speakers in general to assist the unsupervised speaker clustering process. By means of an independent training data set, we encode the prior knowledge at the various stages of the speaker clustering pipeline via 1) learning a speaker-discriminative acoustic feature transformation, 2) learning a universal speaker prior model, and 3) learning a discriminative speaker subspace, or equivalently, a speaker-discriminative distance metric. We study the directional scattering property of the Gaussian mixture model (GMM) mean supervector representation of utterances in the high-dimensional space, and advocate exploiting this property by using the cosine distance metric instead of the euclidean distance metric for speaker clustering in the GMM mean supervector space. We propose to perform discriminant analysis based on the cosine distance metric, which leads to a novel distance metric learning algorithm-linear spherical discriminant analysis (LSDA). We show that the proposed LSDA formulation can be systematically solved within the elegant graph embedding general dimensionality reduction framework. Our speaker clustering experiments on the GALE database clearly indicate that 1) our speaker clustering methods based on the GMM mean supervector representation and vector-based distance metrics outperform traditional speaker clustering methods based on the “bag of acoustic features” representation and statistical model-based distance metrics, 2) our advocated use of the cosine distance metric yields consistent increases in the speaker clustering performance as compared to the commonly used euclidean distance metric, 3) our partially supervised speaker clustering concept and strategies significantly improve the speaker clustering performance over the baselines, and 4) our proposed LSDA algorithm further leads to state-of-the-art speaker clustering performance.

IEEE Transactions on Image Processing | 2015

Consistency-Driven Alternating Optimization for Multigraph Matching: A Unified Approach

Junchi Yan; Jun Wang; Hongyuan Zha; Xiaokang Yang; Stephen M. Chu

The problem of graph matching (GM) in general is nondeterministic polynomial-complete and many approximate pairwise matching techniques have been proposed. For a general setting in real applications, it typically requires to find the consistent matching across a batch of graphs. Sequentially performing pairwise matching is prone to error propagation along the pairwise matching sequence, and the sequences generated in different pairwise matching orders can lead to contradictory solutions. Motivated by devising a robust and consistent multiple-GM model, we propose a unified alternating optimization framework for multi-GM. In addition, we define and use two metrics related to graphwise and pairwise consistencies. The former is used to find an appropriate reference graph, which induces a set of basis variables and launches the iteration procedure. The latter defines the order in which the considered graphs in the iterations are manipulated. We show two embodiments under the proposed framework that can cope with the nonfactorized and factorized affinity matrix, respectively. Our multi-GM model has two major characters: 1) the affinity information across multiple graphs are explored in each iteration by fixing part of the matching variables via a consistency-driven mechanism and 2) the framework is flexible to incorporate various existing pairwise GM solvers in an out-of-box fashion, and also can proceed with the output of other multi-GM methods. The experimental results on both synthetic data and real images empirically show that the proposed framework performs competitively with the state-of-the-art.

international conference on acoustics, speech, and signal processing | 2002

Audio-visual speech modeling using coupled hidden Markov models

Stephen M. Chu; Thomas S. Huang

In this work we consider the bimodal fusion problem in audio-visual speech recognition. A novel sensory fusion architecture based on the coupled hidden Markov models (CHMMs) is presented. CHMMs are directed graphical models of stochastic processes and are a special type of dynamic Bayesian networks. The proposed fusion architecture allows us to address the statistical modeling and the fusion of audio-visual speech in a unified framework. Furthermore, the architecture is capable of capturing the asynchronous and temporal inter-modal dependencies between the two information channels. We describe a model transformation strategy to facilitate inference and learning in CHMMs. Results from audio-visual speech recognition experiments confirmed the superior capability of the proposed fusion architecture.

european conference on computer vision | 2014

Graduated Consistency-Regularized Optimization for Multi-graph Matching

Junchi Yan; Yin Li; Wei Liu; Hongyuan Zha; Xiaokang Yang; Stephen M. Chu

Graph matching has a wide spectrum of computer vision applications such as finding feature point correspondences across images. The problem of graph matching is generally NP-hard, so most existing work pursues suboptimal solutions between two graphs. This paper investigates a more general problem of matching N attributed graphs to each other, i.e. labeling their common node correspondences such that a certain compatibility/affinity objective is optimized. This multi-graph matching problem involves two key ingredients affecting the overall accuracy: a) the pairwise affinity matching score between two local graphs, and b) global matching consistency that measures the uniqueness and consistency of the pairwise matching results by different sequential matching orders. Previous work typically either enforces the matching consistency constraints in the beginning of iterative optimization, which may propagate matching error both over iterations and across different graph pairs; or separates score optimizing and consistency synchronization in two steps. This paper is motivated by the observation that affinity score and consistency are mutually affected and shall be tackled jointly to capture their correlation behavior. As such, we propose a novel multi-graph matching algorithm to incorporate the two aspects by iteratively approximating the global-optimal affinity score, meanwhile gradually infusing the consistency as a regularizer, which improves the performance of the initial solutions obtained by existing pairwise graph matching solvers. The proposed algorithm with a theoretically proven convergence shows notable efficacy on both synthetic and public image datasets.

international conference on acoustics, speech, and signal processing | 2008

Universal background model based speech recognition

Daniel Povey; Stephen M. Chu; Balakrishnan Varadarajan

The universal background model (UBM) is an effective framework widely used in speaker recognition. But so far it has received little attention from the speech recognition field. In this work, we make a first attempt to apply the UBM to acoustic modeling in ASR. We propose a tree-based parameter estimation technique for UBMs, and describe a set of smoothing and pruning methods to facilitate learning. The proposed UBM approach is benchmarked on a state-of-the-art large-vocabulary continuous speech recognition platform on a broadcast transcription task. Preliminary experiments reported in this paper already show very exciting results.

computer vision and pattern recognition | 2015

Discrete hyper-graph matching

Junchi Yan; Chao Zhang; Hongyuan Zha; Wei Liu; Xiaokang Yang; Stephen M. Chu

This paper focuses on the problem of hyper-graph matching, by accounting for both unary and higher-order affinity terms. Our method is in line with the linear approximate framework while the problem is iteratively solved in discrete space. It is empirically found more efficient than many extant continuous methods. Moreover, it avoids unknown accuracy loss by heuristic rounding step from the continuous approaches. Under weak assumptions, we prove the iterative discrete gradient assignment in general will trap into a degenerating case - an m-circle solution path where m is the order of the problem. A tailored adaptive relaxation mechanism is devised to detect the degenerating case and makes the algorithm converge to a fixed point in discrete space. Evaluations on both synthetic and real-world data corroborate the efficiency of our method.

international conference on multimedia and expo | 2009

Emotion recognition from speech VIA boosted Gaussian mixture models

Hao Tang; Stephen M. Chu; Mark Hasegawa-Johnson; Thomas S. Huang

Gaussian mixture models (GMMs) and the minimum error rate classifier (i.e. Bayesian optimal classifier) are popular and effective tools for speech emotion recognition. Typically, GMMs are used to model the class-conditional distributions of acoustic features and their parameters are estimated by the expectation maximization (EM) algorithm based on a training data set. Then, classification is performed to minimize the classification error w.r.t. the estimated class-conditional distributions. We call this method the EM-GMM algorithm. In this paper, we introduce a boosting algorithm for reliably and accurately estimating the class-conditional GMMs. The resulting algorithm is named the Boosted-GMM algorithm. Our speech emotion recognition experiments show that the emotion recognition rates are effectively and significantly ȁboosted” by the Boosted-GMM algorithm as compared to the EM-GMM algorithm. This is due to the fact that the boosting algorithm can lead to more accurate estimates of the class-conditional GMMs, namely the class-conditional distributions of acoustic features.

international conference on acoustics, speech, and signal processing | 2009

Fishervoice and semi-supervised speaker clustering

Stephen M. Chu; Hao Tang; Thomas S. Huang

Speaker subspace modeling has become increasingly important in speaker recognition, diarization, and clustering. Principal component analysis (PCA) is a popular linear subspace learning technique and the approach that represents an arbitrary utterance or speaker as a linear combination of a set of basis voices based on PCA is known as the eigenvoice approach. In this paper, a novel technique, namely the fishervoice approach, is proposed. The fishervoice approach is based on linear discriminant analysis, another successful linear subspace learning technique that provides an optimized low-dimensional representation of utterances or speakers with focus on the most discriminative basis voices. We apply the fishervoice approach to speaker clustering in a semi-supervised manner and show that the fishervoice approach significantly outperforms the eigenvoice approach in all our experiments on the GALE Mandarin dataset.

Explore More