Chris D. Bartels | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chris D. Bartels is active.

Explore More

Publication

Featured researches published by Chris D. Bartels.

IEEE Signal Processing Magazine | 2005

Graphical model architectures for speech recognition

Jeff A. Bilmes; Chris D. Bartels

This article discusses the foundations of the use of graphical models for speech recognition as presented in J. R. Deller et al. (1993), X. D. Huang et al. (2001), F. Jelinek (19970, L. R. Rabiner and B. -H. Juang (1993) and S. Young et al. (1990) giving detailed accounts of some of the more successful cases. Our discussion employs dynamic Bayesian networks (DBNs) and a DBN extension using the Graphical Model Toolkits (GMTKs) basic template, a dynamic graphical model representation that is more suitable for speech and language systems. While this article concentrates on speech recognition, it should be noted that many of the ideas presented here are also applicable to natural language processing and general time-series analysis.

international conference on acoustics, speech, and signal processing | 2004

DBN based multi-stream models for audio-visual speech recognition

John N. Gowdy; Amarnag Subramanya; Chris D. Bartels; Jeff A. Bilmes

In this paper, we propose a model based on dynamic Bayesian networks (DBN) to integrate information from multiple audio and visual streams. We also compare the DBN based system (implemented using the Graphical Model Toolkit (GMTK)) with a classical HMM (implemented in the Hidden Markov Model Toolkit (HTK)) for both the single and two stream integration problems. We also propose a new model (mixed integration) to integrate information from three or more streams derived from different modalities and compare the new models performance with that of a synchronous integration scheme. A new technique to estimate stream confidence measures for the integration of three or more streams is also developed and implemented. Results from our implementation using the Clemson University Audio Visual Experiments (CUAVE) database indicate an absolute improvement of about 4% in word accuracy in the -4 to 10db average case when making use of two audio and one video streams for the mixed integration models over the sychronous models.

international conference on acoustics, speech, and signal processing | 2007

An Articulatory Feature-Based Tandem Approach and Factored Observation Modeling

Özgür Çetin; Arthur Kantor; Simon King; Chris D. Bartels; Mathew Magimai-Doss; Joe Frankel; Karen Livescu

The so-called tandem approach, where the posteriors of a multilayer perceptron (MLP) classifier are used as features in an automatic speech recognition (ASR) system has proven to be a very effective method. Most tandem approaches up to date have relied on MLPs trained for phone classification, and appended the posterior features to some standard feature hidden Markov model (HMM). In this paper, we develop an alternative tandem approach based on MLPs trained for articulatory feature (AF) classification. We also develop a factored observation model for characterizing the posterior and standard features at the HMM outputs, allowing for separate hidden mixture and state-tying structures for each factor. In experiments on a subset of Switchboard, we show that the AF-based tandem approach is as effective as the phone-based approach, and that the factored observation model significantly outperforms the simple feature concatenation approach while using fewer parameters.

international conference on acoustics, speech, and signal processing | 2003

DBN based multi-stream models for speech

Yimin Zhang; Qian Diao; Shan Huang; Wei Hu; Chris D. Bartels; Jeff A. Bilmes

We propose dynamic Bayesian network (DBN) based synchronous and asynchronous multi-stream models for noise-robust automatic speech recognition. In these models, multiple noise-robust features are combined into a single DBN to obtain better performance than any single feature system alone. Results on the Aurora 2.0 noisy speech task show significant improvements of our synchronous model over both single stream models and over a ROVER based fusion method.

international conference on acoustics, speech, and signal processing | 2014

SUBMODULAR SUBSET SELECTION FOR LARGE-SCALE SPEECH TRAINING DATA

Kai Wei; Yuzong Liu; Katrin Kirchhoff; Chris D. Bartels; Jeff A. Bilmes

We address the problem of subselecting a large set of acoustic data to train automatic speech recognition (ASR) systems. To this end, we apply a novel data selection technique based on constrained submodular function maximization. Though NP-hard, the combinatorial optimization problem can be approximately solved by a simple and scalable greedy algorithm with constant-factor guarantees. We evaluate our approach by subselecting data from 1300 hours of conversational English telephone data to train two types large-vocabulary speech recognizers, one with Gaussian mixture model (GMM) based acoustic models, and another based on deep neural networks (DNNs). We show that training data can be reduced significantly, and that our technique outperforms both random selection and a previously proposed selection method utilizing comparable resources. Notably, using the submodular selection method, the DNN system using only about 5% of the training data is able to achieve performance on par with the GMM system using 100% of the training data - with the baseline subset selection methods, however, the DNN system is unable to accomplish this correspondence.

ieee automatic speech recognition and understanding workshop | 2007

Monolingual and crosslingual comparison of tandem features derived from articulatory and phone MLPS

Özgür Çetin; Mathew Magimai-Doss; Karen Livescu; Arthur Kantor; Simon King; Chris D. Bartels; Joe Frankel

The features derived from posteriors of a multilayer perceptron (MLP), known as tandem features, have proven to be very effective for automatic speech recognition. Most tandem features to date have relied on MLPs trained for phone classification. We recently showed on a relatively small data set that MLPs trained for articulatory feature classification can be equally effective. In this paper, we provide a similar comparison using MLPs trained on a much larger data set -2000 hours of English conversational telephone speech. We also explore how portable phone-and articulatory feature-based tandem features are in an entirely different language - Mandarin - without any retraining. We find that while the phone-based features perform slightly better than AF-based features in the matched-language condition, they perform significantly better in the cross-language condition. However, in the cross-language condition, neither approach is as effective as the tandem features extracted from an MLP trained on a relatively small amount of in-domain data. Beyond feature concatenation, we also explore novel factored observation modeling schemes that allow for greater flexibility in combining the tandem and standard features.

ieee automatic speech recognition and understanding workshop | 2007

Use of syllable nuclei locations to improve ASR

Chris D. Bartels; Jeff A. Bilmes

This work presents the use of dynamic Bayesian networks (DBNs) to jointly estimate word position and word identity in an automatic speech recognition system. In particular, we have augmented a standard Hidden Markov Model (HMM) with counts and locations of syllable nuclei. Three experiments are presented here. The first uses oracle syllable counts, the second uses oracle syllable nuclei locations, and the third uses estimated (non-oracle) syllable nuclei locations. All results are presented on the 10 and 500 word tasks of the SVitch-board corpus. The oracle experiments give relative improvements ranging from 7.0% to 37.2%. When using estimated syllable nuclei a relative improvement of 3.1% is obtained on the 10 word task.

ieee automatic speech recognition and understanding workshop | 2007

Uncertainty in training large vocabulary speech recognizers

Amarnag Subramanya; Chris D. Bartels; Jeff A. Bilmes; Patrick Nguyen

We propose a technique for annotating data used to train a speech recognizer. The proposed scheme is based on labeling only a single frame for every word in the training set. We make use of the virtual evidence (VE) framework within a graphical model to take advantage of such data. We apply this approach to a large vocabulary speech recognition task, and show that our VE-based training scheme can improve over the performance of a system trained using sequence labeled data by 2.8% and 2.1% on the dev01 and eva101 sets respectively. Annotating data in the proposed scheme is not significantly slower than sequence labeling. We present timing results showing that training using the proposed approach is about 10 times faster than training using sequence labeled data while using only about 75% of the memory.

Machine Learning | 2011

Creating non-minimal triangulations for use in inference in mixed stochastic/deterministic graphical models

Chris D. Bartels; Jeff A. Bilmes

We demonstrate that certain large-clique graph triangulations can be useful for reducing computational requirements when making queries on mixed stochastic/deterministic graphical models. This is counter to the conventional wisdom that triangulations that minimize clique size are always most desirable for use in computing queries on graphical models. Many of these large-clique triangulations are non-minimal and are thus unattainable via the popular elimination algorithm. We introduce ancestral pairs as the basis for novel triangulation heuristics and prove that no more than the addition of edges between ancestral pairs needs to be considered when searching for state space optimal triangulations in such graphs. Empirical results on random and real world graphs are given. We also present an algorithm and correctness proof for determining if a triangulation can be obtained via elimination, and we show that the decision problem associated with finding optimal state space triangulations in this mixed setting is NP-complete.

international conference on acoustics, speech, and signal processing | 2009

Modelling the prepausal lengthening effect for speech recognition: a dynamic Bayesian network approach

Ning Ma; Chris D. Bartels; Jeff A. Bilmes; Phil D. Green

Speech has a property that the speech unit preceding a speech pause tends to lengthen. This work presents the use of a dynamic Bayesian network to model the prepausal lengthening effect for robust speech recognition. Specifically, we introduce two distributions to model inter-state transitions in prepausal and non-prepausal words, respectively. The selection of the transition distributions depends on a random variable whose value is influenced by whether a pause will appear between the current and the following word. Two experiments are presented here. The first one considers pauses hypothesised during speech decoding. The second one employs an extra component for speech/non-speech determination. By modelling the prepausal lengthening effect we achieve a 5.5% relative reduction in word error rate on the 500-word task of the SVitchboard corpus.

Explore More