Is this you? Create Your Porfile

Barbara Peskin

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Barbara Peskin is active.

Explore More

Publication

Featured researches published by Barbara Peskin.

international conference on acoustics, speech, and signal processing | 2003

The ICSI Meeting Corpus

Adam Janin; Don Baron; Jane Edwards; Daniel P. W. Ellis; David Gelbart; Nelson Morgan; Barbara Peskin; Thilo Pfau; Elizabeth Shriberg; Andreas Stolcke; Chuck Wooters

We have collected a corpus of data from natural meetings that occurred at the International Computer Science Institute (ICSI) in Berkeley, California over the last three years. The corpus contains audio recorded simultaneously from head-worn and table-top microphones, word-level transcripts of meetings, and various metadata on participants, meetings, and hardware. Such a corpus supports work in automatic speech recognition, noise robustness, dialog modeling, prosody, rich transcription, information retrieval, and more. We present details on the contents of the corpus, as well as rationales for the decisions that led to its configuration. The corpus were delivered to the Linguistic Data Consortium (LDC).

international conference on acoustics, speech, and signal processing | 2003

The SuperSID project: exploiting high-level information for high-accuracy speaker recognition

Douglas A. Reynolds; Walter D. Andrews; Joseph P. Campbell; Jiri Navratil; Barbara Peskin; André Gustavo Adami; Qin Jin; David Klusacek; Joy S. Abramson; Radu Mihaescu; John J. Godfrey; Douglas A. Jones; Bing Xiang

The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have indeed produced very low error rates, they ignore other levels of information beyond low-level acoustics that convey speaker information. Recently published work has shown examples that such high-level information can be used successfully in automatic speaker recognition systems and has the potential to improve accuracy and add robustness. For the 2002 JHU CLSP summer workshop, the SuperSID project (http://www.clsp.jhu.edu/ws2002/groups/supersid/) was undertaken to exploit these high-level information sources and dramatically increase speaker recognition accuracy on a defined NIST evaluation corpus and task. The paper provides an overview of the structure, data, task, tools, and accomplishments of this project. Wide ranging approaches using pronunciation models, prosodic dynamics, pitch and duration features, phone streams, and conversational interactions were explored and developed. We show how these novel features and classifiers indeed provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST extended data task to 0.2% - a 71% relative reduction in error over the previous state of the art.

international conference on acoustics speech and signal processing | 1996

Speaker normalization on conversational telephone speech

Steven Wegmann; Don McAllaster; Jeremy Orloff; Barbara Peskin

This paper reports on a simplified system for determining vocal tract normalization. Such normalization has led to significant gains in recognition accuracy by reducing variability among speakers and allowing the pooling of training data and the construction of sharper models. But standard methods for determining the warp scale have been extremely cumbersome, generally requiring multiple recognition passes. We present a new system for warp scale selection which uses a simple generic voiced speech model to rapidly select appropriate frequency scales. The selection is sufficiently streamlined that it can moved completely into the front-end processing. Using this system on a standard test of the Switchboard Corpus, we have achieved relative reductions in word error rates of 12% over unnormalized gender-independent models and 6% over our best unnormalized gender-dependent models.

international conference on acoustics, speech, and signal processing | 2005

Structural metadata research in the EARS program

Yang Liu; Elizabeth Shriberg; Andreas Stolcke; Barbara Peskin; Jeremy Ang; Dustin Hillard; Mari Ostendorf; Marcus Tomalin; Philip C. Woodland; Mary P. Harper

Both human and automatic processing of speech require recognition of more than just words. In this paper we provide a brief overview of research on structural metadata extraction in the DARPA EARS rich transcription program. Tasks include detection of sentence boundaries, filler words, and disfluencies. Modeling approaches combine lexical, prosodic, and syntactic information, using various modeling techniques for knowledge source integration. The performance of these methods is evaluated by task, by data source (broadcast news versus spontaneous telephone conversations) and by whether transcriptions come from humans or from an (errorful) automatic speech recognizer. A representative sample of results shows that combining multiple knowledge sources (words, prosody, syntactic information) is helpful, that prosody is more helpful for news speech than for conversational speech, that word errors significantly impact performance, and that discriminative models generally provide benefit over maximum likelihood models. Important remaining issues, both technical and programmatic, are also discussed.

international conference on machine learning | 2005

Further progress in meeting recognition: the ICSI-SRI spring 2005 speech-to-text evaluation system

Andreas Stolcke; Xavier Anguera; Kofi Boakye; Özgür Çetin; Frantisek Grezl; Adam Janin; Arindam Mandal; Barbara Peskin; Chuck Wooters; Jing Zheng

We describe the development of our speech recognition system for the National Institute of Standards and Technology (NIST) Spring 2005 Meeting Rich Transcription (RT-05S) evaluation, highlighting improvements made since last year [1]. The system is based on the SRI-ICSI-UW RT-04F conversational telephone speech (CTS) recognition system, with meeting-adapted models and various audio preprocessing steps. This years system features better delay-sum processing of distant microphone channels and energy-based crosstalk suppression for close-talking microphones. Acoustic modeling is improved by virtue of various enhancements to the background (CTS) models, including added training data, decision-tree based state tying, and the inclusion of discriminatively trained phone posterior features estimated by multilayer perceptrons. In particular, we make use of adaptation of both acoustic models and MLP features to the meeting domain. For distant microphone recognition we obtained considerable gains by combining and cross-adapting narrow-band (telephone) acoustic models with broadband (broadcast news) models. Language models (LMs) were improved with the inclusion of new meeting and web data. In spite of a lack of training data, we created effective LMs for the CHIL lecture domain. Results are reported on RT-04S and RT-05S meeting data. Measured on RT-04S conference data, we achieved an overall improvement of 17% relative in both MDM and IHM conditions compared to last years evaluation system. Results on lecture data are comparable to the best reported results for that task.

international conference on acoustics, speech, and signal processing | 2002

Using prosodic and lexical information for speaker identification

Frederick Weber; Linda Manganaro; Barbara Peskin; Elizabeth Shriberg

We investigate the incorporation of larger time-scale information, such as prosody, into standard speaker ID systems. Our study is based on the Extended Data Task of the NIST 2001 Speaker ID evaluation, which provides much more test and training data than has traditionally been available to similar speaker ID investigations. In addition, we have had access to a detailed prosodic feature database of Switchboard-I conversations, including data not previously applied to speaker ID. We describe two baseline acoustic systems, an approach using Gaussian Mixture Models, and an LVCSR-based speaker ID system. These results are compared to and combined with two larger time-scale systems: a system based on an “idiolect” language model. and a system making use of the contents of the prosody database. We find that, with sufficient test and training data, suprasegmental information can significantly enhance the performance of traditional speaker ID systems.

international conference on acoustics, speech, and signal processing | 2003

Meetings about meetings: research at ICSI on speech in multiparty conversations

Nelson Morgan; Don Baron; Sonali Bhagat; Hannah Carvey; Rajdip Dhillon; Jane Edwards; David Gelbart; Adam Janin; Ashley Krupski; Barbara Peskin; Thilo Pfau; Elizabeth Shriberg; Andreas Stolcke; Chuck Wooters

In early 2001, we reported (at the Human Language Technology meeting) the early stages of an ICSI (International Computer Science Institute) project on processing speech from meetings (in collaboration with other sites, principally SRI, Columbia, and UW). We report our progress from the first few years of this effort, including: the collection and subsequent release of a 75-meeting corpus (over 70 meeting-hours and up to 16 channels for each meeting); the development of a prosodic database for a large subset of these meetings, and its subsequent use for punctuation and disfluency detection; the development of a dialog annotation scheme and its implementation for a large subset of the meetings; and the improvement of both near-mic and far-mic speech recognition results for meeting speech test sets.

international conference on acoustics, speech, and signal processing | 1993

Application of large vocabulary continuous speech recognition to topic and speaker identification using telephone speech

Larry Gillick; Janet M. Baker; John S. Bridle; Melvyn J. Hunt; Yoshiko Ito; S. Lowe; Jeremy Orloff; Barbara Peskin; R. Roth; F. Scattone

The authors describe a novel approach to the problems of topic and speaker identification that makes use of large-vocabulary continuous speech recognition. A theoretical framework for dealing with these problems in a symmetric way is provided. Some empirical results on topic and speaker identification that have been obtained on the extensive Switchboard corpus of telephone conversations are presented.<<ETX>>

international conference on acoustics speech and signal processing | 1996

Improvements in switchboard recognition and topic identification

Barbara Peskin; Sean Connolly; L. Gillick; Steve Lowe; Don McAllaster; Venkatesh Nagesha

We revisit a topic identification test on the Switchboard Corpus first reported by Gillick et al. (see Proc. ICASSP-93, 1993 and ARPA Workshop on Human Language Technology, 1993). This approach to topic ID uses a large vocabulary continuous speech recognizer as a front-end to transcribe the speech and then scores the transcripts using a set of topic-specific language models. Our recognition of conversational telephone speech has improved dramatically in the three years since the original test, dropping from word error rates in the 90%s to those in the 40%s. Changing only the recognition engine but otherwise leaving our 1993 topic ID system in place, the resulting rate of message misclassification drops from 33/120 in 1993 down to 1/120 now-the same error rate that we obtain from the true transcriptions. This paper describes the topic classification test and the many improvements to the recognition engine that made such a dramatic reduction possible.

international conference on spoken language processing | 1996

Speaker verification through large vocabulary continuous speech recognition

Michael Newman; Larry Gillick; Yoshiko Ito; Don McAllaster; Barbara Peskin

The authors present a study of a speaker verification system for telephone data based on large-vocabulary speech recognition. After describing the recognition engine, they give details of the verification algorithm and draw comparisons with other systems. The system has been tested on a test set taken from the Switchboard corpus of conversational telephone speech, and they present results showing how performance varies with length of test utterance, and whether or not the training data has been transcribed. The dominant factor in performance appears to be channel or handset mismatch between training and testing data.

Explore More