Richard M. Schwartz | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Richard M. Schwartz is active.

Explore More

Publication

Featured researches published by Richard M. Schwartz.

ieee automatic speech recognition and understanding workshop | 2013

Score normalization and system combination for improved keyword spotting

Damianos Karakos; Richard M. Schwartz; Stavros Tsakalidis; Le Zhang; Shivesh Ranjan; Tim Ng; Roger Hsiao; Guruprasad Saikumar; Ivan Bulyko; Long Nguyen; John Makhoul; Frantisek Grezl; Mirko Hannemann; Martin Karafiát; Igor Szöke; Karel Vesely; Lori Lamel; Viet-Bac Le

We present two techniques that are shown to yield improved Keyword Spotting (KWS) performance when using the ATWV/MTWV performance measures: (i) score normalization, where the scores of different keywords become commensurate with each other and they more closely correspond to the probability of being correct than raw posteriors; and (ii) system combination, where the detections of multiple systems are merged together, and their scores are interpolated with weights which are optimized using MTWV as the maximization criterion. Both score normalization and system combination approaches show that significant gains in ATWV/MTWV can be obtained, sometimes on the order of 8-10 points (absolute), in five different languages. A variant of these methods resulted in the highest performance for the official surprise language evaluation for the IARPA-funded Babel project in April 2013.

CSREAHCI | 1996

Multiple-Pass Search Strategies

Richard M. Schwartz; Long Nguyen; John Makhoul

Large vocabulary speech recognition is very expensive computationally. We explore multi-pass search strategies as a way to reduce computation substantially, without any increase in error rate. We consider two basic strategies: the N-best Paradigm, and the Forward-Backward search. Both of these strategies operate on the entire sentence in (at least) two passes. The N-best Paradigm computes alternative hypotheses for a sentence, which can later be rescored using more detailed and more expensive knowledge sources. We present and compare many algorithms for finding the N-best sentence hypotheses, and suggest which are the most efficient and accurate. The Forward-Backward Search performs a time-synchronous forward search that finds all of the words that are likely to end at each frame within an utterance. Then, a second more expensive search can be performed in the backward direction, restricting its attention to those words found in the forward pass.

international conference on acoustics, speech, and signal processing | 2013

Developing a speaker identification system for the DARPA RATS project

Oldrich Plchot; Spyros Matsoukas; Pavel Matejka; Najim Dehak; Jeff Z. Ma; Sandro Cumani; Ondrej Glembek; Hynek Hermansky; Sri Harish Reddy Mallidi; Nima Mesgarani; Richard M. Schwartz; Mehdi Soufifar; Zheng-Hua Tan; Samuel Thomas; Bing Zhang; Xinhui Zhou

This paper describes the speaker identification (SID) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state of the art detection capabilities on audio from highly degraded communication channels. We present results using multiple SID systems differing mainly in the algorithm used for voice activity detection (VAD) and feature extraction. We show that (a) unsupervised VAD performs as well supervised methods in terms of downstream SID performance, (b) noise-robust feature extraction methods such as CFCCs out-perform MFCC front-ends on noisy audio, and (c) fusion of multiple systems provides 24% relative improvement in EER compared to the single best system when using a novel SVM-based fusion algorithm that uses side information such as gender, language, and channel id.

ieee automatic speech recognition and understanding workshop | 2013

Discriminative semi-supervised training for keyword search in low resource languages

Roger Hsiao; Tim Ng; Frantisek Grezl; Damianos Karakos; Stavros Tsakalidis; Long Nguyen; Richard M. Schwartz

In this paper, we investigate semi-supervised training for low resource languages where the initial systems may have high error rate (≥ 70.0% word eror rate). To handle the lack of data, we study semi-supervised techniques including data selection, data weighting, discriminative training and multilayer perceptron learning to improve system performance. The entire suite of semi-supervised methods presented in this paper was evaluated under the IARPA Babel program for the keyword spotting tasks. Our semi-supervised system had the best performance in the OpenKWS13 surprise language evaluation for the limited condition. In this paper, we describe our work on the Turkish and Vietnamese systems.

international conference on acoustics, speech, and signal processing | 2014

The 2013 BBN Vietnamese telephone speech keyword spotting system

Stavros Tsakalidis; Roger Hsiao; Damianos Karakos; Tim Ng; Shivesh Ranjan; Guruprasad Saikumar; Le Zhang; Long Nguyen; Richard M. Schwartz; John Makhoul

In this paper we describe the Vietnamese conversational telephone speech keyword spotting system under the IARPA Babel program for the 2013 evaluation conducted by NIST. The system contains several, recently developed, novel methods that significantly improve speech-to-text and keyword spotting performance such as stacked bottleneck neural network features, white listing, score normalization, and improvements on semi-supervised training methods. These methods resulted in the highest performance for the official IARPA Babel surprise language evaluation of 2013.

international conference on acoustics, speech, and signal processing | 2014

Normalizationofphonetic keyword search scores

Damianos Karakos; Ivan Bulyko; Richard M. Schwartz; Stavros Tsakalidis; Long Nguyen; John Makhoul

As shown in [1, 2], score normalization is of crucial importance for improving the Average Term-Weighted Value (ATWV) measure that is commonly used for evaluating keyword spotting systems. In this paper, we compare three different methods for score normalization within a keyword spotting system that employs phonetic search. We show that a new unsupervised linear fit method results in better-estimated posterior scores, that, when fed into the keyword-specific normalization of [1], result in ATWV gains of 3% on average. Furthermore, when these scores are used as features within a supervised machine learning framework, they result in additional gains of 3.8% on average over the five languages used in the first year of the IARPA-funded project Babel.

the second international conference | 2002

Japanese broadcast news transcription demonstration

Long Nguyen; Xuefeng Guo; Richard M. Schwartz; John Makhoul; Toru Imai; Akio Kobayashi; Atsushi Matsui; Akio Ando

In this demonstration, we show a real-time transcription of TV broadcast news in Japanese using a very large vocabulary speech recognition system developed at BBN Technologies. Both signal processing and speech recognition are run on a commodity notebook computer. Transcription word error rate is about 1.5% with average word latency less than 2 seconds. The high recognition accuracy in real time is achieved by a fast, and with low latency, 2-pass Byblos recognizer utilizing good acoustic and language models trained on the NHK Broadcast News Corpus.

the second international conference | 2002

MeetingLogger: rich transcription of courtroom speech

Rohit Prasad; Long Nguyen; Richard M. Schwartz; John Makhoul

In this paper we describe our on-going effort in developing a speech recognition system for transcribing courtroom hearings. Court hearings are a rich source of naturally occurring speech data, much of which is in public domain. The presence of multiple microphones coupled with presence of noise and reverberation makes the problem simultaneously rich and challenging. We have exploited the availability of multiple channels to mitigate, to some extent, the severe noise problem prevalent in courtroom speech. By using a novel technique for channel change detection, domain-specific language modeling, and unsupervised channel adaptation we have been able to achieve a word error rate (WER) of 36% with an acoustic model trained on 150 hours of broadcast news data. We also report on our preliminary acoustic modeling experiments with the legal transcripts provided with 120 hours of courtroom speech training data.

Archive | 1996