S. Roucos
University of Florida
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by S. Roucos.
international conference on acoustics, speech, and signal processing | 1985
Richard M. Schwartz; Yen-Lu Chow; Owen Kimball; S. Roucos; M. Krasner; J. Makhoul
This paper describes the results of our work in designing a system for phonetic recognition of unrestricted continuous speech. We describe several algorithms used to recognize phonemes using context-dependent Hidden Markov Models of the phonemes. We present results for several variations of the parameters of the algorithms. In addition, we propose a technique that makes it possible to integrate traditional acoustic-phonetic features into a hidden Markov process. The categorical decisions usually associated with heuristic acoustic-phonetic algorithms are replaced by automated training techniques and global search strategies. The combination of general spectral information and specific acoustic-phonetic features is shown to result in more accurate phonetic recognition than either representation by itself.
international conference on acoustics, speech, and signal processing | 1987
Y.-L. Chow; M. O. Dunham; Owen Kimball; M. Krasner; G. F. Kubala; J. Makhoul; P. Price; S. Roucos; Richard M. Schwartz
In this paper, we describe BYBLOS, the BBN continuous speech recognition system. The system, designed for large vocabulary applications, integrates acoustic, phonetic, lexical, and linguistic knowledge sources to achieve high recognition performance. The basic approach, as described in previous papers [1, 2], makes extensive use of robust context-dependent models of phonetic coarticulation using Hidden Markov Models (HMM). We describe the components of the BYBLOS system, including: signal processing frontend, dictionary, phonetic model training system, word model generator, grammar and decoder. In recognition experiments, we demonstrate consistently high word recognition performance on continuous speech across: speakers, task domains, and grammars of varying complexity. In speaker-dependent mode, where 15 minutes of speech is required for training to a speaker, 98.5% word accuracy has been achieved in continuous speech for a 350-word task, using grammars with perplexity ranging from 30 to 60. With only 15 seconds of training speech we demonstrate performance of 97% using a grammar.
international conference on acoustics, speech, and signal processing | 1984
Richard M. Schwartz; Yen-Lu Chow; S. Roucos; Michael A. Krasner; John Makhoul
This paper discusses the use of the Hidden Markov Model (HMM) in phonetic recognition. In particular, we present improvements that deal with the problems of modeling the effect of phonetic context and the problem of robust pdf estimation. The effect of phonetic context is taken into account by conditioning the probability density functions (pdfs) of the acoustic parameters on the adjacent phonemes, only to the extent that there are sufficient tokens of the phoneme in that context. This partial conditioning is achieved by combining the conditioned and unconditioned pdfs models with weights that depend on the confidence in each pdf estimate. This combination is shown to result in better performance than either model by itself. We also show that it is possible to obtain the computational advantages of using discrete probability densities without the usual requirement for large amounts of training data.
international conference on acoustics, speech, and signal processing | 1982
Richard M. Schwartz; S. Roucos; Michael G. Berouti
Most text-independent speaker identification methods to date depend on the use of some distance metric for classification. In this paper we develop the use of probability density function (pdf) estimation for text-independent speaker identification. We compare the performance of two parametric and one non-parametric pdf estimation methods to one distance classification method that uses the Mahalanobis distance. Under all conditions tested, the pdf estimation methods performed substantially better than the Mahalanobis distance method. The best method is a non-parametric pdf estimation method.
international conference on acoustics speech and signal processing | 1988
Francis Kubala; Yen-Lu Chow; A. Derr; M.-W. Feng; O. Kimball; J. Makhoul; P. Price; J. Rohlicek; S. Roucos; Richard G. Schwartz; J. Vandegrift
The system was trained in a speaker dependent mode on 28 minutes of speech from each of 8 speakers, and was tested on independent test material for each speaker. The system was tested with three artificial grammars spanning a broad perplexity range. The average performance of the system measured in percent word error was: 1.4% for a pattern grammar of perplexity 9, 7.5% for a word-pair grammar of perplexity 62, and 32.4% for a null grammar of perplexity 1000.<<ETX>>
international conference on acoustics, speech, and signal processing | 1985
Herbert Gish; Kenneth F. Karnofsky; M. Krasner; S. Roucos; Richard M. Schwartz; Jared J. Wolf
In this paper, we examine several methods for text-independent speaker identification of telephone speech with limited duration data, The issue addressed is the assessment of channel characteristics, especially linear aspects, and methods for improving speaker identification performance when the speaker to be identified is on a different telephone channel than that data used for training. We show experimental evidence illustrating the cross-channel problem and also show that the direct approach, of using simple channel-invariant features, can discard much speaker dependent information. The methods we have found to be most effective rely on the training process to incorporate channel variability.
international conference on acoustics, speech, and signal processing | 1987
S. Roucos; Alexander MacLeod Wilgus; William E. Russell
In previous papers, we have described the segment vocoder, which transmits intelligible speech at 300 b/s in speaker-independent mode, i.e., new users need not train the system. As expected for vector quantizers, the storage and computational requirements of the segment vocoder are significantly larger than those of the standard LPC-10 vocoder. In this paper, we describe methods for reducing computational and storage requirements of the segment vocoder and present an algorithm that is implementable in real-time on hardware containing several Digital Signal Processing chips. The DRT score of the simplified algorithm is 78%.
international conference on acoustics speech and signal processing | 1988
S. Roucos; M. Ostendorf; Herbert Gish; A. Derr
A probabilistic model called the stochastic segment model is introduced that describes the statistical dependence of all the frames of a speech segment. The model uses a time-warping transformation to map the sequence of observed frames to the appropriate frames of the segment model. The joint density of the observed frames is then given by the joint density of the selected model frames. The automatic training and recognition algorithms are discussed and a few preliminary recognition results are presented.<<ETX>>
international conference on acoustics, speech, and signal processing | 1983
Richard M. Schwartz; S. Roucos
In this paper we discuss several algorithms that can be used to reduce the transmission rate for LPC vocoded speech to around 300 to 400 b/s, with only a modest degradation in speech quality relative to that of fixed-rate 2400 b/s LPC vocoders. We limit the discussion to vocoders that transmit information for single frames (as opposed to whole segments of speech). We start with vector quantization, which reduces the bit rate to around 800 b/s accompanied by a significant but tolerable loss in quality relative to a typical fixed-rate 2400 b/s vocoder. Then we reduce the frame rate using one of two techniques: Fixed-Rate Transmission with Variable Interpolation, or Optimal Variable-Frame-Rate Transmission. We also reduce the data rate necessary for the source parameters (pitch, voicing, gain) from 400 b/s to about 100 b/s by taking advantage of their statistical dependence on the spectrum and some perceptual factors. The final result at 300 b/s has a quality comparable to that of the fixed-rate 800 b/s vector quantization vocoder. At 400 b/s, the quality is, in many respects, better than that of the 800 b/s vocoder and comparable to the 2400 b/s LPC vocoder.
international conference on acoustics speech and signal processing | 1988
J.R. Rohlicek; Yen-Lu Chow; S. Roucos
Statistical language models have been successfully used to improve the performance of continuous speech recognition algorithms. Application of such techniques is difficult when only a small training corpus is available. The authors present an approach for dealing with limited training available from the DARPA resource management domain. An initial training corpus of sentences was abstracted by replacing sentence fragments or phrases with variables. This training corpus of phrase sequences was used to derive parameters of a Markov model. The probability of a word sequence is then decomposed into the probability of possible phrase sequences within each of the phrases. Initial results obtained on 150 utterances from six speakers in the DARPA database indicate that this language modeling technique has potential for improved recognition performance. Furthermore, this approach provides a framework for incorporating linguistic knowledge into statistical language models.<<ETX>>