Aaron E. Rosenberg | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Aaron E. Rosenberg is active.

Explore More

Publication

Featured researches published by Aaron E. Rosenberg.

IEEE Transactions on Acoustics, Speech, and Signal Processing | 1980

Performance tradeoffs in dynamic time warping algorithms for isolated word recognition

Cory S. Myers; Lawrence R. Rabiner; Aaron E. Rosenberg

The technique of dynamic programming for the time registration of a reference and a test pattern has found widespread use in the area of isolated word recognition. Recently, a number of variations on the basic time warping algorithm have been proposed by Sakoe and Chiba, and Rabiner, Rosenberg, and Levinson. These algorithms all assume that the test input is the time pattern of a feature vector from an isolated word whose endpoints are known (at least approximately). The major differences in the methods are the global path constraints (i.e., the region of possible warping paths), the local continuity constraints on the path, and the distance weighting and normalization used to give the overall minimum distance. The purpose of this investigation is to study the effects of such variations on the performance of different dynamic time warping algorithms for a realistic speech database. The performance measures that were used include: speed of operation, memory requirements, and recognition accuracy. The results show that both axis orientation and relative length of the reference and the test patterns are important factors in recognition accuracy. Our results suggest a new approach to dynamic time warping for isolated words in which both the reference and test patterns are linearly warped to a fixed length, and then a simplified dynamic time warping algorithm is used to handle the nonlinear component of the time alignment. Results with this new algorithm show performance comparable to or better than that of all other dynamic time warping algorithms that were studied.

international conference on acoustics, speech, and signal processing | 1985

A vector quantization approach to speaker recognition

Frank K. Soong; Aaron E. Rosenberg; Lawrence R. Rabiner; Biing-Hwang Juang

In this study a vector quantization (VQ) codebook was used as an efficient means of characterizing the short-time spectral features of a speaker. A set of such codebooks were then used to recognize the identity of an unknown speaker from his/her unlabelled spoken utterances based on a minimum distance (distortion) classification rule. A series of speaker recognition experiments was performed using a 100-talker (50 male and 50 female) telephone recording database consisting of isolated digit utterances. For ten random but different isolated digits, over 98% speaker identification accuracy was achieved. The effects, on performance, of different system parameters such as codebook sizes, the number of test digits, phonetic richness of the text, and difference in recording sessions were also studied in detail.

IEEE Transactions on Acoustics, Speech, and Signal Processing | 1981

An improved endpoint detector for isolated word recognition

L. Lamel; Lawrence R. Rabiner; Aaron E. Rosenberg; Jay Gordon Wilpon

Accurate location of the endpoints of an isolated word is important for reliable and robust word recognition. The endpoint detection problem is nontrivial for nonstationary backgrounds where artifacts (i.e., nonspeech events) may be introduced by the speaker, the recording environment, and the transmission system. Several techniques for the detection of the endpoints of isolated words recorded over a dialed-up telephone line were studied. The techniques were broadly classified as either explicit, implicit, or hybrid in concept. The explicit techniques for endpoint detection locate the endpoints prior to and independent of the recognition and decision stages of the system. For the implicit methods, the endpoints are determined solely by the recognition and decision stages of the system, i.e., there is no separate stage for endpoint detection. The hybrid techniques incorporate aspects from both the explicit and implicit methods. Investigations showed that the hybrid techniques consistently provided the best estimates for both of the word endpoints and, correspondingly, the highest recognition accuracy of the three classes studied. A hybrid end-point detector is proposed which gives a rejection rate of less than 0.5 percent, while providing recognition accuracy close to that obtained from hand-edited endpoints.

IEEE Transactions on Acoustics, Speech, and Signal Processing | 1988

On the use of instantaneous and transitional spectral information in speaker recognition

Frank K. Soong; Aaron E. Rosenberg

The use of instantaneous and transitional spectral representations of spoken utterances for speaker recognition is investigated. Linear-predictive-coding (LPC)-derived cepstral coefficients are used to represent instantaneous spectral information, and best linear fits of each cepstral coefficient over a specified time window are used to represent transitional information. An evaluation has been carried out using a database of isolated digit utterances over dialed-up telephone lines by 10 talkers. Two vector quantization (VQ) codebooks, instantaneous and transitional, were constructed from each speakers training utterances. The experimental results show that the instantaneous and transitional representations are relatively uncorrelated, thus providing complementary information for speaker recognition. A rectangular window of approximately 100 ms duration provides an effective estimate of the transitional spectral features for speaker recognition. Also, simple transmission channel variations are shown to affect both the instantaneous spectral representations and the corresponding recognition performance significantly, while the transitional representations and performance are relatively resistant. >

IEEE Transactions on Acoustics, Speech, and Signal Processing | 1979

Speaker-independent recognition of isolated words using clustering techniques

Lawrence R. Rabiner; Stephen E. Levinson; Aaron E. Rosenberg; Jay Gordon Wilpon

A speaker-independent isolated word recognition system is described which is based on the use of multiple templates for each word in the vocabulary. The word templates are obtained from a statistical clustering analysis of a large database consisting of 100 replications of each word (i.e., once by each of 100 talkers). The recognition system, which accepts telephone quality speech input, is based on an LPC analysis of the unknown word, dynamic time warping of each reference template to the unknown word (using the Itakura LPC distance measure), and the application of a K-nearest neighbor (KNN) decision rule. Results for several test sets of data are presented. They show error rates that are comparable to, or better than, those obtained with speaker-trained isolated word recognition systems.

international conference on acoustics speech and signal processing | 1996

Speaker background models for connected digit password speaker verification

Aaron E. Rosenberg; Sarangarajan Parthasarathy

Likelihood ratio or cohort normalized scoring has been shown to be effective for improving the performance of speaker verification systems. An important problem in this connection is the establishment of principles for constructing speaker background or cohort models which provide the most effective normalized scores. Several kinds of speaker background models are studied. These include individual speaker models, models constructed from the pooled utterances of different numbers of speakers, models selected on the basis of similarity with customer models, models constructed from random selections of speakers, and models constructed from databases recorded under different conditions than the customer models. The results of experiments show that pooled models based on similarity to the reference speaker perform better than individual cohort models from the same similar set of speakers. Pooled background models from a small number of speakers based on similarity perform about the best, but not significantly better than a random selection of 40 or more gender balanced speakers with training conditions matched to the reference speakers.

Journal of the Acoustical Society of America | 1975

Comparative performance study of several pitch detection algorithms

Michael J. Cheng; Lawrence R. Rabiner; Aaron E. Rosenberg; C. A. McGonegal

A comparative performance study of five pitch detection algorithms was conducted. A speech data base, consisting of eight utterances spoken by three males, three females, and one child was constructed. Both telephone and wideband recordings were made of each of the utterances. For each of the utterances in the data base a “standard” pitch contour was semiautomatically measured using a highly sophistocated interactive pitch detection program. The “standard” pitch contour was then compared with the pitch contour that was obtained from each of the five programmed pitch detectors. The algorithms used in this study were (1) a center clipping, infinite‐peak clipping, modified autocorrelation method; (2) the cepstral method; (3) the SIFT method; (4) the parallel processing time domain method; and (5) the data reduction method. A set of measurements were made on the pitch contours to quantify the various types of errors which occur in each of the above methods. Included among the error measurements were the avera...

human factors in computing systems | 2002

SCANMail: a voicemail interface that makes speech browsable, readable and searchable

Steve Whittaker; Julia Hirschberg; Brian Amento; Litza A. Stark; Michiel Bacchiani; Philip L. Isenhour; Larry Stead; Gary Zamchick; Aaron E. Rosenberg

Increasing amounts of public, corporate, and private speech data are now available on-line. These are limited in their usefulness, however, by the lack of tools to permit their browsing and search. The goal of our research is to provide tools to overcome the inherent difficulties of speech access, by supporting visual scanning, search, and information extraction. We describe a novel principle for the design of UIs to speech data: What You See Is Almost What You Hear (WYSIAWYH). In WYSIAWYH, automatic speech recognition (ASR) generates a transcript of the speech data. The transcript is then used as a visual analogue to that underlying data. A graphical user interface allows users to visually scan, read, annotate and search these transcripts. Users can also use the transcript to access and play specific regions of the underlying message. We first summarize previous studies of voicemail usage that motivated the WYSIAWYH principle, and describe a voicemail UI, SCANMail, that embodies WYSIAWYH. We report on a laboratory experiment and a two-month field trial evaluation. SCANMail outperformed a state of the art voicemail system on core voicemail tasks. This was attributable to SCANMails support for visual scanning, search and information extraction. While the ASR transcripts contain errors, they nevertheless improve the efficiency of voicemail processing. Transcripts either provide enough information for users to extract key points or to navigate to important regions of the underlying speech, which they can then play directly

Computer Speech & Language | 1987

Evaluation of a vector quantization talker recognition system in text independent and text dependent modes

Aaron E. Rosenberg; F.K. Soong

Abstract A vector quantization based talker recognition system is described and evaluated. The system is based on constructing highly efficient short-term spectral representations of individual talkers using vector quantization codebook construction techniques. Although the approach is intrinsically text-independent, the system can be easily extended to text-dependent operation for improved performance and security by encoding specified training word utterances to form word prototypes. The system was evaluated using a 100-talker database of 20 000 digits spoken in isolation. In a talker verification mode, average equal-error rate performance of 2·2% for text-independent operation and 0·3% for text-dependent operation was obtained for 7-digit long test utterances. Because the evaluation database was restricted to the vocabulary of spoken digits, the text-independent operation of the system has not been formally tested beyond the confines of that vocabulary.

international conference on acoustics, speech, and signal processing | 1990

Sub-word unit talker verification using hidden Markov models

Aaron E. Rosenberg; Chin-Hui Lee; Frank K. Soong

A talker verification system based on characterizing talker utterances as sequences of subword units represented by hidden Markov models (HMMs) was implemented and tested. Two types of subword units were studied: phonelike units (PLUs) and acoustic segment units (ASUs). PLUs are based on phonetic transcriptions of spoken utterances and ASUs are extracted directly from the acoustic signal without use of any linguistic knowledge. The ASU representation has the advantage of not requiring transcriptions of training utterances. Verification performance was evaluated on a 100-talker database of 20000 isolated digit utterances. The experiments show only small differences in performance between PLU- and ASU-based representations. Overall, the verification equal-error rate is approximately 7 to 8% for one-digit test utterances (approximately 0.5 s in duration) and 1% or less for seven-digit test utterances (approximately 3.5 s in duration). In addition, a technique for updating models, using data from current test utterances, was devised and implemented. Using this adaptation technique, the error rate falls to 6% for one-digit utterances and less than 0.5% for seven-digit utterances. The experiment confirms that excellent verification performance can be obtained using HMMs of subword units.<<ETX>>

Explore More