Fil Alleva | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Fil Alleva is active.

Explore More

Publication

Featured researches published by Fil Alleva.

international conference on acoustics, speech, and signal processing | 1995

Microsoft Windows highly intelligent speech recognizer: Whisper

Xuedong Huang; Alex Acero; Fil Alleva; Mei-Yuh Hwang; Li Jiang; Milind Mahajan

Since January 1993, the authors have been working to refine and extend Sphinx-II technologies in order to develop practical speech recognition at Microsoft. The result of that work has been the Whisper (Windows Highly Intelligent Speech Recognizer). Whisper represents significantly improved recognition efficiency, usability, and accuracy, when compared with the Sphinx-II system. In addition Whisper offers speech input capabilities for Microsoft Windows and can be scaled to meet different PC platform configurations. It provides features such as continuous speech recognition, speaker-independence, on-line adaptation, noise robustness, dynamic vocabularies and grammars. For typical Windows Command-and-Control applications (less than 1000 words), Whisper provides a software only solution on PCs equipped with a 486DX, 4MB of memory, and a standard sound card and a desk-top microphone.

international conference on acoustics speech and signal processing | 1996

Improvements on the pronunciation prefix tree search organization

Fil Alleva; Xuedong Huang; Mei-Yuh Hwang

The need for ever more efficient search organizations persists as the size and complexity of the knowledge sources used in continuous speech recognition (CSR) tasks continues to increase. We address efficiency issues associated with a search organization based on pronunciation prefix trees (PPTs). In particular we present (1) a mechanism that eliminates redundant computations in non-reentrant trees, (2) a comparison of two methods for distributing language model probabilities in PPTs, and (3) report results on two look ahead pruning strategies. Using the 1994 DARPA 20 k NAB word bigram for the male segment of si dev5m 92 (the 5k speaker independent development test set for the WSJ), the error rate was 12.2% with a real-time factor of 1.0 on a 120 MHz Pentium.

international conference on acoustics, speech, and signal processing | 1993

An improved search algorithm using incremental knowledge for continuous speech recognition

Fil Alleva; Xuedong Huang; Mei-Yuh Hwang

A search algorithm that incrementally makes effective use of detailed sources of knowledge is proposed. The algorithm incrementally applies all available acoustic and linguistic information in three search phases. Phase one is a left-to-right Viterbi beam search that produces word end times and scores using right context between-word models with a bigram language model. Phase two, guided by results from phase one, is a right-to-left Viterbi beam search that produces word begin times and scores based on left context between-word models. Phase three is an A* search that combines the results of phases one and two with a long-distance language model. The objective is to maximize the recognition accuracy with a minimal increase in computational cost. With the decomposed, incremental, search algorithm, it is shown that early use of detailed acoustic models can significantly reduce the recognition error rate with a negligible increase in computational cost. It is demonstrated that the early use of detailed knowledge can improve the word error bound by at least 22% for large-vocabulary, speaker-independent, continuous speech recognition.<<ETX>>

international conference on acoustics, speech, and signal processing | 1994

Improving speech recognition performance via phone-dependent VQ codebooks and adaptive language models in SPHINX-II

M. Hwamg; Roni Rosenfeld; E. Theyer; R. Mosur; L. Chase; Robert Weide; Xuedong Huang; Fil Alleva

This paper presents improvements in acoustic and language modeling for automatic speech recognition. Specifically, semi-continuous HMMs (SCHMMs) with phone-dependent VQ codebooks are presented and incorporated into the SPHINX-II speech recognition system. The phone-dependent VQ codebooks relax the density-tying constraint in SCHMMs in order to obtain more detailed models. A 6% error rate reduction was achieved on the speaker-independent 20000-word Wall Street Journal (WSJ) task. Dynamic adaptation of the language model in the context of long documents is also explored. A maximum entropy framework is used to exploit long distance trigrams and trigger effects. A 10%-15% word error rate reduction is reported on the same WSJ task using the adaptive language modeling technique.<<ETX>>

international conference on acoustics, speech, and signal processing | 1993

Unified stochastic engine (USE) for speech recognition

Xuedong Huang; Marie-Thérèse Belin; Fil Alleva; Mei-Yuh Hwang

A unified stochastic engine (USE) that jointly optimizes both acoustic and language models is presented. In the USE, not only can one iteratively adjust language probabilities to fit the given acoustic representations, but one can also adjust acoustic models (including feature representation) guided by language constraints. From the language modeling point of view, the USE makes it possible to encode acoustically confusable words in the language probabilities. From the acoustic modeling point of view, the language-constraint approach makes it possible to focus on acoustic words for which language models lack enough discrimination capacity. The authors report preliminary experimental results for Wall Street Journal continuous 5000-word speaker-independent dictation. The error rate is reduced from 7.3% to 6.9% with the proposed method.<<ETX>>

human language technology | 1990

Improved hidden Markov modeling for speaker-independent continuous speech recognition

Xuedong Huang; Fil Alleva; Satoru Hayamizu; Hsiao-Wuen Hon; Mei-Yuh Hwang; Kai-Fu Lee

The paper reports recent efforts to further improve the performance of the Sphinx system for speaker-independent continuous speech recognition. The recognition error rate is significantly reduced with incorporation of additional dynamic features, semi-continuous hidden Markov models, and speaker clustering. For the June 1990 (RM2) evaluation test set, the error rates of our current system are 4.3% and 19.9% for word-pair grammar and no grammar respectively.

human language technology | 1992

Applying SPHINX-II to the DARPA Wall Street Journal CSR task

Fil Alleva; Hsiao-Wuen Hon; Xuedong Huang; Mei-Yuh Hwang; Ronald Rosenfeld; Robert Weide

This paper reports recent efforts to apply the speaker-independent SPHINX-II system to the DARPA Wall Street Journal continuous speech recognition task. In SPHINX-II, we incorporated additional dynamic and speaker-normalized features, replaced discrete models with sex-dependent semi-continuous hidden Markov models, augmented within-word triphones with between-word triphones, and extended generalized triphone models to shared-distribution models. The configuration of SPHINX-II being used for this task includes sex-dependent, semi-continuous, shared-distribution hidden Markov models and left context dependent between-word triphones. In applying our technology to this task we addressed issues that were not previously of concern owing to the (relatively) small size of the Resource Management task.

international conference on spoken language processing | 1996

Evaluation of a language model using a clustered model backoff

John Miller; Fil Alleva

Describes and evaluates a language model using word classes that have been automatically generated from a word clustering algorithm. Class-based language models have been shown to be effective for rapid adaptation, training on small datasets, and reduced memory usage. In terms of model perplexity, prior work has shown diminished returns for class-based language models constructed using very large training sets. This paper describes a method of using a class model as a backoff to a bigram model which produced significant benefits even when trained from a large text corpus. Tests results on the Whisper continuous speech recognition system show that, for a given word error rate, the clustered bigram model uses 2/3 fewer parameters compared to a standard bigram model using unigram backoff.

human language technology | 1989

Automatic new word acquisition: spelling from acoustics

Fil Alleva; Kai-Fu Lee

The problem of extending the lexicon of words in an automatic speech recognition system is commonly referred to as the new word problem. When encountered in the context of an embedded speech recognition system this problem can be be divided into the following sub-problems. First, identify the presence of a new word. Second, acquire a phonetic transcription of the new word. Third, acquire the orthographic transcription (spelling) of the new word. In this paper we present the results of a preliminary study that employs a novel approach to the problem of acquiring the orthographic transcription through the use of an n-gram language model of english spelling and a quad-letter labeling of acoustic models that when taken together potentially produce an acoustic to spelling transcription of any spoken input.

ieee automatic speech recognition and understanding workshop | 1997

Search organization in the Whisper continuous speech recognition system

Fil Alleva

Since the earliest days of computing, automatic speech recognition technology has ridden the technology wave that has come to be known as Moores law. This is vividly illustrated by the market introduction of several general-purpose continuous speech recognition products. Besides Moores law, the two things that have made this possible have been advances in acoustic modeling, especially adaptation technologies, and advances in decoding techniques that permit real-time performance on todays PCs. This paper discusses our approach to the decoding problem, including the role of heuristic pruning, the A* criteria and its relation to Viterbi searching and stack searching, as well our approach to problems relating to prefix-tree searching and the application of language models and complex acoustic structures, such as cross-word triphones.

Explore More