Victor Abrash
SRI International
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Victor Abrash.
Language Testing | 2010
Horacio Franco; Harry Bratt; Romain Rossier; Venkata Ramana Rao Gadde; Elizabeth Shriberg; Victor Abrash; Kristin Precoda
SRI International’s EduSpeak® system is a software development toolkit that enables developers of interactive language education software to use state-of-the-art speech recognition and pronunciation scoring technology. Automatic pronunciation scoring allows the computer to provide feedback on the overall quality of pronunciation and to point to specific production problems. We review our approach to pronunciation scoring, where our aim is to estimate the grade that a human expert would assign to the pronunciation quality of a paragraph or a phrase. Using databases of nonnative speech and corresponding human ratings at the sentence level, we evaluate different machine scores that can be used as predictor variables to estimate pronunciation quality. For more specific feedback on pronunciation, the EduSpeak toolkit supports a phone-level mispronunciation detection functionality that automatically flags specific phone segments that have been mispronounced. Phone-level information makes it possible to provide the student with feedback about specific pronunciation mistakes.Two approaches to mispronunciation detection were evaluated in a phonetically transcribed database of 130,000 phones uttered in continuous speech sentences by 206 nonnative speakers. Results show that classification error of the best system, for the phones that can be reliably transcribed, is only slightly higher than the average pairwise disagreement between the human transcribers.
international conference on acoustics speech and signal processing | 1996
Victor Abrash; Ananth Sankar; Horacio Franco; Michael Cohen
Speech recognition performance degrades significantly when there is a mismatch between testing and training conditions. Linear transformation-based maximum-likelihood (ML) techniques have been proposed recently to tackle this problem. We extend this approach to use nonlinear transformations. These are implemented by multilayer perceptrons (MLPs) which transform the Gaussian means. We derive a generalized expectation-maximization (GEM) training algorithm to estimate the MLP weights. Some preliminary experimental results on nonnative speaker adaptation are presented.
the second international conference | 2002
Horacio Franco; Jing Zheng; John Butzberger; Federico Cesari; Michael W. Frandsen; Jim Arnold; Venkata Ramana Rao Gadde; Andreas Stolcke; Victor Abrash
We introduce SRIs new speech recognition engine, DynaSpeak™, which is characterized by its scalability and flexibility, high recognition accuracy, memory and speed efficiency, adaptation capability, efficient grammar optimization, support for natural language parsing functionality, and operation based on integer arithmetic. These features are designed to address the needs of the fast-developing and changing domain of embedded and mobile computing platforms.
international conference on acoustics, speech, and signal processing | 2006
Shrikanth Narayanan; Panayiotis G. Georgiou; Abhinav Sethy; Dagen Wang; Murtaza Bulut; Shiva Sundaram; Emil Ettelaie; Sankaranarayanan Ananthakrishnan; Horacio Franco; Kristin Precoda; Dimitra Vergyri; Jing Zheng; Wen Wang; Ramana Rao Gadde; Martin Graciarena; Victor Abrash; Michael W. Frandsen; Colleen Richey
Engineering automatic speech recognition (ASR) for speech to speech (S2S) translation systems, especially targeting languages and domains that do not have readily available spoken language resources, is immensely challenging due to a number of reasons. In addition to contending with the conventional data-hungry speech acoustic and language modeling needs, these designs have to accommodate varying requirements imposed by the domain needs and characteristics, target device and usage modality (such as phrase-based, or spontaneous free form interactions, with or without visual feedback) and huge spoken language variability arising due to socio-linguistic and cultural differences of the users. This paper, using case studies of creating speech translation systems between English and languages such as Pashto and Farsi, describes some of the practical issues and the solutions that were developed for multilingual ASR development. These include novel acoustic and language modeling strategies such as language adaptive recognition, active-learning based language modeling, class-based language models that can better exploit resource poor language data, efficient search strategies, including N-best and confidence generation to aid multiple hypotheses translation, use of dialog information and clever interface choices to facilitate ASR, and audio interface design for meeting both usability and robustness requirements
Speech Communication | 2015
Luciana Ferrer; Harry Bratt; Colleen Richey; Horacio Franco; Victor Abrash; Kristin Precoda
A system for classification of lexical stress for language learners is proposed.It successfully combines spectral and prosodic characteristics using GMMs.Models are learned on native speech, which does not require manual labeling.A method for controlling the operating point of the system is proposed.We achieve a 20% error rate on Japanese children speaking English. We present a system for detection of lexical stress in English words spoken by English learners. This system was designed to be part of the EduSpeak? computer-assisted language learning (CALL) software. The system uses both prosodic and spectral features to detect the level of stress (unstressed, primary or secondary) for each syllable in a word. Features are computed on the vowels and include normalized energy, pitch, spectral tilt, and duration measurements, as well as log-posterior probabilities obtained from the frame-level mel-frequency cepstral coefficients (MFCCs). Gaussian mixture models (GMMs) are used to represent the distribution of these features for each stress class. The system is trained on utterances by L1-English children and tested on English speech from L1-English children and L1-Japanese children with variable levels of English proficiency. Since it is trained on data from L1-English speakers, the system can be used on English utterances spoken by speakers of any L1 without retraining. Furthermore, automatically determined stress patterns are used as the intended target; therefore, hand-labeling of training data is not required. This allows us to use a large amount of data for training the system. Our algorithm results in an error rate of approximately 11% on English utterances from L1-English speakers and 20% on English utterances from L1-Japanese speakers. We show that all features, both spectral and prosodic, are necessary for achievement of optimal performance on the data from L1-English speakers; MFCC log-posterior probability features are the single best set of features, followed by duration, energy, pitch and finally, spectral tilt features. For English utterances from L1-Japanese speakers, energy, MFCC log-posterior probabilities and duration are the most important features.
international conference on acoustics, speech, and signal processing | 2014
Luciana Ferrer; Harry Bratt; Colleen Richey; Horacio Franco; Victor Abrash; Kristin Precoda
We present a system for detecting lexical stress in English words spoken by English learners. The system uses both spectral and segmental features to detect three levels of stress for each syllable in a word. The segmental features are computed on the vowels and include normalized energy, pitch, spectral tilt and duration measurements. The spectral features are computed at the frame level and are modeled by one Gaussian Mixture Model (GMM) for each stress class. These GMMs are used to obtain segmental posteriors, which are then appended to the segmental features to obtain a final set of GMMs. The segmental GMMs are used to obtain posteriors for each stress class. The system was tested on English speech from native English-speaking children and from Japanese-speaking children with variable levels of English proficiency. Our algorithm results in an error rate of approximately 13% on native data and 20% on Japanese non-native data.
international conference on acoustics, speech, and signal processing | 1994
Victor Abrash; Michael Cohen; Horacio Franco; Isao Arima
We have developed a hybrid speech recognition system which uses a multilayer perceptron (MLP) to estimate the observation likelihoods associated with the states of a HMM. In this paper, we propose two schemes for incorporating distinctive speech features (sonorant, fricative, nasal, vocalic, and voiced) into the MLP component of our system. We show a small improvement in recognition performance on a 160-word speaker-independent continuous-speech Japanese conference room reservation database. Further experiments simulating an improved distinctive feature classifier indicate that this approach can potentially lead to substantial performance improvements.<<ETX>>
Archive | 2011
Andreas Stolcke; Jing Zheng; Wen Wang; Victor Abrash
Archive | 2000
Horacio Franco; Victor Abrash; Kristin Precoda; Harry Bratt; Ramana Rao; John Butzberger; Romain Rossier; Federico Cesari
conference of the international speech communication association | 1995
Victor Abrash; Horacio Franco; Ananth Sankar; Michael Cohen