Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Harry Bratt is active.

Publication


Featured researches published by Harry Bratt.


Language Testing | 2010

EduSpeak[R]: A Speech Recognition and Pronunciation Scoring Toolkit for Computer-Aided Language Learning Applications

Horacio Franco; Harry Bratt; Romain Rossier; Venkata Ramana Rao Gadde; Elizabeth Shriberg; Victor Abrash; Kristin Precoda

SRI International’s EduSpeak® system is a software development toolkit that enables developers of interactive language education software to use state-of-the-art speech recognition and pronunciation scoring technology. Automatic pronunciation scoring allows the computer to provide feedback on the overall quality of pronunciation and to point to specific production problems. We review our approach to pronunciation scoring, where our aim is to estimate the grade that a human expert would assign to the pronunciation quality of a paragraph or a phrase. Using databases of nonnative speech and corresponding human ratings at the sentence level, we evaluate different machine scores that can be used as predictor variables to estimate pronunciation quality. For more specific feedback on pronunciation, the EduSpeak toolkit supports a phone-level mispronunciation detection functionality that automatically flags specific phone segments that have been mispronounced. Phone-level information makes it possible to provide the student with feedback about specific pronunciation mistakes.Two approaches to mispronunciation detection were evaluated in a phonetically transcribed database of 130,000 phones uttered in continuous speech sentences by 206 nonnative speakers. Results show that classification error of the best system, for the phones that can be reliably transcribed, is only slightly higher than the average pairwise disagreement between the human transcribers.


international conference on acoustics, speech, and signal processing | 2006

The Contribution of Cepstral and Stylistic Features to SRI's 2005 NIST Speaker Recognition Evaluation System

Luciana Ferrer; Elizabeth Shriberg; Sachin S. Kajarekar; Andreas Stolcke; M. Kemal Sönmez; Anand Venkataraman; Harry Bratt

Recent work in speaker recognition has demonstrated the advantage of modeling stylistic features in addition to traditional cepstral features, but to date there has been little study of the relative contributions of these different feature types to a state-of-the-art system. In this paper we provide such an analysis, based on SRIs submission to the NIST 2005 speaker recognition evaluation. The system consists of 7 subsystems (3 cepstral 4 stylistic). By running independent N-way subsystem combinations for increasing values of N, we fines that (1) a monotonic pattern in the choice of the best N systems allows for the inference of subsystem importance; (2) the ordering of subsystems alternates between cepstral and stylistic; (3) syllable-based prosodic features are the strongest stylistic features, and (4) overall subsystem ordering depends crucially on the amount of training data (1 versus 8 conversation sides). Improvements over the baseline cepstral system, when all systems are combined, range from 47% to 67%, with larger improvements for the 8-side condition. These results provide direct evidence of the complementary contributions of cepstral and stylistic features to speaker discrimination


international conference on acoustics, speech, and signal processing | 2014

Adaptive and discriminative modeling for improved mispronunciation detection

Horacio Franco; Luciana Ferrer; Harry Bratt

In the context of computer-aided language learning, automatic detection of specific phone mispronunciations by nonnative speakers can be used to provide detailed feedback about specific pronunciation problems. In previous work we found that significant improvements could be achieved, compared to standard approaches that compute posteriors with respect to native models, by explicitly modeling both mispronunciations and correct pronunciations by nonnative speakers. In this work, we extend our approach with the use of model adaptation and discriminative modeling techniques, inspired on methods that have been effective in the area of speaker identification. Two systems were developed, one based on Bayesian adaptation of Gaussian Mixture Models (GMMs), and likelihood-ratio-based detection, and another one based on Support Vector Machines classification of supervectors derived from adapted GMMs. Both systems, and their combination, were evaluated in a phonetically transcribed Spanish database of 130,000 phones uttered in continuous speech sentences by 206 nonnative speakers, showing significant improvements from our previous best system.


Speech Communication | 2015

Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems

Luciana Ferrer; Harry Bratt; Colleen Richey; Horacio Franco; Victor Abrash; Kristin Precoda

A system for classification of lexical stress for language learners is proposed.It successfully combines spectral and prosodic characteristics using GMMs.Models are learned on native speech, which does not require manual labeling.A method for controlling the operating point of the system is proposed.We achieve a 20% error rate on Japanese children speaking English. We present a system for detection of lexical stress in English words spoken by English learners. This system was designed to be part of the EduSpeak? computer-assisted language learning (CALL) software. The system uses both prosodic and spectral features to detect the level of stress (unstressed, primary or secondary) for each syllable in a word. Features are computed on the vowels and include normalized energy, pitch, spectral tilt, and duration measurements, as well as log-posterior probabilities obtained from the frame-level mel-frequency cepstral coefficients (MFCCs). Gaussian mixture models (GMMs) are used to represent the distribution of these features for each stress class. The system is trained on utterances by L1-English children and tested on English speech from L1-English children and L1-Japanese children with variable levels of English proficiency. Since it is trained on data from L1-English speakers, the system can be used on English utterances spoken by speakers of any L1 without retraining. Furthermore, automatically determined stress patterns are used as the intended target; therefore, hand-labeling of training data is not required. This allows us to use a large amount of data for training the system. Our algorithm results in an error rate of approximately 11% on English utterances from L1-English speakers and 20% on English utterances from L1-Japanese speakers. We show that all features, both spectral and prosodic, are necessary for achievement of optimal performance on the data from L1-English speakers; MFCC log-posterior probability features are the single best set of features, followed by duration, energy, pitch and finally, spectral tilt features. For English utterances from L1-Japanese speakers, energy, MFCC log-posterior probabilities and duration are the most important features.


conference of the international speech communication association | 2016

Privacy-Preserving Speech Analytics for Automatic Assessment of Student Collaboration.

Nikoletta Bassiou; Andreas Tsiartas; Jennifer Smith; Harry Bratt; Colleen Richey; Elizabeth Shriberg; Cynthia D'Angelo; Nonye Alozie

This work investigates whether nonlexical information from speech can automatically predict the quality of smallgroup collaborations. Audio was collected from students as they collaborated in groups of three to solve math problems. Experts in education annotated 30-second time windows by hand for collaboration quality. Speech activity features (computed at the group level) and spectral, temporal and prosodic features (extracted at the speaker level) were explored. After the latter were transformed from the speaker level to the group level, features were fused. Results using support vector machines and random forests show that feature fusion yields best classification performance. The corresponding unweighted average F1 measure on a 4-class prediction task ranges between 40% and 50%, significantly higher than chance (12%). Speech activity features alone are strong predictors of collaboration quality, achieving an F1 measure between 35% and 43%. Speaker-based acoustic features alone achieve lower classification performance, but offer value in fusion. These findings illustrate that the approach under study offers promise for future monitoring of group dynamics, and should be attractive for many collaboration activity settings in which privacy is desired.


north american chapter of the association for computational linguistics | 2007

A Conversational In-Car Dialog System

Baoshi Yan; Fuliang Weng; Zhe Feng; Florin Ratiu; Madhuri Raya; Yao Meng; Sebastian Varges; Matthew Purver; Annie Lien; Tobias Scheideck; Badri Raghunathan; Feng Lin; Rohit Mishra; Brian Lathrop; Zhaoxia Zhang; Harry Bratt; Stanley Peters

In this demonstration we present a conversational dialog system for automobile drivers. The system provides a voice-based interface to playing music, finding restaurants, and navigating while driving. The design of the system as well as the new technologies developed will be presented. Our evaluation showed that the system is promising, achieving high task completion rate and good user satisfation.


spoken language technology workshop | 2016

Toward human-assisted lexical unit discovery without text resources

Chris Bartels; Wen Wang; Vikramjit Mitra; Colleen Richey; Andreas Kathol; Dimitra Vergyri; Harry Bratt; Chiachi Hung

This work addresses lexical unit discovery for languages without (usable) written resources. Previous work has addressed this problem using entirely unsupervised methodologies. Our approach in contrast investigates the use of linguistic and speaker knowledge which are often available even if text resources are not. We create a framework that benefits from such resources, not assuming orthographic representations and avoiding generation of word-level transcriptions. We adapt a universal phone recognizer to the target language and use it to convert audio into a searchable phone string for lexical unit discovery via fuzzy sub-string matching. Linguistic knowledge is used to constrain phone recognition output and to constrain lexical unit discovery on the phone recognizer output.


conference of the international speech communication association | 2016

The SRI Speech-Based Collaborative Learning Corpus.

Colleen Richey; Cynthia D'Angelo; Nonye Alozie; Harry Bratt; Elizabeth Shriberg

We introduce the SRI speech-based collaborative learning corpus, a novel collection designed for the investigation and measurement of how students collaborate together in small groups. This is a multi-speaker corpus containing high-quality audio recordings of middle school students working in groups of three to solve mathematical problems. Each student was recorded via a head-mounted noise-cancelling microphone. Each group was also recorded via a stereo microphone placed nearby. A total of 80 sessions were collected with the participation of 134 students. The average duration of a session was 20 minutes. All students spoke English; for some students, English was a second language. Sessions have been annotated with time stamps to indicate which mathematical problem the students were solving and which student was speaking. Sessions have also been hand annotated with common indicators of collaboration for each speaker (e.g., inviting others to contribute, planning) and the overall collaboration quality for each problem. The corpus will be useful to education researchers interested in collaborative learning and to speech researchers interested in children’s speech, speech analytics, and speech diarization. The corpus, both audio and annotation, will be made available to researchers.


international conference on acoustics, speech, and signal processing | 2014

Lexical stress classification for language learning using spectral and segmental features

Luciana Ferrer; Harry Bratt; Colleen Richey; Horacio Franco; Victor Abrash; Kristin Precoda

We present a system for detecting lexical stress in English words spoken by English learners. The system uses both spectral and segmental features to detect three levels of stress for each syllable in a word. The segmental features are computed on the vowels and include normalized energy, pitch, spectral tilt and duration measurements. The spectral features are computed at the frame level and are modeled by one Gaussian Mixture Model (GMM) for each stress class. These GMMs are used to obtain segmental posteriors, which are then appended to the segmental features to obtain a final set of GMMs. The segmental GMMs are used to obtain posteriors for each stress class. The system was tested on English speech from native English-speaking children and from Japanese-speaking children with variable levels of English proficiency. Our algorithm results in an error rate of approximately 13% on native data and 20% on Japanese non-native data.


Archive | 2000

THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM

Andreas Stolcke; Harry Bratt; John Butzberger; H. Franco; V. R. Rao Gadde; C. Richey; Elizabeth Shriberg; Fuliang Weng; Jing Zheng

Collaboration


Dive into the Harry Bratt's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge