Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where John-Paul Hosom is active.

Publication


Featured researches published by John-Paul Hosom.


Speech Communication | 2009

Speaker-independent phoneme alignment using transition-dependent states

John-Paul Hosom

Determining the location of phonemes is important to a number of speech applications, including training of automatic speech recognition systems, building text-to-speech systems, and research on human speech processing. Agreement of humans on the location of phonemes is, on average, 93.78% within 20 msec on a variety of corpora, and 93.49% within 20 msec on the TIMIT corpus. We describe a baseline forced-alignment system and a proposed system with several modifications to this baseline. Modifications include the addition of energy-based features to the standard cepstral feature set, the use of probabilities of a state transition given an observation, and the computation of probabilities of distinctive phonetic features instead of phoneme-level probabilities. Performance of the baseline system on the test partition of the TIMIT corpus is 91.48% within 20 msec, and performance of the proposed system on this corpus is 93.36% within 20 msec. The results of the proposed system are a 22% relative reduction in error over the baseline system, and a 14% reduction in error over results from a non-HMM alignment system. This result of 93.36% agreement is the best known reported result on the TIMIT corpus.


international conference on acoustics, speech, and signal processing | 2003

Intelligibility of modifications to dysarthric speech

John-Paul Hosom; Alexander Kain; Taniya Mishra; H. van Santen; Melanie Fried-Oken; Janice Staehely

Dysarthria is a motor speech impairment affecting millions of people. Dysarthric speech can be far less intelligible than that of non-dysarthric speakers, causing significant communication difficulties. The goal of our work is to understand the effect that certain modifications have on the intelligibility of dysarthric speech. These modifications are designed to identify aspects of the speech signal or signal processing that may be especially relevant to the effectiveness of a system that transforms dysarthric speech to improve its intelligibility. A result of this study is that dysarthric speech can, in the best case, be modified only at the short-term spectral level to improve intelligibility from 68% to 87%. A baseline transformation system using standard technology, however, does not show improvement in intelligibility. Prosody also has a significant (p<0.05) effect on intelligibility.


Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376) | 1998

Connected digit recognition experiments with the OGI Toolkit's neural network and HMM-based recognizers

Piero Cosi; John-Paul Hosom; Johan Shalkwyk; Stephen Sutton; Ronald A. Cole

This paper describes a series of experiments that compare different approaches to training a speaker-independent continuous-speech digit recognizer using the CSLU Toolkit. Comparisons are made between the hidden Markov model (HMM) and neural network (NN) approaches. In addition, a description of the CSLU Toolkit research environment is given. The CSLU Toolkit is a research and development software environment that provides a powerful and flexible tool for creating and using spoken language systems for telephone and PC applications. In particular, the CSLU-HMM, the CSLU-NN, and the CSLU-FBNN development environments, with which our experiments were implemented, are described in detail and recognition results are compared. Our speech corpus is OGI 30K-Numbers, which is a collection of spontaneous ordinal and cardinal numbers, continuous digit strings and isolated digit strings. The utterances were recorded by having a large number of people recite their ZIP code, street address, or other numeric information over the telephone. This corpus represents a very noisy and difficult recognition task. Our best results (98% word recognition, 92% sentence recognition), obtained with the FBNN architecture, suggest the effectiveness of the CSLU Toolkit in building real-life speech recognition systems.


international conference on acoustics, speech, and signal processing | 2010

Effect of speaking style and speaking rate on formant contours

Akiko Amano-Kusumoto; John-Paul Hosom

This paper presents the results of formant analysis using a newly developed formant contour model. We model formant contours with a linear combination of formant target values and coarticulation functions for /wVl/ and /tVl/ words. While formant target values are estimated globally over different speaking styles, coarticulation coefficients are estimated for individual tokens. The results show that the estimated coarticulation coefficients are inherently different between clear (CLR) and conversational (CNV) speech and that the movement of articulators when producing CLR speech is faster than when producing CNV speech. On the other hand, speaking rate is not a key determinant in movement of articulators at vowel onsets. The direct measure of F2 slope is strongly correlated with estimated coarticulation coefficients, which may lead to less parameters to estimate.


international conference on acoustics, speech, and signal processing | 2009

The effect of formant trajectories and phoneme durations on vowel intelligibility

Akiko Amano-Kusumoto; John-Paul Hosom

We examined how much listeners can benefit from listening to “clear” (CLR) speech compared to “conversational” (CNV) speech, both spoken at different speaking rates. Vowel intelligibilities of four front vowels (/i:/, /I/, /E/, and /ei/) in background noise were measured with four speaking styles (CNV/SLOW, CNV, CLR, and CLR/FAST). Results showed only tense vowels of CLR speech had a significant difference between CNV and CLR speaking styles, after energy and F0 contour were normalized. We synthesized hybrid (HYB) speech whose formant features were equal to those of CLR speech, while all other features were taken from CNV speech. Primary conclusions from this study are (1) naturally-spoken fast CLR speech was not as intelligible as CLR speech, (2) enhancing formant frequencies to resemble those of CLR speech was effective at improving vowel intelligibility, and (3) spectral tilt and formant bandwidths were not contributing factors to the CLR speech benefit.


international conference on acoustics, speech, and signal processing | 1997

A diphone-based digit recognition system using neural networks

John-Paul Hosom; Ronald A. Cole

In exploring new ways of looking at speech data, we have developed an alternative method of segmentation for training a neural-network-based digit-recognition system. Whereas previous methods segment the data into monophones, biphones, or triphones and train on each sub-phone unit in several broad-category contexts, our new method uses modified diphones to train on the regions of greatest spectral change as well as the regions of greatest stability. Although we account for regions of spectral stability, we do not require their presence in our word models. Empirical evidence for the advantage of this new method is seen by the 13% reduction in word-level error that was achieved on a test set of the OGI Numbers corpus. Comparison was made to a baseline system that used context-independent monophones and context-dependent biphones and triphones.


Journal of the Acoustical Society of America | 2011

Towards the recovery of targets from coarticulated speech for automatic speech recognition

John-Paul Hosom; Alexander Kain; Brian O. Bush

An HMM-based ASR system tested on phoneme recognition of TIMIT (accuracy 74.2%) shows substitution errors covering all distinctive-feature dimensions of vowels: front/back, tense/lax, and high/low. These vowel-to-vowel errors account for about 30% of all substitution errors. These types of errors may be addressed by recovering vowel targets (and, as a by-product, coarticulation functions) during ASR. The current work models observed trajectories using a linear combination of target vectors, one vector per phoneme. A sigmoid function (with parameters for slope and position) models the evolution of the trajectory. In accordance with the Locus theory, if duration is sufficiently short and the rate of change is sufficiently slow, the targets may not be reached. Current data indicate that in clearly articulated speech, the vowel target is often reached, while in conversational speech, the vowel target is often not reached. This difference between speaking styles may explain the difficulty that current ASR syst...


conference of the international speech communication association | 1998

Universal speech tools: the CSLU toolkit.

Stephen Sutton; Ronald A. Cole; Jacques de Villiers; Johan Schalkwyk; Pieter Vermeulen; Michael W. Macon; Yonghong Yan; Edward C. Kaiser; Brian Rundle; Khaldoun Shobaki; John-Paul Hosom; Alexander Kain; Johan Wouters; Dominic W. Massaro; Michael M. Cohen


Archive | 2000

Automatic time alignment of phonemes using acoustic-phonetic information

Ronald A. Cole; John-Paul Hosom


conference of the international speech communication association | 2000

The OGI kids² speech corpus and recognizers.

Khaldoun Shobaki; John-Paul Hosom; Ronald A. Cole

Collaboration


Dive into the John-Paul Hosom's collaboration.

Top Co-Authors

Avatar

Ronald A. Cole

University of Colorado Boulder

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Piero Cosi

National Research Council

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge