Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where James L. Hieronymus is active.

Publication


Featured researches published by James L. Hieronymus.


Journal of the Acoustical Society of America | 2013

Syllable language models for Mandarin speech recognition: Exploiting character language models

Xunying Liu; James L. Hieronymus; Mark J. F. Gales; Philip C. Woodland

Mandarin Chinese is based on characters which are syllabic in nature and morphological in meaning. All spoken languages have syllabiotactic rules which govern the construction of syllables and their allowed sequences. These constraints are not as restrictive as those learned from word sequences, but they can provide additional useful linguistic information. Hence, it is possible to improve speech recognition performance by appropriately combining these two types of constraints. For the Chinese language considered in this paper, character level language models (LMs) can be used as a first level approximation to allowed syllable sequences. To test this idea, word and character level n-gram LMs were trained on 2.8 billion words (equivalent to 4.3 billion characters) of texts from a wide collection of text sources. Both hypothesis and model based combination techniques were investigated to combine word and character level LMs. Significant character error rate reductions up to 7.3% relative were obtained on a state-of-the-art Mandarin Chinese broadcast audio recognition task using an adapted history dependent multi-level LM that performs a log-linearly combination of character and word level LMs. This supports the hypothesis that character or syllable sequence models are useful for improving Mandarin speech recognition performance.


Journal of the Acoustical Society of America | 1990

Automatic sentential vowel stress and sentence final pitch slope labeling

James L. Hieronymus; Briony Williams

Vowel stress in English is marked by pitch rise‐falls, energy, and duration. After studying the stress marking strategies of 15 talkers of American English, an algorithm was devised that labels vowels in continuous speech with three levels of stress. The algorithm is based on a model for combinations of pitch rise‐falls, relative energy, and duration. The pitch of voiced regions is characterized as rising, falling, or steady. Sequences of three regions are examined to find the pitch rise‐fall patterns that signal stress. If the energy of the vowel is within 8 dB of the maximum in the utterance, it is considered energy stressed. Duration is corrected for pre‐pausal effects. If two out of three cues are present, then the vowel is labeled stressed. The algorithm was tested on sentences of American and British English and found to perform very well. Detailed results will be presented, and performance of the final pitch slope determination will be discussed relative to a perceptually labeled database.


Journal of the Acoustical Society of America | 1984

Perceptually motivated spectral comparisons for automatic speech recognition

James L. Hieronymus; Roy W. Gengel

Early experiments on amplitude discrimination by Flanagan [J. Acoust. Soc. Am. 27, 1223–1225 (1955); J. Speech Hear. Disord. 22, 205–212 (1957)] suggest that second formant amplitude in synthetic vowels may vary ± 4 dB without a noticeable difference. Klatt (ICASSP 82, 1278–1281) conducted experiments changing the spectral tilt of vowels. This changed the “quality” but not the identity of the vowel. These results contrast sharply with spectral comparisons used in automatic speech recognizers—which are often based on Enclidean or Itakura distances. We used a dynamic time warp based speech recognizer to study several perceptually based spectral comparison measures. One of these measures matches the lowest frequency spectral peak in the template within ± 1.5 dB. For each subsequent peak within the template spectrum, each peak in the unknown spectrum which is within a perceptual frequency band is scored as good match if it is within ± 4 dB of the template peak. Another measure tilts the spectrum so that F1 an...


Journal of the Acoustical Society of America | 1983

A word model based endpoint detector for isolated utterunce recognition

James L. Hieronymus

Endpoint detection is a critical issue for several types of isolated utterance recognizers, because improper endpoints often result in recognition errors. Endpoint errors often stem from nonspeech artifacts, namely lip smacks, tongue and teeth clicks, and breath noise. Endpoint detectors based only on energy thresholds cannot correctly reject these artifacts, but adding a word model allows most of these artifacts to be properly rejected. The rules which implement the word model are (1) the word cannot begin or end with two released plosives, (2) word initial stop gaps are less than 120 ms and word final ones less than 200 ms, (3) a word must contain a vocalic nucleus and be at least 100 ms in length, (4) word final sounds containing only mid‐frequency energy are breath noise. The detection algorithm has been implemented on a Heuristics Speech Recognizer and tested using the Texas Instruments isolated word data base. The word model based system substantially reduced the error rate relative to an energy thr...


Journal of the Acoustical Society of America | 1991

Study of vowel coarticulation in British English

James L. Hieronymus

Coarticulation in continuous speech causes the vowel formant tracks to be altered by nearby phonemes. The present study concentrates on 660 phonetically hand‐labeled read sentences from one male talker of the RP accent of British English. The 14 monothongal vowels of RP English, /ii, i, a, e, aa, uh, oo, o, u, u″, uu, @@, @, |/ were studied using formants, duration, and syllable stress as parameters. The formant frequency values near at the right and left edges and the center of the hand‐labeled vowel region were studied using scatter plots and statististical analysis. No simple relationship between adjacent phoneme place of articulation and the vowel target change has been found because of the presence of “robust vowels” within each vowel category which always reach their target even in the presence of adjacent semivowels. These vowels are not merely stressed, or in content words; instead, they depend on other factors that are being studied. The long duration (prepausal lengthened) vowels were always rob...


Journal of the Acoustical Society of America | 1987

An integrated approach to automatic detection of retroflexed and high‐fronted approximants in continuous speech

Roy W. Gengel; James L. Hieronymus

Our goal is the development of speaker‐independent, semivowel detector classifiers for use with continuous speech [R. W. Gengel and J. L. Hieronymus, J. Acoust. Soc. Am. Suppl. 1 80, S19 (1986)]. Progress on detection of retroflexed approximants (/r/,/ɝ/,/ɚ/) in one branch of a classification tree, and high‐fronted approximants (/y/,/i/) and high‐front vowels (/ɪ/,/eɪ/,/ɛ/) in another, is reported. The analytical scheme relies heavily on energy ratios selected on the basis of “frames of reference.” For example, a basic frame of reference is the energy in the frequency band between approximately 2–3 kHz. According to formant theory, the energy level in this band should be relatively low for retroflexed approximants (F1‐F3 below 2 kHz) and relatively high for high‐fronted approximants (F2,F3 above 2 kHz for) male speakers. In contrast, the energy band between about 1.2–2 kHz should be relatively high for retroflexed approximants while the band between about 0.6–1.6 kHz should be low for high‐fronted approxi...


Journal of the Acoustical Society of America | 1986

Toward automatic recognition of the semivowels /r, w, l, y/: A progress report

Roy W. Gengel; James L. Hieronymus

The goal is to develop a speaker‐independent, semivowel detector‐classifier for use with continuous speech. Preliminary analysis of semivowels within continuous speech indicates large within‐(and between) speaker variability in the respective semivowel formant frequencies, amplitudes, and durations of steady states and transitions. Deviations from textbook descriptions based on isolated words are significant. Coarticulation effects are also large. Nevertheless, through the use of energy, energy ratios, zero crossings, and signal envelopes, a logical analytical scheme was developed that classifies segments of continuous speech as: semivowel, high‐front vowel, vowel, nasal, fricative, stop, or plosive. The specific characteristics of this scheme will be discussed and the computer‐generated displays that were used will be presented. Results of using this scheme on a large corpus of phonetically labeled speech will be presented. [Work supported, in part, by DARPA.]


Journal of the Acoustical Society of America | 1986

Discrimination of synthetic‐vowel formant amplitude change in F2, F3, with two pyschophysical methods

Roy W. Gengel; James L. Hieronymus

Estimates of ΔI are reported for amplitude changes in F2 of two‐formant synthetic vowels and for amplitude changes in F3 for three‐formant synthetic vowels. The vowels /ae, ɔ, i, and u/ were used. Four subjects were tested at an overall level of 70 dB SPL using both a same‐different and a 3AFC procedure. The results suggest that ΔI for a “comparison” stimulus is dependent, in part, on the amplitude of F2 (or F3) relative to the amplitude of F1 in the “standard” stimulus. Thus, for example, when the dB level of F3 is low relative to the dB level in F1 (as in /u/), ΔI is comparatively large. When the amplitude of F3 is high compared to F1 (as in /ae/), ΔI is comparatively small.


Journal of the Acoustical Society of America | 1986

A formant tracker based on pitch synchronous spectra and bark scaling

James L. Hieronymus; William J. Majurski

A formant tracker for continuous speech recognition which uses pitch synchronous spectra has been developed. The spectra are bark scaled to control diffuseness in the higher formants. Then the longest and strongest contiguous ridges are found beginning with regions of high intensity (vocalic nuclei). These are labeled as possible formant trajectories. In regions where the ridges are discontinuous, an attempt is made to join ridges of approximately the same amplitude and frequency. Then the formants are assigned, based on the average frequencies of the ridges in the vocalic regions. Statistically based heuristics are used when several candidate ridges occupy the region where formant frequencies overlap. Difficult areas include nasalized vowels and fronted back vowels. Results of evaluating the formant tracker on a large number of continuous utterances will be presented. [Work supported, in part, by DARPA.]


Journal of the Acoustical Society of America | 1983

A reference speech recognition algorithm: An approach to characterizing speech data bases

James L. Hieronymus; David S. Pallett

One way to characterize the relative recognition difficulty of speech data bases is to use a reference speech recognition algorithm. This approach was first suggested by R. K. Moore at the Royal Signals and Radar Establishment. A preliminary algorithm for this purpose has been published [G. F. Chollet and C. Gagnoulet, Proc. 1982 IEEE ICASSP, 2026–2029 (1982)]. Currently, an isolated utterance reference algorithm under development in our laboratory includes automatic endpoint detection and a training technique which averages several tokens of each utterance. Mel scale cepstral coefficients are used in conjunction with dynamic time alignment in both training and recognition. Recognition accuracy and performance statistics are presented along with distance measures for best and second‐best match. Histograms of the ratio of second‐best to first‐best scores are plotted as a measure of data base confusability. If the ratio histogram has considerable population near unity, there is significant danger of confusi...

Collaboration


Dive into the James L. Hieronymus's collaboration.

Top Co-Authors

Avatar

Roy W. Gengel

Central Institute for the Deaf

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Xunying Liu

University of Cambridge

View shared research outputs
Researchain Logo
Decentralizing Knowledge