Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Ann K. Syrdal is active.

Publication


Featured researches published by Ann K. Syrdal.


Journal of the Acoustical Society of America | 1999

The AT&T Next‐Gen TTS System

Mark C. Beutnagel; Alistair Conkie; Juergen Schroeter; Yannis Stylianou; Ann K. Syrdal

The new AT&T TTS system for general U.S. English text is based on best‐choice components picked from the AT&T Flextalk TTS, the Festival System from the University of Edinburgh, and ATR’s CHATR system. From Flextalk, it employs text normalization, letter‐to‐sound, and (optionally) baseline prosody generation. Festival provides general software‐engineering infrastructure (modularity) for easy experimentation and competitive evaluation of different algorithms or modules. Finally, CHATR’s unit selection was modified to guarantee the intelligibility of a good n‐phone (n=2 would be diphone) synthesizer while improving significantly on perceived naturalness relative to Flextalk. Each decision made during the research and development phase of this system was based on formal subjective evaluations. For example, the best voice found in a test that compared TTS systems built from several speakers gave a 0.3‐point head start (on a 5‐point rating scale) in quality over the mean of all speakers. Similarly, using our H...


international conference on acoustics, speech, and signal processing | 2001

Perceptual and objective detection of discontinuities in concatenative speech synthesis

Yannis Stylianou; Ann K. Syrdal

Concatenative speech synthesis systems attempt to minimize audible signal discontinuities between two successive concatenated units. An objective distance measure which is able to predict audible discontinuities is therefore very important, particularly in unit selection synthesis, for which units are selected from among a large inventory at run time. In this paper, we describe a perceptual test to measure the detection rate of concatenation discontinuity by humans, and then we evaluate 13 different objective distance measures based on their ability to predict the human results. Criteria used to classify these distances include the detection rate, the Bhattacharyya measure of separability of two distributions, and receiver operating characteristic (ROC) curves. Results show that the Kullback-Leibler distance on power spectra has the higher detection rate followed by the Euclidean distance on Mel-frequency cepstral coefficients (MFCC).


Speech Communication | 2001

Automatic ToBI prediction and alignment to speed manual labeling of prosody

Ann K. Syrdal; Julia Hirschberg; Julie McGory; Mary E. Beckman

Tagging of corpora for useful linguistic categories can be a time-consuming process, especially with linguistic categories for which annotation standards are relatively new, such as discourse segment boundaries or the intonational events marked in the Tones and Break Indices (ToBI) system for American English. A ToBI prosodic labeling of speech typically takes even experienced labelers from 100 to 200 times real time. An experiment was conducted to determine (1) whether manual correction of automatically assigned ToBI labels would speed labeling, and (2) whether default labels introduced any bias in label assignment. A large speech corpus of one female speaker reading several types of texts was automatically assigned default labels. Default accent placement and phrase boundary location were predicted from text using machine learning techniques. The most common ToBI labels were assigned to these locations for default tones and break type. Predicted pitch accents were automatically aligned to the mid-point of the word, while breaks and edge tones were aligned to the end of the phrase-final word. The corpus was then labeled by a group of five trained transcribers working over a period of nine months. Half of each set of recordings was labeled in the standard fashion without default labels, and the other half was presented with preassigned default labels for labelers to correct. Results indicate that labeling from defaults was generally faster than standard labeling, and that defaults had relatively little impact on label assignment.


international conference on acoustics speech and signal processing | 1998

TD-PSOLA versus harmonic plus noise model in diphone based speech synthesis

Ann K. Syrdal; Yannis Stylianou; Laurie Garrison; Alistair Conkie; Juergen Schroeter

In an effort to select a speech representation for our next generation concatenative text-to-speech synthesizer, the use of two candidates is investigated; TD-PSOLA and the harmonic plus noise model, HNM. A formal listening test has been conducted and the two candidates have been rated regarding intelligibility, naturalness and pleasantness. Ability for database compression and computational load is also discussed. The results show that HNM consistently outperforms TD-PSOLA in all the above features except for computational load. HNM allows for high-quality speech synthesis without smoothing problems at the segmental boundaries and without buzziness or other oddities observed with TD-PSOLA.


Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002. | 2002

A perspective on the next challenges for TTS research

Juergen Schroeter; Alistair Conkie; Ann K. Syrdal; Mark C. Beutnagel; Matthias Jilka; Volker Strom; Yeon Jun Kim; Hong-Goo Kang; David A. Kapilow

The quality of speech synthesis has come a long way since Homer Dudleys Voder in 1939. In fact, with the widespread use of unit-selection synthesizers, the naturalness of the synthesized speech is now high enough to pass the Turing test for short utterances, such as voice prompts. Therefore, it seems valid to ask the question what are the next challenges for TTS research? This paper tries to identify unsolved issues, the solution of which would greatly enhance the state of the art in TTS.


international conference on spoken language processing | 1996

Acoustic variability in spontaneous conversational speech of American English talkers

Ann K. Syrdal

Speaker variability strongly impacts human perception and technology performance, yet large-scale, systematic studies of the acoustic characteristics involved are rarely undertaken. This study provides statistics on selected segmental and suprasegmental acoustic parameters from measures made on spontaneous conversational telephone speech from 160 speakers in the Switchboard Corpus. Since spontaneous conversational speech is more dynamically variable than read speech representative of actual human communication, it was preferred for our applied research purposes.


conference on computers and accessibility | 2011

On the intelligibility of fast synthesized speech for individuals with early-onset blindness

Amanda Stent; Ann K. Syrdal; Taniya Mishra

People with visual disabilities increasingly use text-to-speech synthesis as a primary output modality for interaction with computers. Surprisingly, there have been no systematic comparisons of the performance of different text-to-speech systems for this user population. In this paper we report the results of a pilot experiment on the intelligibility of fast synthesized speech for individuals with early-onset blindness. Using an open-response recall task, we collected data on four synthesis systems representing two major approaches to text-to-speech synthesis: formant-based synthesis and concatenative unit selection synthesis. We found a significant effect of speaking rate on intelligibility of synthesized speech, and a trend towards significance for synthesizer type. In post-hoc analyses, we found that participant-related factors, including age and familiarity with a synthesizer and voice, also affect intelligibility of fast synthesized speech.


International Journal of Speech Technology | 1998

An evaluation of the diagnostic rhyme test

Steven L. Greenspan; Raymond Walden Bennett; Ann K. Syrdal

The intelligibility of a speech output device is an important predictor of user acceptability. The Diagnostic Rhyme Test (DRT) is an ANSI standard for measuring speech intelligibility (ANSI S3.2-1989). In the DRT, respondents hear a word and choose its equivalent from two visually presented words. The two words differ only in their initial (e.g., veal-feel), and the two consonants differ only in a single distinctive acousticphonetic feature (e.g., voicing). To define “distinctive feature”, the DRT uses a minimal distinctive feature system, loosely based on the work of Jakobson et al. (1963) and Miller and Nicely (1955). These studies carefully analyzed natural speech errors in various noise environments. Whether or not these studies can be freely applied to alternative forced-choice tests of coded or synthesized speech is an empirical issue. In the present study, the results of a Consonant Identification (CI) task were compared to a previously conducted DRT using the same coding algorithms. The CI data indicated that the low-bit-rate coded speech yielded significantly more multifeature confusions then the uncoded speech. Moreover, the multifeature confusions could not be easily predicted from the single-feature confusions. A fundamental assumption of the DRT is that speech errors are adequately diagnosed by testing single-feature confusions. The results of the present study contradict that assumption. In conclusion, we argue that the application of the DRT (and more generally, any closed-response choice procedure) to coded or synthesized speech is questionable.


international conference on acoustics, speech, and signal processing | 2011

Using F 0 to constrain the unit selection Viterbi network

Alistair Conkie; Ann K. Syrdal

The goal of the work described here is to limit the computation needed in unit selection Viterbi search for text-to-speech synthesis. The broader goal is to improve speech quality through the practical use of significantly larger databases. We focus in this paper on trying to reduce the number of concatenation cost calculations. By making certain weak assumptions about ƒ0 distributions we estimate that only a fraction of possible concatenations are relevant. A method for selecting the relevant concatenations by imposing an ordering constraint on candidate units is proposed. The ordering is based on unit ƒ0 value(s). Strengths and weaknesses of this approach are discussed and data is presented about calculation complexity compared with naive Viterbi search. A listening test was conducted to investigate the effect on synthesis quality under various configurations of algorithm and database.


Journal of the Acoustical Society of America | 1997

Voice selection for speech synthesis

Ann K. Syrdal; Alistair Conkie; Yannis Stylianou; Juergen Schroeter; Laurie Garrison; Dawn L. Dutton

A TTS voice quality experiment was conducted to select a speaker and to evaluate synthesis techniques. Small‐scale TTS diphone inventories using six professional female speakers who were pre‐selected in an audition were recorded. Two types of inventories were recorded for each speaker: a series of nonsense words and a series of English sentences. Using these 12 inventories, two synthesis methods were compared: PSOLA [Charpentier and Moulines, Eurospeech ’89] and Harmonic Plus Noise (HNM) [Stylianou et al., Eurospeech ’97]. Synthetic prosody closely modeled naturally spoken versions of the target utterances. Three fully synthetic (TTS) and two hybrid (i.e., partly recorded from the human speaker and partly synthesized) sentences formed the experimental stimuli for subjective testing. For references, two MNRU versions of the naturally spoken sentences were used: (a) Q10 (resembling low‐end commercial 16‐kbps encoded speech) and (b) Q35 (resembling high‐quality telephone speech). Forty‐one subjects rated int...

Researchain Logo
Decentralizing Knowledge