Kim E. A. Silverman | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kim E. A. Silverman is active.

Explore More

Publication

Featured researches published by Kim E. A. Silverman.

Journal of the Acoustical Society of America | 1987

The timing of prenuclear high accents in English

Kim E. A. Silverman; Janet B. Pierrehumbert

In English, the alignment of intonation peaks with their syllables exhibits a great deal of contextually governed variation. Understanding this variation is of theoretical interest, and modeling it correctly is important for good quality intonation synthesis. An experimental study of the alignment of prenuclear accent peaks with their associated syllables will be described. Two speakers produced repetitions of names of the form “Ma Lemm,” “Mom LeMann,” “Mamalie Lemonick,” and “Mama Lemonick,” with all combinations of the four first names and three surnames. Segmental durations and the F0 peak location in the first name were measured. Results show that although both speaking rate and prosodic context affect syllable duration, they exert different influences on peak alignment. Specifically, when a syllable is lengthened by a word boundary (e.g., Ma Le Man versus Mama Lemm) or stress clash (e.g., Ma Lemm), the peak falls disproportionately earlier in the vowel. This seems to be related to the syllable‐intern...

Journal of the Acoustical Society of America | 1985

Evidence for the independent function of intonation contour type, voice quality, and F0 range in signaling speaker affect

D. Robert Ladd; Kim E. A. Silverman; Frank Tolkmitt; Günther Bergmann; Klaus R. Scherer

In three related experiments, listeners judged the affect conveyed by short recorded utterances in which the voice quality, intonation contour type, and fundamental frequency range had been systematically and independently manipulated. (Contour and range were manipulated using digital resynthesis of naturally spoken utterances.) Analyses of variance of the results showed that range and contour, and less clearly range and voice quality, had independent effects on the way the utterances were judged. The results also strongly suggest that these differences are independent of effects due to interspeaker differences and to differences of verbal content. Finally, analysis of the results suggests that differences of F0 range, as is commonly assumed, have continuous rather than categorical effects on affective judgments.

Journal of the Acoustical Society of America | 1984

Vocal cues to speaker affect: Testing two models

Klaus R. Scherer; D. Robert Ladd; Kim E. A. Silverman

We identified certain assumptions implicit in two divergent approaches to studying vocal affect signaling. The ‘‘covariance’’ model assumes that nonverbal cues function independently of verbal content, and that relevant acoustic parameters covary with the strength of the affect conveyed. The ‘‘configuration’’ model assumes that both verbal and nonverbal cues exhibit categorical linguistic structure, and that different affective messages are conveyed by different configurations of category variables. We tested these assumptions in a series of two judgment experiments in which subjects rated recorded utterances, written transcripts, and three different acoustically masked versions of the utterances. Comparison of the different conditions showed that voice quality and F0 level can convey affective information independently of the verbal context. However, judgments of the unaltered recordings also showed that intonational categories (contour types) conveyed affective information only in interaction with grammatical features of the text. It appears necessary to distinguish between linguistic features of intonation and other (paralinguistic) nonverbal cues and to design research methods appropriate to the type of cues under study.

Journal of the Acoustical Society of America | 1998

Methods for controlling the generation of speech from text representing names and addresses

Kim E. A. Silverman

Improved automated synthesis of human audible speech from text is disclosed. Performance enhancement of the underlying text comprehensibility is obtained through prosodic treatment of the synthesized material, improved speaking rate treatment, and improved methods of spelling words or terms for the system user. Prosodic shaping of text sequences appropriate for the discourse in large groupings of text segments, with prosodic boundaries developed to indicate conceptual units within the text groupings, is implemented in a preferred embodiment.

Journal of the Acoustical Society of America | 1998

Adaptive methods for controlling the annunciation rate of synthesized speech

Kim E. A. Silverman

Improved automated synthesis of human audible speech from text is disclosed. Performance enhancement of the underlying text comprehensibility is obtained through prosodic treatment of the synthesized material, improved speaking rate treatment, and improved methods of spelling words or terms for the sysstem user. Prosodic shaping of text sequences appropriate for the discourse in large groupings of text segments, with prosodic boundaries developed to indicate conceptual units within the text groupings, is implemented in a preferred embodiment.

IEEE Transactions on Speech and Audio Processing | 2001

Statistical prosodic modeling: from corpus design to parameter estimation

Jerome R. Bellegarda; Kim E. A. Silverman; Kevin A. Lenzo; Victoria B. Anderson

The increasing availability of carefully designed and collected speech corpora opens up new possibilities for the statistical estimation of formal multivariate prosodic models. At Apple Computer, statistical prosodic modeling exploits the Victoria corpus, created to broadly support ongoing speech synthesis research and development. This corpus is composed of five constituent parts, each designed to cover a specific aspect of speech synthesis: polyphones, prosodic contexts, reiterant speech, function word sequences, and continuous speech. This paper focuses on the use of the Victoria corpus in the statistical estimation of duration and pitch models for Apples next-generation text-to-speech system in Macintosh OS X. Duration modeling relies primarily on the subcorpus of prosodic contexts, which is instrumental to uncover empirical evidence in favor of a piecewise linear transformation in the well-known sums-of-products approach. Pitch modeling relies primarily on the subcorpus of reiterant speech, which makes possible the optimization of superpositional pitch models with more accurate underlying smooth contours. Experimental results illustrate the improved prosodic representation resulting from these new duration and pitch models.

IEEE Transactions on Speech and Audio Processing | 2003

Natural language spoken interface control using data-driven semantic inference

Jerome R. Bellegarda; Kim E. A. Silverman

Spoken interaction tasks are typically approached using a formal grammar as language model. While ensuring good system performance, this imposes a rigid framework on users, by implicitly forcing them to conform to a pre-defined interaction structure. This paper introduces the concept of data-driven semantic inference, which in principle allows for any word constructs in command/query formulation. Each unconstrained word string is automatically mapped onto the intended action through a semantic classification against the set of supported actions. As a result, it is no longer necessary for users to memorize the exact syntax of every command. The underlying (latent semantic analysis) framework relies on co-occurrences between words and commands, as observed in a training corpus. A suitable extension can also handle commands that are ambiguous at the word level. The behavior of semantic inference is characterized using a desktop user interface control task involving 113 different actions. Under realistic usage conditions, this approach exhibits a 2 to 5% classification error rate. Various training scenarios of increasing scope are considered to assess the influence of coverage on performance. Sufficient semantic knowledge about the task domain is found to be captured at a level of coverage as low as 70%. This illustrates the good generalization properties of semantic inference.

ieee automatic speech recognition and understanding workshop | 2003

Automatic junk e-mail filtering based on latent content

Jerome R. Bellegarda; Devang K. Naik; Kim E. A. Silverman

The explosion in unsolicited mass electronic mail (junk e-mail) over the past decade has sparked interest in automatic filtering solutions. Traditional techniques tend to rely on header analysis, keyword/keyphrase matching and analogous rule-based predicates, and/or some probabilistic model of text generation. This paper aims instead at deciding whether or not the latent subject matter is consistent with the users interests. The underlying framework is latent semantic analysis: each e-mail is automatically classified against two semantic anchors, one for legitimate and one for junk messages. Experiments show that this approach is competitive with the state-of-the-art in e-mail classification, and potentially advantageous in real-world applications with high junk-to-legitimate ratios. The resulting technology has been successfully released in August 2002 as part of the e-mail client bundled with the MacOS 10.2 operating system.

Journal of the Acoustical Society of America | 2013

Method and apparatus for speech synthesis using paralinguistic variation

Kim E. A. Silverman; Donald J. Lindsay

A method and apparatus for speech synthesis in a computer-user interface using random paralinguistic variation is described herein. According to one aspect of the present invention, a method for synthesizing speech comprises generating synthesized speech having certain prosodic features. The synthesized speech is further processed by applying a random paralinguistic variation to the acoustic sequence representing the synthesized speech without altering the linguistic prosodic features. According to one aspect of the present invention, the application of the paralinguistic variation is correlated with a previously applied paralinguistic variation to reflect a gradual change in the computer voice, while still maintaining a random quality.

human language technology | 1992

Towards using prosody in speech recognition/understanding systems: differences between read and spontaneous speech

Kim E. A. Silverman; Eleonora Blaauw; Judith Spitz; John F. Pitrelli

A persistent problem for keyword-driven speech recognition systems is that users often embed the to-be-recognized words or phrases in longer utterances. The recognizer needs to locate the relevant sections of the speech signal and ignore extraneous words. Prosody might provide an extra source of information to help locate target words embedded in other speech. In this paper we examine some prosodic characteristics of 160 such utterances and compare matched read and spontaneous versions. Half of the utterances are from a corpus of spontaneous answers to requests for the name of a city, recorded from calls to Directory Assistance Operators. The other half are the same word strings read by volunteers attempting to model the real dialogue. Results show a consistent pattern across both sets of data: embedded city names almost always bear nuclear pitch accents and are in their own intonational phrases. However the distributions of tonal make-up of these prosodic features differ markedly in read versus spontaneous speech, implying that if algorithms that exploit these prosodic regularities are trained on read speech, then the probabilities are likely to be incorrect models of real user speech.

Explore More