Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Esther Klabbers is active.

Publication


Featured researches published by Esther Klabbers.


Speech Communication | 2005

Synthesis of Prosody using Multi-level Unit Sequences ?

Jan P. H. van Santen; Alexander Kain; Esther Klabbers; Taniya Mishra

Generating meaningful and natural sounding prosody is a central challenge in textto-speech synthesis (TTS). In traditional synthesis, the challenge consists of how to generate natural target prosodic contours and how to impose these contours on recorded speech without causing audible distortions. In unit selection synthesis, the challenge is the sheer size of the speech corpus that is needed to cover all combinations of phone sequences and prosodic contexts that can occur in a given language. This paper describes new methods that are being explored, based on the principle of superpositional prosody transplant. Both methods are based on the following procedure. In a recorded, prosodically and phonemically labeled corpus, the log pitch contours are additively decomposed into component curves according to a prosodic hierarchy, typically phrase curves (corresponding to phrases), accent curves (corresponding to feet), and segmental perturbation (or residuals) curves. During synthesis, the corpus is searched for multiple unit sequences: A unit sequence that covers the target phoneme string, and one or more unit sequences that cover the prosodic labels at a given phonological level (e.g., the foot or phrase) and are constrained by being matched to the phone match sequence in terms of the phonetic classes of the phonemes (or in terms of higher level entities, such as the number of feet and their sizes measured in syllables). The methods dier in terms of the level of detail of these constraints. A superpositional prosody transplant procedure generates a target pitch contour by extracting and recombining component curves from these sequences, and imposing this contour on the sequence that matches the phone string using standard speech modification methods. This process minimizes prosodic modification artifacts, optimizes the naturalness of the target pitch contour, yet avoids the combinatorial explosion of standard unit selection synthesis.


Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002. | 2002

Prosodic factors for predicting local pitch shape

Esther Klabbers; J.P.H. van Santen; Johan Wouters

In this paper, we investigate the predictive power of different prosodic factorization schemes with respect to pitch movement. We use this to propose an extension of a standard diphone database with diphones that have been recorded in different prosodic contexts. The goal of this research is to reduce the amount of pitch modification required, thereby improving the segmental quality of the synthetic voice. We present a factorization scheme based on the foot structure of utterances and show that this efficient scheme results in a fairly small number of additional diphones that need to be recorded.


IEEE Transactions on Audio, Speech, and Language Processing | 2007

The Contribution of Various Sources of Spectral Mismatch to Audible Discontinuities in a Diphone Database

Esther Klabbers; J.P.H. van Santen; Alexander Kain

One of the major problems in concatenative synthesis is the occurrence of audible discontinuities between two successive concatenative units. Several studies have attempted to discover objective distance measures that predict the audibility of these discontinuities. In this paper, we investigate mid-vowel joins for three vowels with a range of post-vocalic consonant contexts typical for diphone databases. A first perceptual experiment uses a pairwise comparison procedure to find two subsets of unit combinations: Those with versus without audible discontinuities. A second perceptual experiment uses these two subsets in a procedure where formant resynthesis is used to manipulate three sources of discontinuity separately: formant frequencies, formant bandwidths, and overall energy. Results show mismatch in formant frequencies provides the largest contribution to audible discontinuity, followed by mismatch in overall energy


international conference on acoustics, speech, and signal processing | 2014

A novel pitch decomposition method for the generalized linear alignment model

Mahsa Sadat Elyasi Langarani; Esther Klabbers; Jan P. H. van Santen

Superpositional models of intonation typically propose decomposing fundamental frequency (F0) contours into phrase curves and accent curves, aligned with phrases and left-headed feet, respectively. Extracting these component curves from F0 contours without making undue assumptions is challenging. We propose a novel method for decomposing pitch curves, based on the assumption that accent curves can be described by combining skewed normal distributions and sigmoid functions. In contrast to an earlier pitch decomposition algorithm (“PRISM”), this allows for simple joint optimization of phrase and accent curve parameters, using fewer parameters. The proposed method was evaluated on three speech corpora containing: (1) synthetically generated pitch curves, (2) all-sonorant utterances, and (3) utterances containing both sonorant and non-sonorant speech sounds. The root weighted mean squared error is small, and, on the corpus for which comparable data are available, is significantly smaller than for PRISM.


international conference on acoustics, speech, and signal processing | 2011

F 0 range and peak alignment across speakers and emotions

Eric Morley; Jan P. H. van Santen; Esther Klabbers; Alexander Kain

We present an analysis of F0 range and peak alignment in emotional speech from a heterogeneous group of speakers varying in age and gender. Both speaker and emotion had a strong effect on F0 range. Despite these large changes in the F0 trajectory, peak alignment was remarkably stable. Using the Linear Alignment Model (LAM) [1], we show that the effects on alignment of emotion and speaker differences, although statistically significant, are small. This stability results in a conclusion that peak alignment, unlike F0 range, does not appear to carry much information about speaker identity or emotional state. The LAM is effective in that it explains 42% of the variance in peak location on average, and furthermore it predicts the time of F0 peaks with an average RMS error of 12ms.


Journal of the Acoustical Society of America | 2009

Recombinant speech synthesis: Natural text‐to‐speech synthesis with prosodic control.

Esther Klabbers; Taniya Mishra; Jan P. H. van Santen

Unit selection text‐to‐speech synthesis methods rely on large corpora to cover all phoneme sequences in as many prosodic contexts as possible. This coverage is rarely complete except in limited domains. This becomes particularly salient when using prosodic markup to generate specific prosodic patterns (e.g., emphatic stress). An architecture is proposed combining the naturalness of unit‐selection synthesis with the requirement of prosodic control. The speech corpus consists of multiple sub‐corpora, each optimized to cover a “linguistic subspace”; subspaces include phoneme sequences, left‐headed feet, sentence structures, and paralinguistic categories. The system relies on the superpositional model of intonation to decompose natural pitch contours into component contours, e.g., phrase curves (corresponding to phrases) and accent curves (corresponding to left‐headed feet); on analogous methods for timing; and on hybridization methods to implement paralinguistic features. During synthesis, phoneme sequences, curves, and parameters are generated from the sub‐corpora, optionally modified as per prosodic control tags, and “re‐combined.” The explicit representation in terms of component curves allows for complete prosodic control, while the naturalness of the prosodic patterns is guaranteed by extracting these curves from natural speech and smoothly modifying them, thereby preserving important natural detail. [Research supported by NSF grant 0205731, “Prosody generation in child‐oriented speech.”]


Journal of the Acoustical Society of America | 2006

Expressive speech synthesis using multilevel unit selection

Esther Klabbers; Jan P. H. van Santen

Generating natural sounding and meaningful prosody is a central challenge in text‐to‐speech synthesis, especially when generating expressive speech. Recently, we proposed a multilevel unit sequence synthesis approach, based on the general superpositional model of intonation, which describes a pitch contour as the sum of component curves that are each associated with different phonological levels, specifically the phoneme, foot, and phrase. During synthesis, segmental perturbation curves, accent curves, and phrase curves are extracted from the acoustic signal and are combined into target pitch curves; these target curves are then imposed on the acoustic unit sequences using standard pitch modification methods. This approach represents an attempt to combine the strengths of the two dominant approaches to speech synthesis: unit selection synthesis, which preserves all details of natural speech but struggles with coverage of the very large combinatorial space of phoneme sequences and prosodic contexts, and di...


SSW | 2004

Estimating phrase curves in the general superpositional intonation model.

Jan P. H. van Santen; Taniya Mishra; Esther Klabbers


SSW | 2004

Clustering of foot-based pitch contours in expressive speech.

Esther Klabbers; Jan P. H. van Santen


conference of the international speech communication association | 2003

Applications of computer generated expressive speech for communication disorders

Jan P. H. van Santen; Lois M. Black; Gilead Cohen; Alexander Kain; Esther Klabbers; Taniya Mishra; Jacques de Villiers; Xiaochuan Niu

Collaboration


Dive into the Esther Klabbers's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge