Julie Carson-Berndsen
University College Dublin
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Julie Carson-Berndsen.
spoken language technology workshop | 2012
Éva Székely; Tamás Gábor Csapó; Bálint Tóth; Péter Mihajlik; Julie Carson-Berndsen
Freely available audiobooks are a rich resource of expressive speech recordings that can be used for the purposes of speech synthesis. Natural sounding, expressive synthetic voices have previously been built from audiobooks that contained large amounts of highly expressive speech recorded from a professionally trained speaker. The majority of freely available audiobooks, however, are read by amateur speakers, are shorter and contain less expressive (less emphatic, less emotional, etc.) speech both in terms of quality and quantity. Synthesizing expressive speech from a typical online audiobook therefore poses many challenges. In this work we address these challenges by applying a method consisting of minimally supervised techniques to align the text with the recorded speech, select groups of expressive speech segments and build expressive voices for hidden Markov-model based synthesis using speaker adaptation. Subjective listening tests have shown that the expressive synthetic speech generated with this method is often able to produce utterances suited to an emotional message. We used a restricted amount of speech data in our experiment, in order to show that the method is generally applicable to most typical audiobooks widely available online.
international conference on acoustics, speech, and signal processing | 2012
Éva Székely; John Kane; Stefan Scherer; Christer Gobl; Julie Carson-Berndsen
Audiobooks are known to contain a variety of expressive speaking styles that occur as a result of the narrator mimicking a character in a story, or expressing affect. An accurate modeling of this variety is essential for the purposes of speech synthesis from an audiobook. Voice quality differences are important features characterizing these different speaking styles, which are realized on a gradient and are often difficult to predict from the text. The present study uses a parameter characterizing breathy to tense voice qualities using features of the wavelet transform, and a measure for identifying creaky segments in an utterance. Based on these features, a combination of supervised and unsupervised classification is used to detect the regions in an audiobook, where the speaker changes his regular voice quality to a particular voice style. The target voice style candidates are selected based on the agreement of the supervised classifier ensemble output, and evaluated in a listening test.
Journal on Multimodal User Interfaces | 2014
Éva Székely; Ingmar Steiner; Zeeshan Ahmed; Julie Carson-Berndsen
One of the challenges of speech-to-speech translation is to accurately preserve the paralinguistic information in the speaker’s message. Information about affect and emotional intent of a speaker are often carried in more than one modality. For this reason, the possibility of multimodal interaction with the system and the conversation partner may greatly increase the likelihood of a successful and gratifying communication process. In this work we explore the use of automatic facial expression analysis as an input annotation modality to transfer paralinguistic information at a symbolic level from input to output in speech-to-speech translation. To evaluate the feasibility of this approach, a prototype system, FEAST (facial expression-based affective speech translation) has been developed. FEAST classifies the emotional state of the user and uses it to render the translated output in an appropriate voice style, using expressive speech synthesis.
international conference on acoustics, speech, and signal processing | 2010
Kalu U. Ogbureke; Julie Carson-Berndsen
Annotation of large multilingual corpora remains a challenge to the data-driven approach to speech research, especially for under-resourced languages. This paper presents cross-language automatic phonetic segmentation using Hidden Markov Models (HMMs). The underlying notion is segmentation based on articulation (manner and place) so as to provide extensive models that will be applicable across languages. A test on the Appen Spanish speech corpus gives phone recognition accuracy of 61.15% when bootstrapped with acoustic models trained on the TIMIT as compared with a baseline result of 54.63% for flat start initialization of the monophone models.
conference of the international speech communication association | 1994
Kai Hübener; Julie Carson-Berndsen
This paper presents a new approach to phoneme recognition using nonsequen tial sub phoneme units These units are called acoustic events and are phonolog ically meaningful as well as recognizable from speech signals Acoustic events form a phonologically incomplete representation as compared to distinctive features This problem may partly be overcome by incorporat ing phonological constraints Currently binary events describing manner and place of articulation vowel quality and voicing are used to recognize all German phonemes Phoneme recognition in this paradigm consists of two steps After the acoustic events have been determined from the speech signal a phonological parser is used to generate syllable and phoneme hypotheses from the event lattice Results obtained on a speaker dependent corpus are presented
non-linear speech processing | 2013
João P. Cabral; Julie Carson-Berndsen
The control over aspects of the glottal source signal is fundamental to correctly modify relevant voice characteristics, such as breathiness. This voice quality is strongly related to the characteristics of the glottal source signal produced at the glottis, mainly the shape of the glottal pulse and the aspiration noise. This type of noise results from the turbulence of air passing through the glottis and it can be represented by an amplitude modulated Gaussian noise, which depends on the glottal volume velocity and glottal area. However, the dependency between the glottal signal and the noise component is usually not taken into account for transforming breathiness. In this paper, we propose a method for modelling the aspiration noise which permits to adapt the aspiration noise to take into account its dependency with the glottal pulse shape, while producing high-quality speech. The envelope of the amplitude modulated noise is estimated from the speech signal pitch-synchronously and then it is parameterized by using a non-linear polynomial fitting algorithm. Finally, an asymmetric triangular window is obtained from the non-linear polynomial representation for obtaining a shape of the energy envelope of the noise closer to that of the glottal source. In the experiments for voice transformation, both the proposed aspiration noise model and an acoustic glottal source model are used to transform a modal voice into breathy. Results show that the aspiration noise model improves the voice quality transformation compared with an excitation using only the glottal model and an excitation that combines the glottal source model and a spectral representation of the noise component.
industrial and engineering applications of artificial intelligence and expert systems | 2006
Supphanat Kanokphara; Jan Macek; Julie Carson-Berndsen
Generally speech recognition systems make use of acoustic features as a representation of speech for further processing. These acoustic features are usually based on human auditory perception or signal processing. More recently, Articulatory Feature (AF) based speech representations have been investigated by a number of speech technology researchers. Articulatory features are motivated by linguistic knowledge and hence may better represent speech characteristics. In this paper, we introduce two popular classification models, Hidden Markov Model (HMM) and Support Vector Machine (SVM), for automatic articulatory feature extraction. HMM-based systems are found to be best when there is good balance in the numbers of positive and negative examples in the data while SVM is better in the unbalanced data condition.
meeting of the association for computational linguistics | 2004
Julie Carson-Berndsen; Robert Kelly; Moritz Neugebauer
Automata induction and typed feature theory are described in a unified framework for the automatic acquisition of feature-based phonotactic resources. The viability of this data-driven procedure is illustrated with examples taken from a corpus of syllable-labelled data.
computational linguistics in the netherlands | 2000
Julie Carson-Berndsen; Gina Joue; Michael Walsh
The aim of this paper is to highlight areas in which a computational linguistic model of phonology can contribute to robustness in speech technology applications. We discuss a computational linguistic model which uses finite state methodology and an event logic to demonstrate how declarative descriptions of phonological constraints can play a role in speech recognition. The model employs statistics derived from a cognitive phonological analysis of speech corpora. These statistics are used in ranking feature-based phonotactic constraints for the purposes of constraint relaxation and output extrapolation in syllable recognition. We present the model using a generic framework which we have developed specifically for constructing and evaluating phonotactic constraint descriptions. We demonstrate how new phonotactic constraint descriptions can be developed for the model and how the ranking of these constraints is used to cater for underspecified representations thus making the model more robust.
Archive | 2000
Lynne J. Cahill; Julie Carson-Berndsen; Gerald Gazdar
This is a tutorial paper on lexical knowledge representation techniques that are applicable in lexica that provide phonetic or phonological representations for words, rather than orthographic representations. Because the semantic and syntactic levels of lexical description are largely neutral with respect to phonology versus orthography, we do not consider them here. Instead, we concentrate on the morphological, morphophonological and phonological levels of description, levels of representation which are less often considered in NLP. Much of the field continues to operate under the assumption that natural languages correspond to sets of strings of orthographic characters.