Éva Székely
University College Dublin
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Éva Székely.
spoken language technology workshop | 2012
Éva Székely; Tamás Gábor Csapó; Bálint Tóth; Péter Mihajlik; Julie Carson-Berndsen
Freely available audiobooks are a rich resource of expressive speech recordings that can be used for the purposes of speech synthesis. Natural sounding, expressive synthetic voices have previously been built from audiobooks that contained large amounts of highly expressive speech recorded from a professionally trained speaker. The majority of freely available audiobooks, however, are read by amateur speakers, are shorter and contain less expressive (less emphatic, less emotional, etc.) speech both in terms of quality and quantity. Synthesizing expressive speech from a typical online audiobook therefore poses many challenges. In this work we address these challenges by applying a method consisting of minimally supervised techniques to align the text with the recorded speech, select groups of expressive speech segments and build expressive voices for hidden Markov-model based synthesis using speaker adaptation. Subjective listening tests have shown that the expressive synthetic speech generated with this method is often able to produce utterances suited to an emotional message. We used a restricted amount of speech data in our experiment, in order to show that the method is generally applicable to most typical audiobooks widely available online.
international conference on acoustics, speech, and signal processing | 2012
Éva Székely; John Kane; Stefan Scherer; Christer Gobl; Julie Carson-Berndsen
Audiobooks are known to contain a variety of expressive speaking styles that occur as a result of the narrator mimicking a character in a story, or expressing affect. An accurate modeling of this variety is essential for the purposes of speech synthesis from an audiobook. Voice quality differences are important features characterizing these different speaking styles, which are realized on a gradient and are often difficult to predict from the text. The present study uses a parameter characterizing breathy to tense voice qualities using features of the wavelet transform, and a measure for identifying creaky segments in an utterance. Based on these features, a combination of supervised and unsupervised classification is used to detect the regions in an audiobook, where the speaker changes his regular voice quality to a particular voice style. The target voice style candidates are selected based on the agreement of the supervised classifier ensemble output, and evaluated in a listening test.
Journal on Multimodal User Interfaces | 2014
Éva Székely; Ingmar Steiner; Zeeshan Ahmed; Julie Carson-Berndsen
One of the challenges of speech-to-speech translation is to accurately preserve the paralinguistic information in the speaker’s message. Information about affect and emotional intent of a speaker are often carried in more than one modality. For this reason, the possibility of multimodal interaction with the system and the conversation partner may greatly increase the likelihood of a successful and gratifying communication process. In this work we explore the use of automatic facial expression analysis as an input annotation modality to transfer paralinguistic information at a symbolic level from input to output in speech-to-speech translation. To evaluate the feasibility of this approach, a prototype system, FEAST (facial expression-based affective speech translation) has been developed. FEAST classifies the emotional state of the user and uses it to render the translated output in an appropriate voice style, using expressive speech synthesis.
intelligent user interfaces | 2013
Zeeshan Ahmed; Ingmar Steiner; Éva Székely; Julie Carson-Berndsen
In the emerging field of speech-to-speech translation, emphasis is currently placed on the linguistic content, while the significance of paralinguistic information conveyed by facial expression or tone of voice is typically neglected. We present a prototype system for multimodal speech-to-speech translation that is able to automatically recognize and translate spoken utterances from one language into another, with the output rendered by a speech synthesis system. The novelty of our system lies in the technique of generating the synthetic speech output in one of several expressive styles that is automatically determined using a camera to analyze the users facial expression during speech.
Proceedings of the 1st ACM SIGCHI International Workshop on Investigating Social Interactions with Artificial Agents | 2017
Catharine Oertel; Patrik Jonell; Kevin El Haddad; Éva Székely; Joakim Gustafson
In this paper we are describing how audio-visual corpora recordings using crowd-sourcing techniques can be used for the audio-visual synthesis of attitudional non-verbal feedback expressions for virtual agents. We are discussing the limitations of this approach as well as where we see the opportunities for this technology.
conference of the international speech communication association | 2011
Éva Székely; João P. Cabral; Peter Cahill; Julie Carson-Berndsen
language resources and evaluation | 2012
Éva Székely; João P. Cabral; Mohamed Abou-Zleikha; Peter Cahill; Julie Carson-Berndsen
north american chapter of the association for computational linguistics | 2012
Éva Székely; Zeeshan Ahmed; João P. Cabral; Julie Carson-Berndsen
language resources and evaluation | 2012
João P. Cabral; Mark Kane; Zeeshan Ahmed; Mohamed Abou-Zleikha; Éva Székely; Kalu U. Ogbureke; Peter Cahill; Julie Carson-Berndsen; Stephan Schlögl
Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction | 2012
Éva Székely; Zeeshan Ahmed; Ingmar Steiner; Julie Carson-Berndsen