Is this you? Create Your Porfile

Tamás Gábor Csapó

Budapest University of Technology and Economics

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tamás Gábor Csapó is active.

Explore More

Publication

Featured researches published by Tamás Gábor Csapó.

Procedia Computer Science | 2014

Speech-centric Multimodal Interaction for Easy-to-access Online Services – A Personal Life Assistant for the Elderly

António J. S. Teixeira; Annika Hämäläinen; Jairo Avelar; Nuno Almeida; Géza Németh; Tibor Fegyó; Csaba Zainkó; Tamás Gábor Csapó; Bálint Tóth; André Oliveira; Miguel Sales Dias

Abstract The PaeLife project is a European industry-academia collaboration whose goal is to provide the elderly with easy access to online services that make their life easier and encourage their continued participation in the society. To reach this goal, the project partners are developing a multimodal virtual personal life assistant (PLA) offering a wide range of services from weather information to social networking. This paper presents the multimodal architecture of the PLA, the services provided by the PLA, and the work done in the area of speech input and output modalities, which play a key role in the application.

spoken language technology workshop | 2012

Synthesizing expressive speech from amateur audiobook recordings

Éva Székely; Tamás Gábor Csapó; Bálint Tóth; Péter Mihajlik; Julie Carson-Berndsen

Freely available audiobooks are a rich resource of expressive speech recordings that can be used for the purposes of speech synthesis. Natural sounding, expressive synthetic voices have previously been built from audiobooks that contained large amounts of highly expressive speech recorded from a professionally trained speaker. The majority of freely available audiobooks, however, are read by amateur speakers, are shorter and contain less expressive (less emphatic, less emotional, etc.) speech both in terms of quality and quantity. Synthesizing expressive speech from a typical online audiobook therefore poses many challenges. In this work we address these challenges by applying a method consisting of minimally supervised techniques to align the text with the recorded speech, select groups of expressive speech segments and build expressive voices for hidden Markov-model based synthesis using speaker adaptation. Subjective listening tests have shown that the expressive synthetic speech generated with this method is often able to produce utterances suited to an emotional message. We used a restricted amount of speech data in our experiment, in order to show that the method is generally applicable to most typical audiobooks widely available online.

IEEE Journal of Selected Topics in Signal Processing | 2014

Modeling Irregular Voice in Statistical Parametric Speech Synthesis With Residual Codebook Based Excitation

Tamás Gábor Csapó; Géza Németh

Statistical parametric text-to-speech synthesis is optimized for regular voices and may not create high-quality output with speakers producing irregular phonation frequently. A number of excitation models have been proposed recently in the hidden Markov-model speech synthesis framework, but few of them deal with the occurrence of this phenomenon. The baseline system of this study is our previous residual codebook based excitation model, which uses frames of pitch-synchronous residuals. To model the irregular voice typically occurring in phrase boundaries or sentence endings, two alternative extensions are proposed. The first, rule-based method applies pitch halving, amplitude scaling of residual periods with random factors and spectral distortion. The second, data-driven approach uses a corpus of residuals extracted from irregularly phonated vowels and unit selection is applied during synthesis. In perception tests of short speech segments, both methods have been found to improve the baseline excitation in preference and similarity to the original speaker. An acoustic experiment has shown that both methods can synthesize irregular voice that is close to original irregular phonation in terms of open quotient. The proposed methods may contribute to building natural, expressive and personalized speech synthesis systems.

SLSP 2015 Proceedings of the Third International Conference on Statistical Language and Speech Processing - Volume 9449 | 2015

Residual-Based Excitation with Continuous F0 Modeling in HMM-Based Speech Synthesis

Tamás Gábor Csapó; Géza Németh; Milos Cernak

In statistical parametric speech synthesis, creaky voice can cause disturbing artifacts. The reason is that standard pitch tracking algorithms tend to erroneously measure F0 in regions of creaky voice. This pattern is learned during training of hidden Markov-models HMMs. In the synthesis phase, false voiced/unvoiced decision caused by creaky voice results in audible quality degradation. In order to eliminate this phenomena, we use a simple continuous F0 tracker which does not apply a strict voiced/unvoiced decision. In the proposed residual-based vocoder, Maximum Voiced Frequency is used for mixed voiced and unvoiced excitation. As all parameters of the vocoder are continuous, Multi-Space Distribution is not necessary during training the HMMs, which has been shown to be advantageous. Artifacts caused by creaky voice are eliminated with this speech synthesis system. A subjective listening test of English utterances has shown improvement over the traditional excitation.

Intelligent Decision Technologies | 2014

Statistical parametric speech synthesis with a novel codebook-based excitation model

Tamás Gábor Csapó; Géza Németh

Speech synthesis is an important modality in Cognitive Infocommunications, which is the intersection of informatics and cognitive sciences. Statistical parametric methods have gained importance in speech synthesis recently. The speech signal is decomposed to parameters and later restored from them. The decomposition is implemented by speech coders. We apply a novel codebook-based speech coding method to model the excitation of speech. In the analysis stage the speech signal is analyzed frameby-frame and a codebook of pitch synchronous excitations is built from the voiced parts. Timing, gain and harmonic-to-noise ratio parameters are extracted and fed into the machine learning stage of Hidden Markov-model based speech synthesis. During the synthesis stage the codebook is searched for a suitable element in each voiced frame and these are concatenated to create the excitation signal, from which the final synthesized speech is created. Our initial experiments show that the model fits well in the statistical parametric speech synthesis framework and in most cases it can synthesize speech in a better quality than the traditional pulse-noise excitation. (This paper is an extended version of [10].)

european signal processing conference | 2016

Modeling unvoiced sounds in statistical parametric speech synthesis with a continuous vocoder

Tamás Gábor Csapó; Géza Németh; Milos Cernak; Philip N. Garner

In this paper, we introduce an improved excitation model for statistical parametric speech synthesis. Our earlier vocoder [1], which applies continuous F0 in combination with Maximum Voiced Frequency (MVF), is extended. The focus of this paper is on the modeling of unvoiced consonants, for which two alternative methods are proposed. The first method applies no postprocessing during MVF estimation to reduce the unwanted voiced component of unvoiced speech sounds. The second separates voiced and unvoiced excitation based on the phonetic labels of the text to be synthesized. In an objective experiment we found that the first method produces unvoiced sounds that are closer to natural speech in terms of Harmonics-to-Noise Ratio. A subjective listening test showed that both methods are more natural than our baseline system, and the second method is significantly preferred.

Journal of the Acoustical Society of America | 2017

Convolutional neural network-based automatic classification of midsagittal tongue gestural targets using B-mode ultrasound images

Kele Xu; Pierre Roussel; Tamás Gábor Csapó; Bruce Denby

Tongue gestural target classification is of great interest to researchers in the speech production field. Recently, deep convolutional neural networks (CNN) have shown superiority to standard feature extraction techniques in a variety of domains. In this letter, both CNN-based speaker-dependent and speaker-independent tongue gestural target classification experiments are conducted to classify tongue gestures during natural speech production. The CNN-based method achieves state-of-the-art performance, even though no pre-training of the CNN (with the exception of a data augmentation preprocessing) was carried out.

Journal of the Acoustical Society of America | 2016

A comparative study on the contour tracking algorithms in ultrasound tongue images with automatic re-initialization

Kele Xu; Tamás Gábor Csapó; Pierre Roussel; Bruce Denby

The feasibility of an automatic re-initialization of contour tracking is explored by using an image similarity-based method in the ultrasound tongue sequences. To this end, the re-initialization method was incorporated into current state-of-art tongue tracking algorithms, and a quantitative comparison was made between different algorithms by computing the mean sum of distances errors. The results demonstrate that with automatic re-initialization, the tracking error can be reduced from an average of 5-6 to about 4 pixels, a result obtained by using a large number of hand-labeled frames and similarity measurements to extract the contours, which results in improved performance.

international conference on speech and computer | 2016

Design of a Speech Corpus for Research on Cross-Lingual Prosody Transfer

Milan Sečujski; Branislav Gerazov; Tamás Gábor Csapó; Vlado Delić; Philip N. Garner; Aleksandar Gjoreski; David Guennec; Zoran A. Ivanovski; Aleksandar Melov; Géza Németh; Ana Stojkovic; György Szaszák

Since the prosody of a spoken utterance carries information about its discourse function, salience, and speaker attitude, prosody models and prosody generation modules have played a crucial part in text-to-speech (TTS) synthesis systems from the beginning, especially those set not only on sounding natural, but also on showing emotion or particular speaker intention. Prosody transfer within speech-to-speech translation is a recent research area with increasing importance, with one of its most important research topics being the detection and treatment of salient events, i.e. instances of prominence or focus which do not result from syntactic constraints, but are rather products of semantic or pragmatic level effects. This paper presents the design and the guidelines for the creation of a multilingual speech corpus containing prosodically rich sentences, ultimately aimed at training statistical prosody models for multilingual prosody transfer in the context of expressive speech synthesis.

european signal processing conference | 2016

Continuous fundamental frequency prediction with deep neural networks

Bálint Tóth; Tamás Gábor Csapó

Deep learning is proven to outperform other machine learning methods in numerous research fields. However, previous approaches, like multispace probability distribution hidden Markov models still surpass deep learning methods in the prediction accuracy of speech fundamental frequency (F0), inter alia, due to its discontinuous behavior. The current research focuses on the application of feedforward deep neural networks (DNNs) for modeling continuous F0 extracted by a recent vocoding technique. In order to achieve lower validation error, hyperparameter optimization with manual grid search was carried out. The results of objective and subjective evaluations show that using continuous F0 trajectories, DNNs can reach the modeling performance of previous state-of-the-art solutions. The complexity of DNN architectures could be reduced in case of continuous F0 contours as well.

Explore More