Branislav Gerazov
Idiap Research Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Branislav Gerazov.
international conference on acoustics, speech, and signal processing | 2015
Pierre-Edouard Honnet; Branislav Gerazov; Philip N. Garner
Current statistical parametric text-to-speech (TTS) synthesis methods allow production of neutral speech with acceptable quality. However, prosody is often qualified as unsatisfactory and sounding too flat. In this paper, we address intonation modelling for TTS based on physiological aspects of prosody production. A set of gamma distribution shaped atoms is defined and then intonation decomposition is performed using a matching pursuit algorithm. Some preliminary experiments show that this model allows easy extraction of physiologically meaningful atoms that could be used to generate intonation in a TTS system.
IEEE Transactions on Audio, Speech, and Language Processing | 2015
Branislav Gerazov; Zoran A. Ivanovski
Noise-robustness has become a crucial parameter in Automatic Speech Recognition (ASR) systems today with their increased use in noise-filled real-world environments. One way to address this issue is to develop features that are innately noise-robust. The Kernel Power flow Orientation Coefficients (KPOCs) are a novel feature set based on spectro-temporal analysis that uses a bank of 2D kernels to extract the dominant orientation of the power flow at each point in the auditory spectrogram of the speech signal. The collection of dominant power flow orientation angles forms a novel representation of the speech signal named the Power flow Orientation Spectrogram (POS), which is innately resistant to the spectral masking introduced by the presence of noise and reverberation. This approach not only grants KPOC its noise robustness, but also keeps the number of output coefficients inherently small, thus eliminating the need of the feature dimensionality reduction otherwise necessary in the conventional the spectro-temporal approach. KPOCs performance has been evaluated on three experimental frameworks, and the results have shown that they outperform a number of well-known noise-robust features for average and low SNRs. The relative improvement in Word Recognition Accuracy (WRA) to the classic Mel Frequency Cepstral Coefficients (MFCCs) for the Aurora 2 task goes from 32% up to 190% for SNRs in the range from 10 down to - 5 dB. The experimental results also show that in clean training the performance of KPOC approaches that of the state-of-the-art noise-robust ASR frontends in all noise scenarios for small vocabulary ASR tasks.
international symposium on communications, control and signal processing | 2012
Branislav Gerazov; Zoran A. Ivanovski
The paper presents the results of an analysis of extracted pitch contours of spoken Macedonian across 7 native speakers and 3 discourse contexts. The 125 analyzed intonation phrases (IPs) were taken from a speech corpus recorded for this purpose. Pitch contours were extracted for each of the phrases, and then normalized to the speakers mode. This allowed group analysis of the contours. In total 8 groups were created based on speaker sex and 4 different discourse functions. Each group was then statistically analyzed and average normalized pitch as well as upper and lower bound vectors were calculated. The algorithms designed for this purpose have been described in detail. The calculated vectors were used as the basis for building linearsegment models used for automatic intonation generation in a text-to-speech (TTS) synthesis system.
telecommunications forum | 2015
Branislav Gerazov; Philip N. Garner
Prosody is a crucial aspect of the speech signal and its modelling is of great importance for various speech technologies. Intonation models based on physiology rely on an accurate model of muscle activation. Although most of them are based on the spring-damper-mass (SDM) muscle model, the more complex Hill type model offers a more accurate representation of muscle dynamics. In this paper we analyse and compare these two muscle models and discuss the benefits and disadvantages they bring. This research is a part of an on-going effort to develop an improved intonation model.
Speech Communication | 2017
Pierre-Edouard Honnet; Branislav Gerazov; Aleksandar Gjoreski; Philip N. Garner
We propose a physiologically based intonation model using perceptual relevance. Motivated by speech synthesis from a speech-to-speech translation (S2ST) point of view, we aim at a language independent way of modelling intonation. The model presented in this paper can be seen as a generalisation of the command response (CR) model, albeit with the same modelling power. It is an additive model which decomposes intonation contours into a sum of critically damped system impulse responses. To decompose the intonation contour, we use a weighted correlation based atom decomposition algorithm (WCAD) built around a matching pursuit framework. The algorithm allows for an arbitrary precision to be reached using an iterative procedure that adds more elementary atoms to the model. Experiments are presented demonstrating that this generalised CR (GCR) model is able to model intonation as would be expected. Experiments also show that the model produces a similar number of parameters or elements as the CR model. We conclude that the GCR model is appropriate as an engineering solution for modelling prosody, and hope that it is a contribution to a deeper scientific understanding of the neurobiological process of intonation.
international conference on speech and computer | 2016
György Szaszák; Máté Ákos Tündik; Branislav Gerazov; Aleksandar Gjoreski
Weighted Correlation based Atom Decomposition (WCAD) algorithm is a technique for intonation modelling that uses a matching pursuit framework to decompose the F0 contour into a set of basic components, called atoms. The atoms attempt to model the physiological activation of the laryngeal muscles responsible for changes in F0. Recently, WCAD has been upgraded to use the orthogonal matching pursuit (OMP) algorithm, which gives qualitative improvements in the modelling of intonation. A possible exploitation of the OMP based WCAD is the automatic detection of stress in speech, which we undertake for the Hungarian language. Correlation is demonstrated between stress and atomic peaks, as well as between stress and atomic valleys on the previous syllable. The stress detection technique based on WCAD is compared to a baseline system using HMM/GMM stress/phrase models. 7 % improvement is noticed in the F-measure compared to baseline when evaluating on hand-made reference. Finally, we propose a hybrid approach which outperforms both individual systems (by 11 % compared to the baseline).
international conference on speech and computer | 2016
Milan Sečujski; Branislav Gerazov; Tamás Gábor Csapó; Vlado Delić; Philip N. Garner; Aleksandar Gjoreski; David Guennec; Zoran A. Ivanovski; Aleksandar Melov; Géza Németh; Ana Stojkovic; György Szaszák
Since the prosody of a spoken utterance carries information about its discourse function, salience, and speaker attitude, prosody models and prosody generation modules have played a crucial part in text-to-speech (TTS) synthesis systems from the beginning, especially those set not only on sounding natural, but also on showing emotion or particular speaker intention. Prosody transfer within speech-to-speech translation is a recent research area with increasing importance, with one of its most important research topics being the detection and treatment of salient events, i.e. instances of prominence or focus which do not result from syntactic constraints, but are rather products of semantic or pragmatic level effects. This paper presents the design and the guidelines for the creation of a multilingual speech corpus containing prosodically rich sentences, ultimately aimed at training statistical prosody models for multilingual prosody transfer in the context of expressive speech synthesis.
international conference on speech and computer | 2016
Branislav Gerazov; Philip N. Garner
Prosody is a phenomenon that is crucial for numerous fields of speech research, accenting the importance of having a robust prosody model. A class of intonation models based on the physiology of pitch production are especially attractive for their inherent multilingual support. These models rely on an accurate model of muscle activation. Traditionally they have used the 2nd order spring-damper-mass (SDM) muscle model. However, recent research has shown that the SDM model is not sufficient for adequate modelling of the muscle dynamics. The 3rd order Hill type model offers a more accurate representation of muscle dynamics, but it has been shown to be underdamped when using physiologically plausible muscle parameters. In this paper we propose an agonist-antagonist pitch production (A2P2) model that both validates and gives insight behind the improved results of using higher-order critically damped system models in intonation modelling.
international conference on speech and computer | 2016
Tijana Delic; Branislav Gerazov; Branislav M. Popovic; Milan Sečujski
One of the most recently proposed techniques for modeling the prosody of an utterance is the decomposition of its pitch, duration and/or energy contour into physiologically motivated units called atoms, based on matching pursuit. Since this model is based on the physiology of the production of sentence intonation, it is essentially language independent. However, the intonation of an utterance in a particular language is obviously under the influence of factors of a predominantly linguistic nature. In this research, restricted to the case of American English with prosody annotated using standard ToBI conventions, we have shown that, under certain mild constraints, the positive and negative atoms identified in the pitch contour coincide very well with high and low pitch accents and phrase accents of ToBI. By giving a linguistic interpretation of the atom decomposition model, this research enables its practical use in domains such as speech synthesis or cross-lingual prosody transfer.
Speech Communication | 2018
György Szaszák; Máté Ákos Tündik; Branislav Gerazov
Abstract The detection of prosodic events, prosodic stress, and speech segmentation based on prosody have received much attention in the research community in the past decades. Prosody is relevant for both main areas of speech technology, text-to-speech synthesis and automatic speech recognition and understanding, and is exploited increasingly: besides providing redundancy, prosody is recognized to carry information unavailable from other sources and also contributes to the naturalness of the perceived speech. This paper addresses a recently proposed intonation analysis technique, called Weighted Correlation based Atom Decomposition (WCAD). The WCAD approach is inspired by the physiology of speech production and the Fujisaki-model used in speech synthesis, however, it is employed in an analytic, and not in a generative approach: the intonation contour is decomposed into a set of elementary components, called atoms, by a pattern matching algorithm. The obtained atom decomposition is used for prosodic stress detection and automatic phonological phrasing. We compare and also combine the WCAD approach to a phonological approach, which relies on automatic segmentation for phonological phrases using a Gaussian Mixture Model (GMM) / Hidden Markov Model (HMM) model and Viterbi-alignment. Results show comparable performance of the physiologically inspired system to the phonologically conceived one in phonological phrasing for two fixed stress languages of different language families: Hungarian and French. By this we also intend to experimentally confirm that the physiologically inspired WCAD model is able to predict or extract linguistically relevant markers linked to meaning. Finally, a hybrid model is proposed, combining the physiologically and the phonologically inspired approaches, and evaluated in phonological phrase and prosodic stress detection in both languages. The performance of the hybrid model is found to be superior to both individual systems. The basic algorithmic steps targeting feature extraction and atom decomposition, as a whole, are applicable to a wide range of languages. However, linking these to linguistic levels and meaning is by nature language specific, i.e. determining which event refers to which linguistic cue or function cannot be defined without knowing the language.