Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Bajibabu Bollepalli is active.

Publication


Featured researches published by Bajibabu Bollepalli.


international conference on acoustics, speech, and signal processing | 2016

High-pitched excitation generation for glottal vocoding in statistical parametric speech synthesis using a deep neural network

Lauri Juvela; Bajibabu Bollepalli; Manu Airaksinen; Paavo Alku

Achieving high quality and naturalness in statistical parametric synthesis of female voices remains to be difficult despite recent advances in the study area. Vocoding is one such key element in all statistical speech synthesizers that is known to affect the synthesis quality and naturalness. The present study focuses on a special type of vocoding, glottal vocoders, which aim to parameterize speech based on modelling the real excitation of (voiced) speech, the glottal flow. More specifically, we compare three different glottal vocoders by aiming at improved synthesis naturalness of female voices. Two of the vocoders are previously known, both utilizing an old glottal inverse filtering (GIF) method in estimating the glottal flow. The third on, denoted as Quasi Closed Phase - Deep Neural Net (QCP-DNN), takes advantage of a recently proposed new GIF method that shows improved accuracy in estimating the glottal flow from high-pitched speech. Subjective listening tests conducted on an US English female voice show that the proposed QCP-DNN method gives significant improvement in synthetic naturalness compared to the two previously developed glottal vocoders.


conference of the international speech communication association | 2016

GlottDNN - A full-band glottal vocoder for statistical parametric speech synthesis

Manu Airaksinen; Bajibabu Bollepalli; Lauri Juvela; Zhizheng Wu; Simon King; Paavo Alku

GlottHMM is a previously developed vocoder that has been successfully used in HMM-based synthesis by parameterizing speech into two parts (glottal flow, vocal tract) according to the functioning of the real human voice production mechanism. In this study, a new glottal vocoding method, GlottDNN, is proposed. The GlottDNN vocoder is built on the principles of its predecessor, GlottHMM, but the new vocoder introduces three main improvements: GlottDNN (1) takes advantage of a new, more accurate glottal inverse filtering method, (2) uses a new method of deep neural network (DNN) -based glottal excitation generation, and (3) proposes a new approach of band-wise processing of full-band speech. The proposed GlottDNN vocoder was evaluated as part of a full-band state-of-the-art DNN-based text-to-speech (TTS) synthesis system, and compared against the release version of the original GlottHMM vocoder, and the well-known STRAIGHT vocoder. The results of the subjective listening test indicate that GlottDNN improves the TTS quality over the compared methods.


international conference on signal processing | 2012

Analysis of breathy voice based on excitation characteristics of speech production

Sathya Adithya Thati; Bajibabu Bollepalli; Peri Bhaskararao; B. Yegnanarayana

The objective of this paper is to find the fundamental difference between breathy and modal voices based on differences in speech production as reflected in the signal. We propose signal processing methods for analyzing the phonation in breathy voice. These methods include technique of zero-frequency filtering, loudness measurement, computation of periodic to aperiodic energy ratio and extraction of formants and their amplitudes using group-delay based technique. Parameters derived using these methods capture the excitation source characteristics which play a prominent role in deciding the voice quality. Classification of vowels into breathy or modal voice is achieved with an accuracy of 93.93% using the loudness measure.


9th IFIP WG 5.5 International Summer Workshop on Multimodal Interfaces, eNTERFACE 2013, Lisbon, Portugal, July 15 – August 9, 2013 | 2014

Tutoring Robots : Multiparty multimodal social dialogue with an embodied tutor

Samer Al Moubayed; Jonas Beskow; Bajibabu Bollepalli; Ahmed Hussen-Abdelaziz; Martin Johansson; Maria Koutsombogera; José Lopes; Jekaterina Novikova; Catharine Oertel; Gabriel Skantze; Kalin Stefanov; Gül Varol

This project explores a novel experimental setup towards building spoken, multi-modally rich, and human-like multiparty tutoring agent. A setup is developed and a corpus is collected that targets t ...


conference of the international speech communication association | 2017

Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis

Bajibabu Bollepalli; Lauri Juvela; Paavo Alku

Recent studies have shown that text-to-speech synthesis quality can be improved by using glottal vocoding. This refers to vocoders that parameterize speech into two parts, the glottal excitation and vocal tract, that occur in the human speech production apparatus. Current glottal vocoders generate the glottal excitation waveform by using deep neural networks (DNNs). However, the squared error-based training of the present glottal excitation models is limited to generating conditional average waveforms, which fails to capture the stochastic variation of the waveforms. As a result, shaped noise is added as post-processing. In this study, we propose a new method for predicting glottal waveforms by generative adversarial networks (GANs). GANs are generative models that aim to embed the data distribution in a latent space, enabling generation of new instances very similar to the original by randomly sampling the latent distribution. The glottal pulses generated by GANs show a stochastic component similar to natural glottal pulses. In our experiments, we compare synthetic speech generated using glottal waveforms produced by both DNNs and GANs. The results show that the newly proposed GANs achieve synthesis quality comparable to that of widely-used DNNs, without using an additive noise component.


human-robot interaction | 2014

Human-robot collaborative tutoring using multiparty multimodal spoken dialogue

Samer Al Moubayed; Jonas Beskow; Bajibabu Bollepalli; Joakim Gustafson; Ahmed Hussen-Abdelaziz; Martin Johansson; Maria Koutsombogera; José Lopes; Jekaterina Novikova; Catharine Oertel; Gabriel Skantze; Kalin Stefanov; Gül Varol

In this paper, we describe a project that explores a novel experimental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robot interaction setup is designed, and a human-human dialogue corpus is collected. The corpus targets the development of a dialogue system platform to study verbal and nonverbal tutoring strategies in multiparty spoken interactions with robots which are capable of spoken dialogue. The dialogue task is centered on two participants involved in a dialogue aiming to solve a card-ordering game. Along with the participants sits a tutor (robot) that helps the participants perform the task, and organizes and balances their interaction. Different multimodal signals captured and autosynchronized by different audio-visual capture technologies, such as a microphone array, Kinects, and video cameras, were coupled with manual annotations. These are used build a situated model of the interaction based on the participants personalities, their state of attention, their conversational engagement and verbal dominance, and how that is correlated with the verbal and visual feedback, turn-management, and conversation regulatory actions generated by the tutor. Driven by the analysis of the corpus, we will show also the detailed design methodologies for an affective, and multimodally rich dialogue system that allows the robot to measure incrementally the attention states, and the dominance for each participant, allowing the robot head Furhat to maintain a wellcoordinated, balanced, and engaging conversation, that attempts to maximize the agreement and the contribution to solve the task. This project sets the first steps to explore the potential of using multimodal dialogue systems to build interactive robots that can serve in educational, team building, and collaborative task solving applications.


non-linear speech processing | 2013

Non-linear Pitch Modification in Voice Conversion Using Artificial Neural Networks

Bajibabu Bollepalli; Jonas Beskow; Joakim Gustafson

Majority of the current voice conversion methods do not focus on the modelling local variations of pitch contour, but only on linear modification of the pitch values, based on means and standard deviations. However, a significant amount of speaker related information is also present in pitch contour. In this paper we propose a non-linear pitch modification method for mapping the pitch contours of the source speaker according to the target speaker pitch contours. This work is done within the framework of Artificial Neural Networks (ANNs) based voice conversion. The pitch contours are represented with Discrete Cosine Transform (DCT) coefficients at the segmental level. The results evaluated using subjective and objective measures confirm that the proposed method performed better in mimicking the target speaker’s speaking style when compared to the linear modification method.


IEEE Signal Processing Letters | 2017

Glottal vocoding with frequency-warped time-weighted linear prediction

Manu Airaksinen; Bajibabu Bollepalli; Jouni Pohjalainen; Paavo Alku

Linear prediction (LP) is a prevalent source-filter separation method of speech production. One of the drawbacks of conventional LP-based approaches is the biasing of estimated formants by harmonic peaks. Methods such as discrete all-pole modeling and weighted LP have been proposed to overcome this problem, but they all use a linear frequency scale. This study proposes a new LP technique, frequency-warped time-weighted linear prediction (WWLP), to provide spectral envelope estimates robust to harmonic peaks that work on a warped frequency scale that approximates the sensitivities of the human auditory system. Experiments are performed within the context of vocoding in statistical parametric speech synthesis. Subjective listening test results show that WWLP-based spectral envelope modeling increases quality over previously developed methods.


international conference on acoustics, speech, and signal processing | 2014

A comparative evaluation of vocoding techniques for HMM-based laughter synthesis

Bajibabu Bollepalli; Jérôme Urbain; Tuomo Raitio; Joakim Gustafson; Hüseyin Çakmak

This paper presents an experimental comparison of various leading vocoders for the application of HMM-based laughter synthesis. Four vocoders, commonly used in HMM-based speech synthesis, are used in copy-synthesis and HMM-based synthesis of both male and female laughter. Subjective evaluations are conducted to assess the performance of the vocoders. The results show that all vocoders perform relatively well in copy-synthesis. In HMM-based laughter synthesis using original phonetic transcriptions, all synthesized laughter voices were significantly lower in quality than copy-synthesis, indicating a challenging task and room for improvements. Interestingly, two vocoders using rather simple and robust excitation modeling performed the best, indicating that robustness in speech parameter extraction and simple parameter representation in statistical modeling are key factors in successful laughter synthesis.


international conference on acoustics, speech, and signal processing | 2017

Lombard speech synthesis using long short-term memory recurrent neural networks

Bajibabu Bollepalli; Manu Airaksinen; Paavo Alku

In statistical parametric speech synthesis (SPSS), a few studies have investigated the Lombard effect, specifically by using hidden Markov model (HMM)-based systems. Recently, artificial neural networks have demonstrated promising results in SPSS, specifically by using long short-term memory recurrent neural networks (LSTMs). The Lombard effect, however, has not been studied in the LSTM-based speech synthesis systems. In this study, we propose three methods for Lombard speech adaptation in LSTM-based speech synthesis. In particular, (1) we augment Lombard specific information with the linguistic features as input, (2) scale the hidden activations using the learning hidden unit contributions (LHUC) method, and (3) fine-tune the LSTMs trained on normal speech with a small Lombard speech data. To investigate the effectiveness of the proposed methods, we carry out experiments using small (10 utterances) and large (500 utterances) Lombard speech data. Experimental results confirm the adaptability of the LSTMs, and similarity tests show that the LSTMs can achieve significantly better adaptation performance than the HMMs in both small and large data conditions.

Collaboration


Dive into the Bajibabu Bollepalli's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jonas Beskow

Royal Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Catharine Oertel

Royal Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Kalin Stefanov

Royal Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Martin Johansson

Royal Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Samer Al Moubayed

Royal Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge