Srikanth Ronanki
University of Edinburgh
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Srikanth Ronanki.
international conference on acoustics, speech, and signal processing | 2016
Gustav Eje Henter; Srikanth Ronanki; Oliver Watts; Mirjam Wester; Zhizheng Wu; Simon King
Accurate modelling and prediction of speech-sound durations is an important component in generating more natural synthetic speech. Deep neural networks (DNNs) offer a powerful modelling paradigm, and large, found corpora of natural and expressive speech are easy to acquire for training them. Unfortunately, found datasets are seldom subject to the quality-control that traditional synthesis methods expect. Common issues likely to affect duration modelling include transcription errors, reductions, filled pauses, and forced-alignment inaccuracies. To combat this, we propose to improve modelling and prediction of speech durations using methods from robust statistics, which are able to disregard ill-fitting points in the training material. We describe a robust fitting criterion based on the density power divergence (the ß-divergence) and a robust generation heuristic using mixture density networks (MDNs). Perceptual tests indicate that subjects prefer synthetic speech generated using robust models of duration over the baselines.
spoken language technology workshop | 2016
Srikanth Ronanki; Oliver Watts; Simon King; Gustav Eje Henter
This paper proposes a new approach to duration modelling for statistical parametric speech synthesis in which a recurrent statistical model is trained to output a phone transition probability at each timestep (acoustic frame). Unlike conventional approaches to duration modelling - which assume that duration distributions have a particular form (e.g., a Gaussian) and use the mean of that distribution for synthesis - our approach can in principle model any distribution supported on the non-negative integers. Generation from this model can be performed in many ways; here we consider output generation based on the median predicted duration. The median is more typical (more probable) than the conventional mean duration, is robust to training-data irregularities, and enables incremental generation. Furthermore, a frame-level approach to duration prediction is consistent with a longer-term goal of modelling durations and acoustic features together. Results indicate that the proposed method is competitive with baseline approaches in approximating the median duration of held-out natural speech.
conference of the international speech communication association | 2016
Srikanth Ronanki; Gustav Eje Henter; Zhizheng Wu; Simon King
The absence of convincing intonation makes current parametric speech synthesis systems sound dull and lifeless, even when trained on expressive speech data. Typically, these systems use regression techniques to predict the fundamental frequency (F0) frame-by-frame. This approach leads to overly-smooth pitch contours and fails to construct an appropriate prosodic structure across the full utterance. In order to capture and reproduce larger-scale pitch patterns, this paper proposes a template-based approach for automatic F0 generation, where per-syllable pitchcontour templates (from a small, automatically learned set) are predicted by a recurrent neural network (RNN). The use of syllable templates mitigates the over-smoothing problem and is able to reproduce pitch patterns observed in the data. The use of an RNN, paired with connectionist temporal classification (CTC), enables the prediction of structure in the pitch contour spanning the entire utterance. This novel F0 prediction system is used alongside separate LSTMs for predicting phone durations and the other acoustic features, to construct a complete text-to-speech system. We report the results of objective and subjective tests on an expressive speech corpus of children’s audiobooks, and include comparisons to a conventional baseline that predicts F0 directly at the frame level.
arXiv: Computation and Language | 2016
Srikanth Ronanki; Siva Reddy Gangireddy; Bajibabu Bollepalli; Simon King
Text-to-Speech synthesis in Indian languages has a seen lot of progress over the decade partly due to the annual Blizzard challenges. These systems assume the text to be written in Devanagari or Dravidian scripts which are nearly phonemic orthography scripts. However, the most common form of computer interaction among Indians is ASCII written transliterated text. Such text is generally noisy with many variations in spelling for the same word. In this paper we evaluate three approaches to synthesize speech from such noisy ASCII text: a naive Uni-Grapheme approach, a Multi-Grapheme approach, and a supervised Grapheme-to-Phoneme (G2P) approach. These methods first convert the ASCII text to a phonetic script, and then learn a Deep Neural Network to synthesize speech from that. We train and test our models on Blizzard Challenge datasets that were transliterated to ASCII using crowdsourcing. Our experiments on Hindi, Tamil and Telugu demonstrate that our models generate speech of competetive quality from ASCII text compared to the speech synthesized from the native scripts. All the accompanying transliterated datasets are released for public access.
Journal of the Acoustical Society of America | 2016
Gustav Eje Henter; Srikanth Ronanki; Oliver Watts; Mirjam Wester; Zhizheng Wu; Simon King
Accurate modeling and prediction of speech-sound durations is important for generating more natural synthetic speech. Deep neural networks (DNNs) offer powerful models, and large, found corpora of natural speech are easily acquired for training them. Unfortunately, poor quality control (e.g., transcription errors) and phenomena such as reductions and filled pauses complicate duration modelling from found speech data. To mitigate issues caused by these idiosyncrasies, we propose to incorporate methods from robust statistics into speech synthesis. Robust methods can disregard ill-fitting training-data points—errors or other outliers—to describe the typical case better. For instance, parameter estimation can be made robust by replacing maximum likelihood with a robust estimation criterion based on the density power divergence (a.k.a. the β-divergence). Alternatively, a standard approximation for output generation with mixture density networks (MDNs) can be interpreted as a robust output generation heuristic....
SSW | 2016
Srikanth Ronanki; Zhizheng Wu; Oliver Watts; Simon King
conference of the international speech communication association | 2018
Zack Hodari; Oliver Watts; Srikanth Ronanki; Simon King
conference of the international speech communication association | 2017
Srikanth Ronanki; Oliver Watts; Simon King
Archive | 2017
Srikanth Ronanki
Archive | 2016
Oliver Watts; Mirjam Wester; Simon King; Srikanth Ronanki; Zhizheng Wu; Gustav Eje Henter