Gustav Eje Henter | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gustav Eje Henter is active.

Explore More

Publication

Featured researches published by Gustav Eje Henter.

international conference on acoustics, speech, and signal processing | 2016

Robust TTS duration modelling using DNNS

Gustav Eje Henter; Srikanth Ronanki; Oliver Watts; Mirjam Wester; Zhizheng Wu; Simon King

Accurate modelling and prediction of speech-sound durations is an important component in generating more natural synthetic speech. Deep neural networks (DNNs) offer a powerful modelling paradigm, and large, found corpora of natural and expressive speech are easy to acquire for training them. Unfortunately, found datasets are seldom subject to the quality-control that traditional synthesis methods expect. Common issues likely to affect duration modelling include transcription errors, reductions, filled pauses, and forced-alignment inaccuracies. To combat this, we propose to improve modelling and prediction of speech durations using methods from robust statistics, which are able to disregard ill-fitting points in the training material. We describe a robust fitting criterion based on the density power divergence (the ß-divergence) and a robust generation heuristic using mixture density networks (MDNs). Perceptual tests indicate that subjects prefer synthetic speech generated using robust models of duration over the baselines.

spoken language technology workshop | 2016

Median-based generation of synthetic speech durations using a non-parametric approach

Srikanth Ronanki; Oliver Watts; Simon King; Gustav Eje Henter

This paper proposes a new approach to duration modelling for statistical parametric speech synthesis in which a recurrent statistical model is trained to output a phone transition probability at each timestep (acoustic frame). Unlike conventional approaches to duration modelling - which assume that duration distributions have a particular form (e.g., a Gaussian) and use the mean of that distribution for synthesis - our approach can in principle model any distribution supported on the non-negative integers. Generation from this model can be performed in many ways; here we consider output generation based on the median predicted duration. The median is more typical (more probable) than the conventional mean duration, is robust to training-data irregularities, and enables incremental generation. Furthermore, a frame-level approach to duration prediction is consistent with a longer-term goal of modelling durations and acoustic features together. Results indicate that the proposed method is competitive with baseline approaches in approximating the median duration of held-out natural speech.

international conference on acoustics, speech, and signal processing | 2016

Testing the consistency assumption: Pronunciation variant forced alignment in read and spontaneous speech synthesis

Rasmus Dali; Sandrine Brognaux; Korin Richmond; Cassia Valentini-Botinhao; Gustav Eje Henter; Julia Hirschberg; Junichi Yamagishi; Simon King

Forced alignment for speech synthesis traditionally aligns a phoneme sequence predetermined by the front-end text processing system. This sequence is not altered during alignment, i.e., it is forced, despite possibly being faulty. The consistency assumption is the assumption that these mistakes do not degrade models, as long as the mistakes are consistent across training and synthesis. We present evidence that in the alignment of both standard read prompts and spontaneous speech this phoneme sequence is often wrong, and that this is likely to have a negative impact on acoustic models. A lattice-based forced alignment system allowing for pronunciation variation is implemented, resulting in improved phoneme identity accuracy for both types of speech. A perceptual evaluation of HMM-based voices showed that spontaneous models trained on this improved alignment also improved standard synthesis, despite breaking the consistency assumption.

conference of the international speech communication association | 2016

A Template-Based Approach for Speech Synthesis Intonation Generation Using LSTMs.

Srikanth Ronanki; Gustav Eje Henter; Zhizheng Wu; Simon King

The absence of convincing intonation makes current parametric speech synthesis systems sound dull and lifeless, even when trained on expressive speech data. Typically, these systems use regression techniques to predict the fundamental frequency (F0) frame-by-frame. This approach leads to overly-smooth pitch contours and fails to construct an appropriate prosodic structure across the full utterance. In order to capture and reproduce larger-scale pitch patterns, this paper proposes a template-based approach for automatic F0 generation, where per-syllable pitchcontour templates (from a small, automatically learned set) are predicted by a recurrent neural network (RNN). The use of syllable templates mitigates the over-smoothing problem and is able to reproduce pitch patterns observed in the data. The use of an RNN, paired with connectionist temporal classification (CTC), enables the prediction of structure in the pitch contour spanning the entire utterance. This novel F0 prediction system is used alongside separate LSTMs for predicting phone durations and the other acoustic features, to construct a complete text-to-speech system. We report the results of objective and subjective tests on an expressive speech corpus of children’s audiobooks, and include comparisons to a conventional baseline that predicts F0 directly at the frame level.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2016

Minimum Entropy Rate Simplification of Stochastic Processes

Gustav Eje Henter; W. Bastiaan Kleijn

We propose minimum entropy rate simplification (MERS), an information-theoretic, parameterization-independent framework for simplifying generative models of stochastic processes. Applications include improving model quality for sampling tasks by concentrating the probability mass on the most characteristic and accurately described behaviors while de-emphasizing the tails, and obtaining clean models from corrupted data (nonparametric denoising). This is the opposite of the smoothing step commonly applied to classification models. Drawing on rate-distortion theory, MERS seeks the minimum entropy-rate process under a constraint on the dissimilarity between the original and simplified processes. We particularly investigate the Kullback-Leibler divergence rate as a dissimilarity measure, where, compatible with our assumption that the starting model is disturbed or inaccurate, the simplification rather than the starting model is used for the reference distribution of the divergence. This leads to analytic solutions for stationary and ergodic Gaussian processes and Markov chains. The same formulas are also valid for maximum-entropy smoothing under the same divergence constraint. In experiments, MERS successfully simplifies and denoises models from audio, text, speech, and meteorology.

IEEE Transactions on Audio, Speech, and Language Processing | 2016

Bayesian analysis of phoneme confusion matrices

Arne Leijon; Gustav Eje Henter; Martin Dahlquist

This paper presents a parametric Bayesian approach to the statistical analysis of phoneme confusion matrices measured for groups of individual listeners in one or more test conditions. Two different bias problems in conventional estimation of mutual information are analyzed and explained theoretically. Evaluations with synthetic datasets indicate that the proposed Bayesian method can give satisfactory estimates of mutual information and response probabilities, even for phoneme confusion tests using a very small number of test items for each phoneme category. The proposed method can reveal overall differences in performance between two test conditions with better power than conventional Wilcoxon significance tests or conventional confidence intervals. The method can also identify sets of confusion-matrix cells that are credibly different between two test conditions, with better power than a similar approximate frequentist method.

conference of the international speech communication association | 2016

A Hierarchical Predictor of Synthetic Speech Naturalness Using Neural Networks.

Takenori Yoshimura; Gustav Eje Henter; Oliver Watts; Mirjam Wester; Junichi Yamagishi; Keiichi Tokuda

A problem when developing and tuning speech synthesis systems is that there is no well-established method of automatically rating the quality of the synthetic speech. This research attempts to obtain a new automated measure which is trained on the result of large-scale subjective evaluations employing many human listeners, i.e., the Blizzard Challenge. To exploit the data, we experiment with linear regression, feed-forward and convolutional neural network models, and combinations of them to regress from synthetic speech to the perceptual scores obtained from listeners. The biggest improvements were seen when combining stimulusand system-level predictions.

Archive | 2016

Listening test materials for "Evaluating comprehension of natural and synthetic conversational speech"

Oliver Watts; Gustav Eje Henter; Mirjam Wester

Current speech synthesis methods typically operate on isolated sentences and lack convincing prosody when generating longer segments of speech. Similarly, prevailing TTS evaluation paradigms, such as intelligibility (transcription word error rate) or MOS, only score sentences in isolation, even though overall comprehension arguably is more important for speech-based communication. In an effort to develop more ecologicallyrelevant evaluation techniques that go beyond isolated sentences, we investigated comprehension of natural and synthetic speech dialogues. Specifically, we tested listener comprehension on long segments of spontaneous and engaging conversational speech (three 10-minute radio interviews of comedians). Interviews were reproduced either as natural speech, synthesised from carefully prepared transcripts, or synthesised using durations from forced-alignment against the natural speech, all in a balanced design. Comprehension was measured using multiple choice questions. A significant difference was measured between the comprehension/retention of natural speech (74% correct responses) and synthetic speech with forced-aligned durations (61% correct responses). However, no significant difference was observed between natural and regular synthetic speech (70% correct responses). Effective evaluation of comprehension remains elusive.

Journal of the Acoustical Society of America | 2016

Robust text-to-speech duration modelling with a deep neural network

Gustav Eje Henter; Srikanth Ronanki; Oliver Watts; Mirjam Wester; Zhizheng Wu; Simon King

Accurate modeling and prediction of speech-sound durations is important for generating more natural synthetic speech. Deep neural networks (DNNs) offer powerful models, and large, found corpora of natural speech are easily acquired for training them. Unfortunately, poor quality control (e.g., transcription errors) and phenomena such as reductions and filled pauses complicate duration modelling from found speech data. To mitigate issues caused by these idiosyncrasies, we propose to incorporate methods from robust statistics into speech synthesis. Robust methods can disregard ill-fitting training-data points—errors or other outliers—to describe the typical case better. For instance, parameter estimation can be made robust by replacing maximum likelihood with a robust estimation criterion based on the density power divergence (a.k.a. the β-divergence). Alternatively, a standard approximation for output generation with mixture density networks (MDNs) can be interpreted as a robust output generation heuristic....

international conference on acoustics, speech, and signal processing | 2016