Katherine Mary Knill | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Katherine Mary Knill is active.

Explore More

Publication

Featured researches published by Katherine Mary Knill.

IEEE Transactions on Audio, Speech, and Language Processing | 2012

Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization

Heiga Zen; Norbert Braunschweiler; Sabine Buchholz; Mark J. F. Gales; Katherine Mary Knill; Sacha Krstulovic; Javier Latorre

An increasingly common scenario in building speech synthesis and recognition systems is training on inhomogeneous data. This paper proposes a new framework for estimating hidden Markov models on data containing both multiple speakers and multiple languages. The proposed framework, speaker and language factorization, attempts to factorize speaker-/language-specific characteristics in the data and then model them using separate transforms. Language-specific factors in the data are represented by transforms based on cluster mean interpolation with cluster-dependent decision trees. Acoustic variations caused by speaker characteristics are handled by transforms based on constrained maximum-likelihood linear regression. Experimental results on statistical parametric speech synthesis show that the proposed framework enables data from multiple speakers in different languages to be used to: train a synthesis system; synthesize speech in a language using speaker characteristics estimated in a different language; and adapt to a new language.

international conference on acoustics, speech, and signal processing | 2011

Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification?

Javier Latorre; Mark J. F. Gales; Sabine Buchholz; Katherine Mary Knill; Masatsune Tamura; Yamato Ohtani; Masami Akamine

Most HMM-based TTS systems use a hard voiced/unvoiced classification to produce a discontinuous F0 signal which is used for the generation of the source-excitation. When a mixed source excitation is used, this decision can be based on two different sources of information: the state-specific MSD-prior of the F0 models, and/or the frame-specific features generated by the aperiodicity model. This paper examines the meaning of these variables in the synthesis process, their interaction, and how they affect the perceived quality of the generated speech The results of several perceptual experiments show that when using mixed excitation, subjects consistently prefer samples with very few or no false unvoiced errors, whereas a reduction in the rate of false voiced errors does not produce any perceptual improvement. This suggests that rather than using any form of hard voiced/unvoiced classification, e.g., the MSD-prior, it is better for synthesis to use a continuous F0 signal and rely on the frame-level soft voiced/unvoiced decision of the aperiodicity model.

international conference on acoustics, speech, and signal processing | 2011

Rapid joint speaker and noise compensation for robust speech recognition

K. K. Chin; Haitian Xu; Mark J. F. Gales; Catherine Breslin; Katherine Mary Knill

For speech recognition, mismatches between training and testing for speaker and noise are normally handled separately. The work presented in this paper aims at jointly applying speaker adaptation and model-based noise compensation by embedding speaker adaptation as part of the noise mismatch function. The proposed method gives a faster and more optimum adaptation compared to compensating for these two factors separately. It is also more consistent with respect to the basic assumptions of speaker and noise adaptation. Experimental results show significant and consistent gains from the proposed method.

IEEE Journal of Selected Topics in Signal Processing | 2014

Integrated Expression Prediction and Speech Synthesis From Text

Langzhou Chen; Mark J. F. Gales; Norbert Braunschweiler; Masami Akamine; Katherine Mary Knill

Generating expressive, naturally sounding, speech from text using a speech synthesis (TTS) system is a highly challenging problem. However for tasks such as audiobooks it is essential if their use is to become widespread. Generating expressive speech from text can be divided into two parts: predicting expressive information from text; and synthesizing the speech with a particular expression. Traditionally these components have been studied separately. This paper proposes an integrated approach, where the training data and representation of expressive synthesis is shared across the two components. There are several advantages to this scheme including: robust handling of automatically generated expressive labels; support for a continuous representation of expressions; and joint training of the expression predictor and speech synthesizer. Synthesis experiments indicated that the proposed approach produced far more expressive speech than both a neutral TTS and one where the expression was randomly selected. The experimental results also show the advantage of a continuous expressive synthesis space over a discrete space.

international conference on acoustics, speech, and signal processing | 2013

Integrated automatic expression prediction and speech synthesis from text

Langzhou Chen; Mark J. F. Gales; Norbert Braunschweiler; Masami Akamine; Katherine Mary Knill

Getting a text to speech synthesis (TTS) system to speak lively animated stories like a human is very difficult. To generate expressive speech, the system can be divided into 2 parts: predicting expressive information from text; and synthesizing the speech with a particular expression. Traditionally these blocks have been studied separately. This paper proposes an integrated approach, sharing the expressive synthesis space and training data across the two expressive components. There are several advantages to this approach, including a simplified expression labelling process, support of a continuous expressive synthesis space, and joint training of the expression predictor and speech synthesiser to maximise the likelihood of the TTS system given the training data. Synthesis experiments indicated that the proposed approach generated far more expressive speech than both a neutral TTS and one where the expression was randomly selected. The experimental results also showed the advantage of a continuous expressive synthesis space over a discrete space.

international conference on acoustics, speech, and signal processing | 2013

Training a supra-segmental parametric F0 model without interpolating F0

Javier Latorre; Mark J. F. Gales; Katherine Mary Knill; Masami Akamine

Combining multiple intonation models at different linguistic levels is an effective way to improve the naturalness of the predicted F0. In many of these approaches, the intonation models for suprasegmental levels are based on a parametrization of the log-F0 contours over the units of that level. However, many of these parametrisations are not stable when applied to discontinuous signals. Therefore, the F0 signal has to be interpolated. These interpolated values introduce a distortion in the coefficients that degrades the quality of the model. This paper proposes two methods that eliminate the need for such interpolation, one based on regularization and the other on factor analysis. Subjective evaluations show that, for a Discrete-cosine-transform (DCT) syllable-level model, both approaches result in a significant improvement w.r.t. a baseline using interpolated F0. The approach based on regularization yields the best results.

Archive | 2013