Norbert Braunschweiler

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Norbert Braunschweiler is active.

Explore More

Publication

Featured researches published by Norbert Braunschweiler.

IEEE Transactions on Audio, Speech, and Language Processing | 2012

Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization

Heiga Zen; Norbert Braunschweiler; Sabine Buchholz; Mark J. F. Gales; Katherine Mary Knill; Sacha Krstulovic; Javier Latorre

An increasingly common scenario in building speech synthesis and recognition systems is training on inhomogeneous data. This paper proposes a new framework for estimating hidden Markov models on data containing both multiple speakers and multiple languages. The proposed framework, speaker and language factorization, attempts to factorize speaker-/language-specific characteristics in the data and then model them using separate transforms. Language-specific factors in the data are represented by transforms based on cluster mean interpolation with cluster-dependent decision trees. Acoustic variations caused by speaker characteristics are handled by transforms based on constrained maximum-likelihood linear regression. Experimental results on statistical parametric speech synthesis show that the proposed framework enables data from multiple speakers in different languages to be used to: train a synthesis system; synthesize speech in a language using speaker characteristics estimated in a different language; and adapt to a new language.

international conference on acoustics, speech, and signal processing | 2012

Unsupervised clustering of emotion and voice styles for expressive TTS

Florian Eyben; Sabine Buchholz; Norbert Braunschweiler

Current text-to-speech synthesis (TTS) systems are often perceived as lacking expressiveness, limiting the ability to fully convey information. This paper describes initial investigations into improving expressiveness for statistical speech synthesis systems. Rather than using hand-crafted definitions of expressive classes, an unsupervised clustering approach is described which is scalable to large quantities of training data. To incorporate this “expression cluster” information into an HMM-TTS system two approaches are described: cluster questions in the decision tree construction; and average expression speech synthesis (AESS) using cluster-based linear transform adaptation. The performance of the approaches was evaluated on audiobook data in which the reader exhibits a wide range of expressiveness. A subjective listening test showed that synthesising with AESS results in speech that better reflects the expressiveness of human speech than a baseline expression-independent system.

IEEE Journal of Selected Topics in Signal Processing | 2014

Building HMM-TTS Voices on Diverse Data

Vincent Wan; Javier Latorre; Kayoko Yanagisawa; Norbert Braunschweiler; Langzhou Chen; Mark J. F. Gales; Masami Akamine

The statistical models of hidden Markov model based text-to-speech (HMM-TTS) systems are typically built using homogeneous data. It is possible to acquire data from many different sources but combining them leads to a non-homogeneous or diverse dataset. This paper describes the application of average voice models (AVMs) and a novel application of cluster adaptive training (CAT) with multiple context dependent decision trees to create HMM-TTS voices using diverse data: speech data recorded in studios mixed with speech data obtained from the internet. Training AVM and CAT models on diverse data yields better quality speech than training on high quality studio data alone. Tests show that CAT is able to create a voice for a target speaker with as little as 7 seconds; an AVM would need more data to reach the same level of similarity to target speaker. Tests also show that CAT produces higher quality voices than AVMs irrespective of the amount of adaptation data. Lastly, it is shown that it is beneficial to model the data using multiple context clustering decision trees.

SmartKom | 2006

Multimodal Speech Synthesis

Antje Schweitzer; Norbert Braunschweiler; Grzegorz Dogil; Tanja Klankert; Bernd Möbius; Gregor Möhler; Edmilson Morais; Bettina Säuberlich; Matthias Thomae

Speech output generation in the SmartKom system is realized by a corpus-based unit selection strategy that preserves many properties of the human voice. When the system’s avatar “Smartakus” is present on the screen, the synthetic speech signal is temporally synchronized with Smartakus visible speech gestures and prosodically adjusted to his pointing gestures to enhance multimodal communication. The unit selection voice was formally evaluated and found to be very well accepted and reasonably intelligible in SmartKom- specific scenarios.

IEEE Journal of Selected Topics in Signal Processing | 2014

Integrated Expression Prediction and Speech Synthesis From Text

Langzhou Chen; Mark J. F. Gales; Norbert Braunschweiler; Masami Akamine; Katherine Mary Knill

Generating expressive, naturally sounding, speech from text using a speech synthesis (TTS) system is a highly challenging problem. However for tasks such as audiobooks it is essential if their use is to become widespread. Generating expressive speech from text can be divided into two parts: predicting expressive information from text; and synthesizing the speech with a particular expression. Traditionally these components have been studied separately. This paper proposes an integrated approach, where the training data and representation of expressive synthesis is shared across the two components. There are several advantages to this scheme including: robust handling of automatically generated expressive labels; support for a continuous representation of expressions; and joint training of the expression predictor and speech synthesizer. Synthesis experiments indicated that the proposed approach produced far more expressive speech than both a neutral TTS and one where the expression was randomly selected. The experimental results also show the advantage of a continuous expressive synthesis space over a discrete space.

international conference on acoustics, speech, and signal processing | 2013

Integrated automatic expression prediction and speech synthesis from text

Langzhou Chen; Mark J. F. Gales; Norbert Braunschweiler; Masami Akamine; Katherine Mary Knill

Getting a text to speech synthesis (TTS) system to speak lively animated stories like a human is very difficult. To generate expressive speech, the system can be divided into 2 parts: predicting expressive information from text; and synthesizing the speech with a particular expression. Traditionally these blocks have been studied separately. This paper proposes an integrated approach, sharing the expressive synthesis space and training data across the two expressive components. There are several advantages to this approach, including a simplified expression labelling process, support of a continuous expressive synthesis space, and joint training of the expression predictor and speech synthesiser to maximise the likelihood of the TTS system given the training data. Synthesis experiments indicated that the proposed approach generated far more expressive speech than both a neutral TTS and one where the expression was randomly selected. The experimental results also showed the advantage of a continuous expressive synthesis space over a discrete space.

IEEE Transactions on Audio, Speech, and Language Processing | 2015

Speaker and expression factorization for audiobook data: expressiveness and transplantation

Langzhou Chen; Norbert Braunschweiler; Mark J. F. Gales

Expressive synthesis from text is a challenging problem. There are two issues. First, read text is often highly expressive to convey the emotion and scenario in the text. Second, since the expressive training speech is not always available for different speakers, it is necessary to develop methods to share the expressive information over speakers. This paper investigates the approach of using very expressive, highly diverse audiobook data from multiple speakers to build an expressive speech synthesis system. Both of two problems are addressed by considering a factorized framework where speaker and emotion are modeled in separate sub-spaces of a cluster adaptive training (CAT) parametric speech synthesis system. The sub-spaces for the expressive state of a speaker and the characteristics of the speaker are jointly trained using a set of audiobooks. In this work, the expressive speech synthesis system works in two distinct modes. In the first mode, the expressive information is given by audio data and the adaptation method is used to extract the expressive information in the audio data. In the second mode, the input of the synthesis system is plain text and a full expressive synthesis system is examined where the expressive state is predicted from the text. In both modes, the expressive information is shared and transplanted over different speakers. Experimental results show that in both modes, the expressive speech synthesis method proposed in this work significantly improves the expressiveness of the synthetic speech for different speakers. Finally, this paper also examines whether it is possible to predict the expressive states from text for multiple speakers using a single model, or whether the prediction process needs to be speaker specific.

conference of the international speech communication association | 2016

Automated Pause Insertion for Improved Intelligibility Under Reverberation.

Petko N. Petkov; Norbert Braunschweiler; Yannis Stylianou

Speech intelligibility in reverberant environments is reduced because of overlap-masking. Signal modification prior to presentation in such listening environments, e.g., with a public announcement system, can be employed to alleviate this problem. Time-scale modifications are particularly effective in reducing the effect of overlap-masking. A method for introducing linguistically-motivated pauses is proposed in this paper. Given the transcription of a sentence, pause strengths are predicted at word boundaries. Pause duration is obtained by combining the pause strength and the time it takes late reverberation to decay to a level where a target signal-to-late-reverberation ratio criterion is satisfied. Considering a moderate reverberation condition and both binary and continuous pause strengths, a formal listening test was performed. The results show that the proposed methodology offers a significant intelligibility improvement over unmodified speech while continuous pause strengths offer an advantage over binary pause strengths.

conference of the international speech communication association | 2016

Pause Prediction from Text for Speech Synthesis with User-Definable Pause Insertion Likelihood Threshold.

Norbert Braunschweiler; Ranniery Maia

Predicting the location of pauses from text is an important aspect for speech synthesizers. The accuracy of pause prediction can significantly influence both naturalness and intelligibility. Pauses which help listeners to better parse the synthesized speech into meaningful units are deemed to increase naturalness and intelligibility ratings, while pauses in unexpected or incorrect locations can reduce these ratings and cause confusion. This paper presents a multi-stage pause prediction approach including first prosodic chunk prediction, followed by a feature scoring algorithm and finally a pause sequence evaluation module. Preference tests showed that the new method outperformed a pauses-at-punctuation baseline while not yet matching human performance. In addition, the approach includes two more functionalities: (1) a user-specifiable pause insertion rate and (2) multiple output formats in the form of binary pauses, multi-level pauses or as a score reflecting pause strength.

international conference on acoustics, speech, and signal processing | 2014

Speaker dependent expression predictor from text: Expressiveness and transplantation

Langzhou Chen; Norbert Braunschweiler; Mark J. F. Gales

Automatically generating expressive speech from plain text is an important research topic in speech synthesis. Given the same text, different speakers may interpret it and read it in very different ways. This implies that expression prediction from text is a speaker dependent task. Previous work presented an integrated method for expression prediction and speech synthesis which can be used to model the diverse expressions in humans speech and build speaker dependent expression predictors from text. This work extends the integrated method for expression prediction and speech synthesis into a framework for speaker and expression factorization. The expressions generated by the speaker dependent expression predictors can be represented in a shared expression space, and in this space the expressions can be transplanted between different speakers. The experimental results indicate that based on the proposed method, the expressiveness of the synthetic speech can be improved for different speakers. Furthermore this work also shows how important the speaker specific information is for the performance of the expression predictor from text.

Explore More