Mirjam Wester | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mirjam Wester is active.

Explore More

Publication

Featured researches published by Mirjam Wester.

Journal of the Acoustical Society of America | 2007

Speech production knowledge in automatic speech recognition

Simon King; Joe Frankel; Karen Livescu; Erik McDermott; Korin Richmond; Mirjam Wester

Although much is known about how speech is produced, and research into speech production has resulted in measured articulatory data, feature systems of different kinds, and numerous models, speech production knowledge is almost totally ignored in current mainstream approaches to automatic speech recognition. Representations of speech production allow simple explanations for many phenomena observed in speech which cannot be easily analyzed from either acoustic signal or phonetic transcription alone. In this article, a survey of a growing body of work in which such representations are used to improve automatic speech recognition is provided.

Computer Speech & Language | 2003

Pronunciation modeling for ASR - knowledge-based and data-derived methods

Mirjam Wester

This paper focuses on modeling pronunciation variation in two different ways: data-derived and knowledge-based. The knowledge-based approach consists of using phonological rules to generate variants. The data-derived approach consists of performing phone recognition, followed by smoothing using decision trees (D-trees) to alleviate some of the errors in the phone recognition. Using phonological rules led to a small improvement in WER; a data-derived approach in which the phone recognition was smoothed using D-trees prior to lexicon generation led to larger improvements compared to the baseline. The lexicon was employed in two different recognition systems: a hybrid HMM/ANN system and a HMM-based system, to ascertain whether pronunciation variation was truly being modeled. This proved to be the case as no significant differences were found between the results obtained with the two systems. A comparison between the knowledge-based and data-derived methods showed that 17% of variants generated by the phonological rules were also found using phone recognition, and this increases to 46% when the phone recognition output is smoothed by using D-trees.

Language and Speech | 2001

Obtaining Phonetic Transcriptions: A Comparison between Expert Listeners and a Continuous Speech Recognizer

Mirjam Wester; Judith M. Kessens; Catia Cucchiarini; Helmer Strik

Key words Abstract In this article, we address the issue of using a continuous speech recognition tool to obtain phonetic or phonological representations of speech. Two experiments were carried out in which the performance of a continuous speech recognizer (CSR) was compared to the performance of expert listeners in a task of judging whether a number of prespecified phones had been realized in an utterance. In the first experiment, nine expert listeners and the CSR carried out exactly the same task: deciding whether a segment was presentor no tin 467 cases. In the second experiment, we expanded on the first experiment by focusing on two phonological processes: schwa-deletion and schwa-insertion. The results of these experiments show that significant differences in performance were found between the CSR and the listeners, but also between individual listeners. Although some of these differences appeared to be statistically significant, their magnitude is such that they may very well be acceptable dependingon what the transcriptions are needed for. In other words, although the CSR is not infallible, it makes it possible to explore large data sets, which might outweigh the errors introduced by the mistakes the CSR makes. For these reasons, we can conclude that the CSR can be used instead of a listener to carry out this type of task: deciding whether a phone is presentor not.

international conference on acoustics, speech, and signal processing | 2016

Robust TTS duration modelling using DNNS

Gustav Eje Henter; Srikanth Ronanki; Oliver Watts; Mirjam Wester; Zhizheng Wu; Simon King

Accurate modelling and prediction of speech-sound durations is an important component in generating more natural synthetic speech. Deep neural networks (DNNs) offer a powerful modelling paradigm, and large, found corpora of natural and expressive speech are easy to acquire for training them. Unfortunately, found datasets are seldom subject to the quality-control that traditional synthesis methods expect. Common issues likely to affect duration modelling include transcription errors, reductions, filled pauses, and forced-alignment inaccuracies. To combat this, we propose to improve modelling and prediction of speech durations using methods from robust statistics, which are able to disregard ill-fitting points in the training material. We describe a robust fitting criterion based on the density power divergence (the ß-divergence) and a robust generation heuristic using mixture density networks (MDNs). Perceptual tests indicate that subjects prefer synthetic speech generated using robust models of duration over the baselines.

international conference on acoustics, speech, and signal processing | 2010

Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis

Keiichiro Oura; Keiichi Tokuda; Junichi Yamagishi; Simon King; Mirjam Wester

In the EMIME project, we are developing a mobile device that performs personalized speech-to-speech translation such that a users spoken input in one language is used to produce spoken output in another language, while continuing to sound like the users voice. We integrate two techniques, unsupervised adaptation for HMM-based TTS using a word-based large-vocabulary continuous speech recognizer and cross-lingual speaker adaptation for HMM-based TTS, into a single architecture. Thus, an unsupervised cross-lingual speaker adaptation system can be developed. Listening tests show very promising results, demonstrating that adapted voices sound similar to the target speaker and that differences between supervised and unsupervised cross-lingual speaker adaptation are small.

IEEE Transactions on Audio, Speech, and Language Processing | 2016

Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance

Zhizheng Wu; Phillip L. De Leon; Ali Khodabakhsh; Simon King; Zhen-Hua Ling; Daisuke Saito; Bryan Stewart; Tomoki Toda; Mirjam Wester; Junichi Yamagishi

In this paper, we present a systematic study of the vulnerability of automatic speaker verification to a diverse range of spoofing attacks. We start with a thorough analysis of the spoofing effects of five speech synthesis and eight voice conversion systems, and the vulnerability of three speaker verification systems under those attacks. We then introduce a number of countermeasures to prevent spoofing attacks from both known and unknown attackers. Known attackers are spoofing systems whose output was used to train the countermeasures, while an unknown attacker is a spoofing system whose output was not available to the countermeasures during training. Finally, we benchmark automatic systems against human performance on both speaker verification and spoofing detection tasks.

conference of the international speech communication association | 2016

The Voice Conversion Challenge 2016

Tomoki Toda; Ling-Hui Chen; Daisuke Saito; Fernando Villavicencio; Mirjam Wester; Zhizheng Wu; Junichi Yamagishi

This paper describes the Voice Conversion Challenge 2016 devised by the authors to better understand different voice conversion (VC) techniques by comparing their performance on a common dataset. The task of the challenge was speaker conversion, i.e., to transform the voice identity of a source speaker into that of a target speaker while preserving the linguistic content. Using a common dataset consisting of 162 utterances for training and 54 utterances for evaluation from each of 5 source and 5 target speakers, 17 groups working in VC around the world developed their own VC systems for every combination of the source and target speakers, i.e., 25 systems in total, and generated voice samples converted by the developed systems. These samples were evaluated in terms of target speaker similarity and naturalness by 200 listeners in a controlled environment. This paper summarizes the design of the challenge, its result, and a future plan to share views about unsolved problems and challenges faced by the current VC techniques.

Speech Communication | 2012

Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping

Keiichiro Oura; Junichi Yamagishi; Mirjam Wester; Simon King; Keiichi Tokuda

In the EMIME project, we developed a mobile device that performs personalized speech-to-speech translation such that a users spoken input in one language is used to produce spoken output in another language, while continuing to sound like the users voice. We integrated two techniques into a single architecture: unsupervised adaptation for HMM-based TTS using word-based large-vocabulary continuous speech recognition, and cross-lingual speaker adaptation (CLSA) for HMM-based TTS. The CLSA is based on a state-level transform mapping learned using minimum Kullback-Leibler divergence between pairs of HMM states in the input and output languages. Thus, an unsupervised cross-lingual speaker adaptation system was developed. End-to-end speech-to-speech translation systems for four languages (English, Finnish, Mandarin, and Japanese) were constructed within this framework. In this paper, the English-to-Japanese adaptation is evaluated. Listening tests demonstrate that adapted voices sound more similar to a target speaker than average voices and that differences between supervised and unsupervised cross-lingual speaker adaptation are small. Calculating the KLD state-mapping on only the first 10 mel-cepstral coefficients leads to huge savings in computational costs, without any detrimental effect on the quality of the synthetic speech.

Computer Speech & Language | 2013

Personalising speech-to-speech translation: Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis

John Dines; Hui Liang; Lakshmi Saheer; Matthew Gibson; William Byrne; Keiichiro Oura; Keiichi Tokuda; Junichi Yamagishi; Simon King; Mirjam Wester; Teemu Hirsimäki; Reima Karhila; Mikko Kurimo

In this paper we present results of unsupervised cross-lingual speaker adaptation applied to text-to-speech synthesis. The application of our research is the personalisation of speech-to-speech translation in which we employ a HMM statistical framework for both speech recognition and synthesis. This framework provides a logical mechanism to adapt synthesised speech output to the voice of the user by way of speech recognition. In this work we present results of several different unsupervised and cross-lingual adaptation approaches as well as an end-to-end speaker adaptive speech-to-speech translation system. Our experiments show that we can successfully apply speaker adaptation in both unsupervised and cross-lingual scenarios and our proposed algorithms seem to generalise well for several language pairs. We also discuss important future directions including the need for better evaluation metrics.

conference of the international speech communication association | 2016

Analysis of the Voice Conversion Challenge 2016 Evaluation Results

Mirjam Wester; Zhizheng Wu; Junichi Yamagishi

The Voice Conversion Challenge (VCC) 2016, one of the special sessions at Interspeech 2016, deals with the well-known task of speaker identity conversion, referred as Voice Conversion (VC). The objective of the VCC is to compare various VC techniques on identical training and evaluation speech data. The full description of VCC 2016, the motivation, the database, the rules, the participants and main findings are presented in [1]. In the current paper, we describe the listening test design in more detail, we present the results of the listening test and the subsequent statistical analyses.

Explore More