Robert A. J. Clark | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Robert A. J. Clark is active.

Explore More

Publication

Featured researches published by Robert A. J. Clark.

Speech Communication | 2007

Multisyn: Open-domain unit selection for the Festival speech synthesis system

Robert A. J. Clark; Korin Richmond; Simon King

We present the implementation and evaluation of an open-domain unit selection speech synthesis engine designed to be flexible enough to encourage further unit selection research and allow rapid voice development by users with minimal speech synthesis knowledge and experience. We address the issues of automatically processing speech data into a usable voice using automatic segmentation techniques and how the knowledge obtained at labelling time can be exploited at synthesis time. We describe target cost and join cost implementation for such a system and describe the outcome of building voices with a number of different sized datasets. We show that, in a competitive evaluation, voices built using this technology compare favourably to other systems.

international conference on acoustics, speech, and signal processing | 2015

A multi-level representation of f0 using the continuous wavelet transform and the Discrete Cosine Transform

Manuel Sam Ribeiro; Robert A. J. Clark

We propose a representation of f0 using the Continuous Wavelet Transform (CWT) and the Discrete Cosine Transform (DCT). The CWT decomposes the signal into various scales of selected frequencies, while the DCT compactly represents complex contours as a weighted sum of cosine functions. The proposed approach has the advantage of combining signal decomposition and higher-level representations, thus modeling low-frequencies at higher levels and high-frequencies at lower-levels. Objective results indicate that this representation improves f0 prediction over traditional short-term approaches. Subjective results show that improvements are seen over the typical MSD-HMM and are comparable to the recently proposed CWT-HMM, while using less parameters. These results are discussed and future lines of research are proposed.

international conference on acoustics, speech, and signal processing | 2016

Deep neural network-guided unit selection synthesis

Thomas Merritt; Robert A. J. Clark; Zhizheng Wu; Junichi Yamagishi; Simon King

Vocoding of speech is a standard part of statistical parametric speech synthesis systems. It imposes an upper bound of the naturalness that can possibly be achieved. Hybrid systems using parametric models to guide the selection of natural speech units can combine the benefits of robust statistical models with the high level of naturalness of waveform concatenation. Existing hybrid systems use Hidden Markov Models (HMMs) as the statistical model. This paper demonstrates that the superiority of Deep Neural Network (DNN) acoustic models over HMMs in conventional statistical parametric speech synthesis also carries over to hybrid synthesis. We compare various DNN and HMM hybrid configurations, guiding the selection of waveform units in either the vocoder parameter domain, or in the domain of embeddings (bottleneck features).

Speech Communication | 2012

Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis

J. Sebastian Andersson; Junichi Yamagishi; Robert A. J. Clark

Spontaneous conversational speech has many characteristics that are currently not modelled well by HMM-based speech synthesis and in order to build synthetic voices that can give an impression of someone partaking in a conversation, we need to utilise data that exhibits more of the speech phenomena associated with conversations than the more generally used carefully read aloud sentences. In this paper we show that synthetic voices built with HMM-based speech synthesis techniques from conversational speech data, preserved segmental and prosodic characteristics of frequent conversational speech phenomena. An analysis of an evaluation investigating the perception of quality and speaking style of HMM-based voices confirms that speech with conversational characteristics are instrumental for listeners to perceive successful integration of conversational speech phenomena in synthetic speech. The achieved synthetic speech quality provides an encouraging start for the continued use of conversational speech in HMM-based speech synthesis.

international conference on acoustics, speech, and signal processing | 2013

Lightly supervised GMM VAD to use audiobook for speech synthesiser

Yoshitaka Mamiya; Junichi Yamagishi; Oliver Watts; Robert A. J. Clark; Simon King; Adriana Stan

Audiobooks have been focused on as promising data for training Text-to-Speech (TTS) systems. However, they usually do not have a correspondence between audio and text data. Moreover, they are usually divided only into chapter units. In practice, we have to make a correspondence of audio and text data before we use them for building TTS synthesisers. However aligning audio and text data is time-consuming and involves manual labor. It also requires persons skilled in speech processing. Previously, we have proposed to use graphemes for automatically aligning speech and text data. This paper further integrates a lightly supervised voice activity detection (VAD) technique to detect sentence boundaries as a pre-processing step before the grapheme approach. This lightly supervised technique requires time stamps of speech and silence only for the first fifty sentences. Combining those, we can semi-automatically build TTS systems from audiobooks with minimum manual intervention. From subjective evaluations we analyse how the grapheme-based aligner and/or the proposed VAD technique impact the quality of HMM-based speech synthesisers trained on audiobooks.

Computer Speech & Language | 2016

ALISA: An automatic lightly supervised speech segmentation and alignment tool

Adriana Stan; Yoshitaka Mamiya; Junichi Yamagishi; Peter Bell; Oliver Watts; Robert A. J. Clark; Simon King

Abstract This paper describes the ALISA tool, which implements a lightly supervised method for sentence-level alignment of speech with imperfect transcripts. Its intended use is to enable the creation of new speech corpora from a multitude of resources in a language-independent fashion, thus avoiding the need to record or transcribe speech data. The method is designed so that it requires minimum user intervention and expert knowledge, and it is able to align data in languages which employ alphabetic scripts. It comprises a GMM-based voice activity detector and a highly constrained grapheme-based speech aligner. The method is evaluated objectively against a gold standard segmentation and transcription, as well as subjectively through building and testing speech synthesis systems from the retrieved data. Results show that on average, 70% of the original data is correctly aligned, with a word error rate of less than 0.5%. In one case, subjective listening tests show a statistically significant preference for voices built on the gold transcript, but this is small and in other tests, no statistically significant differences between the systems built from the fully supervised training data and the one which uses the proposed method are found.

spoken language technology workshop | 2008

Automatic labeling of contrastive word pairs from spontaneous spoken english

Leonardo Badino; Robert A. J. Clark

This paper addresses the problem of automatically labeling contrast in spontaneous spoken speech, where contrast here is meant as a relation that ties two words that explicitly contrast with each other. Detection of contrast is certainly relevant in the analysis of discourse and information structure and also, because of the prosodic correlates of contrast, could play an important role in speech applications, such as text-to-speech synthesis, that need an accurate and discourse context related modeling of prosody. With this prospect we investigate the feasibility of automatic contrast labeling by training and evaluating on the Switchboard corpus a novel contrast tagger, based on support vector machines (SVM), that combines lexical features, syntactic dependencies and WordNet semantic relations.

Computer Speech & Language | 2016

Unsupervised language identification based on Latent Dirichlet Allocation

Wei Zhang; Robert A. J. Clark; Yongyuan Wang; Wen Li

Graphical abstract On the left generative model, we propose an unsupervised language identification approach based on Latent Dirichlet Allocation (LDA-LI) where we take the raw n-gram count as features without any smoothing, pruning or interpolation.Display OmittedAs the right experiments on ECI/MCI benchmark, the LDA-LI has comparable precisions, recalls and F scores to state of the art supervised language identification techniques (langID.py and Guess_language, etc.). HighlightsAn unsupervised language identification approach based on Latent Dirichlet Allocation with high precisions, recalls and F scores.The raw n-gram count as features without any smoothing, pruning or interpolation.Purifies main language with unknown number of other languages in high precision.Find out the nearest measure related to the minimum of topic number. To automatically build, from scratch, the language processing component for a speech synthesis system in a new language, a purified text corpora is needed where any words and phrases from other languages are clearly identified or excluded. When using found data and where there is no inherent linguistic knowledge of the language/languages contained in the data, identifying the pure data is a difficult problem. We propose an unsupervised language identification approach based on Latent Dirichlet Allocation where we take the raw n-gram count as features without any smoothing, pruning or interpolation. The Latent Dirichlet Allocation topic model is reformulated for the language identification task and Collapsed Gibbs Sampling is used to train an unsupervised language identification model. In order to find the number of languages present, we compared four kinds of measure and also the Hierarchical Dirichlet process on several configurations of the ECI/UCI benchmark. Experiments on the ECI/MCI data and a Wikipedia based Swahili corpus shows this LDA method, without any annotation, has comparable precisions, recalls and F-scores to state of the art supervised language identification techniques.

conference of the international speech communication association | 2016

The SIWIS database: a multilingual speech database with acted emphasis

Jean-Philippe Goldman; Pierre-Edouard Honnet; Robert A. J. Clark; Philip N. Garner; Maria Ivanova; Alexandros Lazaridis; Hui Liang; Tiago Macedo; Beat Pfister; Manuel Sam Ribeiro; Eric Wehrli; Junichi Yamagishi

We describe here a collection of speech data of bilingual and trilingual speakers of English, French, German and Italian. In the context of speech to speech translation (S2ST), this database is designed for several purposes and studies: training CLSA systems (cross-language speaker adaptation), conveying emphasis through S2ST systems, and evaluating TTS systems. More precisely, 36 speakers judged as accentless (22 bilingual and 14 trilingual speakers) were recorded for a set of 171 prompts in two or three languages, amounting to a total of 24 hours of speech. These sets of prompts include 100 sentences from news, 25 sentences from Europarl, the same 25 sentences with one acted emphasised word, 20 semantically unpredictable sentences, and finally a 240-word long text. All in all, it yielded 64 bilingual session pairs of the six possible combinations of the four languages. The database is freely available for non-commercial use and scientific research purposes. Index Terms: speech-to-speech translation, speech corpus, bilingual speakers, emphasis

international conference on acoustics, speech, and signal processing | 2016

Wavelet-based decomposition of F0 as a secondary task for DNN-based speech synthesis with multi-task learning

Manuel Sam Ribeiro; Oliver Watts; Junichi Yamagishi; Robert A. J. Clark

We investigate two wavelet-based decomposition strategies of the f0 signal and their usefulness as a secondary task for speech synthesis using multi-task deep neural networks (MTL-DNN). The first decomposition strategy uses a static set of scales for all utterances in the training data. We propose a second strategy, where the scale of the mother wavelet is dynamically adjusted to the rate of each utterance. This approach is able to capture f0 variations related to the syllable, word, clitic-group, and phrase units. This method also constrains the wavelet components to be within the frequency range that previous experiments have shown to be more natural. These two strategies are evaluated as a secondary task in multi-task deep neural networks (MTL-DNNs). Results indicate that on an expressive dataset there is a strong preference for the systems using multi-task learning when compared to the baseline system.

Explore More