Sarah Taylor | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sarah Taylor is active.

Explore More

Publication

Featured researches published by Sarah Taylor.

symposium on computer animation | 2012

Dynamic units of visual speech

Sarah Taylor; Moshe Mahler; Barry-John Theobald; Iain A. Matthews

We present a new method for generating a dynamic, concatenative, unit of visual speech that can generate realistic visual speech animation. We redefine visemes as temporal units that describe distinctive speech movements of the visual speech articulators. Traditionally visemes have been surmized as the set of static mouth shapes representing clusters of contrastive phonemes (e.g. /p, b, m/, and /f, v/). In this work, the motion of the visual speech articulators are used to generate discrete, dynamic visual speech gestures. These gestures are clustered, providing a finite set of movements that describe visual speech, the visemes. Dynamic visemes are applied to speech animation by simply concatenating viseme units. We compare to static visemes using subjective evaluation. We find that dynamic visemes are able to produce more accurate and visually pleasing speech animation given phonetically annotated audio, reducing the amount of time that an animator needs to spend manually refining the animation.

knowledge discovery and data mining | 2015

A Decision Tree Framework for Spatiotemporal Sequence Prediction

Taehwan Kim; Yisong Yue; Sarah Taylor; Iain A. Matthews

We study the problem of learning to predict a spatiotemporal output sequence given an input sequence. In contrast to conventional sequence prediction problems such as part-of-speech tagging (where output sequences are selected using a relatively small set of discrete labels), our goal is to predict sequences that lie within a high-dimensional continuous output space. We present a decision tree framework for learning an accurate non-parametric spatiotemporal sequence predictor. Our approach enjoys several attractive properties, including ease of training, fast performance at test time, and the ability to robustly tolerate corrupted training data using a novel latent variable approach. We evaluate on several datasets, and demonstrate substantial improvements over existing decision tree based sequence learning frameworks such as SEARN and DAgger.

ACM Transactions on Graphics | 2017

A deep learning approach for generalized speech animation

Sarah Taylor; Taehwan Kim; Yisong Yue; Moshe Mahler; James Krahe; Anastasio Garcia Rodriguez; Jessica K. Hodgins; Iain A. Matthews

We introduce a simple and effective deep learning approach to automatically generate natural looking speech animation that synchronizes to input speech. Our approach uses a sliding window predictor that learns arbitrary nonlinear mappings from phoneme label input sequences to mouth movements in a way that accurately captures natural motion and visual coarticulation effects. Our deep learning approach enjoys several attractive properties: it runs in real-time, requires minimal parameter tuning, generalizes well to novel input speech sequences, is easily edited to create stylized and emotional speech, and is compatible with existing animation retargeting approaches. One important focus of our work is to develop an effective approach for speech animation that can be easily integrated into existing production pipelines. We provide a detailed description of our end-to-end approach, including machine learning design decisions. Generalized speech animation results are demonstrated over a wide range of animation clips on a variety of characters and voices, including singing and foreign language input. Our approach can also generate on-demand speech animation in real-time from user speech input.

international conference on acoustics, speech, and signal processing | 2014

The effect of speaking rate on audio and visual speech

Sarah Taylor; Barry-John Theobald; Iain A. Matthews

The speed that an utterance is spoken affects both the duration of the speech and the position of the articulators. Consequently, the sounds that are produced are modified, as are the position and appearance of the lips, teeth, tongue and other visible articulators. We describe an experiment designed to measure the effect of variable speaking rate on audio and visual speech by comparing sequences of phonemes and dynamic visemes appearing in the same sentences spoken at different speeds. We find that both audio and visual speech production are affected by varying the rate of speech, however, the effect is significantly more prominent in visual speech.

conference of the international speech communication association | 2016

Audio-to-Visual Speech Conversion using Deep Neural Networks

Sarah Taylor; Akihiro Kato; Iain A. Matthews; Ben Milner

We study the problem of mapping from acoustic to visual speech with the goal of generating accurate, perceptually natural speech animation automatically from an audio speech signal. We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM inversion approach in both objective and subjective evaluations and perform a thorough analysis of our results.

international conference on acoustics, speech, and signal processing | 2015

A mouth full of words: Visually consistent acoustic redubbing

Sarah Taylor; Barry-John Theobald; Iain A. Matthews

This paper introduces a method for automatic redubbing of video that exploits the many-to-many mapping of phoneme sequences to lip movements modelled as dynamic visemes (Taylor et al., 2012). For a given utterance, the corresponding dynamic viseme sequence is sampled to construct a graph of possible phoneme sequences that synchronize with the video. When composed with a pronunciation dictionary and language model, this produces a vast number of word sequences that are in sync with the original video, literally putting plausible words into the mouth of the speaker. We demonstrate that traditional, many-to-one, static visemes lack flexibility for this application as they produce significantly fewer word sequences. This work explores the natural ambiguity in visual speech and offers insight for automatic speech recognition and the importance of language modeling.

international conference on data mining | 2016

HIVE-COTE: The Hierarchical Vote Collective of Transformation-Based Ensembles for Time Series Classification

Jason Lines; Sarah Taylor; Anthony J. Bagnall

There have been many new algorithms proposed over the last five years for solving time series classification (TSC) problems. A recent experimental comparison of the leading TSC algorithms has demonstrated that one approach is significantly more accurate than all others over 85 datasets. That approach, the Flat Collective of Transformation-based Ensembles (Flat-COTE), achieves superior accuracy through combining predictions of 35 individual classifiers built on four representations of the data into a flat hierarchy. Outside of TSC, deep learning approaches such as convolutional neural networks (CNN) have seen a recent surge in popularity and are now state of the art in many fields. An obvious question is whether CNNs could be equally transformative in the field of TSC. To test this, we implement a common CNN structure and compare performance to Flat-COTE and a recently proposed time series-specific CNN implementation. We find that Flat-COTE is significantly more accurate than both deep learning approaches on 85 datasets. These results are impressive, but Flat-COTE is not without deficiencies. We improve the collective by adding new components and proposing a modular hierarchical structure with a probabilistic voting scheme that allows us to encapsulate the classifiers built on each transformation. We add two new modules representing dictionary and interval-based classifiers, and significantly improve upon the existing frequency domain classifiers with a novel spectral ensemble. The resulting classifier, the Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE) is significantly more accurate than Flat-COTE and represents a new state of the art for TSC. HIVE-COTE captures more sources of possible discriminatory features in time series and has a more modular, intuitive structure.

ACM Transactions on Knowledge Discovery From Data | 2018

Time Series Classification with HIVE-COTE: The Hierarchical Vote Collective of Transformation-Based Ensembles

Jason Lines; Sarah Taylor; Anthony J. Bagnall

A recent experimental evaluation assessed 19 time series classification (TSC) algorithms and found that one was significantly more accurate than all others: the Flat Collective of Transformation-based Ensembles (Flat-COTE). Flat-COTE is an ensemble that combines 35 classifiers over four data representations. However, while comprehensive, the evaluation did not consider deep learning approaches. Convolutional neural networks (CNN) have seen a surge in popularity and are now state of the art in many fields and raises the question of whether CNNs could be equally transformative for TSC. We implement a benchmark CNN for TSC using a common structure and use results from a TSC-specific CNN from the literature. We compare both to Flat-COTE and find that the collective is significantly more accurate than both CNNs. These results are impressive, but Flat-COTE is not without deficiencies. We significantly improve the collective by proposing a new hierarchical structure with probabilistic voting, defining and including two novel ensemble classifiers built in existing feature spaces, and adding further modules to represent two additional transformation domains. The resulting classifier, the Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE), encapsulates classifiers built on five data representations. We demonstrate that HIVE-COTE is significantly more accurate than Flat-COTE (and all other TSC algorithms that we are aware of) over 100 resamples of 85 TSC problems and is the new state of the art for TSC. Further analysis is included through the introduction and evaluation of 3 new case studies and extensive experimentation on 1,000 simulated datasets of 5 different types.

conference of the international speech communication association | 2016

Visual speech synthesis using dynamic visemes, contextual features and DNNs

Ausdang Thangthai; Ben Milner; Sarah Taylor

This paper examines methods to improve visual speech synthesis from a text input using a deep neural network (DNN). Two representations of the input text are considered, namely into phoneme sequences or dynamic viseme sequences. From these sequences, contextual features are extracted that include information at varying linguistic levels, from frame level down to the utterance level. These are extracted from a broad sliding window that captures context and produces features that are input into the DNN to estimate visual features. Experiments first compare the accuracy of these visual features against an HMM baseline method which establishes that both the phoneme and dynamic viseme systems perform better with best performance obtained by a combined phoneme-dynamic viseme system. An investigation into the features then reveals the importance of the frame level information which is able to avoid discontinuities in the visual feature sequence and produces a smooth and realistic output.

Journal of the Acoustical Society of America | 2016

Learning from visual speech

Sarah Taylor; Taehwan Kim; Yisong Yue; Ben Milner; Iain A. Matthews

Realistic animation of faces has proven to be challenging due to our sensitivity in perceiving visual articulation errors. We study the problem of learning to automatically generate natural speech-related facial motion from audio speech which can be used to drive both CG and robotic talking heads with low bandwidth requirements and low latency. A many-to-one mapping from acoustic phones to lip shapes (i.e., static “visemes”) is a poor approximation to the complex, context-dependent relationship visual speech truly has with acoustic speech production. We introduced “dynamic visemes” as data-derived visual-only speech units associated with distributions of phone strings and demonstrated they capture context and co-articulation. Further improvement in predicting visual speech can be achieved using an end-to-end deep learning approach. We train a sliding window deep neural network that learns a mapping from a window of phone labels or acoustic features to a window of visual features. This approach removes the...

Explore More