Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Ricard Marxer is active.

Publication


Featured researches published by Ricard Marxer.


ieee automatic speech recognition and understanding workshop | 2015

The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines

Jon Barker; Ricard Marxer; Emmanuel Vincent; Shinji Watanabe

The CHiME challenge series aims to advance far field speech recognition technology by promoting research at the interface of signal processing and automatic speech recognition. This paper presents the design and outcomes of the 3rd CHiME Challenge, which targets the performance of automatic speech recognition in a real-world, commercially-motivated scenario: a person talking to a tablet device that has been fitted with a six-channel microphone array. The paper describes the data collection, the task definition and the baseline systems for data simulation, enhancement and recognition. The paper then presents an overview of the 26 systems that were submitted to the challenge focusing on the strategies that proved to be most successful relative to the MVDR array processing and DNN acoustic modeling reference system. Challenge findings related to the role of simulated data in system training and evaluation are discussed.


Computer Speech & Language | 2017

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Emmanuel Vincent; Shinji Watanabe; Aditya Arie Nugraha; Jon Barker; Ricard Marxer

An analysis of the impact of acoustic mismatches between training and test data on the performance of robust ASR.Including: environment, microphone and data simulation mismatches.Based on: a critical analysis of the results published on the CHiME-3 dataset and new experiments.Result: with the exception of MVDR beamforming, these mismatches have little effect on the ASR performance.Contribution: the CHiME-4 challenge, which revisits the CHiME-3 dataset and reduces the number of microphones available for testing. Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multi-condition) settings where the acoustic conditions of the training data match (or cover) those of the test data. Few studies have systematically assessed the impact of acoustic mismatches between training and test data, especially concerning recent speech enhancement and state-of-the-art ASR techniques. In this article, we study this issue in the context of the CHiME-3 dataset, which consists of sentences spoken by talkers situated in challenging noisy environments recorded using a 6-channel tablet based microphone array. We provide a critical analysis of the results published on this dataset for various signal enhancement, feature extraction, and ASR backend techniques and perform a number of new experiments in order to separately assess the impact of different noise environments, different numbers and positions of microphones, or simulated vs. real data on speech enhancement and ASR performance. We show that, with the exception of minimum variance distortionless response (MVDR) beamforming, most algorithms perform consistently on real and simulated data and can benefit from training on simulated data. We also find that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact. Based on these results, we introduce the CHiME-4 Speech Separation and Recognition Challenge, which revisits the CHiME-3 dataset and makes it more challenging by reducing the number of microphones available for testing.


annual meeting of the special interest group on discourse and dialogue | 2015

Knowledge transfer between speakers for personalised dialogue management

Iñigo Casanueva; Thomas Hain; Heidi Christensen; Ricard Marxer; Phil D. Green

Model-free reinforcement learning has been shown to be a promising data driven approach for automatic dialogue policy optimization, but a relatively large amount of dialogue interactions is needed before the system reaches reasonable performance. Recently, Gaussian process based reinforcement learning methods have been shown to reduce the number of dialogues needed to reach optimal performance, and pre-training the policy with data gathered from different dialogue systems has further reduced this amount. Following this idea, a dialogue system designed for a single speaker can be initialised with data from other speakers, but if the dynamics of the speakers are very different the model will have a poor performance. When data gathered from different speakers is available, selecting the data from the most similar ones might improve the performance. We propose a method which automatically selects the data to transfer by defining a similarity measure between speakers, and uses this measure to weight the influence of the data from each speaker in the policy model. The methods are tested by simulating users with different severities of dysarthria interacting with a voice enabled environmental control system.


ieee automatic speech recognition and understanding workshop | 2015

Exploiting synchrony spectra and deep neural networks for noise-robust automatic speech recognition

Ning Ma; Ricard Marxer; Jon Barker; Guy J. Brown

This paper presents a novel system that exploits synchrony spectra and deep neural networks (DNNs) for automatic speech recognition (ASR) in challenging noisy environments. Synchrony spectra measure the extent to which each frequency channel in an auditory model is entrained to a particular pitch period, and they are used together with F0 estimates either in a DNN for time-frequency (T-M) mask estimation or to augment the input features for a DNN-based ASR system. The proposed approach was evaluated in the context of the CHiME 3 Challenge. Our experiments show that the synchrony spectra features work best when augmenting the input features to the DNN-based ASR system. Compared to the CHiME-3 baseline system, our best system provides a word error rate (WER) reduction of more than 14% absolute and achieved a WER of 18.56% on the evaluation test set.


New Era for Robust Speech Recognition, Exploiting Deep Learning | 2017

The CHiME Challenges: Robust Speech Recognition in Everyday Environments

Jon Barker; Ricard Marxer; Emmanuel Vincent; Shinji Watanabe

The CHiME challenge series has been aiming to advance the development of robust automatic speech recognition for use in everyday environments by encouraging research at the interface of signal processing and statistical modelling. The series has been running since 2011 and is now entering its 4th iteration. This chapter provides an overview of the CHiME series, including a description of the datasets that have been collected and the tasks that have been defined for each edition. In particular, the chapter describes novel approaches that have been developed for producing simulated data for system training and evaluation, and conclusions about the validity of using simulated data for robust-speech-recognition development. We also provide a brief overview of the systems and specific techniques that have proved successful for each task. These systems have demonstrated the remarkable robustness that can be achieved through a combination of training data simulation and multicondition training, well-engineered multichannel enhancement, and state-of-the-art discriminative acoustic and language modelling techniques.


conference of the international speech communication association | 2016

Progress and Prospects for Spoken Language Technology: Results from Four Sexennial Surveys.

Roger K. Moore; Ricard Marxer

Since 1997, a survey has been conducted every six years at the IEEE workshop on Automatic Speech Recognition and Understanding (ASRU) in order to ascertain the research community’s perspective on future progress and prospects in spoken language technology. These surveys have been based on a set of ‘statements’, each of which portray a possible future scenario, and respondents are asked to estimate the year in which each given scenario might become true. Many of the statements have appeared in several of the surveys, hence it is possible to track changes in opinion over time. This paper presents the combined results of all four surveys, the most recent of which was conducted at ASRU-2015. The results give an insight into the key trends that are taking place in the spoken language technology field, and reveal the realism that pervades the research community. They also suggest that there is growing confidence that some of the scenarios will indeed be realised at some point in the future.


Frontiers in Robotics and AI | 2016

Vocal Interactivity in-and-between Humans, Animals, and Robots

Roger K. Moore; Ricard Marxer; Serge Thill

Almost all animals exploit vocal signals for a range of ecologically-motivated purposes: detecting predators/prey and marking territory, expressing emotions, establishing social relations and sharing information. Whether it is a bird raising an alarm, a whale calling to potential partners, a dog responding to human commands, a parent reading a story with a child, or a business-person accessing stock prices using \emph{Siri}, vocalisation provides a valuable communication channel through which behaviour may be coordinated and controlled, and information may be distributed and acquired. Indeed, the ubiquity of vocal interaction has led to research across an extremely diverse array of fields, from assessing animal welfare, to understanding the precursors of human language, to developing voice-based human-machine interaction. Opportunities for cross-fertilisation between these fields abound; for example, using artificial cognitive agents to investigate contemporary theories of language grounding, using machine learning to analyse different habitats or adding vocal expressivity to the next generation of language-enabled autonomous social agents. However, much of the research is conducted within well-defined disciplinary boundaries, and many fundamental issues remain. This paper attempts to redress the balance by presenting a comparative review of vocal interaction within-and-between humans, animals and artificial agents (such as robots), and it identifies a rich set of open research questions that may benefit from an inter-disciplinary analysis.


conference of the international speech communication association | 2015

Automatic dysfluency detection in dysarthric speech using deep belief networks

Stacey Oue; Ricard Marxer; Frank Rudzicz

Dysarthria is a speech disorder caused by difficulties in controlling muscles, such as the tongue and lips, that are needed to produce speech. These differences in motor skills cause speech to be slurred, mumbled, and spoken relatively slowly, and can also increase the likelihood of dysfluency. This includes nonspeech sounds, and ‘stuttering’, defined here as a disruption in the fluency of speech manifested by prolongations, stop-gaps, and repetitions. This paper investigates different types of input features used by deep neural networks (DNNs) to automatically detect repetition stuttering and non-speech dysfluencies within dysarthric speech. The experiments test the effects of dimensionality within Mel-frequency cepstral coefficients (MFCCs) and linear predictive cepstral coefficients (LPCCs), and explore the detection capabilities in dyarthric versus non-dysarthric speech. The results obtained using MFCC and LPCC features produced similar recognition accuracies; repetition stuttering in dysarthric speech was identified correctly at approximately 86% and 84% for non-dysarthric speech. Non-speech sounds were recognized with approximately 75% accuracy in dysarthric speakers.


conference of the international speech communication association | 2015

Remote Speech Technology for Speech Professionals - the CloudCAST initiative

Phil D. Green; Ricard Marxer; Stuart P. Cunningham; Heidi Christensen; Frank Rudzicz; Maria Yancheva; André Coy; Massimiliano Malavasi; Lorenzo Desideri

Clinical applications of speech technology face two challenges. The first is data sparsity. There is little data available to underpin techniques which are based on machine learning and, because it is difficult to collect disordered speech corpora, the only way to address this problem is by pooling what is produced from systems which are already in use. The second is personalisation. This field demands individual solutions, technology which adapts to its user rather than demanding that the user adapt to it. Here we introduce a project, CloudCAST, which addresses these two problems by making remote, adaptive technology available to professionals who work with speech: therapists, educators and clinicians. Index Terms: assistive technology, clinical applications of speech technology


Speech Communication | 2018

The impact of the Lombard effect on audio and visual speech recognition systems

Ricard Marxer; Jon Barker; Najwa Alghamdi; Steve C. Maddock

Abstract When producing speech in noisy backgrounds talkers reflexively adapt their speaking style in ways that increase speech-in-noise intelligibility. This adaptation, known as the Lombard effect, is likely to have an adverse effect on the performance of automatic speech recognition systems that have not been designed to anticipate it. However, previous studies of this impact have used very small amounts of data and recognition systems that lack modern adaptation strategies. This paper aims to rectify this by using a new audio-visual Lombard corpus containing speech from 54 different speakers – significantly larger than any previously available – and modern state-of-the-art speech recognition techniques. The paper is organised as three speech-in-noise recognition studies. The first examines the case in which a system is presented with Lombard speech having been exclusively trained on normal speech. It was found that the Lombard mismatch caused a significant decrease in performance even if the level of the Lombard speech was normalised to match the level of normal speech. However, the size of the mismatch was highly speaker-dependent thus explaining conflicting results presented in previous smaller studies. The second study compares systems trained in matched conditions (i.e., training and testing with the same speaking style). Here the Lombard speech affords a large increase in recognition performance. Part of this is due to the greater energy leading to a reduction in noise masking, but performance improvements persist even after the effect of signal-to-noise level difference is compensated. An analysis across speakers shows that the Lombard speech energy is spectro-temporally distributed in a way that reduces energetic masking, and this reduction in masking is associated with an increase in recognition performance. The final study repeats the first two using a recognition system training on visual speech. In the visual domain, performance differences are not confounded by differences in noise masking. It was found that in matched-conditions Lombard speech supports better recognition performance than normal speech. The benefit was consistently present across all speakers but to a varying degree. Surprisingly, the Lombard benefit was observed to a small degree even when training on mismatched non-Lombard visual speech, i.e., the increased clarity of the Lombard speech outweighed the impact of the mismatch. The paper presents two generally applicable conclusions: i) systems that are designed to operate in noise will benefit from being trained on well-matched Lombard speech data, ii) the results of speech recognition evaluations that employ artificial speech and noise mixing need to be treated with caution: they are overly-optimistic to the extent that they ignore a significant source of mismatch but at the same time overly-pessimistic in that they do not anticipate the potential increased intelligibility of the Lombard speaking style.

Collaboration


Dive into the Ricard Marxer's collaboration.

Top Co-Authors

Avatar

Jon Barker

University of Sheffield

View shared research outputs
Top Co-Authors

Avatar

Shinji Watanabe

Mitsubishi Electric Research Laboratories

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

André Coy

University of Sheffield

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Guy J. Brown

University of Sheffield

View shared research outputs
Researchain Logo
Decentralizing Knowledge