Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jon Barker is active.

Publication


Featured researches published by Jon Barker.


ieee automatic speech recognition and understanding workshop | 2015

The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines

Jon Barker; Ricard Marxer; Emmanuel Vincent; Shinji Watanabe

The CHiME challenge series aims to advance far field speech recognition technology by promoting research at the interface of signal processing and automatic speech recognition. This paper presents the design and outcomes of the 3rd CHiME Challenge, which targets the performance of automatic speech recognition in a real-world, commercially-motivated scenario: a person talking to a tablet device that has been fitted with a six-channel microphone array. The paper describes the data collection, the task definition and the baseline systems for data simulation, enhancement and recognition. The paper then presents an overview of the 26 systems that were submitted to the challenge focusing on the strategies that proved to be most successful relative to the MVDR array processing and DNN acoustic modeling reference system. Challenge findings related to the role of simulated data in system training and evaluation are discussed.


Computer Speech & Language | 2013

The PASCAL CHiME speech separation and recognition challenge

Jon Barker; Emmanuel Vincent; Ning Ma; Heidi Christensen; Phil D. Green

Distant microphone speech recognition systems that operate with human-like robustness remain a distant goal. The key difficulty is that operating in everyday listening conditions entails processing a speech signal that is reverberantly mixed into a noise background composed of multiple competing sound sources. This paper describes a recent speech recognition evaluation that was designed to bring together researchers from multiple communities in order to foster novel approaches to this problem. The task was to identify keywords from sentences reverberantly mixed into audio backgrounds binaurally recorded in a busy domestic environment. The challenge was designed to model the essential difficulties of the multisource environment problem while remaining on a scale that would make it accessible to a wide audience. Compared to previous ASR evaluations a particular novelty of the task is that the utterances to be recognised were provided in a continuous audio background rather than as pre-segmented utterances thus allowing a range of background modelling techniques to be employed. The challenge attracted thirteen submissions. This paper describes the challenge problem, provides an overview of the systems that were entered and provides a comparison alongside both a baseline recognition system and human performance. The paper discusses insights gained from the challenge and lessons learnt for the design of future such evaluations.


Journal of the Acoustical Society of America | 2008

The foreign language cocktail party problem: Energetic and informational masking effects in non-native speech perception

Martin Cooke; M. L. Garcia Lecumberri; Jon Barker

Studies comparing native and non-native listener performance on speech perception tasks can distinguish the roles of general auditory and language-independent processes from those involving prior knowledge of a given language. Previous experiments have demonstrated a performance disparity between native and non-native listeners on tasks involving sentence processing in noise. However, the effects of energetic and informational masking have not been explicitly distinguished. Here, English and Spanish listener groups identified keywords in English sentences in quiet and masked by either stationary noise or a competing utterance, conditions known to produce predominantly energetic and informational masking, respectively. In the stationary noise conditions, non-native talkers suffered more from increasing levels of noise for two of the three keywords scored. In the competing talker condition, the performance differential also increased with masker level. A computer model of energetic masking in the competing talker condition ruled out the possibility that the native advantage could be explained wholly by energetic masking. Both groups drew equal benefit from differences in mean F0 between target and masker, suggesting that processes which make use of this cue do not engage language-specific knowledge.


Speech Communication | 2004

Techniques for handling convolutional distortion with `missing data' automatic speech recognition

Kalle J. Palomäki; Guy J. Brown; Jon Barker

In this study we describe two techniques for handling convolutional distortion with ‘missing data’ speech recognition using spectral features. The missing data approach to automatic speech recognition (ASR) is motivated by a model of human speech perception, and involves the modification of a hidden Markov model (HMM) classifier to deal with missing or unreliable features. Although the missing data paradigm was proposed as a means of handling additive noise in ASR, we demonstrate that it can also be effective in dealing with convolutional distortion. Firstly, we propose a normalisation technique for handling spectral distortions and changes of input level (possibly in the presence of additive noise). The technique computes a normalising factor only from the most intense regions of the speech spectrum, which are likely to remain intact across various noise conditions. We show that the proposed normalisation method improves performance compared to a conventional missing data approach with spectrally distorted and noise contaminated speech, and in conditions where the gain of the input signal varies. Secondly, we propose a method for handling reverberated speech which attempts to identify time-frequency regions that are not badly contaminated by reverberation and have strong speech energy. This is achieved by using modulation filtering to identify ‘reliable’ regions of the speech spectrum. We demonstrate that our approach improves recognition performance in cases where the reverberation time T60 exceeds 0.7 s, compared to a baseline system which uses acoustic features derived from perceptual linear prediction and the modulation-filtered spectrogram. � 2004 Elsevier B.V. All rights reserved.


Computer Speech & Language | 2010

Speech fragment decoding techniques for simultaneous speaker identification and speech recognition

Jon Barker; Ning Ma; André Coy; Martin Cooke

This paper addresses the problem of recognising speech in the presence of a competing speaker. We review a speech fragment decoding technique that treats segregation and recognition as coupled problems. Data-driven techniques are used to segment a spectro-temporal representation into a set of fragments, such that each fragment is dominated by one or other of the speech sources. A speech fragment decoder is used which employs missing data techniques and clean speech models to simultaneously search for the set of fragments and the word sequence that best matches the target speaker model. The paper investigates the performance of the system on a recognition task employing artificially mixed target and masker speech utterances. The fragment decoder produces significantly lower error rates than a conventional recogniser, and mimics the pattern of human performance that is produced by the interplay between energetic and informational masking. However, at around 0dB the performance is generally quite poor. An analysis of the errors shows that a large number of target/masker confusions are being made. The paper presents a novel fragment-based speaker identification approach that allows the target speaker to be reliably identified across a wide range of SNRs. This component is combined with the recognition system to produce significant improvements. When the target and masker utterance have the same gender, the recognition system has a performance at 0dB equal to that of humans; in other conditions the error rate is roughly twice the human error rate.


Computer Speech & Language | 2017

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Emmanuel Vincent; Shinji Watanabe; Aditya Arie Nugraha; Jon Barker; Ricard Marxer

An analysis of the impact of acoustic mismatches between training and test data on the performance of robust ASR.Including: environment, microphone and data simulation mismatches.Based on: a critical analysis of the results published on the CHiME-3 dataset and new experiments.Result: with the exception of MVDR beamforming, these mismatches have little effect on the ASR performance.Contribution: the CHiME-4 challenge, which revisits the CHiME-3 dataset and reduces the number of microphones available for testing. Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multi-condition) settings where the acoustic conditions of the training data match (or cover) those of the test data. Few studies have systematically assessed the impact of acoustic mismatches between training and test data, especially concerning recent speech enhancement and state-of-the-art ASR techniques. In this article, we study this issue in the context of the CHiME-3 dataset, which consists of sentences spoken by talkers situated in challenging noisy environments recorded using a 6-channel tablet based microphone array. We provide a critical analysis of the results published on this dataset for various signal enhancement, feature extraction, and ASR backend techniques and perform a number of new experiments in order to separately assess the impact of different noise environments, different numbers and positions of microphones, or simulated vs. real data on speech enhancement and ASR performance. We show that, with the exception of minimum variance distortionless response (MVDR) beamforming, most algorithms perform consistently on real and simulated data and can benefit from training on simulated data. We also find that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact. Based on these results, we introduce the CHiME-4 Speech Separation and Recognition Challenge, which revisits the CHiME-3 dataset and makes it more challenging by reducing the number of microphones available for testing.


international conference on acoustics, speech, and signal processing | 2002

Missing data speech recognition in reverberant conditions

Kalle J. Palomäki; Guy J. Brown; Jon Barker

In this study we describe an auditory processing front-end for missing data speech recognition, which is robust in the presence of reverberation. The model attempts to identify time-frequency regions that are not badly contaminated by reverberation and have strong speech energy. This is achieved by applying reverberation masking. Subsequently, reliable time-frequency regions are passed to a ‘missing data’ speech recogniser for classification. We demonstrate that the model improves recognition performance in three different virtual rooms where reverberation time T60 varies from 0.7 sec to 2.7 sec. We also discuss the advantages of our approach over RASTA and modulation filtered spectrograms.


international conference on spoken language processing | 2000

Decoding speech in the presence of other sound sources

Jon Barker; Martin Cooke; Daniel P. W. Ellis

Conventional speech recognition is notoriously vulnerable to additive noise, and even the best compensation methods are defeated if the noise is nonstationary. To address this problem, we propose a new integration of bottom-up techniques to identify ‘coherent fragments’ of spectro-temporal energy (based on local features), with the top-down hypothesis search of conventional speech recognition, extended to search also across possible assignments of each fragment as speech or interference. Initial tests demonstrate the feasibility of this approach, and achieve a reduction in word error rate of more than 25% relative at 5 dB SNR over stationary noise missing data recognition.


international conference on acoustics, speech, and signal processing | 2009

A speech fragment approach to localising multiple speakers in reverberant environments

Heidi Christensen; Ning Ma; Stuart N. Wrigley; Jon Barker

Sound source localisation cues are severely degraded when multiple acoustic sources are active in the presence of reverberation. We present a binaural system for localising simultaneous speakers which exploits the fact that in a speech mixture there exist spectro-temporal regions or ‘fragments’, where the energy is dominated by just one of the speakers. A fragment-level localisation model is proposed that integrates the localisation cues within a fragment using a weighted mean. The weights are based on local estimates of the degree of reverberation in a given spectro-temporal cell. The paper investigates different weight estimation approaches based variously on, i) an established model of the perceptual precedence effect; ii) a measure of interaural coherence between the left and right ear signals; iii) a data-driven approach trained in matched acoustic conditions. Experiments with reverberant binaural data with two simultaneous speakers show appropriate weighting can improve frame-based localisation performance by up to 24%.


Speech Communication | 2007

An automatic speech recognition system based on the scene analysis account of auditory perception

André Coy; Jon Barker

Despite many years of concentrated research, the performance gap between automatic speech recognition (ASR) and human speech recognition (HSR) remains large. The difference between ASR and HSR is particularly evident when considering the response to additive noise. Whereas human performance is remarkably robust, ASR systems are brittle and only operate well within the narrow range of noise conditions for which they were designed. This paper considers how humans may achieve noise robustness. We take the view that robustness is achieved because the human perceptual system treats the problems of speech recognition and sound source separation as being tightly coupled. Taking inspiration from Bregmans Auditory Scene Analysis account of auditory organisation, we present a speech recognition system which couples these processes by using a combination of primitive and schema-driven processes: first, a set of coherent spectro-temporal fragments is generated by primitive segmentation techniques; then, a decoder based on statistical ASR techniques performs a simultaneous search for the correct background/foreground segmentation and word sequence hypothesis. Mutually supporting solutions to both the source segmentation and speech recognition problems arise as a result. The decoder is tested on a challenging corpus of connected digit strings mixed monaurally at 0dB and recognition performance is compared with that achieved by listeners using identical data. The results, although preliminary, are encouraging and suggest that techniques which interface ASA and statistical ASR have great potential. The paper concludes with a discussion of future research directions that may further develop this class of perceptually motivated ASR solutions.

Collaboration


Dive into the Jon Barker's collaboration.

Top Co-Authors

Avatar

Ning Ma

University of Sheffield

View shared research outputs
Top Co-Authors

Avatar

Martin Cooke

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Guy J. Brown

University of Sheffield

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Shinji Watanabe

Mitsubishi Electric Research Laboratories

View shared research outputs
Top Co-Authors

Avatar

Thomas Hain

University of Sheffield

View shared research outputs
Top Co-Authors

Avatar

André Coy

University of Sheffield

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge