Is this you? Create Your Porfile

Tom Barker

Tampere University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tom Barker is active.

Explore More

Publication

Featured researches published by Tom Barker.

IEEE Transactions on Audio, Speech, and Language Processing | 2016

Blind Separation of Audio Mixtures Through Nonnegative Tensor Factorization of Modulation Spectrograms

Tom Barker; Tuomas Virtanen

This paper presents an algorithm for unsupervised single-channel source separation of audio mixtures. The approach specifically addresses the challenging case of separation where no training data are available. By representing mixtures in the modulation spectrogram (MS) domain, we exploit underlying similarities in patterns present across frequency. A three-dimensional tensor factorization is able to take advantage of these redundant patterns, and is used to separate a mixture into an approximated sum of components by minimizing a divergence cost. Furthermore, we show that the basic tensor factorization can be extended with convolution in time being used to improve separation results and provide update rules to learn components in such a manner. Following factorization, sources are reconstructed in the audio domain from estimated components using a novel approach based on reconstruction masks that are learned using MS activations, and then applied to a mixture spectrogram. We demonstrate that the proposed method produces superior separation performance to a spectrally based nonnegative matrix factorization approach, in terms of source-to-distortion ratio. We also compare separation with the perceptually motivated interference-related perceptual score metric and identify cases with higher performance.

spoken language technology workshop | 2014

Exemplar-based noise robust automatic speech recognition using modulation spectrogram features

Deepak Baby; Tuomas Virtanen; Jort F. Gemmeke; Tom Barker; Hugo Van hamme

We propose a novel exemplar-based feature enhancement method for automatic speech recognition which uses coupled dictionaries: an input dictionary containing atoms sampled in the modulation (envelope) spectrogram domain and an output dictionary with atoms in the Mel or full-resolution frequency domain. The input modulation representation is chosen for its separation properties of speech and noise and for its relation with human auditory processing. The output representation is one which can be processed by the ASR back-end. The proposed method was investigated on the AURORA-2 and AURORA-4 databases and improved word error rates (WER) were obtained when compared to the system which uses Mel features in the input exemplars. The paper also proposes a hybrid system which combines the baseline and the proposed algorithm on the AURORA-2 database which in turn also yielded improvement over both the algorithms.

international conference on acoustics, speech, and signal processing | 2015

Low-latency sound-source-separation using non-negative matrix factorisation with coupled analysis and synthesis dictionaries

Tom Barker; Tuomas Virtanen; Niels Henrik Pontoppidan

For real-time or close to real-time applications, sound source separation can be performed on-line, where new frames of incoming data for a mixture signal are processed as they arrive, at very low delay. We propose an approach which generates the separation filters for short synthesis frames to achieve low latency source separation, based on a compositional model mixture of the audio to be separated. Filter parameters are derived from a longer temporal context than the current processing frame through use of a longer analysis frame. A pair of dictionaries are used, one for analysis and one for reconstruction. With this approach we are able to increase separation performance at low latencies whilst retaining the low-latency provided by the use of short synthesis frames. The proposed data handling scheme and parameters can be adjusted to achieve real-time performance, given sufficient computational power. Low-latency output allows a human listener to use the results of such a separation scheme directly, without a perceptible delay. With the proposed method, separated source-to-distortion ratios (SDRs) can be improved by over 1 dB for latencies below 20 ms, without any affect on latency.

international symposium on neural networks | 2014

Semi-supervised non-negative tensor factorisation of modulation spectrograms for monaural speech separation

Tom Barker; Tuomas Virtanen

This paper details the use of a semi-supervised approach to audio source separation. Where only a single source model is available, the model for an unknown source must be estimated. A mixture signal is separated through factorisation of a feature-tensor representation, based on the modulation spectrogram. Harmonically related components tend to modulate in a similar fashion, and this redundancy of patterns can be isolated. This feature representation requires fewer parameters than spectrally based methods and so minimises overfitting. Following the tensor factorisation, the separated signals are reconstructed by learning appropriate Wiener-filter spectral parameters which have been constrained by activation parameters learned in the first stage. Strong results were obtained for two-speaker mixtures where source separation performance exceeded those used as benchmarks. Specifically, the proposed semi-supervised method outperformed both semi-supervised non-negative matrix factorisation and blind non-negative modulation spectrum tensor factorisation.

international conference on acoustics, speech, and signal processing | 2014

Ultrasound-coupled semi-supervised nonnegative matrix factorisation for speech enhancement

Tom Barker; Tuomas Virtanen; Olivier Delhomme

We present an extension to an existing speech enhancement technique, whereby the incorporation of easily obtained Doppler-based ultrasound data, obtained from frequency shifts caused by a talkers mouth movements, is shown to improve speech enhancement results. Noisy speech mixtures were enhanced using semi-supervised nonnegative matrix factorisation (NMF). Ultrasound data recorded alongside the speech is transformed into the spectral domain and used additionally to audio in the mixture to be separated. Speech components are learned from a training set, whilst noise components are estimated from the mixture signal. We show that the ultrasound data can improve source-to-distortion ratios for the enhanced speech, relative to both the non-ultrasound NMF case and an established Wiener filter-based speech enhancement method.

ieee global conference on signal and information processing | 2016

Low-latency sound source separation using deep neural networks

Gaurav Naithani; Giambattista Parascandolo; Tom Barker; Niels Henrik Pontoppidan; Tuomas Virtanen

Sound source separation at low-latency requires that each incoming frame of audio data be processed at very low delay, and outputted as soon as possible. For practical purposes involving human listeners, a 20 ms algorithmic delay is the uppermost limit which is comfortable to the listener. In this paper, we propose a low-latency (algorithmic delay < 20 ms) deep neural network (DNN) based source separation method. The proposed method takes advantage of an extended past context, outputting soft time-frequency masking filters which are then applied to incoming audio frames to give better separation performance as compared to NMF baseline. Acoustic mixtures from five pairs of speakers from CMU Arctic database [1] were used for the experiments. At least 1 dB average improvement in source to distortion ratios (SDR) was observed in our DNN-based system over a low-latency NMF baseline for different processing and analysis frame lengths. The effect of incorporating previous temporal context into DNN inputs yielded significant improvements in SDR for short processing frame lengths.

Archive | 2018

Separation of Known Sources Using Non-negative Spectrogram Factorisation

Tuomas Virtanen; Tom Barker

This chapter presents non-negative spectrogram factorisation (NMF) techniques which can be used to separate sources in the cases where source-specific training material is available in advance. We first present the basic NMF formulation for sound mixtures and then present criteria and algorithms for estimating the model parameters. We introduce selected methods for training the NMF source models by using either vector quantisation, convexity constraints, archetypal analysis, or discriminative methods. We also explain how the learned dictionaries can be adapted to deal with mismatches between the training data and usage scenario. We present also how semi-supervised learning can be used to deal with unknown noise sources within a mixture and finally we introduce a coupled NMF method which can be used to model large temporal context while retaining low algorithmic latency.

Journal of the Acoustical Society of America | 2018

Improving competing voices segregation for hearing impaired listeners using a low-latency deep neural network algorithma)

Lars Bramsløw; Gaurav Naithani; Atefeh Hafez; Tom Barker; Niels Henrik Pontoppidan; Tuomas Virtanen

Hearing aid users are challenged in listening situations with noise and especially speech-on-speech situations with two or more competing voices. Specifically, the task of attending to and segregating two competing voices is particularly hard, unlike for normal-hearing listeners, as shown in a small sub-experiment. In the main experiment, the competing voices benefit of a deep neural network (DNN) based stream segregation enhancement algorithm was tested on hearing-impaired listeners. A mixture of two voices was separated using a DNN and presented to the two ears as individual streams and tested for word score. Compared to the unseparated mixture, there was a 13%-point benefit from the separation, while attending to both voices. If only one output was selected as in a traditional target-masker scenario, a larger benefit of 37%-points was found. The results agreed well with objective metrics and show that for hearing-impaired listeners, DNNs have a large potential for improving stream segregation and speech intelligibility in difficult scenarios with two equally important targets without any prior selection of a primary target stream. An even higher benefit can be obtained if the user can select the preferred target via remote control.

workshop on applications of signal processing to audio and acoustics | 2017

Low latency sound source separation using convolutional recurrent neural networks

Gaurav Naithani; Tom Barker; Giambattista Parascandolo; Lars Bramsløw; Niels Henrik Pontoppidan; Tuomas Virtanen

Deep neural networks (DNN) have been successfully employed for the problem of monaural sound source separation achieving state-of-the-art results. In this paper, we propose using convolutional recurrent neural network (CRNN) architecture for tackling this problem. We focus on a scenario where low algorithmic delay (< 10 ms) is paramount, and relatively little training data is available. We show that the proposed architecture can achieve slightly better performance as compared to feedforward DNNs and long short-term memory (LSTM) networks. In addition to reporting separation performance metrics (i.e., source to distortion ratios), we also report extended short term objective intelligibility (ESTOI) scores which better predict intelligibility performance in presence of non-stationary interferers.

Journal of the Acoustical Society of America | 2013

Reflection orders and auditory distance

Catarina Mendonça; João Lamas; Tom Barker; Guilherme Campos; Paulo Dias; Ville Pulkki; Carlos A. Silva; Jorge A. Santos

The perception of sound distance has been sparsely studied so far. It is assumed to depend on familiar loudness, reverberation, sound spectrum, and parallax, but most of these factors have never been carefully addressed. Reverberation has been mostly analyzed in terms of ratio between direct and indirect sound, and total duration. Here we were interested in assessing the impact of each reflection order on distance localization. We compared sound source discrimination at an intermediate and at a distant location with direct sound only, one, two, three, and four reflection orders in a 2AFC task. At the intermediate distances, normalized psychophysical curves reveal no differentiation between direct sound and up to three reflection orders, but sounds with four reflection orders have significantly lower thresholds. For the distant sources, sounds with four reflection orders yielded the best discrimination slopes, but there was also a clear benefit for sounds with three reflection orders. We discuss the result...

Explore More