Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Anton Ragni is active.

Publication


Featured researches published by Anton Ragni.


conference of the international speech communication association | 2015

Joint decoding of tandem and hybrid systems for improved keyword spotting on low resource languages

Haipeng Wang; Anton Ragni; Mark J. F. Gales; Kate Knill; Philip C. Woodland; Chao Zhang

Copyright


IEEE Signal Processing Letters | 2010

Structured Log Linear Models for Noise Robust Speech Recognition

Shi-Xiong Zhang; Anton Ragni; Mark J. F. Gales

The use of discriminative models for structured classification tasks, such as speech recognition is becoming increasingly popular. This letter examines the use of structured log-linear models for noise robust speech recognition. An important aspect of log-linear models is the form of the features. By using generative models to derive the features, state-of-the-art model-based compensation schemes can be used to make the system robust to noise. Previous work in this area is extended in two important directions. First, a large margin training of sentence-level log linear models is proposed for automatic speech recognition (ASR). This form of model is shown to be similar to the recently proposed structured Support Vector Machines (SVM). Second, based on the designed joint features, efficient lattice-based training and decoding are performed. This novel model combines generative kernels, discriminative models, efficient lattice-based large margin training and model-based noise compensation. It is evaluated on a noise corrupted continuous digit task: AURORA 2.0.


international conference on acoustics, speech, and signal processing | 2014

Investigation of unsupervised adaptation of DNN acoustic models with filter bank input

Takuya Yoshioka; Anton Ragni; Mark J. F. Gales

Adaptation to speaker variations is an essential component of speech recognition systems. One common approach to adapting deep neural network (DNN) acoustic models is to perform global constrained maximum likelihood linear regression (CMLLR) at some point of the systems. Using CMLLR (or more generally, generative approaches) is advantageous especially in unsupervised adaptation scenarios with high baseline error rates. On the other hand, as the DNNs are less sensitive to the increase in the input dimensionality than GMMs, it is becoming more popular to use rich speech representations, such as log mel-filter bank channel outputs, instead of conventional low-dimensional feature vectors, such as MFCCs and PLP coefficients. This work discusses and compares three different configurations of DNN acoustic models that allow CMLLR-based speaker adaptive training (SAT) to be performed in systems with filter bank inputs. Results of unsupervised adaptation experiments conducted on three different data sets are presented, demonstrating that, by choosing an appropriate configuration, SAT with CMLLR can improve the performance of a well-trained filter bank-based speaker independent DNN system by 10.6% relative in a challenging task with a baseline error rate above 40%. It is also shown that the filter bank features are advantageous than the conventional features even when they are used with SAT models. Some other insights are also presented, including the effects of block diagonal transforms and system combination.


international conference on acoustics, speech, and signal processing | 2015

Unicode-based graphemic systems for limited resource languages

Mark J. F. Gales; Kate Knill; Anton Ragni

Large vocabulary continuous speech recognition systems require a mapping from words, or tokens, into sub-word units to enable robust estimation of acoustic model parameters, and to model words not seen in the training data. The standard approach to achieve this is to manually generate a lexicon where words are mapped into phones, often with attributes associated with each of these phones. Contextdependent acoustic models are then constructed using decision trees where questions are asked based on the phones and phone attributes. For low-resource languages, it may not be practical to manually generate a lexicon. An alternative approach is to use a graphemic lexicon, where the “pronunciation” for a word is defined by the letters forming that word. This paper proposes a simple approach for building graphemic systems for any language written in unicode. The attributes for graphemes are automatically derived using features from the unicode character descriptions. These attributes are then used in decision tree construction. This approach is examined on the IARPA Babel Option Period 2 languages, and a Levantine Arabic CTS task. The described approach achieves comparable, and complementary, performance to phonetic lexicon-based approaches.


ieee automatic speech recognition and understanding workshop | 2015

Multilingual representations for low resource speech recognition and keyword search

Jia Cui; Brian Kingsbury; Bhuvana Ramabhadran; Abhinav Sethy; Kartik Audhkhasi; Xiaodong Cui; Ellen Kislal; Lidia Mangu; Markus Nussbaum-Thom; Michael Picheny; Zoltán Tüske; Pavel Golik; Ralf Schlüter; Hermann Ney; Mark J. F. Gales; Kate Knill; Anton Ragni; Haipeng Wang; Phil Woodland

This paper examines the impact of multilingual (ML) acoustic representations on Automatic Speech Recognition (ASR) and keyword search (KWS) for low resource languages in the context of the OpenKWS15 evaluation of the IARPA Babel program. The task is to develop Swahili ASR and KWS systems within two weeks using as little as 3 hours of transcribed data. Multilingual acoustic representations proved to be crucial for building these systems under strict time constraints. The paper discusses several key insights on how these representations are derived and used. First, we present a data sampling strategy that can speed up the training of multilingual representations without appreciable loss in ASR performance. Second, we show that fusion of diverse multilingual representations developed at different LORELEI sites yields substantial ASR and KWS gains. Speaker adaptation and data augmentation of these representations improves both ASR and KWS performance (up to 8.7% relative). Third, incorporating un-transcribed data through semi-supervised learning, improves WER and KWS performance. Finally, we show that these multilingual representations significantly improve ASR and KWS performance (relative 9% for WER and 5% for MTWV) even when forty hours of transcribed audio in the target language is available. Multilingual representations significantly contributed to the LORELEI KWS systems winning the OpenKWS15 evaluation.


ieee automatic speech recognition and understanding workshop | 2011

Derivative kernels for noise robust ASR

Anton Ragni; Mark J. F. Gales

Recently there has been interest in combining generative and discriminative classifiers. In these classifiers features for the discriminative models are derived from the generative kernels. One advantage of using generative kernels is that systematic approaches exist to introduce complex dependencies into the feature-space. Furthermore, as the features are based on generative models standard model-based compensation and adaptation techniques can be applied to make discriminative models robust to noise and speaker conditions. This paper extends previous work in this framework in several directions. First, it introduces derivative kernels based on context-dependent generative models. Second, it describes how derivative kernels can be incorporated in structured discriminative models. Third, it addresses the issues associated with large number of classes and parameters when context-dependent models and high-dimensional feature-spaces of derivative kernels are used. The approach is evaluated on two noise-corrupted tasks: small vocabulary AURORA 2 and medium-to-large vocabulary AURORA 4 task.


ieee automatic speech recognition and understanding workshop | 2009

Support vector machines for noise robust ASR

Mark J. F. Gales; Anton Ragni; H. AlDamarki; C. Gautier

Using discriminative classifiers, such as Support Vector Machines (SVMs) in combination with, or as an alternative to, Hidden Markov Models (HMMs) has a number of advantages for difficult speech recognition tasks. For example, the models can make use of additional dependencies in the observation sequences than HMMs provided the appropriate form of kernel is used. However standard SVMs are binary classifiers, and speech is a multi-class problem. Furthermore, to train SVMs to distinguish word pairs requires that each word appears in the training data. This paper examines both of these limitations. Tree-based reduction approaches for multiclass classification are described, as well as some of the issues in applying them to dynamic data, such as speech. To address the training data issues, a simplified version of HMM-based synthesis can be used, which allows data for any word-pair to be generated. These approaches are evaluated on two noise corrupted digit sequence tasks: AURORA 2.0; and actual in-car collected data.


international conference on acoustics, speech, and signal processing | 2016

System combination with log-linear models

J. Yang; Chao Zhang; Anton Ragni; Mark J. F. Gales; Philip C. Woodland

Improved speech recognition performance can often be obtained by combining multiple systems together. Joint decoding, where scores from multiple systems are combined during decoding rather than combining hypotheses, is one efficient approach for system combination. In standard joint decoding the frame log-likelihoods from each system are used as the scores. These scores are then weighted and summed to yield the final score for a frame. The system combination weights for this process are usually empirically set. In this paper, a recently proposed scheme for learning these system weights is investigated for a standard noise-robust speech recognition task, AURORA 4. High performance tandem and hybrid systems for this task are described. By applying state-of-the-art training approaches and configurations for the bottleneck features of the tandem system, the difference in performance between the tandem and hybrid systems is significantly smaller than usually observed on this task. A log-linear model is then used to estimate system weights between these systems. Training the system weights yields additional gains over empirically set system weights when used for decoding. Furthermore, when used in a lattice rescoring fashion, further gains can be obtained.


international conference on acoustics, speech, and signal processing | 2015

A language space representation for speech recognition

Anton Ragni; Mark J. F. Gales; Kate Knill

The number of languages for which speech recognition systems have become available is growing each year. This paper proposes to view languages as points in some rich space, termed language space, where bases are eigen-languages and a particular selection of the projection determines points. Such an approach could not only reduce development costs for each new language but also provide automatic means for language analysis. For the initial proof of the concept, this paper adopts cluster adaptive training (CAT) known for inducing similar spaces for speaker adaptation needs. The CAT approach used in this paper builds on the previous work for language adaptation in speech synthesis and extends it to Gaussian mixture modelling more appropriate for speech recognition. Experiments conducted on IARPA Babel program languages show that such language space representations can outperform language independent models and discover closely related languages in an automatic way.


international conference on acoustics, speech, and signal processing | 2012

Inference algorithms for generative score-spaces

Anton Ragni; Mark J. F. Gales

Using generative models, for example hidden Markov models (HMM), to derive features for a discriminative classifier has a number of advantages including the ability to make the features robust to speaker and noise changes. An interesting attribute of the derived features is that they may not have the same conditional independence assumptions as the underlying generative models, which are typically first-order Markovian. For efficiency these features are derived given a particular segmentation. This paper describes a general algorithm for obtaining the optimal segmentation with combined generative and discriminative models. Previous results, where the features were constrained to have first-order Markovian dependencies, are extended to allow derivative features to be used which are non-Markovian in nature. As an example, inference with zero and first-order HMM score-spaces is considered. Experimental results are presented on a noise-corrupted continuous digit string recognition task: AURORA 2.

Collaboration


Dive into the Anton Ragni's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Kate Knill

University of Cambridge

View shared research outputs
Top Co-Authors

Avatar

Xie Chen

University of Cambridge

View shared research outputs
Top Co-Authors

Avatar

Chao Zhang

University of Cambridge

View shared research outputs
Top Co-Authors

Avatar

Haipeng Wang

University of Cambridge

View shared research outputs
Top Co-Authors

Avatar

J. Yang

University of Cambridge

View shared research outputs
Top Co-Authors

Avatar

Yu Wang

University of Cambridge

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

J Vasilakes

University of Cambridge

View shared research outputs
Researchain Logo
Decentralizing Knowledge