Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Kate Knill is active.

Publication


Featured researches published by Kate Knill.


international conference on spoken language processing | 1996

Use of Gaussian selection in large vocabulary continuous speech recognition using HMMS

Kate Knill; Mark J. F. Gales; Steve J. Young

This paper investigates the use of Gaussian Selection (GS) to reduce the state likelihood computation in HMM-based systems. These likelihood calculations contribute significantly (30 to 70%) to the computational load. Previously, it has been reported that when GS is used on large systems the recognition accuracy tends to degrade above a /spl times/3 reduction in likelihood computation. To explain this degradation, this paper investigates the trade-offs necessary between achieving good state likelihoods and low computation. In addition, the problem of unseen states in a cluster is examined. It is shown that further improvements are possible. For example, using a different assignment measure, with a constraint on the number of components per state per cluster enabled the recognition accuracy on a 5k speaker-independent task to be maintained up to a /spl times/5 reduction in likelihood computation.


ieee automatic speech recognition and understanding workshop | 2013

Investigation of multilingual deep neural networks for spoken term detection

Kate Knill; Mark J. F. Gales; Shakti P. Rath; Philip C. Woodland; Chao Zhang; Shi-Xiong Zhang

The development of high-performance speech processing systems for low-resource languages is a challenging area. One approach to address the lack of resources is to make use of data from multiple languages. A popular direction in recent years is to use bottleneck features, or hybrid systems, trained on multilingual data for speech-to-text (STT) systems. This paper presents an investigation into the application of these multilingual approaches to spoken term detection. Experiments were run using the IARPA Babel limited language pack corpora (~10 hours/language) with 4 languages for initial multilingual system development and an additional held-out target language. STT gains achieved through using multilingual bottleneck features in a Tandem configuration are shown to also apply to keyword search (KWS). Further improvements in both STT and KWS were observed by incorporating language questions into the Tandem GMM-HMM decision trees for the training set languages. Adapted hybrid systems performed slightly worse on average than the adapted Tandem systems. A language independent acoustic model test on the target language showed that retraining or adapting of the acoustic models to the target language is currently minimally needed to achieve reasonable performance.


international conference on acoustics, speech, and signal processing | 2013

System combination and score normalization for spoken term detection

Jonathan Mamou; Jia Cui; Xiaodong Cui; Mark J. F. Gales; Brian Kingsbury; Kate Knill; Lidia Mangu; David Nolden; Michael Picheny; Bhuvana Ramabhadran; Ralf Schlüter; Abhinav Sethy; Philip C. Woodland

Spoken content in languages of emerging importance needs to be searchable to provide access to the underlying information. In this paper, we investigate the problem of extending data fusion methodologies from Information Retrieval for Spoken Term Detection on low-resource languages in the framework of the IARPA Babel program. We describe a number of alternative methods improving keyword search performance. We apply these methods to Cantonese, a language that presents some new issues in terms of reduced resources and shorter query lengths. First, we show score normalization methodology that improves in average by 20% keyword search performance. Second, we show that properly combining the outputs of diverse ASR systems performs 14% better than the best normalized ASR system.


international conference on acoustics, speech, and signal processing | 2013

A high-performance Cantonese keyword search system

Brian Kingsbury; Jia Cui; Xiaodong Cui; Mark J. F. Gales; Kate Knill; Jonathan Mamou; Lidia Mangu; David Nolden; Michael Picheny; Bhuvana Ramabhadran; Ralf Schlüter; Abhinav Sethy; Philip C. Woodland

We present a system for keyword search on Cantonese conversational telephony audio, collected for the IARPA Babel program, that achieves good performance by combining postings lists produced by diverse speech recognition systems from three different research groups. We describe the keyword search task, the data on which the work was done, four different speech recognition systems, and our approach to system combination for keyword search. We show that the combination of four systems outperforms the best single system by 7%, achieving an actual term-weighted value of 0.517.


conference of the international speech communication association | 2015

Joint decoding of tandem and hybrid systems for improved keyword spotting on low resource languages

Haipeng Wang; Anton Ragni; Mark J. F. Gales; Kate Knill; Philip C. Woodland; Chao Zhang

Copyright


international conference on acoustics, speech, and signal processing | 2015

Unicode-based graphemic systems for limited resource languages

Mark J. F. Gales; Kate Knill; Anton Ragni

Large vocabulary continuous speech recognition systems require a mapping from words, or tokens, into sub-word units to enable robust estimation of acoustic model parameters, and to model words not seen in the training data. The standard approach to achieve this is to manually generate a lexicon where words are mapped into phones, often with attributes associated with each of these phones. Contextdependent acoustic models are then constructed using decision trees where questions are asked based on the phones and phone attributes. For low-resource languages, it may not be practical to manually generate a lexicon. An alternative approach is to use a graphemic lexicon, where the “pronunciation” for a word is defined by the letters forming that word. This paper proposes a simple approach for building graphemic systems for any language written in unicode. The attributes for graphemes are automatically derived using features from the unicode character descriptions. These attributes are then used in decision tree construction. This approach is examined on the IARPA Babel Option Period 2 languages, and a Levantine Arabic CTS task. The described approach achieves comparable, and complementary, performance to phonetic lexicon-based approaches.


international conference on acoustics speech and signal processing | 1996

Fast implementation methods for Viterbi-based word-spotting

Kate Knill; Steve J. Young

This paper explores methods of increasing the speed of a Viterbi-based word-spotting system for audio document retrieval. Fast processing is essential since the user expects to receive the results of a keyword search many times faster than the actual length of the speech. A number of computational short-cuts to the standard Viterbi word-spotter are presented. These are based on exploiting the background Viterbi phone recognition path that is computed to provide a normalisation base. An initial approximation using the phone transition boundaries reduces the retrieval time by a factor of 5, while achieving a slight improvement in word-spotting performance. To further reduce retrieval time, pattern matching, feature selection, and Gaussian selection techniques are applied to this approximate pass to give a total /spl times/50 increase in speed with little loss in performance. In addition, a low memory requirement means that these approaches can be implemented on any platform, including hand-held devices.


ieee automatic speech recognition and understanding workshop | 2015

Multilingual representations for low resource speech recognition and keyword search

Jia Cui; Brian Kingsbury; Bhuvana Ramabhadran; Abhinav Sethy; Kartik Audhkhasi; Xiaodong Cui; Ellen Kislal; Lidia Mangu; Markus Nussbaum-Thom; Michael Picheny; Zoltán Tüske; Pavel Golik; Ralf Schlüter; Hermann Ney; Mark J. F. Gales; Kate Knill; Anton Ragni; Haipeng Wang; Phil Woodland

This paper examines the impact of multilingual (ML) acoustic representations on Automatic Speech Recognition (ASR) and keyword search (KWS) for low resource languages in the context of the OpenKWS15 evaluation of the IARPA Babel program. The task is to develop Swahili ASR and KWS systems within two weeks using as little as 3 hours of transcribed data. Multilingual acoustic representations proved to be crucial for building these systems under strict time constraints. The paper discusses several key insights on how these representations are derived and used. First, we present a data sampling strategy that can speed up the training of multilingual representations without appreciable loss in ASR performance. Second, we show that fusion of diverse multilingual representations developed at different LORELEI sites yields substantial ASR and KWS gains. Speaker adaptation and data augmentation of these representations improves both ASR and KWS performance (up to 8.7% relative). Third, incorporating un-transcribed data through semi-supervised learning, improves WER and KWS performance. Finally, we show that these multilingual representations significantly improve ASR and KWS performance (relative 9% for WER and 5% for MTWV) even when forty hours of transcribed audio in the target language is available. Multilingual representations significantly contributed to the LORELEI KWS systems winning the OpenKWS15 evaluation.


Archive | 1997

Hidden Markov Models in Speech and Language Processing

Kate Knill; Steve J. Young

Speech is an acoustic representation of a word or sequence of words, characterized by a slowly changing spectral envelope. Humans perceive this spectral envelope, and convert it into the underlying word string and its associated meaning. The ultimate goal of speech and language processing is to mimic this process so that a machine can hold a natural conversation with a human. Speech and language processing has a far wider role to play, however, in performing less complex tasks such as transcription, language identification, or audio document retrieval, all of which are feasible to a certain extent now. The basic step in all of these systems is to perform the inverse mapping of the speech into the underlying sequence of symbols, usually words, that produced it, as shown in Fig. 2.1. This chapter describes a statistical approach to solving the automatic speech recognition problem, based on stochastic Markov process models.


international conference on acoustics, speech, and signal processing | 2015

Improving multiple-crowd-sourced transcriptions using a speech recogniser

R. C. van Dalen; Kate Knill; Pirros Tsiakoulis; Mark J. F. Gales

This paper introduces a method to produce high-quality transcriptions of speech data from only two crowd-sourced transcriptions. These transcriptions, produced cheaply by people on the Internet, for example through Amazon Mechanical Turk, are often of low quality. Often, multiple crowd-sourced transcriptions are combined to form one transcription of higher quality. However, the state of the art is to use essentially a form of majority voting, which requires at least three transcriptions for each utterance. This paper shows how to refine this approach to work with only two transcriptions. It then introduces a method that uses a speech recogniser (bootstrapped on a simple combination scheme) to combine transcriptions. When only two crowd-sourced transcriptions are available, on a noisy data set this improves the word error rate to gold-standard transcriptions by 21% relative.

Collaboration


Dive into the Kate Knill's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Anton Ragni

University of Cambridge

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Yu Wang

University of Cambridge

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge