John Dines
Idiap Research Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by John Dines.
international conference on acoustics, speech, and signal processing | 2007
Thomas Hain; Vincent Wan; Lukas Burget; Martin Karafiát; John Dines; Jithendra Vepa; Giulia Garau; Mike Lincoln
In this paper we describe the 2005 AMI system for the transcription of speech in meetings used in the 2005 NIST RT evaluations. The system was designed for participation in the speech to text part of the evaluations, in particular for transcription of speech recorded with multiple distant microphones and independent headset microphones. System performance was tested on both conference room and lecture style meetings. Although input sources are processed using different front-ends, the recognition process is based on a unified system architecture. The system operates in multiple passes and makes use of state of the art technologies such as discriminative training, vocal tract length normalisation, heteroscedastic linear discriminant analysis, speaker adaptation with maximum likelihood linear regression and minimum word error rate decoding. In this paper we describe the system performance on the official development and test sets for the NIST RT05s evaluations. The system was jointly developed in less than 10 months by a multi-site team and was shown to achieve competitive performance.
IEEE Transactions on Audio, Speech, and Language Processing | 2012
Thomas Hain; Lukas Burget; John Dines; Philip N. Garner; Frantisek Grezl; Asmaa El Hannani; Marijn Huijbregts; Martin Karafiát; Mike Lincoln; Vincent Wan
In this paper, we give an overview of the AMIDA systems for transcription of conference and lecture room meetings. The systems were developed for participation in the Rich Transcription evaluations conducted by the National Institute for Standards and Technology in the years 2007 and 2009 and can process close talking and far field microphone recordings. The paper first discusses fundamental properties of meeting data with special focus on the AMI/AMIDA corpora. This is followed by a description and analysis of improved processing and modeling, with focus on techniques specifically addressing meeting transcription issues such as multi-room recordings or domain variability. In 2007 and 2009, two different strategies of systems building were followed. While in 2007 we used our traditional style system design based on cross adaptation, the 2009 systems were constructed semi-automatically, supported by improved decoders and a new method for system representation. Overall these changes gave a 6%-13% relative reduction in word error rate compared to our 2007 results while at the same time requiring less training material and reducing the real-time factor by five times. The meeting transcription systems are available at www.webasr.org.
IEEE Transactions on Audio, Speech, and Language Processing | 2010
Junichi Yamagishi; Bela Usabaev; Simon King; Oliver Watts; John Dines; Jilei Tian; Yong Guan; Rile Hu; Keiichiro Oura; Yi-Jian Wu; Keiichi Tokuda; Reima Karhila; Mikko Kurimo
In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an “average voice model” plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on “non-TTS” corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues.
international conference on machine learning | 2006
Darren J. Moore; John Dines; Mathew Magimai.-Doss; Jithendra Vepa; Octavian Cheng; Thomas Hain
A major component in the development of any speech recognition system is the decoder. As task complexities and, consequently, system complexities have continued to increase the decoding problem has become an increasingly significant component in the overall speech recognition system development effort, with efficient decoder design contributing to significantly improve the trade-off between decoding time and search errors. In this paper we present the “Juicer” (from transducer) large vocabulary continuous speech recognition (LVCSR) decoder based on weighted finite-State transducer (WFST). We begin with a discussion of the need for open source, state-of-the-art decoding software in LVCSR research and how this lead to the development of Juicer, followed by a brief overview of decoding techniques and major issues in decoder design. We present Juicer and its major features, emphasising its potential not only as a critical component in the development of LVCSR systems, but also as an important research tool in itself, being based around the flexible WFST paradigm. We also provide results of benchmarking tests that have been carried out to date, demonstrating that in many respects Juicer, while still in its early development, is already achieving state-of-the-art. These benchmarking tests serve to not only demonstrate the utility of Juicer in its present state, but are also being used to guide future development, hence, we conclude with a brief discussion of some of the extensions that are currently under way or being considered for Juicer.
Multimodal Technologies for Perception of Humans | 2008
Thomas Hain; Lukas Burget; John Dines; Giulia Garau; Martin Karafiát; David A. van Leeuwen; Mike Lincoln; Vincent Wan
Meeting transcription is one of the main tasks for large vocabulary automatic speech recognition (ASR) and is supported by several large international projects in the area. The conversational nature, the difficult acoustics, and the necessity of high quality speech transcripts for higher level processing make ASR of meeting recordings an interesting challenge. This paper describes the development and system architecture of the 2007 AMIDA meeting transcription system, the third of such systems developed in a collaboration of six research sites. Different variants of the system participated in all speech to text transcription tasks of the 2007 NIST RT evaluations and showed very competitive performance. The best result was obtained on close-talking microphone data where a final word error rate of 24.9% was obtained.
international conference on machine learning | 2005
Thomas Hain; Lukas Burget; John Dines; Iain A. McCowan; Giulia Garau; Martin Karafiát; Mike Lincoln; Darren Moore; Vincent Wan; Roeland Ordelman; Steve Renals
This paper describes the AMI transcription system for speech in meetings developed in collaboration by five research groups. The system includes generic techniques such as discriminative and speaker adaptive training, vocal tract length normalisation, heteroscedastic linear discriminant analysis, maximum likelihood linear regression, and phone posterior based features, as well as techniques specifically designed for meeting data. These include segmentation and cross-talk suppression, beam-forming, domain adaptation, Web-data collection, and channel adaptive training. The system was improved by more than 20% relative in word error rate compared to our previous system and was used in the NIST RT106 evaluations where it was found to yield competitive performance.
international conference on acoustics, speech, and signal processing | 2010
Hui Liang; John Dines; Lakshmi Saheer
The EMIME project aims to build a personalized speech-to-speech translator, such that spoken input of a user in one language is used to produce spoken output that still sounds like the users voice however in another language. This distinctiveness makes unsupervised cross-lingual speaker adaptation one key to the projects success. So far, research has been conducted into unsupervised and cross-lingual cases separately by means of decision tree marginalization and HMM state mapping respectively. In this paper we combine the two techniques to perform unsupervised cross-lingual speaker adaptation. The performance of eight speaker adaptation systems (supervised vs. unsupervised, intra-lingual vs. cross-lingual) is compared using objective and subjective evaluations. Experimental results show the performance of unsupervised cross-lingual speaker adaptation is comparable to that of the supervised case in terms of spectrum adaptation in the EMIME scenario, even though automatically obtained transcriptions have a very high phoneme error rate.
IEEE Journal of Selected Topics in Signal Processing | 2010
John Dines; Junichi Yamagishi; Simon King
The EMIME European project is conducting research in the development of technologies for mobile, personalized speech-to-speech translation systems. The hidden Markov model (HMM) is being used as the underlying technology in both automatic speech recognition (ASR) and text-to-speech synthesis (TTS) components; thus, the investigation of unified statistical modeling approaches has become an implicit goal of our research. As one of the first steps towards this goal, we have been investigating commonalities and differences between HMM-based ASR and TTS. In this paper, we present results and analysis of a series of experiments that have been conducted on English ASR and TTS systems measuring their performance with respect to phone set and lexicon, acoustic feature type and dimensionality, HMM topology, and speaker adaptation. Our results show that, although the fundamental statistical model may be essentially the same, optimal ASR and TTS performance often demands diametrically opposed system designs. This represents a major challenge to be addressed in the investigation of such unified modeling approaches.
international conference on multimodal interfaces | 2008
Sarah Favre; Hugues Salamin; John Dines; Alessandro Vinciarelli
This paper presents an approach for the recognition of roles in multiparty recordings. The approach includes two major stages: extraction of Social Affiliation Networks (speaker diarization and representation of people in terms of their social interactions), and role recognition (application of discrete probability distributions to map people into roles). The experiments are performed over several corpora, including broadcast data and meeting recordings, for a total of roughly 90 hours of material. The results are satisfactory for the broadcast data (around 80 percent of the data time correctly labeled in terms of role), while they still must be improved in the case of the meeting recordings (around 45 percent of the data time correctly labeled). In both cases, the approach outperforms significantly chance.
international conference on acoustics, speech, and signal processing | 2010
Lakshmi Saheer; Philip N. Garner; John Dines; Hui Liang
The advent of statistical speech synthesis has enabled the unification of the basic techniques used in speech synthesis and recognition. Adaptation techniques that have been successfully used in recognition systems can now be applied to synthesis systems to improve the quality of the synthesized speech. The application of vocal tract length normalization (VTLN) for synthesis is explored in this paper. VTLN based adaptation requires estimation of a single warping factor, which can be accurately estimated from very little adaptation data and gives additive improvements over CMLLR adaptation. The challenge of estimating accurate warping factors using higher order features is solved by initializing warping factor estimation with the values calculated from lower order features.