Olivier Siohan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Olivier Siohan is active.

Explore More

Publication

Featured researches published by Olivier Siohan.

international conference on acoustics, speech, and signal processing | 2009

An audio indexing system for election video material

Christopher Alberti; Michiel Bacchiani; Ari Bezman; Ciprian Chelba; Anastassia Drofa; Hank Liao; Pedro J. Moreno; Ted Power; Arnaud Sahuguet; Maria Shugrina; Olivier Siohan

In the 2008 presidential election race in the United States, the prospective candidates made extensive use of YouTube to post video material. We developed a scalable system that transcribes this material and makes the content searchable (by indexing the meta-data and transcripts of the videos) and allows the user to navigate through the video material based on content. The system is available as an iGoogle gadget1 as well as a Labs product (labs.google.com/gaudi). Given the large exposure, special emphasis was put on the scalability and reliability of the system. This paper describes the design and implementation of this system.

international conference on acoustics, speech, and signal processing | 2014

Training data selection based on context-dependent state matching

Olivier Siohan

In this paper we construct a data set for semi-supervised acoustic model training by selecting spoken utterances from a massive collection of anonymized Google Voice Search utterances. Semi-supervised training usually retains high-confidence utterances which are presumed to have an accurate hypothesized transcript, a necessary condition for successful training. Selecting high confidence utterances can however restrict the diversity of the resulting data set. We propose to introduce a constraint enforcing that the distribution of the context-dependent state symbols obtained by running forced alignment of the hypothesized transcript matches a reference distribution estimated from a curated development set. The quality of the obtained training set is illustrated on large scale Voice Search recognition experiments and outperforms random selection of high-confidence utterances.

ieee automatic speech recognition and understanding workshop | 2015

Multitask learning and system combination for automatic speech recognition

Olivier Siohan; David Rybach

In this paper we investigate the performance of an ensemble of convolutional, long short-term memory deep neural networks (CLDNN) on a large vocabulary speech recognition task. To reduce the computational complexity of running multiple recognizers in parallel, we propose instead an early system combination approach which requires the construction of a static decoding network encoding the multiple context-dependent state inventories from the distinct acoustic models. To further reduce the computational load, the hidden units of those models can be shared while keeping the output layers distinct, leading to a multitask training formulation. However in contrast to the traditional multitask training, our formulation uses all predicted outputs leading to a multitask system combination strategy. Results are presented on a Voice Search task designed for children and outperform our current production system.

international conference on acoustics, speech, and signal processing | 2016

Selection and combination of hypotheses for dialectal speech recognition

Victor Soto; Olivier Siohan; Mohamed G. Elfeky; Pedro J. Moreno

While research has often shown that building dialect-specific Automatic Speech Recognizers is the optimal approach to dealing with dialectal variations of the same language, we have observed that dialect-specific recognizers do not always output the best recognitions. Often enough, another dialectal recognizer outputs a better recognition than the dialect-specific one. In this paper, we present two methods to select and combine the best decoded hypothesis from a pool of dialectal recognizers. We follow a Machine Learning approach and extract features from the Speech Recognition output along with Word Embeddings and use Shallow Neural Networks for classification. Our experiments using Dictation and Voice Search data from the main four Arabic dialects show good WER improvements for the hypothesis selection scheme, reducing the WER by 2.1 to 12.1% depending on the test set, and promising results for the hypotheses combination scheme.

international conference on acoustics, speech, and signal processing | 2015

Exemplar-based large vocabulary speech recognition using k-nearest neighbors

Yanbo Xu; Olivier Siohan; David Simcha; Sanjiv Kumar; Hank Liao

This paper describes a large scale exemplar-based acoustic modeling approach for large vocabulary continuous speech recognition. We construct an index of labeled training frames using high-level features extracted from the bottleneck layer of a deep neural network as indexing features. At recognition time, each test frame is turned into a query and a set of k-nearest neighbor frames is retrieved from the index. This set is further filtered using majority voting and the remaining frames are used to derive an estimate of the context-dependent state posteriors of the query, which can then be used for recognition. Using an approximate nearest neighbor search approach based on asymmetric hashing, we are able to construct an index on over 25,000 hours of training data. We present both frame classification and recognition experiments on a Voice Search task.

spoken language technology workshop | 2016

Automatic optimization of data perturbation distributions for multi-style training in speech recognition

Mortaza Doulaty; Richard Rose; Olivier Siohan

Speech recognition performance using deep neural network based acoustic models is known to degrade when the acoustic environment and the speaker population in the target utterances are significantly different from the conditions represented in the training data. To address these mismatched scenarios, multi-style training (MTR) has been used to perturb utterances in an existing uncorrupted and potentially mismatched training speech corpus to better match target domain utterances. This paper addresses the problem of determining the distribution of perturbation levels for a given set of perturbation types that best matches the target speech utterances. An approach is presented that, given a small set of utterances from a target domain, automatically identifies an empirical distribution of perturbation levels that can be applied to utterances in an existing training set. Distributions are estimated for perturbation types that include acoustic background environments, reverberant room configurations, and speaker related variation like frequency and temporal warping. The end goal is for the resulting perturbed training set to characterize the variability in the target domain and thereby optimize ASR performance. An experimental study is performed to evaluate the impact of this approach on ASR performance when the target utterances are taken from a simulated far-field acoustic environment.

Archive | 2013