Francoise Beaufays | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Francoise Beaufays is active.

Explore More

Publication

Featured researches published by Francoise Beaufays.

international conference on acoustics, speech, and signal processing | 2015

Learning acoustic frame labeling for speech recognition with recurrent neural networks

Hasim Sak; Andrew W. Senior; Kanishka Rao; Ozan Irsoy; Alex Graves; Francoise Beaufays; Johan Schalkwyk

We explore alternative acoustic modeling techniques for large vocabulary speech recognition using Long Short-Term Memory recurrent neural networks. For an acoustic frame labeling task, we compare the conventional approach of cross-entropy (CE) training using fixed forced-alignments of frames and labels, with the Connectionist Temporal Classification (CTC) method proposed for labeling unsegmented sequence data. We demonstrate that the latter can be implemented with finite state transducers. We experiment with phones and context dependent HMM states as acoustic modeling units. We also investigate the effect of context in acoustic input by training unidirectional and bidirectional LSTM RNN models. We show that a bidirectional LSTM RNN CTC model using phone units can perform as well as an LSTM RNN model trained with CE using HMM state alignments. Finally, we also show the effect of sequence discriminative training on these models and show the first results for sMBR training of CTC models.

international conference on acoustics, speech, and signal processing | 2008

Deploying GOOG-411: Early lessons in data, measurement, and testing

Michiel Bacchiani; Francoise Beaufays; Johan Schalkwyk; Mike Schuster; Brian Strope

We describe our early experience building and optimizing GOOG-411, a fully automated, voice-enabled, business finder. We show how taking an iterative approach to system development allows us to optimize the various components of the system, thereby progressively improving user-facing metrics. We show the contributions of different data sources to recognition accuracy. For business listing language models, we see a nearly linear performance increase with the logarithm of the amount of training data. To date, we have improved our correct accept rate by 25% absolute, and increased our transfer rate by 35% absolute.

international conference on acoustics, speech, and signal processing | 2016

Personalized speech recognition on mobile devices

Ian McGraw; Rohit Prabhavalkar; Raziel Alvarez; Montse Gonzalez Arenas; Kanishka Rao; David Rybach; Ouais Alsharif; Hasim Sak; Alexander H. Gruenstein; Francoise Beaufays; Carolina Parada

We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time.

international conference on acoustics, speech, and signal processing | 2013

Language model verbalization for automatic speech recognition

Hasim Sak; Francoise Beaufays; Kaisuke Nakajima; Cyril Allauzen

Transcribing speech in properly formatted written language presents some challenges for automatic speech recognition systems. The difficulty arises from the conversion ambiguity between verbal and written language in both directions. Non-lexical vocabulary items such as numeric entities, dates, times, abbreviations and acronyms are particularly ambiguous. This paper describes a finite-state transducer based approach that improves proper transcription of these entities. The approach involves training a language model in the written language domain, and integrating verbal expansions of vocabulary items as a finite-state model into the decoding graph construction. We build an inverted finite-state transducer to map written vocabulary items to alternate verbal expansions using rewrite rules. Then, this verbalizer transducer is composed with the n-gram language model to obtain a verbalized language model, whose input labels are in the verbal language domain while output labels are in the written language domain. We show that the proposed approach is very effective in improving the recognition accuracy of numeric entities.

international conference on acoustics, speech, and signal processing | 2011

Recognizing English queries in Mandarin Voice Search

Hung-An Chang; Yun-hsuan Sung; Brian Strope; Francoise Beaufays

Recent improvements in speech recognition technology, along with increased computing power and bigger datasets, have considerably improved the state of the art in the field, making it possible for commercial apps such as Google Voice Search to serve users in their everyday mobile search needs. Deploying such systems in various countries has shown us the extent to which multilingualism is present in some cultures, and the need for better solutions to handle it in our speech recognition systems. In this paper, we describe a few early data sharing and model combination experiments we did to improve the recognition of English queries made to Mandarin Voice Search, in Taiwan. We obtained a 12% relative sentence accuracy improvement over a baseline system already including some support for English queries.

international conference on acoustics, speech, and signal processing | 2015

Long short term memory neural network for keyboard gesture decoding

Ouais Alsharif; Tom Ouyang; Francoise Beaufays; Shumin Zhai; Thomas M. Breuel; Johan Schalkwyk

Gesture typing is an efficient input method for phones and tablets using continuous traces created by a pointed object (e.g., finger or stylus). Translating such continuous gestures into textual input is a challenging task as gesture inputs exhibit many features found in speech and handwriting such as high variability, co-articulation and elision. In this work, we address these challenges with a hybrid approach, combining a variant of recurrent networks, namely Long Short Term Memories [1] with conventional Finite State Transducer decoding [2]. Results using our approach show considerable improvement relative to a baseline shape-matching-based system, amounting to 4% and 22% absolute improvement respectively for small and large lexicon decoding on real datasets and 2% on a synthetic large scale dataset.

IEEE Journal of Selected Topics in Signal Processing | 2015

A Real-Time End-to-End Multilingual Speech Recognition Architecture

Javier Gonzalez-Dominguez; David Eustis; Ignacio Lopez-Moreno; Andrew W. Senior; Francoise Beaufays; Pedro J. Moreno

Automatic speech recognition (ASR) systems are used daily by millions of people worldwide to dictate messages, control devices, initiate searches or to facilitate data input in small devices. The user experience in these scenarios depends on the quality of the speech transcriptions and on the responsiveness of the system. For multilingual users, a further obstacle to natural interaction is the monolingual character of many ASR systems, in which users are constrained to a single preset language. In this work, we present an end-to-end multi-language ASR architecture, developed and deployed at Google, that allows users to select arbitrary combinations of spoken languages. We leverage recent advances in language identification and a novel method of real-time language selection to achieve similar recognition accuracy and nearly-identical latency characteristics as a monolingual system.

international conference on acoustics, speech, and signal processing | 2013

Language model capitalization

Francoise Beaufays; Brian Strope

In many speech recognition systems, capitalization is not an inherent component of the language model: training corpora are down cased, and counts are accumulated for sequences of lower-cased words. This level of modeling is sufficient for automating voice commands or otherwise enabling users to communicate with a machine, but when the recognized speech is intended to be read by a person, such as in email dictation or even some web search applications, the lack of capitalization of the users input can add an extra cognitive load on the reader. For these cases, speech recognition systems often post-process the recognized text to restore capitalization. We propose folding capitalization directly in the recognition language model. Instead of post-processing, we take the approach that language should be represented in all its richness, with capitalization, diacritics, and other special symbols. With that perspective, we describe a strategy to handle poorly capitalized or uncapitalized training corpora for language modeling. The resulting recognition system retains the accuracy/latency/memory tradeoff of our uncapitalized production recognizer, while providing properly cased outputs.

ieee automatic speech recognition and understanding workshop | 2013

Search results based N-best hypothesis rescoring with maximum entropy classification

Fuchun Peng; Scott Roy; Ben Shahshahani; Francoise Beaufays

We propose a simple yet effective method for improving speech recognition by reranking the N-best speech recognition hypotheses using search results. We model N-best reranking as a binary classification problem and select the hypothesis with the highest classification confidence. We use query-specific features extracted from the search results to encode domain knowledge and use it with a maximum entropy classifier to rescore the N-best list. We show that rescoring even only the top 2 hypotheses, we can obtain a significant 3% absolute sentence accuracy (SACC) improvement over a strong baseline on production traffic from an entertainment domain.

international conference on acoustics, speech, and signal processing | 2012

Recognition of multilingual speech in mobile applications

Hui Lin; Jui-ting Huang; Francoise Beaufays; Brian Strope; Yun-hsuan Sung

We evaluate different architectures to recognize multilingual speech for real-time mobile applications. In particular, we show that combining the results of several recognizers greatly outperforms other solutions such as training a single large multilingual system or using an explicit language identification system to select the appropriate recognizer. Experiments are conducted on a trilingual English-French-Mandarin mobile speech task. The data set includes Google searches, Maps queries, as well as more general inputs such as email and short message dictation. Without pre-specifying the input language, the combined system achieves comparable accuracy to that of the monolingual systems when the input language is known. The combined system is also roughly 5% absolute better than an explicit language identification approach, and 10% better than a single large multilingual system.

Explore More