Johan Schalkwyk | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Johan Schalkwyk is active.

Explore More

Publication

Featured researches published by Johan Schalkwyk.

international conference on implementation and application of automata | 2007

OpenFst: a general and efficient weighted finite-state transducer library

Cyril Allauzen; Michael Riley; Johan Schalkwyk; Wojciech Skut; Mehryar Mohri

We describe OpenFst, an open-source library for weighted finite-state transducers (WFSTs). OpenFst consists of a C++ template library with efficient WFST representations and over twenty-five operations for constructing, combining, optimizing, and searching them. At the shell-command level, there are corresponding transducer file representations and programs that operate on them. OpenFst is designed to be both very efficient in time and space and to scale to very large problems. This library has key applications speech, image, and natural language processing, pattern and string matching, and machine learning. We give an overview of the library, examples of its use, details of its design that allow customizing the labels, states, and weights and the lazy evaluation of many of its operations. Further information and a download of the OpenFst library can be obtained from http://www.openfst.org.

Archive | 2010

“Your Word is my Command”: Google Search by Voice: A Case Study

Johan Schalkwyk; Doug Beeferman; Françoise Beaufays; Bill Byrne; Ciprian Chelba; Mike Cohen; Maryam Kamvar; Brian Strope

An important goal at Google is to make spoken access ubiquitously available. Achieving ubiquity requires two things: availability (i.e., built into every possible interaction where speech input or output can make sense) and performance (i.e., works so well that the modality adds no friction to the interaction).

international conference on acoustics, speech, and signal processing | 2015

Learning acoustic frame labeling for speech recognition with recurrent neural networks

Hasim Sak; Andrew W. Senior; Kanishka Rao; Ozan Irsoy; Alex Graves; Francoise Beaufays; Johan Schalkwyk

We explore alternative acoustic modeling techniques for large vocabulary speech recognition using Long Short-Term Memory recurrent neural networks. For an acoustic frame labeling task, we compare the conventional approach of cross-entropy (CE) training using fixed forced-alignments of frames and labels, with the Connectionist Temporal Classification (CTC) method proposed for labeling unsegmented sequence data. We demonstrate that the latter can be implemented with finite state transducers. We experiment with phones and context dependent HMM states as acoustic modeling units. We also investigate the effect of context in acoustic input by training unidirectional and bidirectional LSTM RNN models. We show that a bidirectional LSTM RNN CTC model using phone units can perform as well as an LSTM RNN model trained with CE using HMM state alignments. Finally, we also show the effect of sequence discriminative training on these models and show the first results for sMBR training of CTC models.

international conference on acoustics, speech, and signal processing | 2008

Deploying GOOG-411: Early lessons in data, measurement, and testing

Michiel Bacchiani; Francoise Beaufays; Johan Schalkwyk; Mike Schuster; Brian Strope

We describe our early experience building and optimizing GOOG-411, a fully automated, voice-enabled, business finder. We show how taking an iterative approach to system development allows us to optimize the various components of the system, thereby progressively improving user-facing metrics. We show the contributions of different data sources to recognition accuracy. For business listing language models, we see a nearly linear performance increase with the logarithm of the amount of training data. To date, we have improved our correct accept rate by 25% absolute, and increased our transfer rate by 35% absolute.

spoken language technology workshop | 2010

Query language modeling for voice search

Ciprian Chelba; Johan Schalkwyk; Thorsten Brants; Vida Ha; Boulos Harb; Will Neveitt; Carolina Parada; Peng Xu

The paper presents an empirical exploration of google.com query stream language modeling. We describe the normalization of the typed query stream resulting in out-of-vocabulary (OoV) rates below 1% for a one million word vocabulary. We present a comprehensive set of experiments that guided the design decisions for a voice search service. In the process we re-discovered a less known interaction between Kneser-Ney smoothing and entropy pruning, and found empirical evidence that hints at non-stationarity of the query stream, as well as strong dependence on various English locales—USA, Britain and Australia.

international conference on implementation and application of automata | 2010

Filters for efficient composition of weighted finite-state transducers

Cyril Allauzen; Michael Riley; Johan Schalkwyk

This paper describes a weighted finite-state transducer composition algorithm that generalizes the concept of the composition filter and presents various filters that process epsilon transitions, look-ahead along paths, and push forward labels along epsilon paths. These filters, either individually or in combination, make it possible to compose some transducers much more efficiently in time and space than otherwise possible. We present examples of this drawn, in part, from demanding speechprocessing applications. The generalized composition algorithm and many of these filters have been included in OpenFst, an open-source weighted transducer library.

international conference on acoustics, speech, and signal processing | 2015

Long short term memory neural network for keyboard gesture decoding

Ouais Alsharif; Tom Ouyang; Francoise Beaufays; Shumin Zhai; Thomas M. Breuel; Johan Schalkwyk

Gesture typing is an efficient input method for phones and tablets using continuous traces created by a pointed object (e.g., finger or stylus). Translating such continuous gestures into textual input is a challenging task as gesture inputs exhibit many features found in speech and handwriting such as high variability, co-articulation and elision. In this work, we address these challenges with a hybrid approach, combining a variant of recurrent networks, namely Long Short Term Memories [1] with conventional Finite State Transducer decoding [2]. Results using our approach show considerable improvement relative to a baseline shape-matching-based system, amounting to 4% and 22% absolute improvement respectively for small and large lexicon decoding on real datasets and 2% on a synthetic large scale dataset.

Archive | 2013

Empirical Exploration of Language Modeling for the google.com Query Stream as Applied to Mobile Voice Search

Ciprian Chelba; Johan Schalkwyk

Mobile is poised to become the predominant platform over which people access the World Wide Web. Recent developments in speech recognition and understanding, backed by high bandwidth coverage and high quality speech signal acquisition on smartphones and tablets are presenting the users with the choice of speaking their web search queries instead of typing them. A critical component of a speech recognition system targeting web search is the language model. The chapter presents an empirical exploration of the google.com query stream with the end goal of high quality statistical language modeling for mobile voice search. Our experiments show that after text normalization the query stream is not as “wild” as it seems at first sight. One can achieve out-of-vocabulary rates below 1% using a 1 million word vocabulary, and excellent n-gram hit ratios of 77/88% even at high orders such as \( n=5/4\), respectively. A more careful analysis shows that a significantly larger vocabulary (approx. 10 million words) may be required to guarantee at most 1% out-of-vocabulary rate for a large percentage (95%) of users. Using large scale, distributed language models can improve performance significantly—up to 10% relative reductions in word-error-rate over conventional models used in speech recognition. We also find that the query stream is non-stationary, which means that adding more past training data beyond a certain point provides diminishing returns, and may even degrade performance slightly. Perhaps less surprisingly, we have shown that locale matters significantly for English query data across USA, Great Britain and Australia. In an attempt to leverage the speech data in voice search logs, we successfully build large-scale discriminative N-gram language models and derive small but significant gains in recognition performance.

international conference on acoustics, speech, and signal processing | 2009

Mobile media search

Berna Erol; Jordan Cohen; Minoru Etoh; Hsiao-Wuen Hon; Jiebo Luo; Johan Schalkwyk

This panel paper presents motivations for discussing mobile media search and contains statements from the panelists who are industry research leaders in this field.

International Journal of Foundations of Computer Science | 2011

A Filter-based Algorithm for Efficient Composition of Finite-State Transducers

Cyril Allauzen; Michael Riley; Johan Schalkwyk

This paper describes a weighted finite-state transducer composition algorithm that generalizes the concept of the composition filter and presents various filters that process epsilon transitions, look-ahead along paths, and push forward labels along epsilon paths. These filters, either individually or in combination, make it possible to compose some transducers much more efficiently in time and space than otherwise possible. We present examples of this drawn, in part, from demanding speech-processing applications. The generalized composition algorithm and many of these filters have been included in Open-Fst, an open-source weighted transducer library.

Explore More