Cyril Allauzen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Cyril Allauzen is active.

Explore More

Publication

Featured researches published by Cyril Allauzen.

international conference on implementation and application of automata | 2007

OpenFst: a general and efficient weighted finite-state transducer library

Cyril Allauzen; Michael Riley; Johan Schalkwyk; Wojciech Skut; Mehryar Mohri

We describe OpenFst, an open-source library for weighted finite-state transducers (WFSTs). OpenFst consists of a C++ template library with efficient WFST representations and over twenty-five operations for constructing, combining, optimizing, and searching them. At the shell-command level, there are corresponding transducer file representations and programs that operate on them. OpenFst is designed to be both very efficient in time and space and to scale to very large problems. This library has key applications speech, image, and natural language processing, pattern and string matching, and machine learning. We give an overview of the library, examples of its use, details of its design that allow customizing the labels, states, and weights and the lazy evaluation of many of its operations. Further information and a download of the OpenFst library can be obtained from http://www.openfst.org.

conference on current trends in theory and practice of informatics | 1999

Factor Oracle: A New Structure for Pattern Matching

Cyril Allauzen; Maxime Crochemore; Mathieu Raffinot

We introduce a new automaton on a word p, sequence of letters taken in an alphabet ?, that we call factor oracle. This automaton is acyclic, recognizes at least the factors of p, has m+1 states and a linear number of transitions. We give an on-line construction to build it. We use this new structure in string matching algorithms that we conjecture optimal according to the experimental results. These algorithms are as efficient as the ones that already exist using less memory and being more easy to implement.

meeting of the association for computational linguistics | 2003

Generalized Algorithms for Constructing Statistical Language Models

Cyril Allauzen; Mehryar Mohri; Brian Roark

Recent text and speech processing applications such as speech mining raise new and more general problems related to the construction of language models. We present and describe in detail several new and efficient algorithms to address these more general problems and report experimental results demonstrating their usefulness. We give an algorithm for computing efficiently the expected counts of any sequence in a word lattice output by a speech recognizer or any arbitrary weighted automaton; describe a new technique for creating exact representations of n-gram language models by weighted automata whose size is practical for offline use even for a vocabulary size of about 500,000 words and an n-gram order n = 6; and present a simple and more general technique for constructing class-based language models that allows each class to represent an arbitrary weighted automaton. An efficient implementation of our algorithms and techniques has been incorporated in a general software library for language modeling, the GRM Library, that includes many other text and grammar processing functionalities.

north american chapter of the association for computational linguistics | 2004

General indexation of weighted automata: application to spoken utterance retrieval

Cyril Allauzen; Mehryar Mohri; Murat Saraclar

Much of the massive quantities of digitized data widely available, e.g., text, speech, hand-written sequences, are either given directly, or, as a result of some prior processing, as weighted automata. These are compact representations of a large number of alternative sequences and their weights reflecting the uncertainty or variability of the data. Thus, the indexation of such data requires indexing weighted automata. We present a general algorithm for the indexation of weighted automata. The resulting index is represented by a deterministic weighted transducer that is optimal for search: the search for an input string takes time linear in the sum of the size of that string and the number of indices of the weighted automata where it appears. We also introduce a general framework based on weighted transducers that generalizes this indexation to enable the search for more complex patterns including syntactic information or for different types of sequences, e.g., word sequences instead of phonemic sequences. The use of this framework is illustrated with several examples. We applied our general indexation algorithm and framework to the problem of indexation of speech utterances and report the results of our experiments in several tasks demonstrating that our techniques yield comparable results to previous methods, while providing greater generality, including the possibility of searching for arbitrary patterns represented by weighted automata.

international conference on acoustics, speech, and signal processing | 2005

The AT&T WATSON speech recognizer

Vincent Goffin; Cyril Allauzen; Enrico Bocchieri; Dilek Hakkani-Tür; Andrej Ljolje; Sarangarajan Parthasarathy; Mazin G. Rahim; Giuseppe Riccardi; Murat Saraclar

This paper describes the AT&T WATSON real-time speech recognizer, the product of several decades of research at AT&T. The recognizer handles a wide range of vocabulary sizes and is based on continuous-density hidden Markov models for acoustic modeling and finite state networks for language modeling. The recognition network is optimized for efficient search. We identify the algorithms used for high-accuracy, real-time and low-latency recognition. We present results for small and large vocabulary tasks taken from the AT&T VoiceTone/sup /spl reg// service, showing word accuracy improvement of about 5% absolute and real-time processing speed-up by a factor between 2 and 3.

international conference on acoustics, speech, and signal processing | 2004

A generalized construction of integrated speech recognition transducers

Cyril Allauzen; Mehryar Mohri; Michael Riley; Brian Roark

We showed in previous work that weighted finite-state transducers provide a common representation for many components of a speech recognition system and described general algorithms for combining these representations to build a single optimized and compact transducer integrating all these components, directly mapping from HMM states to words. This approach works well for certain well-controlled input transducers, but presents some problems related to the efficiency of composition and the applicability of determinization and weight-pushing with more general transducers. We generalize our prior construction of the integrated speech recognition transducer to work with an arbitrary number of component transducers and, to a large extent, release the constraints imposed on the type of input transducers by providing more general solutions to these problems. This generalization allowed us to deal with cases where our prior optimization did not apply. Our experiments in the AT&T HMIHY 0300 task and an AT&T VoiceTone task show the efficiency of our generalized optimization technique. We report a 1.6 recognition speed-up in the HMIHY 0300 task, 1.8 speed-up in a VoiceTone task using a word-based language model, and 1.7 using a class-based model.

combinatorial pattern matching | 2001

Efficient Experimental String Matching by Weak Factor Recognition

Cyril Allauzen; Maxime Crochemore; Mathieu Raffinot

We introduce a new notion of weak factor recognition that is the foundation of new data structures and on-line string matching algorithms. We define a new automaton built on a string p = p1p2 ... pm that acts like an oracle on the set of factors pi ... pj. If a string is recognized by this automaton, it may be a factor of p. But, if it is rejected, it is surely not a factor. We call it factor oracle. More precisely, this automaton is acyclic, recognizes at least the factors of p, has m+ 1 states and a linear number of transitions. We give a very simple sequential construction algorithm to build it. Using this automaton, we design an efficient experimental on-line string matching algorithm (we conjecture its optimality in regard to the experimental results) that is really simple to implement. We also extend the factor oracle to predict that a string could be a suffix (i.e. in the set pi ... pm) of p. We obtain the suffix oracle, that enables in some cases a tricky improvement of the previous string matching algorithm.

mathematical foundations of computer science | 2006

A unified construction of the glushkov, follow, and antimirov automata

Cyril Allauzen; Mehryar Mohri

A number of different techniques have been introduced in the last few decades to create e-free automata representing regular expressions such as the Glushkov automata, follow automata, or Antimirov automata. This paper presents a simple and unified view of all these construction methods both for unweighted and weighted regular expressions. It describes simpler algorithms with time complexities at least as favorable as that of the best previously known techniques, and provides a concise proof of their correctness. Our algorithms are all based on two standard automata operations: epsilon-removal and minimization. This contrasts with the multitude of complicated and special-purpose techniques previously described in the literature, and makes it straightforward to generalize these algorithms to the weighted case. In particular, we extend the definition and construction of follow automata to the case of weighted regular expressions over a closed semiring and present the first algorithm to compute weighted Antimirov automata.

meeting of the association for computational linguistics | 2004

Statistical Modeling for Unit Selection in Speech Synthesis

Mehryar Mohri; Cyril Allauzen; Michael Riley

Traditional concatenative speech synthesis systems use a number of heuristics to define the target and concatenation costs, essential for the design of the unit selection component. In contrast to these approaches, we introduce a general statistical modeling framework for unit selection inspired by automatic speech recognition. Given appropriate data, techniques based on that framework can result in a more accurate unit selection, thereby improving the general quality of a speech synthesizer. They can also lead to a more modular and a substantially more efficient system.We present a new unit selection system based on statistical modeling. To overcome the original absence of data, we use an existing high-quality unit selection system to generate a corpus of unit sequences. We show that the concatenation cost can be accurately estimated from this corpus using a statistical n-gram language model over units. We used weighted automata and transducers for the representation of the components of the system and designed a new and more efficient composition algorithm making use of string potentials for their combination. The resulting statistical unit selection is shown to be about 2.6 times faster than the last release of the AT&T Natural Voices Product while preserving the same quality, and offers much flexibility for the use and integration of new and more complex components.

International Journal of Foundations of Computer Science | 2005

THE DESIGN PRINCIPLES AND ALGORITHMS OF A WEIGHTED GRAMMAR LIBRARY

Cyril Allauzen; Mehryar Mohri; Brian Roark

We present the software design principles, algorithms, and utilities of a general weighted grammar library, the GRM Library, that can be used in a variety of applications in text, speech, and biosequence processing. Several of the algorithms and utilities of this library are described, including in some cases their pseudocodes and pointers to their use in applications. The algorithms and the utilities were designed to support a wide variety of semirings and the representation and use of large grammars and automata of several hundred million rules or transitions.

Explore More