Ciprian Chelba | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ciprian Chelba is active.

Explore More

Publication

Featured researches published by Ciprian Chelba.

Computer Speech & Language | 2000

Structured language modeling

Ciprian Chelba; Frederick Jelinek

This paper presents an attempt at using the syntactic structure in natural language for improved language models for speech recognition. The structured language model merges techniques in automatic parsing and language modeling using an original probabilistic parameterization of a shift-reduce parser. A maximum likelihood re-estimation procedure belonging to the class of expectation-maximization algorithms is employed for training the model. Experiments on the Wall Street Journal and Switchboard corpora show improvement in both perplexity and word error rate?word lattice rescoring?over the standard 3-gram language model.

meeting of the association for computational linguistics | 1998

Exploiting Syntactic Structure for Language Modeling

Ciprian Chelba; Frederick Jelinek

The paper presents a language model that develops syntactic structure and uses it to extract meaningful information from the word history, thus enabling the use of long distance dependencies. The model assigns probability to every joint sequence of words-binary-parse-structure with headword annotation and operates in a left-to-right manner --- therefore usable for automatic speech recognition. The model, its probabilistic parameterization, and a set of experiments meant to evaluate its predictive power are presented; an improvement over standard trigram modeling is achieved.

IEEE Signal Processing Magazine | 2008

Retrieval and browsing of spoken content

Ciprian Chelba; Timothy J. Hazen; Murat Saraclar

Ever-increasing computing power and connectivity bandwidth, together with falling storage costs, are resulting in an overwhelming amount of data of various types being produced, exchanged, and stored. Consequently, information search and retrieval has emerged as a key application area. Text-based search is the most active area, with applications that range from Web and local network search to searching for personal information residing on ones own hard-drive. Speech search has received less attention perhaps because large collections of spoken material have previously not been available. However, with cheaper storage and increased broadband access, there has been a subsequent increase in the availability of online spoken audio content such as news broadcasts, podcasts, and academic lectures. A variety of personal and commercial uses also exist. As data availability increases, the lack of adequate technology for processing spoken documents becomes the limiting factor to large-scale access to spoken content. In this article, we strive to discuss the technical issues involved in the development of information retrieval systems for spoken audio documents, concentrating on the issue of handling the errorful or incomplete output provided by ASR systems. We focus on the usage case where a user enters search terms into a search engine and is returned a collection of spoken document hits.

Archive | 2010

“Your Word is my Command”: Google Search by Voice: A Case Study

Johan Schalkwyk; Doug Beeferman; Françoise Beaufays; Bill Byrne; Ciprian Chelba; Mike Cohen; Maryam Kamvar; Brian Strope

An important goal at Google is to make spoken access ubiquitously available. Achieving ubiquity requires two things: availability (i.e., built into every possible interaction where speech input or output can make sense) and performance (i.e., works so well that the modality adds no friction to the interaction).

ieee automatic speech recognition and understanding workshop | 2003

Is word error rate a good indicator for spoken language understanding accuracy

Ye-Yi Wang; Alex Acero; Ciprian Chelba

It is a conventional wisdom in the speech community that better speech recognition accuracy is a good indicator for better spoken language understanding accuracy, given a fixed understanding component. The findings in this work reveal that this is not always the case. More important than word error rate reduction, the language model for recognition should be trained to match the optimization objective for understanding. In this work, we applied a spoken language understanding model as the language model in speech recognition. The model was obtained with an example-based learning algorithm that optimized the understanding accuracy. Although the speech recognition word error rate is 46% higher than the trigram model, the overall slot understanding error can be reduced by as much as 17%.

meeting of the association for computational linguistics | 2005

Position Specific Posterior Lattices for Indexing Speech

Ciprian Chelba; Alex Acero

The paper presents the Position Specific Posterior Lattice, a novel representation of automatic speech recognition lattices that naturally lends itself to efficient indexing of position information and subsequent relevance ranking of spoken documents using proximity.In experiments performed on a collection of lecture recordings --- MIT iCampus data --- the spoken document ranking accuracy was improved by 20% relative over the commonly used baseline of indexing the 1-best output from an automatic speech recognizer. The Mean Average Precision (MAP) increased from 0.53 when using 1-best output to 0.62 when using the new lattice representation. The reference used for evaluation is the output of a standard retrieval engine working on the manual transcription of the speech collection.Albeit lossy, the PSPL lattice is also much more compact than the ASR 3-gram lattice from which it is computed --- which translates in reduced inverted index size as well --- at virtually no degradation in word-error-rate performance. Since new paths are introduced in the lattice, the ORACLE accuracy increases over the original ASR lattice.

Computer Speech & Language | 2007

Soft indexing of speech content for search in spoken documents

Ciprian Chelba; Jorge F. Silva; Alex Acero

The paper presents the Position Specific Posterior Lattice (PSPL), a novel lossy representation of automatic speech recognition lattices that naturally lends itself to efficient indexing and subsequent relevance ranking of spoken documents. This technique explicitly takes into consideration the content uncertainty by means of using soft-hits. Indexing position information allows one to approximate N-gram expected counts and at the same time use more general proximity features in the relevance score calculation. In fact, one can easily port any state-of-the-art text-retrieval algorithm to the scenario of indexing ASR lattices for spoken documents, rather than using the 1-best recognition result. Experiments performed on a collection of lecture recordings-MIT iCampus database-show that the spoken document ranking performance was improved by 17-26% relative over the commonly used baseline of indexing the 1-best output from an automatic speech recognizer (ASR). The paper also addresses the problem of integrating speech and text content sources for the document search problem, as well as its usefulness from an ad hoc retrieval-keyword search-point of view. In this context, the PSPL formulation is naturally extended to deal with both speech and text content for a given document, where a new relevance ranking framework is proposed for integrating the different sources of information available. Experimental results on the MIT iCampus corpus show a relative improvement of 302% in Mean Average Precision (MAP) when using speech content and text-only metadata as opposed to just text-only metadata (which constitutes about 1% of the amount of data in the transcription of the speech content, measured in number of words). Further experiments show that even in scenarios for which the metadata size is artificially augmented such that it contains more than 10% of the spoken document transcription, the speech content still provides significant performance gains in MAP with respect to only using the text-metadata for relevance ranking.

international conference on acoustics, speech, and signal processing | 2009

An audio indexing system for election video material

Christopher Alberti; Michiel Bacchiani; Ari Bezman; Ciprian Chelba; Anastassia Drofa; Hank Liao; Pedro J. Moreno; Ted Power; Arnaud Sahuguet; Maria Shugrina; Olivier Siohan

In the 2008 presidential election race in the United States, the prospective candidates made extensive use of YouTube to post video material. We developed a scalable system that transcribes this material and makes the content searchable (by indexing the meta-data and transcripts of the videos) and allows the user to navigate through the video material based on content. The system is available as an iGoogle gadget1 as well as a Labs product (labs.google.com/gaudi). Given the large exposure, special emphasis was put on the scalability and reliability of the system. This paper describes the design and implementation of this system.

language and technology conference | 2006

Towards Spoken-Document Retrieval for the Internet: Lattice Indexing For Large-Scale Web-Search Architectures

Zheng-Yu Zhou; Peng Yu; Ciprian Chelba; Frank Seide

Large-scale web-search engines are generally designed for linear text. The linear text representation is suboptimal for audio search, where accuracy can be significantly improved if the search includes alternate recognition candidates, commonly represented as word lattices.This paper proposes a method for indexing word lattices that is suitable for large-scale web-search engines, requiring only limited code changes.The proposed method, called Time-based Merging for Indexing (TMI), first converts the word lattice to a posterior-probability representation and then merges word hypotheses with similar time boundaries to reduce the index size. Four alternative approximations are presented, which differ in index size and the strictness of the phrase-matching constraints.Results are presented for three types of typical web audio content, podcasts, video clips, and online lectures, for phrase spotting and relevance ranking. Using TMI indexes that are only five times larger than corresponding linear-text indexes, phrase spotting was improved over searching top-1 transcripts by 25-35%, and relevance ranking by 14%, at only a small loss compared to unindexed lattice search.

IEEE Transactions on Speech and Audio Processing | 2002

Distributed speech processing in miPad's multimodal user interface

Li Deng; Kuansan Wang; Alex Acero; Hsiao-Wuen Hon; Jasha Droppo; Constantinos Boulis; Ye-Yi Wang; Derek Jacoby; Milind Mahajan; Ciprian Chelba; Xuedong Huang

This paper describes the main components of MiPad (multimodal interactive PAD) and especially its distributed speech processing aspects. MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a multimodal spoken language interface and wireless-data technologies. It fully integrates continuous speech recognition and spoken language understanding, and provides a novel solution for data entry in PDAs or smart phones, often done by pecking with tiny styluses or typing on minuscule keyboards. Our user study indicates that the throughput of MiPad is significantly superior to that of the existing pen-based PDA interface. Acoustic modeling and noise robustness in distributed speech recognition are key components in MiPads design and implementation. In a typical scenario, the user speaks to the device at a distance so that he or she can see the screen. The built-in microphone thus picks up a lot of background noise, which requires MiPad be noise robust. For complex tasks, such as dictating e-mails, resource limitations demand the use of a client-server (peer-to-peer) architecture, where the PDA performs primitive feature extraction, feature quantization, and error protection, while the transmitted features to the server are subject to further speech feature enhancement, speech decoding and understanding before a dialog is carried out and actions rendered. Noise robustness can be achieved at the client, at the server or both. Various speech processing aspects of this type of distributed computation as related to MiPads potential deployment are presented. Previous user interface study results are also described. Finally, we point out future research directions as related to several key MiPad functionalities.

Explore More