Izhak Shafran | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Izhak Shafran is active.

Explore More

Publication

Featured researches published by Izhak Shafran.

international conference on acoustics, speech, and signal processing | 2003

Robust speech detection and segmentation for real-time ASR applications

Izhak Shafran; Richard C. Rose

This paper provides a solution for robust speech detection that can be applied across a variety of tasks. The solution is based on an algorithm that performs non-parametric estimation of the background noise spectrum using minimum statistics of the smoothed short-time Fourier transform (STFT). It is shown that the new algorithm can operate effectively under varying signal-to-noise ratios. Results are reported on two tasks - HMIHY and SPINE - which differ in their speaking style, background noise type and bandwidth. With a computational cost of less than 2% real-time on a 1GHz P-3 machine and a latency of 400 ms, it is suitable for real-time ASR applications.

international conference on acoustics, speech, and signal processing | 2006

Discriminative Classifiers for Language Recognition

Christopher M. White; Izhak Shafran; Jean-Luc Gauvain

Most language recognition systems consist of a cascade of three stages: (1) tokenizers that produce parallel phone streams, (2) phonotactic models that score the match between each phone stream and the phonotactic constraints in the target language, and (3) a final stage that combines the scores from the parallel streams appropriately (M.A. Zissman, 1996). This paper reports a series of contrastive experiments to assess the impact of replacing the second and third stages with large-margin discriminative classifiers. In addition, it investigates how sounds that are not represented in the tokenizers of the first stage can be approximated with composite units that utilize cross-stream dependencies obtained via multi-string alignments. This leads to a discriminative framework that can potentially incorporate a richer set of features such as prosodic and lexical cues. Experiments are reported on the NIST LRE 1996 and 2003 task and the results show that the new techniques give substantial gains over a competitive PPRLM baseline

Computer Speech & Language | 2003

Acoustic model clustering based on syllable structure

Izhak Shafran; Mari Ostendorf

Current speech recognition systems perform poorly on conversational speech as compared to read speech, arguably due to the large acoustic variability inherent in conversational speech. Our hypothesis is that there are systematic effects in local context, associated with syllabic structure, that are not being captured in the current acoustic models. Such variation may be modeled using a broader definition of context than in traditional systems which restrict context to be the neighboring phonemes. In this paper, we study the use of word- and syllable-level context conditioning in recognizing conversational speech. We describe a method to extend standard tree-based clustering to incorporate a large number of features, and we report results on the Switchboard task which indicate that syllable structure outperforms pentaphones and incurs less computational cost. It has been hypothesized that previous work in using syllable models for recognition of English was limited because of ignoring the phenomenon of resyllabification (change of syllable structure at word boundaries), but our analysis shows that accounting for resyllabification does not impact recognition performance.

international conference on acoustics, speech, and signal processing | 2000

Use of higher level linguistic structure in acoustic modeling for speech recognition

Izhak Shafran; Mari Ostendorf

Current speech recognition systems perform poorly on conversational speech as compared to read speech, largely because of the additional acoustic variability observed in conversational speech. Our hypothesis is that there are systematic effects, related to higher level structures, that are not being captured in the current acoustic models. In this paper we describe a method to extend standard clustering to incorporate such features in estimating acoustic models. We report recognition improvements obtained on the Switchboard task over triphones and pentaphones by the use of word- and syllable-level features. In addition, we report preliminary studies on clustering with prosodic information.

conference of the international speech communication association | 2016

Complex Linear Projection (CLP): A Discriminative Approach to Joint Feature Extraction and Acoustic Modeling

Ehsan Variani; Tara N. Sainath; Izhak Shafran; Michiel Bacchiani

State-of-the-art automatic speech recognition (ASR) systems typically rely on pre-processed features. This paper studies the time-frequency duality in ASR feature extraction methods and proposes extending the standard acoustic model with a complex-valued linear projection layer to learn and optimize features that minimize standard cost functions such as crossentropy. The proposed Complex Linear Projection (CLP) features achieve superior performance compared to pre-processed Log Mel features.

ieee automatic speech recognition and understanding workshop | 2011

Efficient determinization of tagged word lattices using categorial and lexicographic semirings

Izhak Shafran; Richard Sproat; Brian Roark

Speech and language processing systems routinely face the need to apply finite state operations (e.g., POS tagging) on results from intermediate stages (e.g., ASR output) that are naturally represented in a compact lattice form. Currently, such needs are met by converting the lattices into linear sequences (n-best scoring sequences) before and after applying the finite state operations. In this paper, we eliminate the need for this unnecessary conversion by addressing the problem of picking only the single-best scoring output labels for every input sequence. For this purpose, we define a categorial semiring that allows determinzation over strings and incorporate it into a 〈Tropical, Categorial〉 lexicographic semiring. Through examples and empirical evaluations we show how determinization in this lexicographic semiring produces the desired output. The proposed solution is general in nature and can be applied to multi-tape weighted transducers that arise in many applications.

conference of the international speech communication association | 2016

Reducing the Computational Complexity of Multimicrophone Acoustic Models with Integrated Feature Extraction.

Tara N. Sainath; Arun Narayanan; Ron J. Weiss; Ehsan Variani; Kevin W. Wilson; Michiel Bacchiani; Izhak Shafran

Recently, we presented a multichannel neural network model trained to perform speech enhancement jointly with acoustic modeling [1], directly from raw waveform input signals. While this model achieved over a 10% relative improvement compared to a single channel model, it came at a large cost in computational complexity, particularly in the convolutions used to implement a time-domain filterbank. In this paper we present several different approaches to reduce the complexity of this model by reducing the stride of the convolution operation and by implementing filters in the frequency domain. These optimizations reduce the computational complexity of the model by a factor of 3 with no loss in accuracy on a 2,000 hour Voice Search task.

ieee automatic speech recognition and understanding workshop | 2003

Voice signatures

Izhak Shafran; M. Riley; Mehryar Mohri

language resources and evaluation | 2006

SParseval: Evaluation Metrics for Parsing Speech.

Brian Roark; Mary P. Harper; Eugene Charniak; Bonnie J. Dorr; Mark Johnson; Jeremy G. Kahn; Yang Liu; Mari Ostendorf; John Hale; Anna Krasnyanskaya; Matthew Lease; Izhak Shafran; Matthew G. Snover; Robin Stewart; Lisa Yung

Archive | 2003