Izhak Shafran
University of Washington
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Izhak Shafran.
international conference on acoustics, speech, and signal processing | 2003
Izhak Shafran; Richard C. Rose
This paper provides a solution for robust speech detection that can be applied across a variety of tasks. The solution is based on an algorithm that performs non-parametric estimation of the background noise spectrum using minimum statistics of the smoothed short-time Fourier transform (STFT). It is shown that the new algorithm can operate effectively under varying signal-to-noise ratios. Results are reported on two tasks - HMIHY and SPINE - which differ in their speaking style, background noise type and bandwidth. With a computational cost of less than 2% real-time on a 1GHz P-3 machine and a latency of 400 ms, it is suitable for real-time ASR applications.
international conference on acoustics, speech, and signal processing | 2006
Christopher M. White; Izhak Shafran; Jean-Luc Gauvain
Most language recognition systems consist of a cascade of three stages: (1) tokenizers that produce parallel phone streams, (2) phonotactic models that score the match between each phone stream and the phonotactic constraints in the target language, and (3) a final stage that combines the scores from the parallel streams appropriately (M.A. Zissman, 1996). This paper reports a series of contrastive experiments to assess the impact of replacing the second and third stages with large-margin discriminative classifiers. In addition, it investigates how sounds that are not represented in the tokenizers of the first stage can be approximated with composite units that utilize cross-stream dependencies obtained via multi-string alignments. This leads to a discriminative framework that can potentially incorporate a richer set of features such as prosodic and lexical cues. Experiments are reported on the NIST LRE 1996 and 2003 task and the results show that the new techniques give substantial gains over a competitive PPRLM baseline
Computer Speech & Language | 2003
Izhak Shafran; Mari Ostendorf
Current speech recognition systems perform poorly on conversational speech as compared to read speech, arguably due to the large acoustic variability inherent in conversational speech. Our hypothesis is that there are systematic effects in local context, associated with syllabic structure, that are not being captured in the current acoustic models. Such variation may be modeled using a broader definition of context than in traditional systems which restrict context to be the neighboring phonemes. In this paper, we study the use of word- and syllable-level context conditioning in recognizing conversational speech. We describe a method to extend standard tree-based clustering to incorporate a large number of features, and we report results on the Switchboard task which indicate that syllable structure outperforms pentaphones and incurs less computational cost. It has been hypothesized that previous work in using syllable models for recognition of English was limited because of ignoring the phenomenon of resyllabification (change of syllable structure at word boundaries), but our analysis shows that accounting for resyllabification does not impact recognition performance.
international conference on acoustics, speech, and signal processing | 2000
Izhak Shafran; Mari Ostendorf
Current speech recognition systems perform poorly on conversational speech as compared to read speech, largely because of the additional acoustic variability observed in conversational speech. Our hypothesis is that there are systematic effects, related to higher level structures, that are not being captured in the current acoustic models. In this paper we describe a method to extend standard clustering to incorporate such features in estimating acoustic models. We report recognition improvements obtained on the Switchboard task over triphones and pentaphones by the use of word- and syllable-level features. In addition, we report preliminary studies on clustering with prosodic information.
conference of the international speech communication association | 2016
Ehsan Variani; Tara N. Sainath; Izhak Shafran; Michiel Bacchiani
State-of-the-art automatic speech recognition (ASR) systems typically rely on pre-processed features. This paper studies the time-frequency duality in ASR feature extraction methods and proposes extending the standard acoustic model with a complex-valued linear projection layer to learn and optimize features that minimize standard cost functions such as crossentropy. The proposed Complex Linear Projection (CLP) features achieve superior performance compared to pre-processed Log Mel features.
ieee automatic speech recognition and understanding workshop | 2011
Izhak Shafran; Richard Sproat; Brian Roark
Speech and language processing systems routinely face the need to apply finite state operations (e.g., POS tagging) on results from intermediate stages (e.g., ASR output) that are naturally represented in a compact lattice form. Currently, such needs are met by converting the lattices into linear sequences (n-best scoring sequences) before and after applying the finite state operations. In this paper, we eliminate the need for this unnecessary conversion by addressing the problem of picking only the single-best scoring output labels for every input sequence. For this purpose, we define a categorial semiring that allows determinzation over strings and incorporate it into a 〈Tropical, Categorial〉 lexicographic semiring. Through examples and empirical evaluations we show how determinization in this lexicographic semiring produces the desired output. The proposed solution is general in nature and can be applied to multi-tape weighted transducers that arise in many applications.
conference of the international speech communication association | 2016
Tara N. Sainath; Arun Narayanan; Ron J. Weiss; Ehsan Variani; Kevin W. Wilson; Michiel Bacchiani; Izhak Shafran
Recently, we presented a multichannel neural network model trained to perform speech enhancement jointly with acoustic modeling [1], directly from raw waveform input signals. While this model achieved over a 10% relative improvement compared to a single channel model, it came at a large cost in computational complexity, particularly in the convolutions used to implement a time-domain filterbank. In this paper we present several different approaches to reduce the complexity of this model by reducing the stride of the convolution operation and by implementing filters in the frequency domain. These optimizations reduce the computational complexity of the model by a factor of 3 with no loss in accuracy on a 2,000 hour Voice Search task.
ieee automatic speech recognition and understanding workshop | 2003
Izhak Shafran; M. Riley; Mehryar Mohri
language resources and evaluation | 2006
Brian Roark; Mary P. Harper; Eugene Charniak; Bonnie J. Dorr; Mark Johnson; Jeremy G. Kahn; Yang Liu; Mari Ostendorf; John Hale; Anna Krasnyanskaya; Matthew Lease; Izhak Shafran; Matthew G. Snover; Robin Stewart; Lisa Yung
Archive | 2003
Mari Ostendorf; Izhak Shafran; Rebecca A. Bates