Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Hynek Hermansky is active.

Publication


Featured researches published by Hynek Hermansky.


IEEE Transactions on Speech and Audio Processing | 1994

RASTA processing of speech

Hynek Hermansky; Nelson Morgan

Performance of even the best current stochastic recognizers severely degrades in an unexpected communications environment. In some cases, the environmental effect can be modeled by a set of simple transformations and, in particular, by convolution with an environmental impulse response and the addition of some environmental noise. Often, the temporal properties of these environmental effects are quite different from the temporal properties of speech. We have been experimenting with filtering approaches that attempt to exploit these differences to produce robust representations for speech recognition and enhancement and have called this class of representations relative spectra (RASTA). In this paper, we review the theoretical and experimental foundations of the method, discuss the relationship with human auditory perception, and extend the original method to combinations of additive noise and convolutional noise. We discuss the relationship between RASTA features and the nature of the recognition models that are required and the relationship of these features to delta features and to cepstral mean subtraction. Finally, we show an application of the RASTA technique to speech enhancement. >


international conference on acoustics, speech, and signal processing | 2000

Tandem connectionist feature extraction for conventional HMM systems

Hynek Hermansky; Daniel P. W. Ellis; Sangita Sharma

Hidden Markov model speech recognition systems typically use Gaussian mixture models to estimate the distributions of decorrelated acoustic feature vectors that correspond to individual subword units. By contrast, hybrid connectionist-HMM systems use discriminatively-trained neural networks to estimate the probability distribution among subword units given the acoustic observations. In this work we show a large improvement in word recognition performance by combining neural-net discriminative feature processing with Gaussian-mixture distribution modeling. By training the network to generate the subword probability posteriors, then using transformations of these estimates as the base features for a conventionally-trained Gaussian-mixture based system, we achieve relative error rate reductions of 35% or more on the multicondition Aurora noisy continuous digits task.


international conference on acoustics, speech, and signal processing | 1992

RASTA-PLP speech analysis technique

Hynek Hermansky; Nelson Morgan; Aruna Bayya; Phil Kohn

Most speech parameter estimation techniques are easily influenced by the frequency response of the communication channel. The authors have developed a technique that is more robust to such steady-state spectral factors in speech. The approach is conceptually simple and computationally efficient. The new method is described, and experimental results are proposed that show significant advantages for the proposed method.<<ETX>>


Speech Communication | 1998

Should recognizers have ears

Hynek Hermansky

Abstract Recently, techniques motivated by human auditory perception are being applied in main-stream speech technology and there seems to be renewed interest in implementing more knowledge of human speech communication into a design of a speech recognizer. The paper discusses the authors experience with applying auditory knowledge to automatic recognition of speech. It advances the notion that the reason for applying of such a knowledge in speech engineering should be the ability of perception to suppress some parts of the irrelevant information in the speech message and argues against the blind implementation of scattered accidental knowledge which may be irrelevant to a speech recognition task. The following three properties of human speech perception are discussed in some detail: • limited spectral resolution, • use of information from about syllable-length segments, • ability to ignore corrupted or irrelevant components of speech. It shows by referring to published works that selective use of auditory knowledge, optimized on and in some cases derived from real speech data, can be consistent with current stochastic approaches to ASR and could yield advantages in practical engineering applications.


international conference on acoustics, speech, and signal processing | 1993

Recognition of speech in additive and convolutional noise based on RASTA spectral processing

Hynek Hermansky; Nelson Morgan; Hans-Günter Hirsch

RASTA (relative spectral) processing is studied in a spectral domain which is linear-like for small spectral values and logarithmic-like for large spectral values. Experiments with a recognizer trained on clean speech and test data degraded by both convolutional and additive noise show that doing RASTA processing in the new domain yields results comparable with those obtained by training the recognizer on known noise.<<ETX>>


international conference on acoustics speech and signal processing | 1999

Temporal patterns (TRAPs) in ASR of noisy speech

Hynek Hermansky; Sangita Sharma

We study a new approach to processing temporal information for automatic speech recognition (ASR). Specifically, we study the use of rather long-time temporal patterns (TRAPs) of spectral energies in place of the conventional spectral patterns for ASR. The proposed neural TRAPs are found to yield significant amount of complementary information to that of the conventional spectral feature based ASR system. A combination of these two ASR systems is shown to result in improved robustness to several types of additive and convolutive environmental degradations.


IEEE Transactions on Speech and Audio Processing | 1995

The challenge of spoken language systems: Research directions for the nineties

Ron Cole; L. Hirschman; L. Atlas; M. Beckman; Alan W. Biermann; M. Bush; Mark A. Clements; L. Cohen; Oscar N. Garcia; B. Hanson; Hynek Hermansky; S. Levinson; Kathleen R. McKeown; Nelson Morgan; David G. Novick; Mari Ostendorf; Sharon L. Oviatt; Patti Price; Harvey F. Silverman; J. Spiitz; Alex Waibel; Cliff Weinstein; Stephen A. Zahorian; Victor W. Zue

A spoken language system combines speech recognition, natural language processing and human interface technology. It functions by recognizing the persons words, interpreting the sequence of words to obtain a meaning in terms of the application, and providing an appropriate response back to the user. Potential applications of spoken language systems range from simple tasks, such as retrieving information from an existing database (traffic reports, airline schedules), to interactive problem solving tasks involving complex planning and reasoning (travel planning, traffic routing), to support for multilingual interactions. We examine eight key areas in which basic research is needed to produce spoken language systems: (1) robust speech recognition; (2) automatic training and adaptation; (3) spontaneous speech; (4) dialogue models; (5) natural language response generation; (6) speech synthesis and speech generation; (7) multilingual systems; and (8) interactive multimodal systems. In each area, we identify key research challenges, the infrastructure needed to support research, and the expected benefits. We conclude by reviewing the need for multidisciplinary research, for development of shared corpora and related resources, for computational support and far rapid communication among researchers. The successful development of this technology will increase accessibility of computers to a wide range of users, will facilitate multinational communication and trade, and will create new research specialties and jobs in this rapidly expanding area. >


international conference on acoustics, speech, and signal processing | 1997

Sub-band based recognition of noisy speech

Sangita Tibrewala; Hynek Hermansky

A new approach to automatic speech recognition based on independent class-conditional probability estimates in several frequency sub-bands is presented. The approach is shown to be especially applicable to environments which cause partial corruption of the frequency spectrum of the signal. Some of the issues involved in the implementation of the approach are also addressed.


Speech Communication | 1999

On the relative importance of various components of the modulation spectrum for automatic speech recognition

Noboru Kanedera; Takayuki Arai; Hynek Hermansky; Misha Pavel

We measured the accuracy of speech recognition as a function of band-pass filtering of the time trajectories of spectral envelopes. We examined (i) several types of recognizers such as dynamic time warping (DTW) and hidden Markov model (HMM), and (ii) several types of features, such as filter bank output, mel-frequency cepstral coefficients (MFCC), and perceptual linear predictive (PLP) coefficients. We used the resulting recognition data to determine the relative importance of information in different modulation spectral components of speech for automatic speech recognition. We concluded that: (1) most of the useful linguistic information is in modulation frequency components from the range between 1 and 16 Hz, with the dominant component at around 4 Hz; (2) in some realistic environments, the use of components from the range below 2 Hz or above 16 Hz can degrade the recognition accuracy.


IEEE Signal Processing Magazine | 2005

Pushing the envelope - aside [speech recognition]

Nelson Morgan; Qifeng Zhu; Andreas Stolcke; K. Sonmez; S. Sivadas; T. Shinozaki; Mari Ostendorf; P. Jain; Hynek Hermansky; Daniel P. W. Ellis; G. Doddington; Barry Y. Chen; O. Cretin; H. Bourlard; M. Athineos

Despite successes, there are still significant limitations to speech recognition performance, particularly for conversational speech and/or for speech with significant acoustic degradations from noise or reverberation. For this reason, authors have proposed methods that incorporate different (and larger) analysis windows, which are described in this article. Note in passing that we and many others have already taken advantage of processing techniques that incorporate information over long time ranges, for instance for normalization (by cepstral mean subtraction as stated in B. Atal (1974) or relative spectral analysis (RASTA) based in H. Hermansky and N. Morgan (1994)). They also have proposed features that are based on speech sound class posterior probabilities, which have good properties for both classification and stream combination.

Collaboration


Dive into the Hynek Hermansky's collaboration.

Top Co-Authors

Avatar

Nelson Morgan

University of California

View shared research outputs
Top Co-Authors

Avatar

Petr Motlicek

Idiap Research Institute

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Aren Jansen

Johns Hopkins University

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge