Is this you? Create Your Porfile

Alex Waibel

Karlsruhe Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alex Waibel is active.

Explore More

Publication

Featured researches published by Alex Waibel.

IEEE Transactions on Acoustics, Speech, and Signal Processing | 1989

Phoneme recognition using time-delay neural networks

Alex Waibel; Toshiyuki Hanazawa; Geoffrey E. Hinton; Kiyohiro Shikano; Kevin J. Lang

The authors present a time-delay neural network (TDNN) approach to phoneme recognition which is characterized by two important properties: (1) using a three-layer arrangement of simple computing units, a hierarchy can be constructed that allows for the formation of arbitrary nonlinear decision surfaces, which the TDNN learns automatically using error backpropagation; and (2) the time-delay arrangement enables the network to discover acoustic-phonetic features and the temporal relationships between them independently of position in time and therefore not blurred by temporal shifts in the input. As a recognition task, the speaker-dependent recognition of the phonemes B, D, and G in varying phonetic contexts was chosen. For comparison, several discrete hidden Markov models (HMM) were trained to perform the same task. Performance evaluation over 1946 testing tokens from three speakers showed that the TDNN achieves a recognition rate of 98.5% correct while the rate obtained by the best of the HMMs was only 93.7%. >

workshop on applications of computer vision | 1996

A real-time face tracker

Jie Yang; Alex Waibel

The authors present a real-time face tracker. The system has achieved a rate of 30+ frames/second using an HP-9000 workstation with a frame grabber and a Canon VC-Cl camera. It can track a persons face while the person moves freely (e.g., walks, jumps, sits down and stands up) in a room. Three types of models have been employed in developing the system. First, they present a stochastic model to characterize skin color distributions of human faces. The information provided by the model is sufficient for tracking a human face in various poses and views. This model is adaptable to different people and different lighting conditions in real-time. Second, a motion model is used to estimate image motion and to predict the search window. Third, a camera model is used to predict and compensate for camera motion. The system can be applied to teleconferencing and many HCI applications including lip reading and gaze tracking. The principle in developing this system can be extended to other tracking problems such as tracking the human hand.

Neural Networks | 1990

A time-delay neural network architecture for isolated word recognition

Kevin J. Lang; Alex Waibel; Geoffrey E. Hinton

Abstract A translation-invariant back-propagation network is described that performs better than a sophisticated continuous acoustic parameter hidden Markov model on a noisy, 100-speaker confusable vocabulary isolated word recognition task. The networks replicated architecture permits it to extract precise information from unaligned training patterns selected by a naive segmentation rule.

asian conference on computer vision | 1998

Skin-Color Modeling and Adaptation

Jie Yang; Weier Lu; Alex Waibel

This paper studies a statistical skin-color model and its adaptation. It is revealed that (1) human skin colors cluster in a small region in a color space; (2) the variance of a skin color cluster can be reduced by intensity normalization, and (3) under a certain lighting condition, a skin-color distribution can be characterized by a multivariate normal distribution in the normalized color space. We then propose an adaptive model to characterize human skin-color distributions for tracking human faces under different lighting conditions. The parameters of the model are adapted based on the maximum likelihood criterion. The model has been successfully applied to a real-time face tracker and other applications.

international conference on spoken language processing | 1996

Recognizing emotion in speech

Frank Dellaert; Thomas Polzin; Alex Waibel

The paper explores several statistical pattern recognition techniques to classify utterances according to their emotional content. The authors have recorded a corpus containing emotional speech with over a 1000 utterances from different speakers. They present a new method of extracting prosodic features from speech, based on a smoothing spline approximation of the pitch contour. To make maximal use of the limited amount of training data available, they introduce a novel pattern recognition technique: majority voting of subspace specialists. Using this technique, they obtain classification performance that is close to human performance on the task.

Speech Communication | 2001

Language-independent and language-adaptive acoustic modeling for speech recognition

Tanja Schultz; Alex Waibel

Abstract With the distribution of speech technology products all over the world, the portability to new target languages becomes a practical concern. As a consequence our research focuses on the question of how to port large vocabulary continuous speech recognition (LVCSR) systems in a fast and efficient way. More specifically we want to estimate acoustic models for a new target language using speech data from varied source languages, but only limited data from the target language. For this purpose, we introduce different methods for multilingual acoustic model combination and a polyphone decision tree specialization procedure. Recognition results using language-dependent, independent and language-adaptive acoustic models are presented and discussed in the framework of our GlobalPhone project which investigates LVCSR systems in 15 languages.

Neural Computation | 1989

Modular construction of time-delay neural networks for speech recognition

Alex Waibel

Several strategies are described that overcome limitations of basic network models as steps towards the design of large connectionist speech recognition systems. The two major areas of concern are the problem of time and the problem of scaling. Speech signals continuously vary over time and encode and transmit enormous amounts of human knowledge. To decode these signals, neural networks must be able to use appropriate representations of time and it must be possible to extend these nets to almost arbitrary sizes and complexity within finite resources. The problem of time is addressed by the development of a Time-Delay Neural Network; the problem of scaling by Modularity and Incremental Design of large nets based on smaller subcomponent nets. It is shown that small networks trained to perform limited tasks develop time invariant, hidden abstractions that can subsequently be exploited to train larger, more complex nets efficiently. Using these techniques, phoneme recognition networks of increasing complexity can be constructed that all achieve superior recognition performance.

IEEE Transactions on Image Processing | 2004

Automatic detection and recognition of signs from natural scenes

Xilin Chen; Jie Yang; Jing Zhang; Alex Waibel

In this paper, we present an approach to automatic detection and recognition of signs from natural scenes, and its application to a sign translation task. The proposed approach embeds multiresolution and multiscale edge detection, adaptive searching, color analysis, and affine rectification in a hierarchical framework for sign detection, with different emphases at each phase to handle the text in different sizes, orientations, color distributions and backgrounds. We use affine rectification to recover deformation of the text regions caused by an inappropriate camera view angle. The procedure can significantly improve text detection rate and optical character recognition (OCR) accuracy. Instead of using binary information for OCR, we extract features from an intensity image directly. We propose a local intensity normalization method to effectively handle lighting variations, followed by a Gabor transform to obtain local features, and finally a linear discriminant analysis (LDA) method for feature selection. We have applied the approach in developing a Chinese sign translation system, which can automatically detect and recognize Chinese signs as input from a camera, and translate the recognized text into English.

international symposium on wearable computers | 1999

Smart Sight: a tourist assistant system

Jie Yang; Weiyi Yang; Matthias Denecke; Alex Waibel

In this paper, we present our efforts towards developing an intelligent tourist system. The system is equipped with a unique combination of sensors and software. The hardware includes two computers, a GPS receiver, a lapel microphone plus an earphone, a video camera and a head-mounted display. This combination includes a multimodal interface to take advantage of speech and gesture input to provide assistance for a tourist. The software supports natural language processing, speech recognition, machine translation, handwriting recognition and multimodal fusion. A vision module is trained to locate and read written language, is able to adapt to to new environments, and is able to interpret intentions offered by the user such as a spoken clarification or pointing gesture. We illustrate the applications of the system using two examples.

Journal of the Acoustical Society of America | 1998

Locating and correcting erroneously recognized portions of utterances by rescoring based on two n-best lists

Alex Waibel; Arthur E. McNair

A method of repairing machine-recognized speech is comprised of the steps of receiving from a recognition engine a first n-best list of hypotheses and scores for each hypothesis generated in response to a primary utterance to be recognized. An error within the hypothesis having the highest score is located. Control signals are generated from the first n-best list which are input to the recognition engine to constrain the generation of a second n-best list of hypotheses, and scores for each hypothesis, in response to an event independent of the primary utterance. The scores for the hypotheses in the first n-best list are combined with the scores for the hypotheses in the second n-best list. The hypothesis having the highest combined score is selected as the replacement for the located error.

Explore More