Harald Singer | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Harald Singer is active.

Explore More

Publication

Featured researches published by Harald Singer.

Computer Speech & Language | 1997

HMM topology design using maximum likelihood successive state splitting

Mari Ostendorf; Harald Singer

Abstract Modelling contextual variations of phones is widely accepted as an important aspect of a continuous speech recognition system, and HMM distribution clustering has been sucessfully used to obtain robust models of context through distribution tying. However, as systems move to the challenge of spontaneous speech, temporal variation also becomes important. This paper describes a method fordesigning HMM topologies that learn both temporal and contextual variation, extending previous work on successive state splitting (SSS). The new approach uses a maximum likelihood criterion consistently at each step, overcoming the previous SSS limitation to speaker-dependent training. Initial experiments show both performance gains and training cost reduction over SSS with the reformulated algorithm.

international conference on acoustics, speech, and signal processing | 1993

A dynamic cepstrum incorporating time-frequency masking and its application to continuous speech recognition

Kiyoaki Aikawa; Harald Singer; Hideki Kawahara; Yoh'ichi Tohkura

A dynamic cepstrum parameter that incorporates the time-frequency characteristics of auditory forward masking is proposed. A masking model is derived from psychological experimental results. A novel operational method using a lifter array is derived to perform the time-frequency masking. The parameter simulates the effective input spectrum at the front-end of the auditory system and can enhance the spectral dynamics. The parameter represents both the instantaneous and transitional aspects of a spectral time series. Phoneme and continuous speech recognition experiments demonstrated that the dynamic cepstrum outperforms the conventional cepstrum individually and in various combinations with other spectral parameters. The phoneme recognition results were improved for ten male and ten female speakers. The masking lifter with a Gaussian window provided a better performance than that with a square window.<<ETX>>

international conference on acoustics speech and signal processing | 1996

Maximum likelihood successive state splitting

Harald Singer; Mari Ostendorf

Modeling contextual variations of phones is widely accepted as an important aspect of a continuous speech recognition system, and much research has been devoted to finding robust models of context for HMM systems. In particular, decision tree clustering has been used to tie output distributions across pre-defined states, and successive state splitting (SSS) has been used to define parsimonious HMM topologies. We describe a new HMM design algorithm, called maximum likelihood successive state splitting (ML-SSS), that combines advantages of both these approaches. Specifically, an HMM topology is designed using a greedy search for the best temporal and contextual splits using a constrained EM algorithm. In Japanese phone recognition experiments, ML-SSS shows recognition performance gains and training cost reduction over SSS under several training conditions.

international conference on acoustics, speech, and signal processing | 1992

Pitch dependent phone modelling for HMM based speech recognition

Harald Singer; Shigeki Sagayama

The authors propose a novel method of incorporating pitch information into a hidden Markov model (HMM) phoneme recognizer by exploiting the correlation between pitch and spectral parameters, e.g. cepstrum. Pitch patterns are not used explicitly; instead, spectral parameters are normalized framewise according to the pitch value. Evidence is given to show that the use of pitch information consistently improves the recognition performance. Experiments with 24 phoneme labels showed that the phoneme error rate for fast continuous speech could be improved by about 10%.<<ETX>>

Journal of the Acoustical Society of America | 1996

Cepstral representation of speech motivated by time–frequency masking: An application to speech recognition

Kiyoaki Aikawa; Harald Singer; Hideki Kawahara; Yoh’ichi Tohkura

A new spectral representation incorporating time-frequency forward masking is proposed. This masked spectral representation is efficiently represented by a quefrency domain parameter called dynamic-cepstrum (DyC). Automatic speech recognition experiments have demonstrated that DyC powerfully improves performance in phoneme classification and phrase recognition. This new spectral representation simulates a perceived spectrum. It enhances formant transition, which provides relevant cues for phoneme perception, while suppressing temporally stationary spectral properties, such as the effect of microphone frequency characteristics or the speaker-dependent time-invariant spectral feature. These features are advantageous for speaker-independent speech recognition. DyC can efficiently represent both the instantaneous and transitional aspects of a running spectrum with a vector of the same size as a conventional cepstrum. DyC is calculated from a cepstrum time sequence using a matrix lifter. Each column vector of the matrix lifter performs spectral smoothing. Smoothing characteristics are a function of the time interval between a masker and a signal. DyC outperformed a conventional cepstrum parameter obtained through linear predictive coding (LPC) analysis for both phoneme classification and phrase recognition by using hidden Markov models (HMMs). Compared with speaker-dependent recognition, an even greater improvement over the cepstrum parameter was found in speaker-independent speech recognition. Furthermore, DyC with only 16 coefficients exhibited higher speech recognition performance than a combination of the cepstrum and a delta-cepstrum with 32 coefficients for the classification experiment of phonemes contaminated by noises.

international conference on acoustics, speech, and signal processing | 1990

Low bit quantization of the smoothed group delay spectrum for speech recognition

Harald Singer; Taizo Umezaki; Fumitada Itakura

The coefficients of the smoothed group delay spectrum (SGDS) are calculated by discrete-time Fourier transform of the linear prediction coefficients, i.e. the representation is in the frequency domain. Isolated word recognition experiments with a low bit quantization of these SGDS coefficients are reported. It is shown that recognition accuracy can be maintained using only 26 b/frame as compared to the conventional calculation with floating-point accuracy. Using a bark scale representation the error rate can be even further reduced.<<ETX>>

Computer Speech & Language | 1999

Multiple pronunciation dictionary using HMM-state confusion characteristics

Yumi Wakita; Harald Singer; Yoshinori Sagisaka

Abstract In this paper, we propose a POS (part-of-speech)-dependent multiple pronunciation dictionary generation method using HMM-state confusions spanning several phonemes. When used in a multi-pass search, a dictionary generated from the method makes it possible to recover missing words that are lost during the first pass of the search process in continuous speech recognition using a single pronunciation dictionary. The new pronunciations are added to a dictionary that considers the POS dependency of the confusion characteristics. Continuous word recognition experiments have confirmed that the best results are obtained when (1) confusions expressed by HMM-state sequences and (2) pronunciation variations considering the POS-dependent confusion characteristics are used.

international conference on acoustics, speech, and signal processing | 1994

Non-uniform unit parsing for SSS-LR continuous speech recognition

Harald Singer; Jun-ichi Takami; Shoichi Matsunaga

We describe recent improvements in ATRs experimental speech recognition system ATREUS, which serves as a recognition font end for the speech translation system ASURA. Our next goal is spontaneous speech translation. To constrain the potentially huge search space, better prosodic control, better probabilistic language models and better acoustic models are proposed. The SSS-LR parser was modified to work with non-uniform unit type acoustic and duration models. Experimental results showed, that, for example, use of mora trigram probabilities improved the phrase error rate from 17% to 14%.<<ETX>>

Computing Prosody | 1997

Accent Phrase Segmentation by F 0 Clustering Using Superpositional Modelling

Mitsuru Nakai; Harald Singer; Yoshinori Sagisaka; Hiroshi Shimodaira

We propose an automatic method for detecting minor phrase boundaries in Japanese continuous speech by using F 0 information. In the training phase, F 0 contours of hand labelled minor phrases are parameterized according to a superpositional model proposed by Fujisaki and Hirose, and assigned to some clusters by a clustering method, in which model parameter of reference templates are calculated as an approximation of each cluster’s centroid. In the segmentation phase, automatic N-best extraction of boundaries is performed by one-stage Dynamic Programming (DP) matching between the reference templates and the target F 0 contour. About 90% of minor phrase boundaries were correctly detected in speaker independent experiments with the ATR Advanced Telecommunications Research Institute International Japanese continuous speech database.

international conference on acoustics, speech, and signal processing | 1995

Time-synchronous continuous speech recognizer driven by a context-free grammar

Tohru Shimizu; Seikou Monzen; Harald Singer; Shoichi Matsunaga

This paper proposes a time-synchronous continuous speech recognizer driven by a context-free grammar that integrates generalized LR parser based phoneme context prediction and context-dependent HMMs. In this method, a phoneme hypotheses trie is introduced for the phoneme history representation of possible LR states, and an LR state network is introduced for LR path merging. Both techniques reduce the amount of computation. The experimental results show that this new method is more efficient than the conventional LR parser driven phoneme-synchronous continuous speech recognizer.

Explore More