Harald Höge | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Harald Höge is active.

Explore More

Publication

Featured researches published by Harald Höge.

international conference on acoustics, speech, and signal processing | 2006

Text-Independent Voice Conversion Based on Unit Selection

David Sündermann; Harald Höge; Antonio Bonafonte; Hermann Ney; Alan W. Black; Shrikanth Narayanan

So far, most of the voice conversion training procedures are text-dependent, i.e., they are based on parallel training utterances of source and large speaker. Since several applications (e.g. speech-to-speech translation or dubbing) require text-independent training, over the last two years, training techniques that use non-parallel data were proposed In this paper, we present a new approach that applies unit selection to find corresponding time frames in source and target speech. By means of a subjective experiment it is shown that this technique achieves the same performance as the conventional text-dependent training

ieee automatic speech recognition and understanding workshop | 2003

VTLN-based cross-language voice conversion

David Sündermann; Hermann Ney; Harald Höge

In speech recognition, vocal tract length normalization (VTLN) is a well-studied technique for speaker normalization. As cross-language voice conversion aims at the transformation of a source speakers voice into that of a target speaker using a different language, we want to investigate whether VTLN is an appropriate method to adapt the voice characteristics. After applying several conventional VTLN warping functions, we extend the conventional piece-wise linear function to several segments, allowing a more detailed warping of the source spectrum. Experiments on cross-language voice conversion are performed on three corpora of two languages and both speaker genders.

international conference on acoustics, speech, and signal processing | 1997

European speech databases for telephone applications

Harald Höge; H.S. Tropf; R. Winski; H. van den Heuvel; R. Haeb-Umbach; Khalid Choukri

The SpeechDat project aims to produce speech databases for all official languages of the European Union and some major dialectal variants and minority languages resulting in 28 speech databases. They will be recorded over fixed and mobile telephone networks. This will provide a realistic basis for training and assessment of both isolated and continuous-speech utterances, employing whole-word or subword approaches, and thus can be used for developing voice driven teleservices including speaker verification. The specification of the databases has been developed jointly, and is essentially the same for each language to facilitate dissemination and use. There will be a controlled variation among the speakers concerning sex, age, dialect, environment of call, etc. The validation of all databases will be carried out centrally. The SpeechDat databases will be transferred to ELRA for distribution. The next databases to be recorded will cover East European languages.

IEEE Transactions on Speech and Audio Processing | 2002

ASR in mobile phones - an industrial approach

Imre Varga; Stefanie Aalburg; Bernt Andrassy; Sergey Astrov; Josef Bauer; Christophe Beaugeant; Christian Geissler; Harald Höge

In order to make hidden Markov model (HMM) speech recognition suitable for mobile phone applications, Siemens developed a recognizer, Very Smart Recognizer (VSR), for deployment in future mobile phone generations. Typical applications will be name dialling, command and control operations suited for different environments, for example in cars. The paper describes research and development issues of a speech recognizer in mobile devices focusing on noise robustness, memory efficiency and integer implementation. The VSR is shown to reach a word error rate as low as 4.1% on continuous digits recorded in a car environment. Furthermore by means of discriminative training and HMM-parameter coding, the memory requirements of the VSR HMMs are smaller than 64 kBytes.

Journal of the Acoustical Society of America | 2006

Method and apparatus for an adaptive speech recognition system utilizing HMM models

Udo Bub; Harald Höge

In speech recognition, phonemes of a language are modelled by a hidden Markov model, whereby each status of the hidden Markov model is described by a probability density function. For speech recognition of a modified vocabulary, the probability density function is split into a first and into a second probability density function. As a result thereof, it is possible to compensate variations in the speaking habits of a speaker or to add a new word to the vocabulary of the speech recognition unit and thereby assure that this new word is distinguished with adequate quality from the words already present in the speech recognition unit and is thus recognized.

international symposium on signal processing and information technology | 2004

Time domain vocal tract length normalization

David Sündermann; Antonio Bonafonte; Hermann Ney; Harald Höge

Recently, the speaker normalization technique VTLN (vocal tract length normalization), known from speech recognition, was applied to voice conversion. So far, VTLN has been performed in frequency domain. However, to accelerate the conversion process, it is helpful to apply VTLN directly to the time frames of a speech signal. In this paper, we propose a technique which directly manipulates the time signal. By means of subjective tests, it is shown that the performance of voice conversion techniques based on frequency domain and time domain VTLN are equivalent in terms of speech quality, while the latter requires about 20 times less processing time.

Signal Processing | 2009

Noise robust F0 determination and epoch-marking algorithms

Bojan Kotnik; Harald Höge; Zdravko Kacic

This paper presents a combined pitch frequency (F0) determination and epoch (pitch period) marking procedure CPDMA using merged normalized forward-backward correlation. The algorithm consists of several processing steps: preprocessing of the input speech signal, voicing detection using artificial neural networks, F0 determination stage based on normalized correlation, F0 contour postprocessing applying partial Viterbi traceback, and finally, epoch (or pitch period) marking. To evaluate the proposed CPDMA procedure against any other algorithm, a manually segmented PDA/PMA reference database based on real-life SPEECON Spanish speech database has been created. A set of criteria was proposed to objectively and compactly evaluate the performance of any evaluated PDA/PMA or voicing detection algorithm. The performance of the proposed CPDMA was compared with the performance of well-known and publicly available PRAAT toolkit. The PDA and PMA performances achieved with the proposed CPDMA algorithm significantly outperformed the performance of the PRAAT toolkit in all its three considered configurations: autocorrelation method (PRAAT_AC), cross-correlation method (PRAAT_CC), SHS (PRAAT_SHS), and point process (PRAAT_PP). The superior noise robustness of CPDMA is achieved at the expense of a more complex algorithm and consequently leads to worse real time factor when compared to PRAAT.

ieee automatic speech recognition and understanding workshop | 2005

Residual prediction based on unit selection

David Sündermann; Harald Höge; Antonio Bonafonte; Hermann Ney; Alan W. Black

Recently, we presented a study on residual prediction techniques that can be applied to voice conversion based on linear transformation or hidden Markov model-based speech synthesis. Our voice conversion experiments showed that none of the six compared techniques was capable of successfully converting the voice while achieving a fair speech quality. In this paper, we suggest a novel residual prediction technique based on unit selection that outperforms the others in terms of speech quality (mean opinion score = 3) while keeping the conversion performance

international conference on acoustics, speech, and signal processing | 1989

Word recognition in continuous speech using a phonological based two-network matching parser and a synthesis based prediction

M. Brenner; Harald Höge; Erwin Marschall; J. Romano

The present state of the SPICOS project, a speech understanding dialog system for fluent German speech using a 1000-word lexicon, is described. The first system version was demonstrated in 1986; the second will be finished in 1990. The authors present new SPICOS I results and some preliminary SPICOS II results with regard to phoneme and word recognition. A data-driven two-network matching parser compares the input network of alternative phonological units with a word lexicon organized as a cyclic network. This lexicon not only contains the standard pronunciation but also models interword and intraword assimilations. Substitutions, deletions, and insertions of single phonemes are taken into account during the match. The output of the parser is a network of word hypotheses. An additional module predicts long words after parts of them have already been hypothesized. Then the long word patterns are synthesized concatenating demisyllables and are verified parametrically. Well-matched words are added to the output network.<<ETX>>

international conference on acoustics, speech, and signal processing | 1989

Real-time recognition of subword units on a hybrid multi-DSP/ASIC based acoustic front-end

Abdulmesih Aktas; Harald Höge

A description is given of the hardware and software structure of the acoustic-phonetic decoding done in real time within the speaker-adaptive continuous speech understanding system SPICOS (Siemens, Philips, IPO continuous speech recognition and understanding). SPICOS is designed as a German language man-machine dialogue interface system consisting of acoustic-phonetic decoding, linguistic analysis, dialogue-modeling, and speech-synthesis modules. The acoustic-phonetic decoding is based on an articulatory feature vector, which is used to recognize subword units with hidden Markov models (HMM). Feature extraction and recognition are supported by special hardware. For the formant extraction, 16 LPC reflection coefficients are calculated by a signal processor and mapped onto a codebook with 4000 codes containing formant hypotheses. The latter task is performed by a dedicated application-specific integrated circuit designed for vector quantization.<<ETX>>

Explore More