Raimo Bakis | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Raimo Bakis is active.

Explore More

Publication

Featured researches published by Raimo Bakis.

international conference on acoustics, speech, and signal processing | 1987

Experiments with the Tangora 20,000 word speech recognizer

Amir Averbuch; Lalit R. Bahl; Raimo Bakis; Peter F. Brown; G. Daggett; Subhro Das; K. Davies; S. De Gennaro; P. V. de Souza; Edward A. Epstein; D. Fraleigh; Frederick Jelinek; Burn L. Lewis; Robert Leroy Mercer; J. Moorhead; Arthur Nádas; Deebitsudo Nahamoo; Michael Picheny; G. Shichman; P. Spinelli; D. Van Compernolle; H. Wilkens

The Speech Recognition Group at IBM Research in Yorktown Heights has developed a real-time, isolated-utterance speech recognizer for natural language based on the IBM Personal Computer AT and IBM Signal Processors. The system has recently been enhanced by expanding the vocabulary from 5,000 words to 20,000 words and by the addition of a speech workstation to support usability studies on document creation by voice. The system supports spelling and interactive personalization to augment the vocabularies. This paper describes the implementation, user interface, and comparative performance of the recognizer.

Journal of the Acoustical Society of America | 2011

System and method for rescoring N-best hypotheses of an automatic speech recognition system

Raimo Bakis; Ellen Eide

A system and method for rescoring the N-best hypotheses from an automatic speech recognition system by comparing an original speech waveform to synthetic speech waveforms that are generated for each text sequence of the N-best hypotheses. A distance is calculated from the original speech waveform to each of the synthesized waveforms, and the text associated with the synthesized waveform that is determined to be closest to the original waveform is selected as the final hypothesis. The original waveform and each synthesized waveform are aligned to a corresponding text sequence on a phoneme level. The mean of the feature vectors which align to each phoneme is computed for the original waveform as well as for each of the synthesized hypotheses. The distance of a synthesized hypothesis to the original speech signal is then computed as the sum over all phonemes in the hypothesis of the Euclidean distance between the means of the feature vectors of the frames aligning to that phoneme for the original and the synthesized signals. The text of the hypothesis which is closest under the above metric to the original waveform is chosen as the final system output.

international conference on acoustics, speech, and signal processing | 2003

Recent improvements to the IBM trainable speech synthesis system

Ellen Eide; Andrew Aaron; Raimo Bakis; R. Cohen; Robert E. Donovan; Wael Hamza; T. Mathes; Michael Picheny; M. Polkosky; M. Smith; M. Viswanathan

In this paper we describe the current status of the trainable text-to-speech system at IBM. Recent algorithmic and database changes to the system have led to significant gains in the output quality. On the algorithms side, we have introduced statistical models for predicting pitch and duration targets which replace the rule-based target generation previously employed. Additionally, we have changed the cost function and the search strategy, introduced a post-search pitch smoothing algorithm, and improved our method of preselection. Through the combined data and algorithmic contributions, we have been able to significantly improve (p < 0.0001) the mean opinion score (MOS) of our female voice, from 3.68 to 4.85 when heard over loudspeakers and to 5.42 when heard over the telephone (seven point scale).

international conference on acoustics, speech, and signal processing | 1993

Influence of background noise and microphone on the performance of the IBM Tangora speech recognition system

Subrata K. Das; Raimo Bakis; Arthur Nádas; David Nahamoo; Michael Picheny

With the intention of developing a robust speech recognizer largely immune to the vagaries of extrinsic changes, the authors investigated the effects of various background noises and microphones on the performance of the Tangora system. They identified several noisy locations such as the cafeteria and a secretarys office and included a relatively quiet office for comparison. They recorded isolated-word training and test data from one male and one female speaker at different locations employing several varieties of microphones. A typical experiment consisted of designing a speaker-independent HMM (hidden Markov model) system with one set of training data and decoding the test data collected at all locations. It was found that microphone characteristics had a significant impact on the robustness of the system. It was also observed that controlled contamination of the quiet training data with ambient noise improved the noise immunity of the recognizer, discounting the role of the Lombard effect in the studies.<<ETX>>

international conference on acoustics speech and signal processing | 1988

Obtaining candidate words by polling in a large vocabulary speech recognition system

R.L. Bahl; Raimo Bakis; P. V. de Souza; Robert L. Mercer

Considers the problem of rapidly obtaining a short list of candidate words for more detailed inspection in a large vocabulary, vector-quantizing speech recognition system. An approach called polling is advocated, in which each label produced by the vector quantizer casts a varying, real-valed vote for each word in the vocabulary. The words receiving the highest votes are placed on a short list to be matched in detail at a later stage of processing. Expressions are derived for these votes under the assumption that for any given word, the observed label frequencies have Poisson distributions. Although the method is more general, particular attention is paid to the implementation of polling in speech recognition systems which use hidden Markov models during the acoustic match computation. Results are presented of experiments with speaker-dependent and speaker-independent Markov models on two different isolated word recognition tasks.<<ETX>>

Journal of the Acoustical Society of America | 2005

Method and apparatus for translating natural-language speech using multiple output phrases

Raimo Bakis; Mark E. Epstein; William Stuart Meisel; Miroslav Novak; Michael Picheny; Ridley M. Whitaker

A multi-lingual translation system that provides multiple output sentences for a given word or phrase. Each output sentence for a given word or phrase reflects, for example, a different emotional emphasis, dialect, accents, loudness or rates of speech. A given output sentence could be selected automatically, or manually as desired, to create a desired effect. For example, the same output sentence for a given word or phrase can be recorded three times, to selectively reflect excitement, sadness or fear. The multi-lingual translation system includes a phrase-spotting mechanism, a translation mechanism, a speech output mechanism and optionally, a language understanding mechanism or an event measuring mechanism or both. The phrase-spotting mechanism identifies a spoken phrase from a restricted domain of phrases. The language understanding mechanism, if present, maps the identified phrase onto a small set of formal phrases. The translation mechanism maps the formal phrase onto a well-formed phrase in one or more target languages. The speech output mechanism produces high-quality output speech. The speech output may be time synchronized to the spoken phrase using the output of the event measuring mechanism.

international conference on acoustics, speech, and signal processing | 2006

High Quality Sinusoidal Modeling of Wideband Speech for the Purposes of Speech Synthesis and Modification

Dan Chazan; Ron Hoory; Ariel Sagi; Slava Shechtman; Alexander Sorin; Zhiwei Shuang; Raimo Bakis

This paper describes an efficient sinusoidal modeling framework for high quality wide band (WB) speech synthesis and modification. This technique may serve as a basis for speech compression in the context of small footprint concatenative Text to Speech systems. In addition, it is a useful representation for voice transformation and morphing purposes, e.g., simultaneous pitch modification and spectral envelope warping. The conventional sinusoidal modeling is enhanced with an adaptive frequency dithering mechanism, based on a degree of voicing analysis. Considerable reduction of the amount of model parameters is achieved by high band phase extension. The proposed model is evaluated and compared to the alternative STRAIGHT framework [1]. Being simpler and considerably more efficient than STRAIGHT, it outperforms it in speech quality for both speech reconstruction and transformation.

international conference on acoustics, speech, and signal processing | 1997

Transcription of broadcast news-system robustness issues and adaptation techniques

Raimo Bakis; Scott Shaobing Chen; Ponani S. Gopalakrishnan; Ramesh A. Gopinath; Stephane Herman Maes; L. Polymenalos

This paper describes some of the main problems and issues specific to the transcription of broadcast news and describes some of the methods for solving them that have been incorporated into the IBM Large Vocabulary Continuous Speech Recognition System.

Journal of the Acoustical Society of America | 1997

Speech coding apparatus and method for generating acoustic feature vector component values by combining values of the same features for multiple time intervals

Raimo Bakis; Ponani S. Gopalakrishnan; Dimitri Kanevsky; Arthur Nádas; David Nahamoo; Michael Picheny; Jan Sedivy

A speech coding apparatus and method measures the values of at least first and second different features of an utterance during each of a series of successive time intervals. For each time interval, a feature vector signal has a first component value equal to a first weighted combination of the values of only one feature of the utterance for at least two time intervals. The feature vector signal has a second component value equal to a second weighted combination, different from the first weighted combination, of the values of only one feature of the utterance for at least two time intervals. The resulting feature vector signals for a series of successive time intervals form a coded representation of the utterance. In one embodiment, a first weighted mixture signal has a value equal to a first weighted mixture of the values of the features of the utterance during a single time interval. A second weighted mixture signal has a value equal to a second weighted mixture, different from the first weighted mixture, of the values of the features of the utterance during a single time interval. The first component value of each feature vector signal is equal to a first weighted combination of the values of only the first weighted mixture signals for at least two time intervals, and the second component value of each feature vector signal is equal to a second weighted combination, different from the first weighted combination, of the values of only the second weighted mixture for at least two time intervals.

Tsinghua Science & Technology | 2008

IBM voice conversion systems for 2007 TC-STAR evaluation

Zhiwei Shuang; Raimo Bakis; Yong Qin

Abstract This paper proposes a novel voice conversion method by frequency warping. The frequency warping function is generated based on mapping formants of the source speaker and the target speaker. In addition to frequency warping, fundamental frequency adjustment, spectral envelope equalization, breathiness addition, and duration modification are also used to improve the similarity to the target speaker. The proposed voice conversion method needs only a very small amount of training data for generating the warping function, thereby greatly facilitating its application. Systems based on the proposed method were used for the 2007 TC-STAR intra-lingual voice conversion evaluation for English and Spanish and a cross-lingual voice conversion evaluation for Spanish. The evaluation results show that the proposed method can achieve a much better quality of converted speech than other methods as well as a good balance between quality and similarity. The IBM1 system was ranked No. 1 for English evaluation and No. 2 for Spanish evaluation. Evaluation results also show that the proposed method is a convenient and competitive method for cross-lingual voice conversion tasks.

Explore More