Toshiaki Fukada | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Toshiaki Fukada is active.

Explore More

Publication

Featured researches published by Toshiaki Fukada.

international conference on acoustics, speech, and signal processing | 1992

An adaptive algorithm for mel-cepstral analysis of speech

Toshiaki Fukada; Keiichi Tokuda; Takao Kobayashi; Satoshi Imai

The authors describe a mel-cepstral analysis method and its adaptive algorithm. In the proposed method, the authors apply the criterion used in the unbiased estimation of log spectrum to the spectral model represented by the mel-cepstral coefficients. To solve the nonlinear minimization problem involved in the method, they give an iterative algorithm whose convergence is guaranteed. Furthermore, they derive an adaptive algorithm for the mel-cepstral analysis by introducing an instantaneous estimate for gradient of the criterion. The adaptive mel-cepstral analysis system is implemented with an IIR adaptive filter which has an exponential transfer function, and whose stability is guaranteed. The authors also present examples of speech analysis and results of an isolated word recognition experiment.<<ETX>>

Speech Communication | 1999

Automatic generation of multiple pronunciations based on neural networks

Toshiaki Fukada; Takayoshi Yoshimura; Yoshinori Sagisaka

We propose a method for automatically generating a pronunciation dictionary based on a pronunciation neural network that can predict plausible pronunciations (alternative pronunciations) from the canonical pronunciation. This method can generate multiple forms of alternative pronunciations using the pronunciation network. For generating a sophisticated alternative pronunciation dictionary, two techniques are described: (1) alternative pronunciations with likelihoods and (2) alternative pronunciations for word boundary phonemes. Experimental results on spontaneous speech show that the automatically-derived pronunciation dictionaries give consistently higher recognition rates than a conventional dictionary.

Journal of the Acoustical Society of America | 1998

Speech synthesis apparatus and method for synthesizing speech from a character series comprising a text and pitch information

Mitsuru Otsuka; Yasunori Ohora; Takashi Aso; Toshiaki Fukada

A speech synthesis method and apparatus for synthesizing speech from a character series comprising a text and pitch information. The apparatus includes a parameter generator for generating power spectrum envelopes as parameters of a speech waveform to be synthesized representing the input text in accordance with the input character series. The apparatus also includes a pitch waveform generator for generating pitch waveforms whose period equals the pitch specified by the pitch information. The pitch waveform generator generates the pitch waveforms from the input pitch information and the power spectrum envelopes generated by the parameter generator. Also provided is a speech waveform output device for outputting the speech waveform obtained by connecting the generated pitch waveforms.

Journal of the Acoustical Society of America | 1998

Syllable-beat-point synchronized rule-based speech synthesis from coded utterance-speed-independent phoneme combination parameters

Mitsuru Ohtsuka; Yasunori Ohora; Takashi Asou; Takeshi Fujita; Toshiaki Fukada

In a speech synthesizer, each frame for generating a speech waveform has an expansion degree to which the frame is expanded or compressed in accordance with the production speed of synthetic speech. In accordance with the set speech production speed, the time interval between beat synchronization points is determined on the basis of the speed of the speech to be produced, and the time length of each frame present between the beat synchronization points is determined on the basis of the expansion degree of the frame. Parameters for producing a speech waveform in each frame are properly generated by the time length determined for the frame. In the speech synthesizer for outputting a speech signal by coupling phonemes constituted by one or a plurality of frames having phoneme vowel-consonant combination parameters (VcV, cV, or V) of the speech waveform, the number of frames can be held constant regardless of a change in the speech production speed. This prevents degradation in the tone quality or a variation in the processing quantity resulting from a change in the speech production speed.

international conference on spoken language processing | 1996

Speech recognition based on acoustically derived segment units

Toshiaki Fukada; Michiel Bacchiani; Kuldip Kumar Paliwal; Yoshinori Sagisaka

The paper describes a new method of word model generation based on acoustically derived segment units (henceforth ASUs). An ASU-based approach has the advantages of growing out of human pre-determined phonemes and of consistently generating acoustic units by using the maximum likelihood (ML) criterion. The former advantage is effective when it is difficult to map acoustics to a phone such as with highly co-articulated spontaneous speech. In order to implement an ASU-based modeling approach in a speech recognition system, one must first solve two points: (1) how does one design an inventory of acoustically-derived segmental units and (2) how does one model the pronunciations of lexical entries in terms of the ASUs. As for the second question, the authors propose an ASU-based word model generation method by composing the ASU statistics, that is, their means, variances and durations. The effectiveness of the proposed method is shown through spontaneous word recognition experiments.

Computer Speech & Language | 1998

Model parameter estimation for mixture density polynomial segment models

Toshiaki Fukada; Kuldip Kumar Paliwal; Yoshinori Sagisaka

Abstract In this paper, we propose parameter estimation techniques for mixture density polynomial segment models (MDPSMs) where their trajectories are specified with an arbitrary regression order. MDPSM parameters can be trained in one of three different ways: (1) segment clustering; (2) expectation maximization (EM) training of mean trajectories; and (3) EM training of mean and variance trajectories. These parameter estimation methods were evaluated in TIMIT vowel classification experiments. The experimental results showed that modelling both the mean and variance trajectories is consistently superior to modelling only the mean trajectory. We also found that modelling both trajectories results in significant improvements over the conventional HMM.

Systems and Computers in Japan | 1999

Phoneme boundary estimation using bidirectional recurrent neural networks and its applications

Toshiaki Fukada; Mike Schuster; Yoshinori Sagisaka

This paper describes a phoneme boundary estimation method based on bidirectional recurrent neural networks (BRNNs). Experimental results showed that the proposed method could estimate segment boundaries significantly better than an HMM or a multilayer perceptron-based method. Furthermore, we incorporated the BRNN-based segment boundary estimator into the HMM-based and segment model-based recognition systems. As a result, we confirmed that (1) BRNN outputs were effective for improving the recognition rate and reducing computational time in an HMM-based recognition system and (2) segment lattices obtained by the proposed methods dramatically reduce the computational complexity of segment model-based recognition.

international conference on acoustics speech and signal processing | 1998

Speaker normalized acoustic modeling based on 3-D Viterbi decoding

Toshiaki Fukada; Yoshinori Sagisaka

This paper describes a novel method for speaker normalization based on a frequency warping approach to reduce variations due to speaker-induced factors such as the vocal tract length. In our approach, a speaker normalized acoustic model is trained using time-varying (i.e., state, phoneme or word dependent) warping factors, while in the conventional approaches, the frequency warping factor is fixed for each speaker. These time-varying frequency warping factors are determined by a 3-dimensional (i.e., input frames, HMM states and warping factors) Viterbi decoding procedure. Experimental results on Japanese spontaneous speech recognition show that the proposed method yields a 9.7% improvement in speech recognition accuracy compared to the conventional speaker-independent model.

international conference on acoustics, speech, and signal processing | 2004

A differential spectral voice activity detector

Philip N. Garner; Toshiaki Fukada; Yasuhiro Komori

The voice activity detection (VAD) problem is placed into a decision theoretic framework, and the Gaussian VAD model of Sohn et al. (1998, 1999) is then shown to fit well with the framework. It is argued that the Gaussian model can be made more robust to correlation and expected spectral shapes of speech and noise by using a differential spectral representation. Such a model is formulated theoretically. The differential spectral VAD is then shown by experiment to compare favourably with the basic Gaussian VAD in a speech recognition setting, especially for noisy environments.

Journal of the Acoustical Society of America | 2007

Speech information processing method, apparatus and storage medium performing speech synthesis based on durations of phonemes

Toshiaki Fukada

A speech information processing apparatus which sets the duration of phonological series with accuracy, and sets a natural phoneme duration in accordance with phonemic/linguistic environment. For this purpose, the duration of a predetermined unit of phonological series is obtained based on a duration model for an entire segment. Then, duration of each of phonemes constructing the phonological series is obtained based on a duration model for a partial segment. Then, duration of each phoneme is set based on the duration of the phonological series and the duration of each phoneme.

Explore More