Takehiko Kagoshima
Toshiba
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Takehiko Kagoshima.
IEICE Transactions on Information and Systems | 2005
Tatsuya Mizutani; Takehiko Kagoshima
This paper proposes a novel speech synthesis method to generate human-like natural speech. The conventional unit-selection-based synthesis method selects speech units from a large database, and concatenates them with or without modifying the prosody to generate synthetic speech. This method features highly human-like voice quality. The method, however, has a problem that a suitable speech unit is not necessarily selected. Since the unsuitable speech unit selection causes discontinuity between the consecutive speech units, the synthesized speech quality deteriorates. It might be considered that the conventional method can attain higher speech quality if the database size increases. However, preparation of a larger database requires a longer recording time. The narrators voice quality does not remain constant throughout the recording period. This fact deteriorates the database quality, and still leaves the problem of unsuitable selection. We propose the plural unit selection and fusion method which avoids this problem. This method integrates the unit fusion used in the unit-training-based method with the conventional unit-selection-based method. The proposed method selects plural speech units for each segment, fuses the selected speech units for each segment, modifies the prosody of the fused speech units, and concatenates them to generate synthetic speech. This unit fusion creates speech units which are connected to one another with much less voice discontinuity, and realizes high quality speech. A subjective evaluation test showed that the proposed method greatly improves the speech quality compared with the conventional method. Also, it showed that the speech quality of the proposed method is kept high regardless of the database size, from small (10 minutes) to large (40 minutes). The proposed method is a new framework in the sense that it is a hybrid method between the unit-selection-based method and the unit-training-based method. In the framework, the algorithms of the unit selection and the unit fusion are exchangeable for more efficient techniques. Thus, the framework is expected to lead to new synthesis methods.
international conference on acoustics, speech, and signal processing | 2006
Dawei Xu; Haifeng Wang; Guohua Li; Takehiko Kagoshima
In Mandarin prosody synthesis by means of hierarchical prosodic structure, the naturalness of the output is reliant largely on the parsing of the prosodic structure. We propose a machine learning approach to improve prosodic structure parsing in cases where full syntax parsing is neglected due to considerations concerning practicality. The novel aspect of our approach is the new attribute in the input vector, which is named connective degree and calculated from the occurrence rate of the punctuation marks between Chinese characters by referring to a large text corpus. The results of experiments show that connective degree yield makes a remarkable contribution to parsing of hierarchical Mandarin prosodic structure.
international conference on acoustics, speech, and signal processing | 2005
Masatsune Tamura; Tatsuya Mizutani; Takehiko Kagoshima
Recently, concatenative speech synthesizers with large databases have been widely developed for high-quality speech synthesis. However, some platforms require a speech synthesis system that can work under the limitation of memory footprint or computational cost. In this paper, we propose a scalable concatenative speech synthesizer based on the plural speech unit selection and fusion method. To realize scalability, we propose the offline unit fusion method in which pitch-cycle waveforms for voiced segments are fused in advance. The experimental results show that the synthetic speech of the offline unit fusion method with half-size waveform database is comparable to that of the online unit fusion method, while the computation cost is reduced to 1/10.
international conference on acoustics, speech, and signal processing | 1997
Takehiko Kagoshima; Masami Akamine
This paper proposes a new method for automatically generating speech synthesis units. A small set of synthesis units is selected from a large speech database by the proposed closed loop training method (CLT). Because CLT is based on the evaluation and minimization of the distortion caused by the synthesis process such as prosodic modification: the selected synthesis units are most suitable for synthesizers. The CLT is applied to a waveform concatenation based synthesizer, whose basic unit is CV/VC (diphone). It is shown that synthesis units can be efficiently generated by CLT from a labeled speech database with a small amount of computation. Moreover, the synthesized speech is clear and smooth even though the storage size of the waveform dictionary is small.
international conference on acoustics, speech, and signal processing | 2011
Masatsune Tamura; Masahiro Morita; Takehiko Kagoshima; Masami Akamine
This paper presents a rapid voice adaptation algorithm using GMM-based frequency warping and shift with parameters of a sub-band basis spectrum model (SBM)[1]. The SBM parameter represents a shape of a spectrum of speech. It is calculated by fitting a sub-band basis to the log-spectrum. Since the parameter is the frequency domain representation, frequency warping can be directly applied to the SBM parameter. A frequency warping function that minimize the distance between source and target SBM parameter pairs in each mixture component of a GMM is derived using a DP (Dynamic programming) algorithm. The proposed method is evaluated in an unit-selection based voice adaptation framework applied to a unit-fusion based text-to-speech synthesizer. The experimental results show that the proposed adaptation method is effective for rapid voice adaptation using just one sentence, compared to the conventional GMM.-based linear transformation of mel-cepstra.
IEICE Transactions on Information and Systems | 2007
Masatsune Tamura; Tatsuya Mizutani; Takehiko Kagoshima
We have previously developed a concatenative speech synthesizer based on the plural speech unit selection and fusion method that can synthesize stable and human-like speech. In this method, plural speech units for each speech segment are selected using a cost function and fused by averaging pitch-cycle waveforms. This method has a large computational cost, but some platforms require a speech synthesis system that can work within limited hardware resources. In this paper, we propose an offline unit fusion method that reduces the computational cost. In the proposed method, speech units are fused in advance to make a pre-fused speech unit database. At synthesis time, a speech unit for each segment is selected from the pre-fused speech unit database and the speech waveform is synthesized by applying prosodic modification and concatenation without the computationally expensive unit fusion process. We compared several algorithms for constructing the pre-fused speech unit database. From the subjective and objective evaluations, the effectiveness of the proposed method is confirmed by the results that the quality of synthetic speech of the offline unit fusion method with 100 MB database is close to that of the online unit fusion method with 93 MB JP database and is slightly lower to that of the 390 MB US database, while the computational time is reduced by 80%. We also show that the frequency-weighted VQ-based method is effective for construction of the pre-fused speech unit database.
international conference on acoustics, speech, and signal processing | 2010
Masatsune Tamura; Norbert Braunschweiler; Takehiko Kagoshima; Masami Akamine
In this paper, we propose a speech synthesis method that combines a natural waveform concatenation based speech synthesis method and our baseline plural unit selection and fusion method. Two main features of the proposed method are (i) prosody regeneration from selected speech units and (ii) using multiple speech units at non-adjacent segments. The nonadjacent segments is the segment that the previous or following speech units in the optimum speech unit sequence are not adjacent in the database. By using the prosody of selected speech units, the original prosodic expressions and sounds of recorded speech are retained, while discontinuities are reduced by using multiple speech units at non-adjacent segments. MOS evaluations showed that the proposed method provides a clear improvement against the conventional unit selection method and our baseline method.
Journal of the Acoustical Society of America | 2002
Takehiko Kagoshima; Masami Akamine
Archive | 2006
Masatsune Tamura; Takehiko Kagoshima
Archive | 1999
Takashi Ida; Yoko Sanbonsugi; Takehiko Kagoshima; Hiroshi Takahashi