Norio Higuchi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Norio Higuchi is active.

Explore More

Publication

Featured researches published by Norio Higuchi.

Journal of the Acoustical Society of America | 1992

Speech synthesis system by rule using phonemes as systhesis units

Seiichi Yamamoto; Norio Higuchi; Toru Shimizu

A speech synthesizer that synthesizes speech by actuating a voice source and a filter which processes output of the voice source according to speech parameters in each successive short interval of time according to feature vectors which include formant frequencies, formant bandwidth, speech rate and so on. Each feature vector, or speech parameter is defined by two target points (r1, r2), and a value at each target point together with a connection curve between target points. A speech rate is defined by a speech rate curve which defines elongation or shortening of the speech rate, by start point (d1) of elongation (or shorteninng), end point (d2), and elongation ratio between d1 and d2. The ratios between the relative time of each speech parameter and absolute time are preliminarily calculated according to the speech rate table in each predetermined short interval.

Journal of the Acoustical Society of America | 1992

Pitch frequency generation system in a speech synthesis system

Norio Higuchi; Seiichi Yamamoto; Toru Shimizu

A speech synthesis system comprises an input terminal for accepting text code, accent code, and phrase code. The speech synthesis system further comprises a converter for converting the text code to speech parameters for speech synthesis; an accent commend generator coupled to an output of the converter for providing a train of accent commands; a phrase command generator coupled to an output of the converter for providing a train of phrase commands; an accent command buffer for storing the accent commands; a phrase command buffer for storing the phrase commands; an accent component calculator operably coupled to the accent command buffer for providing contour of pitch frequency by accent component; a phrase component calculator operably coupled to the phrase command buffer for providing contour of pitch frequency by phrase component; an adder for providing a sum of output signals from the accent component calculator and the phrase component calculator; a device for providing fundamental frequency of voicing which is coupled to an output of the adder; a speech synthesizer coupled to an output of the device for providing the fundamental frequency and output of the converter; and an output terminal coupled to an output of the speech synthesizer for providing synthesized speech to an external circuit.

Speech Communication | 1999

Robust speech detection method for telephone speech recognition system

Shingo Kuroiwa; Masaki Naito; Seiichi Yamamoto; Norio Higuchi

This paper describes speech endpoint detection methods for continuous speech recognition systems used over telephone networks. Speech input to these systems may be contaminated not only by various ambient noises but also by various irrelevant sounds generated by users such as coughs, tongue clicking, lip noises and certain out-of-task utterances. Under these adverse conditions, robust speech endpoint detection remains an unsolved problem. We found in fact, that speech endpoint detection errors occurred in over 10% of the inputs in field trials of a voice activated telephone extension system. These errors were caused by problems of (1) low SNR, (2) long pauses between phrases and (3) irrelevant sounds prior to task sentences. To solve the first two problems, we propose a real-time speech ending point detection algorithm based on the implicit approach, which finds a sentence end by comparing the likelihood of a complete sentence hypothesis and other hypotheses. For the third problem, we propose a speech beginning point detection algorithm which rejects irrelevant sounds by using likelihood ratio and duration conditions. The effectiveness of these methods was evaluated under various conditions. As a result, we found that the ending point detection algorithm was not affected by long pauses and that the beginning point detection algorithm successfully rejected irrelevant sounds by using phone HMMs that fit the task. Furthermore, a garbage model of irrelevant sounds was also evaluated and we found that the garbage modeling technique and the proposed method compensated each other in their respective weak points and that the best recognition accuracy was achieved by integrating these methods.

Archive | 1997

Automatic Extraction of F 0 Control Rules Using Statistical Analysis

Toshio Hirai; Naoto Iwahashi; Norio Higuchi; Yoshinori Sagisaka

This chapter describes an automatic derivation of F 0 control rules using a superpositional F 0 control model and a tree-generation-type statistical model. In this derivation, the superpositional model was used to parameterize the F 0 contours of speech data. The F 0 control rules, which predict the parameters from linguistic information, were formed statistically by analyzing the relationship between the parameter value and the linguistic information. An experiment using 200 Japaneseread sentences showed the effectiveness of this derivation algorithm. Throughout this experiment, two clear F 0-control characteristics of the superpositional model were derived: (1) the dominant factor controlling amplitude of the accent command was accent type, and (2) for the phrase command the dominant factor was the number of morae in the previous phrase.

international conference on acoustics, speech, and signal processing | 1997

Fast and robust joint estimation of vocal tract and voice source parameters

Wen Ding; Nick Campbell; Norio Higuchi; Hideki Kasuya

A new pitch-synchronous method of joint estimation is described to estimate vocal tract and voice source parameters from speech signals based on an autoregressive model with an exogenous input (ARX) model. The method uses Kalman filtering to estimate the time-varying coefficients and simulated annealing to deal with the non-linear optimization of Rosenberg-Klatt parameters. A compact method is suggested in the algorithm in order to reduce the computation cost. Further, an automatic model order selection method is proposed to determine the proper analysis pole-order of the ARX model, based on the estimated formant bandwidths. The new method has been shown to be much faster than our previous method and the order selection technique has been shown to be effective. Finally, an ATR two-channel speech database including varying sentence-level prominence patterns is used to verify the proposed method.

Journal of the Acoustical Society of America | 2002

Mouth shape synthesizing

Masahide Kaneko; Atsushi Koike; Yoshinori Hatori; Seiichi Yamamoto; Norio Higuchi

A picture synthesizing apparatus, and method for synthesizing a moving picture of a person’s face having mouth-shape variations from a train of input characters, wherein the method steps comprise developing from the train of input character a train of phonemes, utilizing a speech synthesis technique outputting, for each phoneme, a corresponding vocal sound feature including articulation mode and its duration of each corresponding phoneme of the train of phonemes. Determining for each phoneme a mouth-shape feature corresponding to each phoneme on the basis of the corresponding vocal sound feature, the mouth-shape feature including the degree of opening of the mouth, the degree of roundness of the lips, the height of the lower jaw in a raised and a lowered position, and the degree to which the tongue is seen. Determining values of mouth-shape parameters, for each phoneme, for representing a concrete mouth-shape on the basis of the mouth-shape feature; and controlling the values of the mouth-shape parameters for each phoneme, for each frame of the moving picture in accordance with the duration of each phoneme, thereby synthesizing the moving picture having mouth-shape variations matched with a speech output audible in case of reading the train of input characters.

international conference on spoken language processing | 1996

Training data selection for voice conversion using speaker selection and vector field smoothing

Makoto Hashimoto; Norio Higuchi

We have previously proposed a spectral mapping method (SSVFS), for the purpose of voice conversion with a small amount of training data using speaker selection and vector held smoothing techniques. It has already been shown that SSVFS is effective for spectral mapping by both objective and subjective evaluations, and that it can operate with a very small amount of training data-as little as only one word (Hashimoto and Higuchi, 1995). We propose a criterion for selecting effective training data for SSVFS. We define coverage of parameter space with respect to the training procedure of SSVFS as the criterion. This criterion is useful not only for the selection of effective training samples, which is important for the efficient learning of spectral characteristics, but also for the estimation of the degree to which learning is carried out. To evaluate the validity of the proposed criterion, we measured the correlation between spectral resemblance and coverage. The result showed that the mean correlation coefficient for eight target speakers is -0.74 with the proposed criterion, and -0.59 without consideration of the training procedure. We conclude that the proposed criterion is useful in selecting effective training samples for SSVFS.

international conference on acoustics, speech, and signal processing | 1995

Stochastic modeling of pause insertion using context-free grammar

Shigeru Fujio; Yoshinori Sagisaka; Norio Higuchi

We propose a model for predicting pause insertion using a stochastic context-free grammar (SCFG) for an input part of speech sequence. In this model, word attributes and stochastic phrasing information obtained by a SCFG trained using phrase dependency bracketings and bracketings based on pause locations are used. Using the inside-outside algorithm for training, corpora with phrase dependency brackets are first used to train the SCFG from scratch. Next, this SCFG is re-trained using the same corpora with bracketings based on pause locations. Then, the probabilities of each bracketing structure are computed using the SCFG, and these are used as parameters in the prediction of the pause locations. Experiments were carried out to confirm the effectiveness of the stochastic model for the prediction of pause locations. In test with open data, 85.2% of the pause boundaries and 90.9% of the no-pause boundaries were correctly predicted.

Systems and Computers in Japan | 2002

Tree-based clustering for gaussian mixture HMMs

Tsuneo Kato; Shingo Kuroiwa; Tohru Shimizu; Norio Higuchi

Tree-based clustering is an effective method for sharing the state of an HMM in which clustering is applied to a set of context-dependent models with the phoneme context as the splitting condition. In past papers, the method has been restricted to the single Gaussian HMM. The single Gaussian HMM, however, is insufficient for representing the acoustic features, and an adequate topology (sharing of HMM state) will not necessarily be realized. Furthermore, in order to arrive at a state-sharing model with the desired number of mixtures, the process of doubling the number of mixtures and the embedded training must be iterated after the tree-based clustering, which increases the time for training. Consequently, this paper proposes a method in which the tree-based clustering algorithm for the single Gaussian HMM is extended to the clustering of the mixed Gaussian HMM. The proposed method reduces the training time to approximately one-third that of the conventional method of handling the single Gaussian HMM. A recognition experiment using a phone typewriter and a recognition experiment for continuous word demonstrate that the recognition rate is improved by one to two points.

Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376) | 1998

Area code, country code, and time difference information system and its field trial

Tsuneo Kato; Shingo Kuroiwa; Norio Higuchi

This paper describes an ASR system which responds to customer inquiries over a telephone network. Customer inquiries on area codes, country codes and time differences are one of the most popular items in information service of international telecommunication. This system called ACTIS (Area code, Country code and Time difference Information System) responds to these inquiries. ACTIS recognizes Japanese continuous speech and vocabulary including the names of 299 countries and 721 major cities throughout the world. We report on several technical features of the system including (1) new acoustic models using additional acoustic parameters, that is, acceleration of MFCC parameters and a log energy for increasing the recognition rate, (2) CMS (cepstrum mean subtraction) with compensation by recognition results for normalizing various channel characteristics and real-time operation, and (3) robust speech detection and out-of-vocabulary word detection for improving robustness to ambient noise, irrelevant sound and out-of-vocabulary words, along with their effects on computer simulation. We also report on results of field trial at KDD and a subjective assessment by users.

Explore More