Is this you? Create Your Porfile

Sumio Ohno

Tokyo University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sumio Ohno is active.

Explore More

Publication

Featured researches published by Sumio Ohno.

Speech Communication | 2005

Analysis and synthesis of fundamental frequency contours of Standard Chinese using the command–response model

Hiroya Fujisaki; Changfu Wang; Sumio Ohno; Wentao Gu

While the tonal characteristics of Chinese syllables have been qualitatively described in traditional phonetics, quantitative analysis requires a mathematical model. This paper presents such a model for the fundamental frequency contours of Standard Chinese, based on an extension of a model that has already been proved to be applicable to non-tone languages including Japanese, English, and others. The model allows one to interpret a given fundamental frequency contour in terms of tone commands and phrase commands, and to analyze various tonal phenomena in quantitative terms. The paper then describes the results of analysis of fundamental frequency contours of a number of utterances, revealing systematic relationships between the timing of the tone commands and the final of each syllable. The results are used to derive constraints for tone and phrase command generation in speech synthesis. The validity of the rules is confirmed by evaluating the naturalness of prosody of synthetic speech. The validity of introducing these constraints in speech synthesis of Standard Chinese is confirmed by perceptual tests on naturalness of prosody as well as on intelligibility of tones, using speech synthesized with and without these constraints.

international conference on spoken language processing | 1996

Prosodic parameterization of spoken Japanese based on a model of the generation process of F/sub 0/ contours

Hiroya Fujisaki; Sumio Ohno

The process of generating an F/sub 0/ contour from a small number of linguistically meaningful parameters, has been modeled quite accurately, and the model has been used extensively in speech synthesis. The study deals with the inverse problem, i.e., that of extracting the model parameters from a given contour, which can only be solved by successive approximation. This paper presents a method for deriving a first-order approximation to a given F/sub 0/ contour from the linguistic information of the utterance, and refining the approximation by analysis-by-synthesis. The validity of the method has been confirmed experimentally.

international conference on spoken language processing | 1996

On the levels of accentuation in spoken Japanese

Hiroya Fujisaki; Sumio Ohno; Osamu Tomita

Accentuation serves to express both the discrete information concerning the accent type of a prosodic word and the continuous information concerning its prominence. The paper examines the latter aspect of accentuation using recorded radio news read by announcers. The amplitude of the accent command was extracted from an F/sub 0/ contour and used as an index for the level of accentuation. Statistical analysis of the accent command amplitude confirmed the difference between accented and unaccented types. Further analysis of the relationship between amplitudes of two adjoining accent commands also revealed a marked difference in characteristics of these two types.

international conference on acoustics, speech, and signal processing | 1993

Utterance normalization using vowel features in a spoken word recognition system for multiple speakers

Sumio Ohno; Keikichi Hirose; H. Fujasaki

The authors propose a novel method of normalization based on linear transformation of acoustic features of input speech using only one isolated utterance each of the five vowels of Japanese by each individual speaker. Experiments on isolated word recognition combining the proposed normalization method and multiple-template DP matching showed a marked improvement in the recognition rate, especially for smaller numbers of templates per word. The proposed method gives consistently higher word recognition scores than the four-dimensional representation on the Karhunen-Loeve transformation, and also gives higher scores than the original 16-dimensional representation of filter-bank outputs, especially when the number of templates is small. Together with the fact that this method reduces the dimension of the feature vector by a factor of four, the results demonstrate the validity of the proposed method.<<ETX>>

international conference on signal processing | 1996

Automatic parameter extraction of fundamental frequency contours of speech based on a generative model

Hiroya Fujisaki; Sumio Ohno; Osamu Tomita

The process of generating an F/sub 0/ contour from a small number of linguistically meaningful parameters, has been modeled quite accurately, and the model has been used extensively in speech synthesis. The paper deals with the inverse problem, i.e., that of extracting the model parameters from a given contour, which can only be solved by successive approximation. It presents a method for deriving a first-order approximation to a given F/sub 0/ contour from the linguistic information of the utterance, and refining the approximation by analysis-by-synthesis. The validity of the method has been confirmed experimentally.

international conference on acoustics, speech, and signal processing | 1990

Spoken word recognition for multiple speakers based on path-limited DP matching and a method for speaker normalization

Keikichi Hirose; Hiroya Fujisaki; Sumio Ohno; H. Mio

A method of path-limitation is proposed for the efficient DP matching of spoken words. Results of a preliminary recognition experiment show that this method can reduce the recognition errors as well. A method for effective speaker normalization is proposed. This method is based on the transformation of the relative positions of the five vowels of a given speaker in the parameter space. As a result of transformation, the dimension of the parameter space is reduced by a factor of four, yielding a great advantage in the computation time for matching. A study on the realization of a spoken-word recognition device using systolic array processors is described. A cyclic configuration that permits real-time matching of 600 templates with a single LSI chip is proposed.<<ETX>>

9th International Conference on Speech Prosody 2018 | 2018

Consistency of base frequency labelling for the F0 contour generation model using expressive emotional speech corpora

Yoshiko Arimoto; Yasuo Horiuchi; Sumio Ohno

To investigate the consistency of base frequency ( Fb) labelling of the F0 contour generation model for expressive and/or authentic emotional speech, a Fb labelling experiment was conducted using three trained labellers employing the parallel corpus of emotional speech, Online-gaming voice chat corpus with emotional labelling (OGVC). Twenty-four utterances from spontaneous dialog speech and emotion-acted speech in the OGVC were labelled with theFb, phrase command, and accent command by the three labellers. A repeated measure analysis of variance was performed with the factor of the corpus type, gender, speaker, emotion, and labeller, for the Fb value of each utterance. The results show a significant main effect on gender, speaker, and emotion and the significant interaction between speaker and emotion. The results also indicate that the value ofFb varied when the different emotions were expressed, even when uttered by the same speaker. Moreover, the precise inspection for theFb of each utterance suggests that the Fb also varied when the linguistic content of the utterances differed, even if the same emotion was expressed in those utterances.

Journal of the Acoustical Society of America | 2016

A framework for systematic studies of attitudes in speech

Hiroya Fujisaki; Sumio Ohno; Wentao Gu

We present here a framework for a systematic study of attitudes expressed by speech. Although the term “attitude” is commonly used to refer to phenomena at several different levels, we shall assign separate terms to each of these levels for the sake of clarity. The term “ attitude” can refer both to a person and an object (either concrete or abstract), here we shall be concerned only with personal relationships. Stances: Those (attitudes) related to official relationships, often influenced by social factors together with long-term personal factors (such as commanding, subordinate, etc.) Attitudes (in the narrow sense): Those referring to private but mid- to long-term relationships, often involving the whole personality (such as kind, friendly, remote, etc.) Manners: Those (attitudes) referring to short-term but somewhat consistent individual behaviors/acts (such as polite, rude, abrupt, etc.) For example, when a teacher tries to call the attention of a young child to an imminent danger, his/her utterance ...

international conference on interactive collaborative learning | 2012

Construction of a personally adapted e-learning system using collective intelligence

Jinhua She; Xiaoxia Zhang; Shumei Chen; Hiroyuki Kameda; Sumio Ohno

To meet the strong demand for training in technical Chinese, we have built an e-learning system for a technical Chinese course. In this study, we devised a method of adapting the course material to a learners ability. For each new learner, the system uses collective intelligence to construct a preliminary vocabulary test. Then, it uses the results to construct a personalized course by selecting the course material from the database that is most suitable to the learner.

Journal of the Acoustical Society of America | 2008

Study on voice quality parameters for anger degree estimation

Yoshiko Arimoto; Sumio Ohno; Hitoshi Iida

With great advance of automatic speech recognition (ASR) systems and a voice command system are demanded to be more sensitive to users intention or emotion. These systems currently process linguistic information, but not process nonlinguistic information or paralinguistic information which users expressed during dialogs. For that reason, computers can obtain less information about a user through a dialog than human listeners can. If computers will recognize users emotions conveyed by acoustic information, more appropriate response can be made toward users. For realization of emotion recognition, we have continued our study on anger degree estimation by both prosodic features and segmental features with anger utterances which were recorded during two kinds of pseudo‐dialogs. This report focuses on only segmental features related to voice quality and examines them for capabilities to estimate anger degree. The first cepstral coefficient of anger utterances has been analyzed to obtain acoustic parameters r...

Explore More