Shizuka Nakamura
Kyoto University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Shizuka Nakamura.
international conference on acoustics, speech, and signal processing | 2009
Shizuka Nakamura; Shigeki Matsuda; Hiroaki Kato; Minoru Tsuzaki; Yoshinori Sagisaka
Automatic evaluation of English timing control proficiency is carried out by comparing segmental duration differences between learners and reference native speakers. To obtain an objective measure matched to human subjective evaluation, we introduced a measure reflecting perceptual characteristics. The proposed measure evaluates duration differences weighted by the loudness of the corresponding speech segment and the differences or jumps in loudness from the two adjacent speech segments. Experiments showed that estimated scores using the new perception-based measure provided a correlation coefficient of 0.72 with subjective evaluation scores given by native English speakers on the basis of naturalness in timing control. This correlation turned out to be significantly higher than that of 0.54 obtained when using a simple duration difference measure.
robot and human interactive communication | 2004
Tsuneo Nitta; Shigeki Sagayama; Yoichi Yamashita; Tatsuya Kawahara; Shigeo Morishima; Shizuka Nakamura; Atsushi Yamada; Koji Ito; M. Kai; A. Li; Masato Mimura; Keikichi Hirose; Takao Kobayashi; Keiichi Tokuda; Nobuaki Minematsu; Yasuharu Den; Takehito Utsuro; Tatsuo Yotsukura; Hiroshi Shimodaira; M. Araki; Takuya Nishimoto; N. Kawaguchi; H. Banno; Kouichi Katsurada
Interactive Speech Technology Consortium (ISTC), established on November 2003 after three years activity of the Galatea project supported by Information-technology Promotion Agency (IPA) of Japan, aims at supporting open-source free software development of multi-modal interaction (MMI) for human-like agents. The software named Galatea-toolkit developed by 24 researchers of 16 research institutes in Japan includes a Japanese speech recognition engine, a Japanese speech synthesis engine, and a facial image synthesis engine used for developing an anthropomorphic agent, as well as dialogue manager that can integrates multiple modalities, interprets them, and decides an action with differentiating it to multiple media of voice and facial expression. ISTC provides members a one-day technical seminar and one-week training course to master Galatea-toolkit, as well as a software set (CDROM) every year.
Proceedings of the Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction | 2016
Koji Inoue; Divesh Lala; Shizuka Nakamura; Katsuya Takanashi; Tatsuya Kawahara
We address the annotation of engagement in the context of human-machine interaction. Engagement represents the level of how much a user is being interested in and willing to continue the current interaction. The conversational data used in the annotation work is a human-robot interaction corpus where a human subject talks with the android ERICA, which is remotely operated by another human subject. The annotation work was done by multiple third-party annotators, and the task was to detect the time point when the level of engagement becomes high. The annotation results indicate that there are agreements among the annotators although the numbers of annotated points are different among them. It is also found that the level of engagement is related to turn-taking behaviors. Furthermore, we conducted interviews with the annotators to reveal behaviors used to show a high level of engagement. The results suggest that laughing, backchannels and nodding are related to the level of engagement.
IWSDS | 2019
Pierrick Milhorat; Divesh Lala; Koji Inoue; Tianyu Zhao; Masanari Ishida; Katsuya Takanashi; Shizuka Nakamura; Tatsuya Kawahara
We present a dialogue system for a conversational robot, Erica. Our goal is for Erica to engage in more human-like conversation, rather than being a simple question-answering robot. Our dialogue manager integrates question-answering with a statement response component which generates dialogue by asking about focused words detected in the user’s utterance, and a proactive initiator which generates dialogue based on events detected by Erica. We evaluate the statement response component and find that it produces coherent responses to a majority of user utterances taken from a human-machine dialogue corpus. An initial study with real users also shows that it reduces the number of fallback utterances by half. Our system is beneficial for producing mixed-initiative conversation.
Journal of the Acoustical Society of America | 2016
Shizuka Nakamura
To verify the possibility of regular periodicity of English rhythm, each sentence was divided into respective rhythm segments and the properties of its durations were analyzed. Rhythm segment (RhySeg) was defined as a segment including one syllable with a primary/secondary stress to which an adjacent unstressed syllable(s). The following locations of a stressed syllable in RhySeg were compared: forward, semi-forward, middle, semi-back, and back. To reflect the perceptual effect to the RhySeg constitution, the following factors to equally compress all of the unstressed syllables were compared: 0.1-1.0 at an interval of 0.1. To find the RhySeg constitution showing regular periodicity, not only the degree of concentration of the distribution, but the degree of closeness between RhySeg with a secondary stress and 1/2 of that with a primary stress, whose engagement on regular periodicity was indicated in previous studies, was applied as a criterion. Comparative experiments showed the best when the stressed syl...
Journal of the Acoustical Society of America | 2013
Shizuka Nakamura
In this author’s previous study [Nakamura, J. Acoust. Soc. Am. 131(4, Pt. 2), 3347 (2012)] on the acoustical analysis of the duration structure of rhythm in English speech observed in short sentences uttered by native speakers, the durations of the following rhythm unit showed the smallest variance among native speakers: 1/4 of the preceding unstressed syllable(s) + stressed syllable + 3/4 of the succeeding unstressed syllable(s). The durations of the rhythm unit with a secondary stress were concentrated at half of those of the unit with a primary stress. Therefore, the rhythm can be described by a series of rhythm units with primary and secondary stresses where the latter unit is half the duration of the former. In this study, the relationship between the rhythm units with primary and secondary stresses was investigated from viewpoints of the position of syllables with primary and secondary stresses in a sentence, correlation among rhythm units in a sentence, and individual differences among native speak...
Journal of the Acoustical Society of America | 2012
Shizuka Nakamura
With the aim of describing the periodicity of English rhythm, this study investigated the properties of the duration of rhythm segments that comprised a set of stressed and unstressed syllables. The data were the speech sounds of short sentences, each including three to five stressed syllables, spoken by a total of 20 native speakers. Five rhythm segment structures were identified based on the location of the stressed syllable and adopted for comparative judgment. They were as follows: a stressed syllable backward, semibackward, middle, semiforward, and forward in a segment. The measurement based on the detailed acoustical analysis showed that the reasonable rhythm segment structure can be defined as existing somewhere around FORWARD and SEMIFORWARD. In these structures, the periodicity of the rhythm is clearly shown in the duration of the rhythm segment. Furthermore, it was revealed that the duration of a rhythm segment with a secondary stress distributed around a half of that of a rhythm segment with a primary stress. That is, the rhythm can be described by a series of the basic periods and its half periods.
Journal of the Acoustical Society of America | 2009
Hiroaki Kato; Shizuka Nakamura; Shigeki Matsuda; Minoru Tsuzaki; Yoshinori Sagisaka
An empirical study is carried out to achieve a computer‐based methodology for evaluating a speaker’s accent in a second language as an alternative to a native‐speaker tutor. Its primary target is the disfluency in the temporal aspects of an English learner’s speech. Conventional approaches commonly use measures based solely on the acoustic features of given speech, such as segmental duration differences between learners and reference native speakers. However, our auditory system, unlike a microphone, is not transparent: it does not send incoming acoustic signals into the brain without any treatment. Therefore, this study uses auditory perceptual characteristics as weighting factors on the conventional measure. These are the loudness of the corresponding speech segment and the magnitude of the jump in loudness between this target segment and each of the two adjacent speech segments. These factors were originally found through general psychoacoustical procedures [H. Kato et al., JASA, 101, 2311–2322 (1997);...
2009 Oriental COCOSDA International Conference on Speech Database and Assessments | 2009
Yoshinori Sagisaka; Hiroaki Kato; Minoru Tsuzaki; Shizuka Nakamura; Chatchawarn Hansakunbuntheung
In this paper, we introduce Japanese segmental duration characteristics and computational modeling that we have been studying for around three decades in speech synthesis. A series of experimental results are also shown on loudness dependence in the duration perception. These computational duration modeling and perceptual studies on duration error sensitivity to loudness give some insights for computational human modeling of spoken language capability. As a first trial to figure out how these findings could be efficiently employed in other field like language learning, we introduce our current efforts on the objective evaluation of 2nd language speaking skill and the research consortium of AESOP (Asian English Speech cOrpus Project) where researchers in Asian countries have started to work together.
Journal of the Acoustical Society of America | 2006
Hajime Tsubaki; Shizuka Nakamura; Yoshinori Sagisaka
This paper introduces analysis results of temporal characteristics of English language uttered by Japanese subjects aiming at automatic evaluation of L2 prosody control. As a preliminary study, 11 sentences uttered by about 200 Japanese children were compared with native’s speech. The correlation between subjective scores on temporal naturalness and differences in segmental durations were measured in different speech units at every position for sentence by sentence. A wide range of correlation differences were observed such as from −0.70 to 0.23 at unstressed syllables, from −0.33 to 0.19 at stressed syllables, and from −0.70 to −0.04 at function words. The strongest negative correlation −0.70 at unstressed syllables and at function words reflects L1 characteristics of mora‐timing characteristics of Japanese. Further analysis results will be given at the conference with additional huge L2 speech database currently under construction. [Work supported in part by Waseda Univ. RISE research project of ‘‘Analy...
Collaboration
Dive into the Shizuka Nakamura's collaboration.
National Institute of Information and Communications Technology
View shared research outputsThailand National Science and Technology Development Agency
View shared research outputs