Shiro Kumano | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shiro Kumano is active.

Explore More

Publication

Featured researches published by Shiro Kumano.

asian conference on computer vision | 2007

Pose-invariant facial expression recognition using variable-intensity templates

Shiro Kumano; Kazuhiro Otsuka; Junji Yamato; Eisaku Maeda; Yoichi Sato

In this paper, we propose a method for pose-invariant facial expression recognition from monocular video sequences. The advantage of our method is that, unlike existing methods, our method uses a very simple model, called the variable-intensity template, for describing different facial expressions, making it possible to prepare a model for each person with very little time and effort. Variable-intensity templates describe how the intensity of multiple points defined in the vicinity of facial parts varies for different facial expressions. By using this model in the framework of a particle filter, our method is capable of estimating facial poses and expressions simultaneously. Experiments demonstrate the effectiveness of our method. A recognition rate of over 90% was achieved for horizontal facial orientations on a range of ±40 degrees from the frontal view.

international conference on multimodal interfaces | 2014

Analysis of Respiration for Prediction of "Who Will Be Next Speaker and When?" in Multi-Party Meetings

Ryo Ishii; Kazuhiro Otsuka; Shiro Kumano; Junji Yamato

To build a model for predicting the next speaker and the start time of the next utterance in multi-party meetings, we performed a fundamental study of how respiration could be effective for the prediction model. The results of the analysis reveal that a speaker inhales more rapidly and quickly right after the end of a unit of utterance in turn-keeping. The next speaker takes a bigger breath toward speaking in turn-changing than listeners who will not become the next speaker. Based on the results of the analysis, we constructed the prediction models to evaluate how effective the parameters are. The results of the evaluation suggest that the speakers inhalation right after a unit of utterance, such as the start time from the end of the unit of utterance and the slope and duration of the inhalation phase, is effective for predicting whether turn-keeping or turn-changing happen about 350 ms before the start time of the next utterance on average and that listeners inhalation before the next utterance, such as the maximal inspiration and amplitude of the inhalation phase, is effective for predicting the next speaker in turn-changing about 900 ms before the start time of the next utterance on average.

mobile and ubiquitous multimedia | 2013

Inferring mood in ubiquitous conversational video

Dairazalia Sanchez-Cortes; Joan-Isaac Biel; Shiro Kumano; Junji Yamato; Kazuhiro Otsuka; Daniel Gatica-Perez

Conversational social video is becoming a worldwide trend. Video communication allows a more natural interaction, when aiming to share personal news, ideas, and opinions, by transmitting both verbal content and nonverbal behavior. However, the automatic analysis of natural mood is challenging, since it is displayed in parallel via voice, face, and body. This paper presents an automatic approach to infer 11 natural mood categories in conversational social video using single and multimodal nonverbal cues extracted from video blogs (vlogs) from YouTube. The mood labels used in our work were collected via crowdsourcing. Our approach is promising for several of the studied mood categories. Our study demonstrates that although multimodal features perform better than single channel features, not always all the available channels are needed to accurately discriminate mood in videos.

international conference on acoustics, speech, and signal processing | 2014

Analysis and modeling of next speaking start timing based on gaze behavior in multi-party meetings

Ryo Ishii; Kazuhiro Otsuka; Shiro Kumano; Junji Yamato

To realize a conversational interface where an agent system can smoothly communicate with multiple persons, it is imperative to know how the start timing of speaking is decided. In this research, we demonstrate a relationship between gaze transition patterns and the start timing of next speaking against the end of the last speaking in multi-party meetings. Then, we construct a prediction model for the start timing using gaze transition patterns near the end of an utterance. An analysis of data collected from natural multi-party meetings reveals a strong relationship between gaze transition patterns of the speaker, next speaker, and listener and the start timing of the next speaker. On the basis of the results, we used gaze transition patterns of the speaker, next speaker, and listener and mutual gaze as variables, and devised several prediction models. A model using all features performed the best and was able to predict the start timing well.

Ksii Transactions on Internet and Information Systems | 2016

Prediction of Who Will Be the Next Speaker and When Using Gaze Behavior in Multiparty Meetings

Ryo Ishii; Kazuhiro Otsuka; Shiro Kumano; Junji Yamato

In multiparty meetings, participants need to predict the end of the speaker’s utterance and who will start speaking next, as well as consider a strategy for good timing to speak next. Gaze behavior plays an important role in smooth turn-changing. This article proposes a prediction model that features three processing steps to predict (I) whether turn-changing or turn-keeping will occur, (II) who will be the next speaker in turn-changing, and (III) the timing of the start of the next speaker’s utterance. For the feature values of the model, we focused on gaze transition patterns and the timing structure of eye contact between a speaker and a listener near the end of the speaker’s utterance. Gaze transition patterns provide information about the order in which gaze behavior changes. The timing structure of eye contact is defined as who looks at whom and who looks away first, the speaker or listener, when eye contact between the speaker and a listener occurs. We collected corpus data of multiparty meetings, using the data to demonstrate relationships between gaze transition patterns and timing structure and situations (I), (II), and (III). The results of our analyses indicate that the gaze transition pattern of the speaker and listener and the timing structure of eye contact have a strong association with turn-changing, the next speaker in turn-changing, and the start time of the next utterance. On the basis of the results, we constructed prediction models using the gaze transition patterns and timing structure. The gaze transition patterns were found to be useful in predicting turn-changing, the next speaker in turn-changing, and the start time of the next utterance. Contrary to expectations, we did not find that the timing structure is useful for predicting the next speaker and the start time. This study opens up new possibilities for predicting the next speaker and the timing of the next utterance using gaze transition patterns in multiparty meetings.

international conference on multimodal interfaces | 2015

Multimodal Fusion using Respiration and Gaze for Predicting Next Speaker in Multi-Party Meetings

Ryo Ishii; Shiro Kumano; Kazuhiro Otsuka

Techniques that use nonverbal behaviors to predict turn-taking situations, such as who will be the next speaker and the next utterance timing in multi-party meetings are receiving a lot of attention recently. It has long been known that gaze is a physical behavior that plays an important role in transferring the speaking turn between humans. Recently, a line of research has focused on the relationship between turn-taking and respiration, a biological signal that conveys information about the intention or preliminary action to start to speak. It has been demonstrated that respiration and gaze behavior separately have the potential to allow predicting the next speaker and the next utterance timing in multi-party meetings. As a multimodal fusion to create models for predicting the next speaker in multi-party meetings, we integrated respiration and gaze behavior, which were extracted from different modalities and are completely different in quality, and implemented a model uses information about them to predict the next speaker at the end of an utterance. The model has a two-step processing. The first is to predict whether turn-keeping or turn-taking happens; the second is to predict the next speaker in turn-taking. We constructed prediction models with either respiration or gaze behavior and with both respiration and gaze behaviors as features and compared their performance. The results suggest that the model with both respiration and gaze behaviors performs better than the one using only respiration or gaze behavior. It is revealed that multimodal fusion using respiration and gaze behavior is effective for predicting the next speaker in multi-party meetings. It was found that gaze behavior is more useful for predicting turn-keeping/turn-taking than respiration and that respiration is more useful for predicting the next speaker in turn-taking.

Ksii Transactions on Internet and Information Systems | 2015

In the Mood for Vlog: Multimodal Inference in Conversational Social Video

Dairazalia Sanchez-Cortes; Shiro Kumano; Kazuhiro Otsuka; Daniel Gatica-Perez

The prevalent “share whats on your mind” paradigm of social media can be examined from the perspective of mood: short-term affective states revealed by the shared data. This view takes on new relevance given the emergence of conversational social video as a popular genre among viewers looking for entertainment and among video contributors as a channel for debate, expertise sharing, and artistic expression. From the perspective of human behavior understanding, in conversational social video both verbal and nonverbal information is conveyed by speakers and decoded by viewers. We present a systematic study of classification and ranking of mood impressions in social video, using vlogs from YouTube. Our approach considers eleven natural mood categories labeled through crowdsourcing by external observers on a diverse set of conversational vlogs. We extract a comprehensive number of nonverbal and verbal behavioral cues from the audio and video channels to characterize the mood of vloggers. Then we implement and validate vlog classification and vlog ranking tasks using supervised learning methods. Following a reliability and correlation analysis of the mood impression data, our study demonstrates that, while the problem is challenging, several mood categories can be inferred with promising performance. Furthermore, multimodal features perform consistently better than single-channel features. Finally, we show that addressing mood as a ranking problem is a promising practical direction for several of the mood categories studied.

ieee international conference on automatic face gesture recognition | 2013

Analyzing perceived empathy/antipathy based on reaction time in behavioral coordination

Shiro Kumano; Kazuhiro Otsuka; Masafumi Matsuda; Junji Yamato

This study analyzes emotions established between people while interacting in face-to-face conversation. By focusing on empathy and antipathy, especially the process by which they are perceived by external observers, this paper aims to elucidate the tendency of their perception and from it develop a computational model that realizes the automatic estimation of perceived empathy/antipathy. This paper makes two main contributions. First, an experiment demonstrates that an observers perception of an interacting pair is affected by the time lags found in their actions and reactions in facial expressions and by whether their expressions are congruent or not. For example, a congruent but delayed reaction is unlikely to be perceived as empathy. Based on our findings, we propose a probabilistic model that relates the perceived empathy/antipathy of external observers to the actions and reactions of conversation participants. An experiment is conducted on ten conversations performed by 16 women in which the perceptions of nine external observers are gathered. The results demonstrate that timing cues are useful in improving the estimation performance, especially for perceived antipathy.

acm multimedia | 2011

A system for reconstructing multiparty conversation field based on augmented head motion by dynamic projection

Kazuhiro Otsuka; Kamil Sebastian Mucha; Shiro Kumano; Dan Mikami; Masafumi Matsuda; Junji Yamato

A novel system is presented for reconstructing, in the real world, multiparty face-to-face conversation scenes; it uses dynamics projection to augment human head motion. This system aims to display and playback pre-recorded conversations to the viewers as if the remote people were taking in front of them. This system consists of multiple projectors and transparent screens. Each screen separately displays the life-size face of one meeting participant, and are spatially arranged to recreate the actual scene. The main feature of this system is dynamics projection, screen pose is dynamically controlled to emulate the head motions of the participants, especially rotation around the vertical axis, that are typical of shifts in visual attention, i.e. turning gaze from one to another. This recreation of head motion by physical screen motion, in addition to image motion, aims to more clearly express the interactions involving visual attention among the participants. The minimal design, frameless-projector-screen, with augmented head motion is expected to create a feeling that the remote participants are actually present in the same room. This demo presents our initial system and discusses its potential impact on future visual communications.

systems, man and cybernetics | 2011

Early facial expression recognition with high-frame rate 3D sensing

Lumei Su; Shiro Kumano; Kazuhiro Otsuka; Dan Mikami; Junji Yamato; Yoichi Sato

This work investigates a new challenging problem: how to exactly recognize facial expression as early as possible, while most works generally focus on improving the recognition rate of facial expression recognition. The features of facial expressions in their early stage are unfortunately very sensitive to noise due to their low intensity. So, we propose a novel wavelet spectral subtraction method to spatio-temporally refine the subtle facial expression features. Moreover, in order to achieve early facial expression recognition, we newly introduce an early AdaBoost algorithm for facial expression recognition problem. Experiments using our database established by using a high-frame rate 3D sensing showed that the proposed method has a promising performance on early facial expression recognition.

Explore More