Kazumasa Murai | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kazumasa Murai is active.

Explore More

Publication

Featured researches published by Kazumasa Murai.

international conference on acoustics, speech, and signal processing | 2002

Audio-visual speech translation with automatic lip syncqronization and face tracking based on 3-D head model

Shigeo Morishima; Shin Ogata; Kazumasa Murai; Satoshi Nakamura

Speech-to-speech translation has been studied to realize natural human communication beyond language barriers. Toward further multi-modal natural communication, visual information such as face and lip movements will be necessary. In this paper, we introduce a multi-modal English-to-Japanese and Japanese-to-English translation system that also translates the speakers speech motion while synchronizing it to the translated speech. To retain the speakers facial expression, we substitute only the speech organs image with the synthesized one, which is made by a three-dimensional wire-frame model that is adaptable to any speaker. Our approach enables image synthesis and translation with an extremely small database. We conduct subjective evaluation by connected digit discrimination using data with and without audiovisual lip-synchronicity. The results confirm the sufficient quality of the proposed audio-visual translation system.

international conference on multimedia and expo | 2001

Model-based lip synchronization with automatically translated synthetic voice toward a multi-modal translation system

Shin Ogata; Kazumasa Murai; Satoshi Nakamura; Shigeo Morishima

In this paper, we introduce a multi-modal English-to-Japanese and Japanese-to-English translation system that also translates the speakers speech motion while synchronizing it to the translated speech. To retain the speakers facial expression, we substitute only the speech organs image with the synthesized one, which is made by a three-dimensional wire-frame model that is adaptable to any speaker. Our approach enables image synthesis and translation with an extremely small database.

multimedia signal processing | 2004

Interactive visualization of multi-stream meeting videos based on automatic visual content analysis

Alejandro Jaimes; Naofumi Yoshida; Kazumasa Murai; Kazutaka Hirata; Jun Miyazaki

We present a new approach to segment and visualize informally captured multi-stream meeting videos. We process the visual content in each stream individually by analyzing the differences between frames in each sequence to find change areas. These results are combined with face detection to determine visual activity in each of the streams. We then combine the activity scores from multiple streams and automatically generate a 3D representation of the video. Our representation allows the user to obtain an at-a-glance view of the video at different granularities of activity, view multiple streams simultaneously, and select particular points in time for viewing. We present experiments that suggest that low-level visual analysis can be effective for finding highlights that can be used for browsing multi-stream meeting videos.

international conference on multimedia and expo | 2001

AUTOMATIC FACE TRACKING AND MODEL MATCH-MOVE IN VIDEO SEQUENCE USING 3D FACE MODEL

Takafumi Misawa; Kazumasa Murai; Satoshi Nakamura; Shigeo Morishima

A stand-in is a common technique for movies and TV programs in foreign languages. The current stand-in that only substitutes the voice channel results awkward matching to the mouth motion. Videophone with auto-automatic voice translation are expected to be widely used in the near future, which may face the same problem with-without lip-synchronized speaking face image translation. In out this paper, we propose a method to track motion of the face from the video image, that is one of the key technologies for speaking image translation. Almost all the old tracking algorithms aim to detect feature points of the face. However, these algorithms had problems, such as blurring of a feature point between frames and occlusion of the hidden feature point by rotation of the head, and so on. In this paper, we propose a method which detects movement and rotation of the head given the three dimensional shape of the face, by template matching using a 3D personal face wire-frame model. The evaluation experiments are carried out frame with the measured reference data of the head. The proposed method achieves 0.48 angle error in average. This posed result confirms effectiveness of the proposed method.

IS&T/SPIE's Symposium on Electronic Imaging: Science & Technology | 1995

Flexible gray component replacement (GCR) based on CIE Lab*

Hitoshi Ogatsu; Kazumasa Murai; Shinji Kita

To improve the color fidelity of 4 color reproduction and to increase the flexibility of Gray Component Replacement (GCR) for the text and continuous images, a novel GCR algorithm based on CIE L*a*b* signals is proposed. The algorithm consist of (1) maximum (achromatic) black determination part, (2) black adjustment part based on chroma, and (3) 3 color determination part. On this configuration, black signal is determined ahead of MCY signals, and the freedom of 3 input i.e L*a*b* 4 output i.e. CMYBk conversion is concentrated in (2). By using xerographic color printer, by neural network technique for resolving this, the algorithm is examined. As a result, it is shown that the algorithm can conserve the color fidelity in any GCR rate, and which is applicable on both of text and continuous images.

international conference on multimedia and expo | 2001

Speech detection by facial image for multimodal speech recognition

Kazumasa Murai; Kenichi Kumatani; Satoshi Nakamura

CIMWOS is a multimedia, multimodal and multilingual system supporting content-based indexing, archiving, retrieval, and on-demand delivery of audiovisual content. The system uses a multifaceted approach to locate important segments within multimedia material employing state-of-the-art algorithms for text, speech and image processing. The audio processing operations employ robust continuous speech recognition, speech/non-speech classification, speaker clustering and speaker identification. Text processing tools operate on the text stream produced by the speech recogniser and perform named entity detection, term recognition, topic detection, and story segmentation. Image processing includes video segmentation and key frame extraction, face detection and face identification, object and scene recognition, video text detection and character recognition. All outputs converge to a textual XML metadata annotation scheme following the MPEG-7 standard. These XML annotations are further merged and loaded into the CIMWOS multimedia database. Additionally, they can be dynamically transformed for interchanging semantic-based information. The retrieval engine is based on a weighted boolean model with intelligent indexing components. An ergonomic and user-friendly web-based interface allows the user to efficiently retrieve video segments by a combination of media description, content metadata and natural language text. The system is currently under evaluation.

international conference on multimedia and expo | 2001

Automatic face tracking and model match-move automatic face tracking and model match-move in video sequence using 3D face model in video sequence using 3D face model

Takafumi Misawa; Kazumasa Murai; Satoshi Nakamura; Shigeo Morishima

A stand-in is a common technique for movies and TV A stand-in is a common technique for movies and TV programs in foreign languages. The current stand-in that programs in foreign languages. The current stand-in that only substitutes the voice channel results awkward only substitutes the voice channel results awkward matching to the mouth motion. Videophone with automatching to the mouth motion. Videophone with automatic voice translation are expected to be widely used in matic voice translation are expected to be widely used in the near future, which may face the same problem withthe near future, which may face the same problem without lip-synchronized speaking face image translation. In out lip-synchronized speaking face image translation. In this paper, we propose a method to track motion of the this paper, we propose a method to track motion of the face from the video image, that is one of the key techface from the video image, that is one of the key technologies for speaking image translation. nologies for speaking image translation. Almost all the old tracking algorithms aim to Almost all the old tracking algorithms aim to detect feature points of the face. However, these algodetect feature points of the face. However, these algorithms had problems, such as blurring of a feature point rithms had problems, such as blurring of a feature point between frames and occlusion of the hidden feature point between frames and occlusion of the hidden feature point by rotation of the head, and so on. In this paper, we by rotation of the head, and so on. In this paper, we propose a method which detects movement and rotation propose a method which detects movement and rotation of the head given the three dimensional shape of the face, of the head given the three dimensional shape of the face, by template matching using a 3D personal face wireby template matching using a 3D personal face wireframe model. The evaluation experiments are carried out frame model. The evaluation experiments are carried out with the measured reference data of the head. The prowith the measured reference data of the head. The proposed method achieves 0.48 angle error in average. This posed method achieves 0.48 angle error in average. This result confirms effectiveness of the proposed method. result confirms effectiveness of the proposed method.

Proceedings of SPIE, the International Society for Optical Engineering | 2005

Simultaneous 3D position sensing by means of large-scale spherical aberration of lens and Hough transform technique

Yasuji Seko; Yasuyuki Saguchi; Yoshinori Yamaguchi; Hiroyuki Hotta; Kazumasa Murai; Jun Miyazaki; Hiroyasu Koshimizu

We demonstrate a real time 3D position sensing of multiple light sources by capturing their ring images that are transformed by the molecular lens system with large spherical aberration. The ring images change in diameter in accordance with the distance to the light sources, and the ring center positions determine the directions toward them. Therefore, the 3D positions of light sources are calculated by detecting the diameters and center positions of the circles. This time we succeeded to measure 3D positions of multiple light sources simultaneously in real time by extracting and tracking the circle patterns individually. Each circle is extracted by the Hough transform technique that uses not-closely-distributing three edge points to search the primal votes more than threshold, and is tracked by predicting the successive positions by Kalman filter. These processes make it possible to measure the 3D positions of light sources even in the case of overlapped plural circles. In the experiment, we could track several circle patterns measuring the center positions and diameters, namely, measuring the 3D positions of LEDs in real space. Measurement error of 3D positions for a LED was 6.8mm in average for 150 sampling points ranging from 450mm to 950mm in distance.

signal processing systems | 2004

A Robust Bimodal Speech Section Detection

Kazumasa Murai; Satoshi Nakamura

This paper discusses robust speech section detection by audio and video modalities. Most of todays speech recognition systems require speech section detection prior to any further analysis, and the accuracy of detected speech section s is said to affect the speech recognition accuracy. Because audio modalities are intrinsically disturbed by audio noise, we have been researching video modality speech section detection by detecting deformations in speech organ images. Video modalities are robust to audio noise, but their detection sections are longer than audio speech sections because deformations in related organs start before the speech to prepare for the articulation of the first phoneme, and also because the settling down motion lasts longer than the speech. We have verified that inaccurate detected sections caused by this excess length degrade the speech recognition rate, leading to speech recognition errors by insertions. To reduce insertion errors, and enhance the robustness of speech detection, we propose a method that takes advantage of the two types of modalities. According to our experiment, the proposed method is confirmed to reduce the insertion error rate as well as increase the recognition rate in noisy environment.

pacific rim conference on communications, computers and signal processing | 2003

A new method to measure 3D position of a light source by tracking the ring images made by a hemispherical lens

Yasuji Seko; Kazumasa Murai; X. Kenju; Hiroyuki Hotta; Jun Miyazaki

We demonstrate a new method to measure 3D position of a light source by tracking the ring images transformed through a single hemispherical lens. The ring images are formed by the large spherical aberration of the lens and always clear regardless of the distance and angle of the light source without any focusing mechanism. Dynamic change of diameter and center position of the ring image enables 3D position measurement of the light source with accuracy and high resolution. Experimental results verified that this new principle solves the 3D optical measurement problems of complex structures, narrow-angle views and slow measuring speed due to focusing mechanism, indicating the possibility for wide applications.

Explore More