Kazumasa Murai
Fuji Xerox
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kazumasa Murai.
international conference on acoustics, speech, and signal processing | 2002
Shigeo Morishima; Shin Ogata; Kazumasa Murai; Satoshi Nakamura
Speech-to-speech translation has been studied to realize natural human communication beyond language barriers. Toward further multi-modal natural communication, visual information such as face and lip movements will be necessary. In this paper, we introduce a multi-modal English-to-Japanese and Japanese-to-English translation system that also translates the speakers speech motion while synchronizing it to the translated speech. To retain the speakers facial expression, we substitute only the speech organs image with the synthesized one, which is made by a three-dimensional wire-frame model that is adaptable to any speaker. Our approach enables image synthesis and translation with an extremely small database. We conduct subjective evaluation by connected digit discrimination using data with and without audiovisual lip-synchronicity. The results confirm the sufficient quality of the proposed audio-visual translation system.
international conference on multimedia and expo | 2001
Shin Ogata; Kazumasa Murai; Satoshi Nakamura; Shigeo Morishima
In this paper, we introduce a multi-modal English-to-Japanese and Japanese-to-English translation system that also translates the speakers speech motion while synchronizing it to the translated speech. To retain the speakers facial expression, we substitute only the speech organs image with the synthesized one, which is made by a three-dimensional wire-frame model that is adaptable to any speaker. Our approach enables image synthesis and translation with an extremely small database.
multimedia signal processing | 2004
Alejandro Jaimes; Naofumi Yoshida; Kazumasa Murai; Kazutaka Hirata; Jun Miyazaki
We present a new approach to segment and visualize informally captured multi-stream meeting videos. We process the visual content in each stream individually by analyzing the differences between frames in each sequence to find change areas. These results are combined with face detection to determine visual activity in each of the streams. We then combine the activity scores from multiple streams and automatically generate a 3D representation of the video. Our representation allows the user to obtain an at-a-glance view of the video at different granularities of activity, view multiple streams simultaneously, and select particular points in time for viewing. We present experiments that suggest that low-level visual analysis can be effective for finding highlights that can be used for browsing multi-stream meeting videos.
international conference on multimedia and expo | 2001
Takafumi Misawa; Kazumasa Murai; Satoshi Nakamura; Shigeo Morishima
A stand-in is a common technique for movies and TV programs in foreign languages. The current stand-in that only substitutes the voice channel results awkward matching to the mouth motion. Videophone with auto-automatic voice translation are expected to be widely used in the near future, which may face the same problem with-without lip-synchronized speaking face image translation. In out this paper, we propose a method to track motion of the face from the video image, that is one of the key technologies for speaking image translation. Almost all the old tracking algorithms aim to detect feature points of the face. However, these algorithms had problems, such as blurring of a feature point between frames and occlusion of the hidden feature point by rotation of the head, and so on. In this paper, we propose a method which detects movement and rotation of the head given the three dimensional shape of the face, by template matching using a 3D personal face wire-frame model. The evaluation experiments are carried out frame with the measured reference data of the head. The proposed method achieves 0.48 angle error in average. This posed result confirms effectiveness of the proposed method.
IS&T/SPIE's Symposium on Electronic Imaging: Science & Technology | 1995
Hitoshi Ogatsu; Kazumasa Murai; Shinji Kita
To improve the color fidelity of 4 color reproduction and to increase the flexibility of Gray Component Replacement (GCR) for the text and continuous images, a novel GCR algorithm based on CIE L*a*b* signals is proposed. The algorithm consist of (1) maximum (achromatic) black determination part, (2) black adjustment part based on chroma, and (3) 3 color determination part. On this configuration, black signal is determined ahead of MCY signals, and the freedom of 3 input i.e L*a*b* 4 output i.e. CMYBk conversion is concentrated in (2). By using xerographic color printer, by neural network technique for resolving this, the algorithm is examined. As a result, it is shown that the algorithm can conserve the color fidelity in any GCR rate, and which is applicable on both of text and continuous images.
international conference on multimedia and expo | 2001
Kazumasa Murai; Kenichi Kumatani; Satoshi Nakamura
CIMWOS is a multimedia, multimodal and multilingual system supporting content-based indexing, archiving, retrieval, and on-demand delivery of audiovisual content. The system uses a multifaceted approach to locate important segments within multimedia material employing state-of-the-art algorithms for text, speech and image processing. The audio processing operations employ robust continuous speech recognition, speech/non-speech classification, speaker clustering and speaker identification. Text processing tools operate on the text stream produced by the speech recogniser and perform named entity detection, term recognition, topic detection, and story segmentation. Image processing includes video segmentation and key frame extraction, face detection and face identification, object and scene recognition, video text detection and character recognition. All outputs converge to a textual XML metadata annotation scheme following the MPEG-7 standard. These XML annotations are further merged and loaded into the CIMWOS multimedia database. Additionally, they can be dynamically transformed for interchanging semantic-based information. The retrieval engine is based on a weighted boolean model with intelligent indexing components. An ergonomic and user-friendly web-based interface allows the user to efficiently retrieve video segments by a combination of media description, content metadata and natural language text. The system is currently under evaluation.
international conference on multimedia and expo | 2001
Takafumi Misawa; Kazumasa Murai; Satoshi Nakamura; Shigeo Morishima
A stand-in is a common technique for movies and TV A stand-in is a common technique for movies and TV programs in foreign languages. The current stand-in that programs in foreign languages. The current stand-in that only substitutes the voice channel results awkward only substitutes the voice channel results awkward matching to the mouth motion. Videophone with automatching to the mouth motion. Videophone with automatic voice translation are expected to be widely used in matic voice translation are expected to be widely used in the near future, which may face the same problem withthe near future, which may face the same problem without lip-synchronized speaking face image translation. In out lip-synchronized speaking face image translation. In this paper, we propose a method to track motion of the this paper, we propose a method to track motion of the face from the video image, that is one of the key techface from the video image, that is one of the key technologies for speaking image translation. nologies for speaking image translation. Almost all the old tracking algorithms aim to Almost all the old tracking algorithms aim to detect feature points of the face. However, these algodetect feature points of the face. However, these algorithms had problems, such as blurring of a feature point rithms had problems, such as blurring of a feature point between frames and occlusion of the hidden feature point between frames and occlusion of the hidden feature point by rotation of the head, and so on. In this paper, we by rotation of the head, and so on. In this paper, we propose a method which detects movement and rotation propose a method which detects movement and rotation of the head given the three dimensional shape of the face, of the head given the three dimensional shape of the face, by template matching using a 3D personal face wireby template matching using a 3D personal face wireframe model. The evaluation experiments are carried out frame model. The evaluation experiments are carried out with the measured reference data of the head. The prowith the measured reference data of the head. The proposed method achieves 0.48 angle error in average. This posed method achieves 0.48 angle error in average. This result confirms effectiveness of the proposed method. result confirms effectiveness of the proposed method.
Proceedings of SPIE, the International Society for Optical Engineering | 2005
Yasuji Seko; Yasuyuki Saguchi; Yoshinori Yamaguchi; Hiroyuki Hotta; Kazumasa Murai; Jun Miyazaki; Hiroyasu Koshimizu
We demonstrate a real time 3D position sensing of multiple light sources by capturing their ring images that are transformed by the molecular lens system with large spherical aberration. The ring images change in diameter in accordance with the distance to the light sources, and the ring center positions determine the directions toward them. Therefore, the 3D positions of light sources are calculated by detecting the diameters and center positions of the circles. This time we succeeded to measure 3D positions of multiple light sources simultaneously in real time by extracting and tracking the circle patterns individually. Each circle is extracted by the Hough transform technique that uses not-closely-distributing three edge points to search the primal votes more than threshold, and is tracked by predicting the successive positions by Kalman filter. These processes make it possible to measure the 3D positions of light sources even in the case of overlapped plural circles. In the experiment, we could track several circle patterns measuring the center positions and diameters, namely, measuring the 3D positions of LEDs in real space. Measurement error of 3D positions for a LED was 6.8mm in average for 150 sampling points ranging from 450mm to 950mm in distance.
signal processing systems | 2004
Kazumasa Murai; Satoshi Nakamura
This paper discusses robust speech section detection by audio and video modalities. Most of todays speech recognition systems require speech section detection prior to any further analysis, and the accuracy of detected speech section s is said to affect the speech recognition accuracy. Because audio modalities are intrinsically disturbed by audio noise, we have been researching video modality speech section detection by detecting deformations in speech organ images. Video modalities are robust to audio noise, but their detection sections are longer than audio speech sections because deformations in related organs start before the speech to prepare for the articulation of the first phoneme, and also because the settling down motion lasts longer than the speech. We have verified that inaccurate detected sections caused by this excess length degrade the speech recognition rate, leading to speech recognition errors by insertions. To reduce insertion errors, and enhance the robustness of speech detection, we propose a method that takes advantage of the two types of modalities. According to our experiment, the proposed method is confirmed to reduce the insertion error rate as well as increase the recognition rate in noisy environment.
pacific rim conference on communications, computers and signal processing | 2003
Yasuji Seko; Kazumasa Murai; X. Kenju; Hiroyuki Hotta; Jun Miyazaki
We demonstrate a new method to measure 3D position of a light source by tracking the ring images transformed through a single hemispherical lens. The ring images are formed by the large spherical aberration of the lens and always clear regardless of the distance and angle of the light source without any focusing mechanism. Dynamic change of diameter and center position of the ring image enables 3D position measurement of the light source with accuracy and high resolution. Experimental results verified that this new principle solves the 3D optical measurement problems of complex structures, narrow-angle views and slow measuring speed due to focusing mechanism, indicating the possibility for wide applications.