Katunobu Itou
Nagoya University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Katunobu Itou.
IEICE Transactions on Information and Systems | 2006
Toshihiro Wakita; Koji Ozawa; Chiyomi Miyajima; Kei Igarashi; Katunobu Itou; Kazuya Takeda; Fumitada Itakura
In this paper, we propose a driver identification method that is based on the driving behavior signals that are observed while the driver is following another vehicle. Driving behavior signals, such as the use of the accelerator pedal, brake pedal, vehicle velocity, and distance from the vehicle in front, are measured using a driving simulator. We compared the identification rate obtained using different identification models and different features. As a result, we found the nonparametric models is better than the parametric models. Also, the drivers operation signals were found to be better than road environment signals and car behavior signals. The identification rate for thirty driver using actual vehicle driving in a city area was 73%.
international acm sigir conference on research and development in information retrieval | 2001
Atsushi Fujii; Katunobu Itou; Tetsuya Ishikawa
Speech recognition has of late become a practical technology for real world applications. Aiming at speech-driven text retrieval, which facilitates retrieving information with spoken queries, we propose a method to integrate speech recognition and retrieval methods. Since users speak contents related to a target collection, we adapt statistical language models used for speech recognition based on the target collection, so as to improve both the recognition and retrieval accuracy. Experiments using existing test collections combined with dictated queries showed the effectiveness of our method.
Speech Communication | 2006
Atsushi Fujii; Katunobu Itou; Tetsuya Ishikawa
Abstract We propose a cross-media lecture-on-demand system, called lodem , which searches a lecture video for specific segments in response to a text query. We utilize the benefits of text, audio, and video data corresponding to a single lecture. lodem extracts the audio track from a target lecture video, generates a transcription by large-vocabulary continuous speech recognition, and produces a text index. A user can formulate text queries using the textbook related to the target lecture and can selectively view specific video segments by submitting those queries. Experimental results showed that by adapting speech recognition to the lecturer and the topic of the target lecture, the recognition accuracy was increased and consequently the retrieval accuracy was comparable with that obtained by human transcription. lodem is implemented as a client–server system on the Web to facilitate e-learning.
empirical methods in natural language processing | 2002
Atsushi Fujii; Katunobu Itou; Tetsuya Ishikawa
While recent retrieval techniques do not limit the number of index terms, out-of-vocabulary (OOV) words are crucial in speech recognition. Aiming at retrieving information with spoken queries, we fill the gap between speech recognition and text retrieval in terms of the vocabulary size. Given a spoken query, we generate a transcription and detect OOV words through speech recognition. We then correspond detected OOV words to terms indexed in a target collection to complete the transcription, and search the collection for documents relevant to the completed transcription. We show the effectiveness of our method by way of experiments.
Journal of Information Processing | 2009
Tomoyosi Akiba; Kiyoaki Aikawa; Yoshiaki Itoh; Tatsuya Kawahara; Hiroaki Nanjo; Hiromitsu Nishizaki; Norihito Yasuda; Yoichi Yamashita; Katunobu Itou
The lecture is one of the most valuable genres of audiovisual data. Though spoken document processing is a promising technology for utilizing the lecture in various ways, it is difficult to evaluate because the evaluation require a subjective judgment and/or the verification of large quantities of evaluation data. In this paper, a test collection for the evaluation of spoken lecture retrieval is reported. The test collection consists of the target spoken documents of about 2, 700 lectures (604 hours) taken from the Corpus of Spontaneous Japanese (CSJ), 39 retrieval queries, the relevant passages in the target documents for each query, and the automatic transcription of the target speech data. This paper also reports the retrieval performance targeting the constructed test collection by applying a standard spoken document retrieval (SDR) method, which serves as a baseline for the forthcoming SDR studies using the test collection.
international conference on spoken language processing | 1996
Satoru Hayamizu; Osamu Hasegawa; Katunobu Itou; Katsuhiko Sakaue; Kazuyo Tanaka; Shigeki Nagaya; Masayuki Nakazawa; Takehiro Endoh; Fumio Togawa; Kenji Sakamoto; Kazuhiko Yamamoto
The paper describes the design policy and prototype data collection of RWC (Real World Computing Program) multimodal database. The database is intended for research and development on the integration of spoken language and visual information for human computer interactions. The interactions are supposed to use image recognition, image synthesis, speech recognition, and speech synthesis. Visual information also includes non-verbal communication such as interactions using hand gestures and facial expressions between human and a human-like CG (computer graphics) agent with a face and hands. Based on the experiments of interactions with these modes, specifications of the database are discussed from the viewpoint of controlling the variability and cost for the collection.
international conference on acoustics, speech, and signal processing | 2005
Weifeng Li; Katunobu Itou; Kazuya Takeda; Fumitada Itakura
We present a two-stage noise spectra estimation approach. After the first-stage noise estimation using the improved minima controlled recursive averaging (IMCRA) method, the second-stage noise estimation is performed by employing a maximum a posteriori (MAP) noise amplitude estimator. We also develop a regression-based speech enhancement system by approximating the clean speech with the estimated noise and the original noisy speech. Evaluation experiments show that the proposed two-stage noise estimation method results in lower estimation error for all test noise types. Compared to the original noisy speech, the proposed regression-based approach obtains an average relative word error rate (WER) reduction of 65% in our isolated word recognition experiments conducted in 12 real car environments.
ubiquitous computing systems | 2009
Kiichiro Yamano; Katunobu Itou
The use of the log of personal life experiences recorded on cameras, microphones, GPS devices, etc., is studied. A record of a person’s personal life is called as a life-log. Since the amount of data stored in a life-log system is vast and since the data may also include redundant data, methods for the retrieval and summarization of the data are required for the effective use of the life-log data. In this paper, audio life-log recorded by wearable microphones is described. The purpose of this study is classifying audio life-log according to places, speakers, and time. However, the places stored in an audio life-log are obtained by GPS devices; information about rooms in buildings cannot be obtained. In this study, experiments were carried out on audio life-log. The audio life-log was divided into segments and clustered by spectrum envelopes according to rooms. The experiments show two situations in which the location information are captured and not captured. The results of the experiments showed that the location information helped in room clustering. Audio life-log browsing on a map using GPS is also suggested.
international conference on multimodal interfaces | 2007
Kazuhiro Morimoto; Chiyomi Miyajima; Norihide Kitaoka; Katunobu Itou; Kazuya Takeda
This paper presents a virtual push button interface created by drawing a shape or line in the air with a fingertip. As an example of such a gesture-based interface, we developed a four-button interface for entering multi-digit numbers by pushing gestures within an invisible 2x2 button matrix inside a square drawn by the user. Trajectories of fingertip movements entering randomly chosen multi-digit numbers are captured with a 3D position sensor mounted on the the forefingers tip. We propose a statistical segmentation method for the trajectory of movements and a normalization method that is associated with the direction and size of gestures. The performance of the proposed method is evaluated in HMM-based gesture recognition. The recognition rate of 60.0% was improved to 91.3% after applying the normalization method.
international conference on acoustics, speech, and signal processing | 2009
Jun Ogata; Masataka Goto; Katunobu Itou
In recognizing spontaneous speech, the performance of typical speech recognizers tends to be degraded by filled and silent pauses, which are hesitation phenomena frequently occurred in such speech. In this paper, we present a method for improving the performance of a speech recognizer by detecting and handling both filled pauses (lengthened vowels) and silent (unfilled) pauses. Our method automatically detects these pauses by using a bottom-up acoustical analysis in parallel with a typical speech decoding process, and then incorporates the detected results into the decoding process. From the results of experiments conducted using the CIAIR spontaneous speech corpus, the effectiveness of the proposed method was confirmed.
Collaboration
Dive into the Katunobu Itou's collaboration.
National Institute of Advanced Industrial Science and Technology
View shared research outputs