Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Norihide Kitaoka is active.

Publication


Featured researches published by Norihide Kitaoka.


Speech Communication | 2007

Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM

Longbiao Wang; Norihide Kitaoka; Seiichi Nakagawa

In this paper, we propose a robust speaker recognition method based on position-dependent Cepstral Mean Normalization (CMN) to compensate for the channel distortion depending on the speaker position. In the training stage, the system measures the transmission characteristics according to the speaker positions from some grid points to the microphone in the room and estimates the compensation parameters a priori. In the recognition stage, the system estimates the speaker position and adopts the estimated compensation parameters corresponding to the estimated position, and then the system applies the CMN to the speech and performs speaker recognition. In our past study, we proposed a new text-independent speaker recognition method by combining speaker-specific Gaussian mixture models (GMMs) with syllable-based HMMs adapted to the speakers by MAP [Nakagawa, S., Zhang, W., Takahashi, M., 2004. Text-independent speaker recognition by combining speaker-specific GMM with speaker-adapted syllable-based HMM. Proc. ICASSP-2004 1, 81-84]. The robustness of this speaker recognition method for the change of the speaking style in close-talking environment was evaluated in (Nakagawa et al., 2004). In this paper, we extend this combination method to distant speaker recognition and integrate this method with the proposed position-dependent CMN. Our experiments showed that the proposed method improved the speaker recognition performance remarkably in a distant environment.


IEEE Transactions on Intelligent Transportation Systems | 2011

Analysis of Real-World Driver's Frustration

Lucas Malta; Chiyomi Miyajima; Norihide Kitaoka; Kazuya Takeda

This paper investigates a method for estimating a drivers spontaneous frustration in the real world. In line with a specific definition of emotion, the proposed method integrates information about the environment, the drivers emotional state, and the drivers responses in a single model. Driving data are recorded using an instrumented vehicle on which multiple sensors are mounted. While driving, drivers also interact with an automatic speech recognition (ASR) system to retrieve and play music. Using a Bayesian network, we combine knowledge on the driving environment assessed through data annotation, speech recognition errors, the drivers emotional state (frustration), and the drivers responses measured through facial expressions, physiological condition, and gas- and brake-pedal actuation. Experiments are performed with data from 20 drivers. We discuss the relevance of the proposed model and features of frustration estimation. When all of the available information is used, the overall estimation achieves a true positive rate of 80% and a false positive rate of 9% (i.e., the system correctly estimates 80% of the frustration and, when drivers are not frustrated, makes mistakes 9% of the time).


ieee automatic speech recognition and understanding workshop | 2007

Development of VAD evaluation framework CENSREC-1-C and investigation of relationship between VAD and speech recognition performance

Norihide Kitaoka; Kazumasa Yamamoto; Tomohiro Kusamizu; Seiichi Nakagawa; Takeshi Yamada; Satoru Tsuge; Chiyomi Miyajima; Takanobu Nishiura; Masato Nakayama; Yuki Denda; Masakiyo Fujimoto; Tetsuya Takiguchi; Satoshi Tamura; Shingo Kuroiwa; Kazuya Takeda; Satoshi Nakamura

Voice activity detection (VAD) plays an important role in speech processing including speech recognition, speech enhancement, and speech coding in noisy environments. We developed an evaluation framework for VAD in such environments, called corpus and environment for noisy speech recognition 1 concatenated (CENSREC-1-C). This framework consists of noisy continuous digit utterances and evaluation tools for VAD results. By adoptiong two evaluation measures, one for frame-level detection performance and the other for utterance-level detection performance, we provide the evaluation results of a power-based VAD method as a baseline. When using VAD in speech recognizer, the detected speech segments are extended to avoid the loss of speech frames and the pause segments are then absorbed by a pause model. We investigate the balance of an explicit segmentation by VAD and an implicit segmentation by a pause model using an experimental simulation of segment extension and show that a small extension improves speech recognition.


EURASIP Journal on Advances in Signal Processing | 2006

Robust distant speech recognition by combining multiple microphone-array processing with position-dependent CMN

Longbiao Wang; Norihide Kitaoka; Seiichi Nakagawa

We propose robust distant speech recognition by combining multiple microphone-array processing with position-dependent cepstral mean normalization (CMN). In the recognition stage, the system estimates the speaker position and adopts compensation parameters estimated a priori corresponding to the estimated position. Then the system applies CMN to the speech (i.e., position-dependent CMN) and performs speech recognition for each channel. The features obtained from the multiple channels are integrated with the following two types of processings. The first method is to use the maximum vote or the maximum summation likelihood of recognition results from multiple channels to obtain the final result, which is called multiple-decoder processing. The second method is to calculate the output probability of each input at frame level, and a single decoder using these output probabilities is used to perform speech recognition. This is called single-decoder processing, resulting in lower computational cost. We combine the delay-and-sum beamforming with multiple-decoder processing or single-decoder processing, which is termed multiple microphone-array processing. We conducted the experiments of our proposed method using a limited vocabulary (100 words) distant isolated word recognition in a real environment. The proposed multiple microphone-array processing using multiple decoders with position-dependent CMN achieved a 3.2% improvement (50% relative error reduction rate) over the delay-and-sum beamforming with conventional CMN (i.e., the conventional method). The multiple microphone-array processing using a single decoder needs about one-third the computational time of that using multiple decoders without degrading speech recognition performance.


international conference on acoustics, speech, and signal processing | 2012

Physical characteristics of vocal folds during speech under stress

Xiao Yao; Takatoshi Jitsuhiro; Chiyomi Miyajima; Norihide Kitaoka; Kazuya Takeda

We focus on variations in the glottal source of speech production, which is essential for understanding the generation of speech under psychological stress. In this paper, a two-mass vocal fold model is fitted to estimate the stiffness parameters of vocal folds during speech, and the stiffness parameters are then analyzed in order to classify recorded samples into neutral and stressed speech. Mechanisms of vocal folds under stress are derived from the experimental results. We propose using a Muscle Tension Ratio (MTR) to identify speech under stress. Our results show that MTR is more effective than a conventional method of stress measurement.


international conference on acoustics, speech, and signal processing | 2007

Robust Distant Speech Recognition by Combining Position-Dependent CMN with Conventional CMN

Longbiao Wang; Norihide Kitaoka; Seiichi Nakagawa

We proposed an environmentally robust speech recognition method based on position-dependent cepstral mean normalization (PD-CMN) to compensate for channel distortion depending on speaker position. PDCMN can efficiently compensate for the channel transmission characteristics while it cannot normalize speaker variation because position-dependent cepstral mean does not contain speaker characteristics. Conventional CMN can compensate for the speaker variation while it cannot obtain good recognition performance for short utterances. In this paper, we propose a robust distant speech recognition by combining position-dependent CMN with the conventional CMN to address the above problems. The position-dependent cepstral mean is linearly combined with conventional cepstral mean with following two types of processing. The first method is to use a fixed weighting coefficient over whole test data to obtain the combinational CMN, which is called fixed-weight combinational CMN. The second method is to calculate the output probability of multiple features compensated by a variable weighting coefficient at each frame, and a single decoder using these output probabilities is used to perform speech recognition, which is called variable-weight combinational CMN. We conducted the experiments of our proposed method using small vocabulary (100 words) distant isolated word recognition in a real environment. The proposed variable-weight combinational CMN method achieved a relative error reduction rate of 56.3% from conventional CMN and 22.2% from PDCMN, respectively.


Eurasip Journal on Audio, Speech, and Music Processing | 2014

Improvement of multimodal gesture and speech recognition performance using time intervals between gestures and accompanying speech

Madoka Miki; Norihide Kitaoka; Chiyomi Miyajima; Takanori Nishino; Kazuya Takeda

We propose an integrative method of recognizing gestures such as pointing, accompanying speech. Speech generated simultaneously with gestures can assist in the recognition of gestures, and since this occurs in a complementary manner, gestures can also assist in the recognition of speech. Our integrative recognition method uses a probability distribution which expresses the distribution of the time interval between the starting times of gestures and of the corresponding utterances. We evaluate the rate of improvement of the proposed integrative recognition method with a task involving the solution of a geometry problem.


international conference on acoustics, speech, and signal processing | 2011

Driver risk evaluation based on acceleration, deceleration, and steering behavior

Chiyomi Miyajima; Hiroki Ukai; Atsumi Naito; Hideomi Amata; Norihide Kitaoka; Kazuya Takeda

We propose a driver risk evaluation method based on the analysis of driving data captured with drive recorders. To evaluate the acceleration behavior of each driver we plot the maximum acceleration per minute to velocity on a two-dimensional plane and approximate the distribution by linear regression. We assume that the higher the y-intercept of the line, the quicker the driver accelerates from a stop, and the higher the x-intercept, the higher the preferred speed of travel. To evaluate deceleration behavior, brake pedal operation patterns are classified into four types, based on how the brake is depressed and released. We evaluate deceleration risk levels based on these four braking pattern categories. Steering behavior is evaluated based on the relationship between the radius of road curvature and road design speed as defined in the road construction ordinance. Some correlation is observed between our evaluation results and those manually scored by risk consultants.


ieee intelligent vehicles symposium | 2007

Generation of Pedal Operation Patterns of Individual Drivers in Car-Following for Personalized Cruise Control

Yoshihiro Nishiwaki; Chiyomi Miyajima; Norihide Kitaoka; Katsunobu Itou; Kazuya Takeda

This paper presents a method to generate car-following patterns for individual drivers. We assume that driving is a recursive process. A driver recognizes a road environment such as velocity and following distance and adjusts gas and brake pedal positions. A vehicle status changes according to the drivers operation and the road environment changes according to the vehicle status. Driving patterns of each driver are modeled with a Gaussian mixture model (GMM), which is trained as a joint probability distribution of following distance, velocity, pedal position signals and their dynamics. Gas and brake pedal operation patterns are generated from the GMMs in a maximum likelihood criterion so that the conditional probability is maximized for a given environment i.e., following distance and velocity. Experimental results for a driving simulator show that car-following patterns generated from GMMs for three different drivers maintain their individual driving characteristics.


asia pacific signal and information processing association annual summit and conference | 2015

Audio-visual speech recognition using deep bottleneck features and high-performance lipreading

Satoshi Tamura; Hiroshi Ninomiya; Norihide Kitaoka; Shin Osuga; Yurie Iribe; Kazuya Takeda; Satoru Hayamizu

This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance, and (3) investigating effectiveness of voice activity detection in a visual modality. In our approach, many kinds of visual features are incorporated, subsequently converted into bottleneck features by deep learning technology. By using proposed features, we successfully achieved 73.66% lipreading accuracy in speaker-independent open condition, and about 90% AVSR accuracy on average in noisy environments. In addition, we extracted speech segments from visual features, resulting 77.80% lipreading accuracy. It is found VAD is useful in both audio and visual modalities, for better lipreading and AVSR.

Collaboration


Dive into the Norihide Kitaoka's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Kazumasa Yamamoto

Toyohashi University of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Satoshi Nakamura

Nara Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge