Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Osamu Ichikawa is active.

Publication


Featured researches published by Osamu Ichikawa.


IEEE Journal of Selected Topics in Signal Processing | 2010

Long-Term Spectro-Temporal and Static Harmonic Features for Voice Activity Detection

Takashi Fukuda; Osamu Ichikawa; Masafumi Nishimura

Accurate voice activity detection (VAD) is important for robust automatic speech recognition (ASR) systems. This paper proposes a statistical-model-based noise-robust VAD algorithm using long-term temporal information and harmonic-structure-based features in speech. Long-term temporal information has recently become an ASR focus, but has not yet been deeply investigated for VAD. In this paper, we first consider the temporal features in a cepstral domain calculated over the average phoneme duration. In contrast, the harmonic structures are well-known bearers of acoustic information in human voices, but that information is difficult to exploit statistically. This paper further describes a new method to exploit the harmonic structure information with statistical models, providing additional noise robustness. The proposed method including both the long-term temporal and the static harmonic features led to considerable improvements under low SNR conditions, with 77.7% error reduction on average as compared with the ETSI AFE-VAD in our VAD testing. In addition, the word error rate was reduced by 29.1% in a test that included a full ASR system.


EURASIP Journal on Advances in Signal Processing | 2010

DOA estimation with local-peak-weighted CSP

Osamu Ichikawa; Takashi Fukuda; Masafumi Nishimura

This paper proposes a novel weighting algorithm for Cross-power Spectrum Phase (CSP) analysis to improve the accuracy of direction of arrival (DOA) estimation for beamforming in a noisy environment. Our sound source is a human speaker and the noise is broadband noise in an automobile. The harmonic structures in the human speech spectrum can be used for weighting the CSP analysis, because harmonic bins must contain more speech power than the others and thus give us more reliable information. However, most conventional methods leveraging harmonic structures require pitch estimation with voiced-unvoiced classification, which is not sufficiently accurate in noisy environments. In our new approach, the observed power spectrum is directly converted into weights for the CSP analysis by retaining only the local peaks considered to be harmonic structures. Our experiment showed the proposed approach significantly reduced the errors in localization, and it showed further improvements when used with other weighting algorithms.


international conference on acoustics, speech, and signal processing | 2010

Improved voice activity detection using static harmonic features

Takashi Fukuda; Osamu Ichikawa; Masafumi Nishimura

Accurate voice activity detection (VAD) is important for robust automatic speech recognition (ASR) systems. We have proposed a statistical-model-based VAD using the long-term temporal information in speech, which shows good robustness against noise in an automobile environment. For further improvement, this paper describes a new method to exploit harmonic structure information with statistical models. In our approach, local peaks considered to be harmonic structures are extracted, without explicit pitch detection and voiced-unvoiced classification. The proposed method including both long-term temporal and static harmonic features led to considerable improvements under low SNR conditions in our VAD testing. In addition, the word error rate was reduced by 29.1% in a test that included a full ASR system.


IEEE Journal of Selected Topics in Signal Processing | 2010

Dynamic Features in the Linear-Logarithmic Hybrid Domain for Automatic Speech Recognition in a Reverberant Environment

Osamu Ichikawa; Takashi Fukuda; Masafumi Nishimura

Static and dynamic features using Mel frequency cepstral coefficients (MFCCs) are widely used in automatic speech recognition. Since the MFCCs are calculated from logarithmic spectra, the delta and delta-delta are considered to be difference operations in the logarithmic domain. In a reverberant environment, speech signals have late reverberations, whose power is plotted as a long-term exponential decay. This tends to cause the logarithmic delta to keep the constant value for a long time. This paper considers new schemes for calculating delta and delta-delta features that quickly diminish in the reverberant segments. Experiments using the evaluation framework for reverberant environments (CENSREC-4) showed significant improvements by simply replacing the MFCC dynamic features with the proposed dynamic features.


IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences | 2005

Simultaneous Adaptation of Echo Cancellation and Spectral Subtraction for In-Car Speech Recognition

Osamu Ichikawa; Masafumi Nishimura

Recently, automatic speech recognition in a car has practical uses for applications like car-navigation and hands-free telephone dialers. For noise robustness, the current successes are based on the assumption that there is only a stationary cruising noise. Therefore, the recognition rate is greatly reduced when there is music or news coming from a radio or a CD player in the car. Since reference signals are available from such in-vehicle units, there is great hope that echo cancellers can eliminate the echo component in the observed noisy signals. However, previous research reported that the performance of an echo canceller is degraded in very noisy conditions. This implies it is desirable to combine the processes of echo cancellation and noise reduction. In this paper, we propose a system that uses echo cancellation and spectral subtraction simultaneously. A stationary noise component for spectral subtraction is estimated through the adaptation of an echo canceller. In our experiments, this system significantly reduced the errors in automatic speech recognition compared with the conventional combination of echo cancellation and spectral subtraction.


international conference on acoustics, speech, and signal processing | 2013

Channel-mapping for speech corpus recycling

Osamu Ichikawa; Steven J. Rennie; Takashi Fukuda; Masafumi Nishimura

The performance of automatic speech recognition (ASR) is heavily dependent on the acoustic environment in the target domain. Large investments have focused on ways to record speech data in specific environments. In contrast, recent Internet services using hand-held devices such as smartphones have created opportunities to acquire huge amounts of “live” speech data at low cost. There are practical demands to reuse this abundant data in different acoustic environments. To transform such source data for a target domain, developers can use channel mapping and noise addition. However, channel mapping of the data is difficult without stereo mapping data or impulse response data. We tested GMM-based channel mapping with a vector Taylor series (VTS) formulation on a per-utterance basis. We found this type of channel mapping effectively simulated our target domain data.


international conference on acoustics, speech, and signal processing | 2012

Model-based noise reduction leveraging frequency-wise confidence metric for in-car speech recognition

Osamu Ichikawa; Steven J. Rennie; Takashi Fukuda; Masafumi Nishimura

Model-based approaches for noise reduction effectively improve the performance of automatic speech recognition in noisy environments. Most of them use the Minimum Mean Square Estimate (MMSE) criterion for de-noised speech estimates. In general, an observation has speech-dominant bands and noise-dominant bands in the Mel spectral domain. This paper introduces a method to add weight to speech-dominated bands when evaluating the posterior probability of each speech state, as these bands are generally more reliable. To leverage high-resolution information in the Mel domain, we use Local Peak Weight (LPW) as the confidence metric for the degree of speech dominance. This information is also used to regulate the amount of compensation that is applied to each frequency band during feature reconstruction under an integrated probabilistic model. The method produced relative word error rate improvements of up to 33.8% over the baseline MMSE method on an isolated word task with car noise.


international conference on acoustics, speech, and signal processing | 2017

Effective joint training of denoising feature space transforms and Neural Network based acoustic models

Takashi Fukuda; Osamu Ichikawa; Gakuto Kurata; Ryuki Tachibana; Samuel Thomas; Bhuvana Ramabhadran

Neural Network (NN) based acoustic frontends, such as denoising autoencoders, are actively being investigated to improve the robustness of NN based acoustic models to various noise conditions. In recent work the joint training of such frontends with backend NNs has been shown to significantly improve speech recognition performance. In this paper, we propose an effective algorithm to jointly train such a denoising feature space transform and a NN based acoustic model with various kinds of data. Our proposed method first pretrains a Convolutional Neural Network (CNN) based denoising frontend and then jointly trains this frontend with a NN backend acoustic model. In the unsupervised pretraining stage, the frontend is designed to estimate clean log Mel-filterbank features from noisy log-power spectral input features. A subsequent multi-stage training of the proposed frontend, with the dropout technique applied only at the joint layer between the frontend and backend NNs, leads to significant improvements in the overall performance. On the Aurora-4 task, our proposed system achieves an average WER of 9.98%. This is a 9.0% relative improvement over one of the best reported speaker independent baseline systems performance. A final semi-supervised adaptation of the frontend NN, similar to feature space adaptation, reduces the average WER to 7.39%, a further relative WER improvement of 25%.


international conference on acoustics, speech, and signal processing | 2016

Convolutional neural network pre-trained with projection matrices on linear discriminant analysis

Takashi Fukuda; Osamu Ichikawa; Ryuki Tachibana

Recently, the hybrid architecture of a neural network (NN) and a hidden Markov model (HMM) has shown significant improvement on automatic speech recognition (ASR) over the conventional Gaussian mixture model (GMM)-based system. The convolutional neural network (CNN), a successful NN-based system, can represent local spectral variations spanning the time-frequency space. Meanwhile, spectro-temporal features have been widely studied to make ASR more robust. Typically, the spectro-temporal features are extracted from acoustic spectral patterns using a 2D filtering process. Convolutional layers in CNN that have various local windows can also be regarded as an efficient feature extractor to capture 2D spectral variations. In a standard procedure, the local windows in CNN are initialized randomly before the pre-training and are iteratively updated with a back propagation algorithm in the pre-training and fine-tuning steps. In this paper, we explore using projection matrices composed of eigenvectors estimated by linear discriminant analysis (LDA) objective function as initial weights for the first convolutional layer in CNN. From analysis of the local windows trained by the proposed method, we can see the eigenvectors of LDA has desirable properties as initial weights of CNN. The proposed method yielded a 8.1% relative improvement compared to CNN with local weights initialized randomly.


Speech Communication | 2018

Detecting breathing sounds in realistic Japanese telephone conversations and its application to automatic speech recognition

Takashi Fukuda; Osamu Ichikawa; Masafumi Nishimura

Abstract Non-verbal sound detection has long attracted attention in the speech analytics field. Although detecting laughter, coughs, and lip smacking has been well studied in the literature, breath-event detection has not been investigated much despite the need for doing so. Breath events are highly correlated with major prosodic breaks, meaning that the positions of breath events can be used as a delimiter of utterances in combination with a voice activity detection (VAD) technique. Silence intervals approximately 20 ms long right before and after breathing sounds, called “edges”, are clearly observed in speech signals. In the literature, capturing the edges is shown to be very effective in reducing false alarms in the detection of breath events. However, the edges often disappear when breaths are taken in spontaneous speech. In this work, we focus on the robustness of breath-event detection in spontaneous speech. The breath detection method we have developed leverages acoustic information that is specialized for breathing sounds, leading to a two-step approach that can detect breath events with an accuracy of 97.4%. We also propose splitting unsegmented speech signals into semantically grouped utterances by leveraging the breath events. The speech segmentation based on accurate breath-event detection provided a 3.8% relative error reduction in automatic speech recognition (ASR).

Researchain Logo
Decentralizing Knowledge