Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Masakiyo Fujimoto is active.

Publication


Featured researches published by Masakiyo Fujimoto.


IEICE Transactions on Information and Systems | 2008

Noise Robust Voice Activity Detection Based on Switching Kalman Filter

Masakiyo Fujimoto; Kentaro Ishizuka

This paper addresses the problem of voice activity detection (VAD) in noisy environments. The VAD method proposed in this paper is based on a statistical model approach, and estimates statistical models sequentially without a priori knowledge of noise. Namely, the proposed method constructs a clean speech / silence state transition model beforehand, and sequentially adapts the model to the noisy environment by using a switching Kalman filter when a signal is observed. In this paper, we carried out two evaluations. In the first, we observed that the proposed method significantly outperforms conventional methods as regards voice activity detection accuracy in simulated noise environments. Second, we evaluated the proposed method on a VAD evaluation framework, CENSREC-1-C. The evaluation results revealed that the proposed method significantly outperforms the baseline results of CENSREC-1-C as regards VAD accuracy in real environments. In addition, we confirmed that the proposed method helps to improve the accuracy of concatenated speech recognition in real environments.


IEEE Transactions on Audio, Speech, and Language Processing | 2012

Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera

Takaaki Hori; Shoko Araki; Takuya Yoshioka; Masakiyo Fujimoto; Shinji Watanabe; Takanobu Oba; Atsunori Ogawa; Kazuhiro Otsuka; Dan Mikami; Keisuke Kinoshita; Tomohiro Nakatani; Atsushi Nakamura; Junji Yamato

This paper presents our real-time meeting analyzer for monitoring conversations in an ongoing group meeting. The goal of the system is to recognize automatically “who is speaking what” in an online manner for meeting assistance. Our system continuously captures the utterances and face poses of each speaker using a microphone array and an omni-directional camera positioned at the center of the meeting table. Through a series of advanced audio processing operations, an overlapping speech signal is enhanced and the components are separated into individual speakers channels. Then the utterances are sequentially transcribed by our speech recognizer with low latency. In parallel with speech recognition, the activity of each participant (e.g., speaking, laughing, watching someone) and the circumstances of the meeting (e.g., topic, activeness, casualness) are detected and displayed on a browser together with the transcripts. In this paper, we describe our techniques and our attempt to achieve the low-latency monitoring of meetings, and we show our experimental results for real-time meeting transcription.


international conference on acoustics, speech, and signal processing | 2008

A voice activity detection based on the adaptive integration of multiple speech features and a signal decision scheme

Masakiyo Fujimoto; Kentaro Ishizuka; Tomohiro Nakatani

This paper addresses the problem of voice activity detection (VAD) in noisy environments. The VAD method proposed in this paper integrates multiple speech features and a signal decision scheme, namely the speech periodic to aperiodic component ratio and a switching Kalman filter. The integration is carried out by using the weighted sum of likelihoods outputted from each VAD (stream). The stream weight is decided adaptively each short time frame. The evaluation is carried out by using a VAD evaluation framework, CENSREC- 1-C. The evaluation results revealed that the proposed method significantly outperforms the baseline results of CENSREC-1-C as regards VAD accuracy in real environments. In addition, we carried out speech recognition evaluations by using detected speech signals, and confirmed that the proposed method contributes to an improvement in speech recognition accuracy.


Speech Communication | 2010

Noise robust voice activity detection based on periodic to aperiodic component ratio

Kentaro Ishizuka; Tomohiro Nakatani; Masakiyo Fujimoto; Noboru Miyazaki

Abstract This paper proposes a noise robust voice activity detection (VAD) technique called PARADE (PAR based Activity DEtection) that employs the periodic component to aperiodic component ratio (PAR). Conventional noise robust features for VAD are still sensitive to non-stationary noise, which yields variations in the signal-to-noise ratio, and sometimes requires a priori noise power estimations, although the characteristics of environmental noise change dynamically in the real world. To overcome this problem, we adopt the PAR, which is insensitive to both stationary and non-stationary noise, as an acoustic feature for VAD. By considering both periodic and aperiodic components simultaneously in the PAR, we can mitigate the effect of the non-stationarity of noise. PARADE first estimates the fundamental frequencies of the dominant periodic components of the observed signals, decomposes the power of the observed signals into the powers of its periodic and aperiodic components by taking account of the power of the aperiodic components at the frequencies where the periodic components exist, and calculates the PAR based on the decomposed powers. Then it detects the presence of target speech signals by estimating the voice activity likelihood defined in relation to the PAR. Comparisons of the VAD performance for noisy speech data confirmed that PARADE outperforms the conventional VAD algorithms even in the presence of non-stationary noise. In addition, PARADE is applied to a front-end processing technique for automatic speech recognition (ASR) that employs a robust feature extraction method called SPADE (Subband based Periodicity and Aperiodicity DEcomposition) as an application of PARADE. Comparisons of the ASR performance for noisy speech show that the SPADE front-end combined with PARADE achieves significantly higher word accuracies than those achieved by MFCC (Mel-frequency Cepstral Coefficient) based feature extraction, which is widely used for conventional ASR systems, the SPADE front-end without PARADE, and other standard noise robust front-end processing techniques (ETSI ES 202 050 and ETSI ES 202 212). This result confirmed that PARADE can improve the performance of front-end processing for ASR.


ieee automatic speech recognition and understanding workshop | 2007

Development of VAD evaluation framework CENSREC-1-C and investigation of relationship between VAD and speech recognition performance

Norihide Kitaoka; Kazumasa Yamamoto; Tomohiro Kusamizu; Seiichi Nakagawa; Takeshi Yamada; Satoru Tsuge; Chiyomi Miyajima; Takanobu Nishiura; Masato Nakayama; Yuki Denda; Masakiyo Fujimoto; Tetsuya Takiguchi; Satoshi Tamura; Shingo Kuroiwa; Kazuya Takeda; Satoshi Nakamura

Voice activity detection (VAD) plays an important role in speech processing including speech recognition, speech enhancement, and speech coding in noisy environments. We developed an evaluation framework for VAD in such environments, called corpus and environment for noisy speech recognition 1 concatenated (CENSREC-1-C). This framework consists of noisy continuous digit utterances and evaluation tools for VAD results. By adoptiong two evaluation measures, one for frame-level detection performance and the other for utterance-level detection performance, we provide the evaluation results of a power-based VAD method as a baseline. When using VAD in speech recognizer, the detected speech segments are extended to avoid the loss of speech frames and the pause segments are then absorbed by a pause model. We investigate the balance of an explicit segmentation by VAD and an implicit segmentation by a pause model using an experimental simulation of segment extension and show that a small extension improves speech recognition.


international conference on acoustics, speech, and signal processing | 2000

Noisy speech recognition using noise reduction method based on Kalman filter

Masakiyo Fujimoto; Yasuo Ariki

In this paper, we propose a noise reduction method based on Kalman filter for noisy speech recognition. The proposed method aims to achieve blind source segregation in real time. Since the Kalman filter needs a huge quantity of computation, it was never used for real time processing. Our proposed method using a fast Kalman filter can reduce a large quantity of computation and achieve processing in 1.5-2.0 times of real time, without losing the accuracy. In order to evaluate the proposed method, we carried out experiments to extract clean speech signal from noisy speech and compared the results by our method with conventionally used spectral subtraction and parallel model combination in word recognition accuracy. As a result, the proposed method obtained word recognition rate equal or superior to parallel model combination.


IEEE Transactions on Audio, Speech, and Language Processing | 2013

Dominance Based Integration of Spatial and Spectral Features for Speech Enhancement

Tomohiro Nakatani; Shoko Araki; Takuya Yoshioka; Marc Delcroix; Masakiyo Fujimoto

This paper proposes a versatile technique for integrating two conventional speech enhancement approaches, a spatial clustering approach (SCA) and a factorial model approach (FMA), which are based on two different features of signals, namely spatial and spectral features, respectively. When used separately the conventional approaches simply identify time frequency (TF) bins that are dominated by interference for speech enhancement. Integration of the two approaches makes identification more reliable, and allows us to estimate speech spectra more accurately even in highly nonstationary interference environments. This paper also proposes extensions of the FMA for further elaboration of the proposed technique, including one that uses spectral models based on mel-frequency cepstral coefficients and another to cope with mismatches, such as channel mismatches, between captured signals and the spectral models. Experiments using simulated and real recordings show that the proposed technique can effectively improve audible speech quality and the automatic speech recognition score.


international conference on acoustics, speech, and signal processing | 2006

Sequential Non-Stationary Noise Tracking Using Particle Filtering with Switching Dynamical System

Masakiyo Fujimoto; Satoshi Nakamura

This paper addresses a speech recognition problem in non-stationary noise environments: the estimation of noise sequences. To solve this problem, we present a particle filter-based sequential noise estimation method for the front-end processing of speech recognition. In the proposed method, the particle filter is defined by a dynamical system based on Polyak averaging and feedback. We also introduce a switching dynamical system into the particle filter to cope with the state transition characteristics of non-stationary noise. In the evaluation results, we observed that the proposed method improves speech recognition accuracy in the results of non-stationary noise environments by a noise compensation method with stationary noise assumptions


international conference on acoustics, speech, and signal processing | 2015

Exploring multi-channel features for denoising-autoencoder-based speech enhancement

Shoko Araki; Tomoki Hayashi; Marc Delcroix; Masakiyo Fujimoto; Kazuya Takeda; Tomohiro Nakatani

This paper investigates a multi-channel denoising autoencoder (DAE)-based speech enhancement approach. In recent years, deep neural network (DNN)-based monaural speech enhancement and robust automatic speech recognition (ASR) approaches have attracted much attention due to their high performance. Although multi-channel speech enhancement usually outperforms single channel approaches, there has been little research on the use of multi-channel processing in the context of DAE. In this paper, we explore the use of several multi-channel features as DAE input to confirm whether multi-channel information can improve performance. Experimental results show that certain multi-channel features outperform both a monaural DAE and a conventional time-frequency-mask-based speech enhancement method.


2008 Hands-Free Speech Communication and Microphone Arrays | 2008

A DOA Based Speaker Diarization System for Real Meetings

Shoko Araki; Masakiyo Fujimoto; Kentaro Ishizuka; Hiroshi Sawada; Shoji Makino

This paper presents a speaker diarization system that estimates who spoke when in a meeting. Our proposed system is realized by using a noise robust voice activity detector (VAD), a direction of arrival (DOA) estimator, and a DOA classifier. Our previous system utilized the generalized cross correlation method with the phase transform (GCC-PHAT) approach for the DOA estimation. Because the GCC-PHAT can estimate just one DOA per frame, it was difficult to handle speaker overlaps. This paper tries to deal with this issue by employing a DOA at each time-frequency slot (TFDOA), and reports how it improves diarization performance for real meetings / conversations recorded in a room with a reverberation time of 350 ms.

Collaboration


Dive into the Masakiyo Fujimoto's collaboration.

Top Co-Authors

Avatar

Tomohiro Nakatani

Nippon Telegraph and Telephone

View shared research outputs
Top Co-Authors

Avatar

Shoko Araki

Nippon Telegraph and Telephone

View shared research outputs
Top Co-Authors

Avatar

Satoshi Nakamura

Nara Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Kentaro Ishizuka

Nippon Telegraph and Telephone

View shared research outputs
Top Co-Authors

Avatar

Kazumasa Yamamoto

Toyohashi University of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge