Zhipeng Zhang
Tokyo Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Zhipeng Zhang.
Speech Communication | 2004
Zhipeng Zhang; Sadaoki Furui
Abstract This paper proposes a new method using piecewise-linear transformation for adapting phone HMMs to noisy speech. Various noises are clustered according to their spectral property, and a noisy speech HMM corresponding to each clustered noise and SNR condition is made. Based on the likelihood maximization criterion, an HMM that best matches an input noisy speech is selected and further adapted using linear transformation. The proposed method is evaluated by its ability to recognize noisy broadcast-news speech. It is confirmed that the proposed method is effective in recognizing numerically noise-added speech and actual noisy speech under various noise conditions. The proposed method minimizes mismatches between noisy input speech and the HMM’s, sentence by sentence, without requiring online noise spectrum/model estimation. The proposed method is therefore easily applicable to real world conditions with frequently changing noise.
Communications of The ACM | 2000
Sadaoki Furui; Katsutoshi Ohtsuki; Zhipeng Zhang
Inspired by the activities within the DARPA research community, we have been developing a large-vocabulary, continuous-speech recognition (LVCSR) system for Japanese broadcast news speech transcription [4]. This is a part of a joint research project with NHK broadcasting whose goal is the closed-captioning of TV programs. While some of the problems that we have investigated are Japanese-specific, others are language independent. The broadcast news manuscripts used for constructing our language models were taken from NHK news broadcasts over a period between July 1992 and May 1996, and comprised roughly 500,000 sentences and 22 million words. To calculate word n-gram language models , we segmented the broadcast news manuscripts into words by using a morphological analyzer since Japanese sentences are written without spaces between words. A word-frequency list was derived for the news manuscripts , and the 20,000 most frequently used words were selected as vocabulary words. This 20,000-word vocabulary covered approximately 98% of the words in the broadcast news manuscripts. We calculated bigrams and trigrams and estimated unseen n-grams using Katzs back-off smoothing method. The feature vector consisted of 16 cepstral coefficients , normalized logarithmic power, and their delta features (derivatives). The total number of parameters in each vector was 34. Cepstral coefficients were normalized by the cepstral mean subtraction (CMS) method. The acoustic models were gender-dependent shared-state triphone hidden Markov models (HMMs) and were designed using tree-based clustering. They were trained using phonetically balanced sentences and dialogues read by 53 male speakers and 56 female speakers. The total number of training utterances was 13,270 for male and 13,367 for female, and the total length of the training data was approximately 20 hours for each gender. The total number of HMM states was approximately 2,000 for each gender, and the number of Gaussian mixture components per state was four. News speech data, from TV broadcasts in July 1996, were divided into two parts, a clean part and a noisy part, and were separately evaluated. The clean part consisted of utterances with no background noise, and the noisy part consisted of utterances with background noise. The noisy part included spontaneous speech such as reports by correspondents. We extracted 50 male utterances and 50 female utterances for each part. Each set included utterances by five or six speakers. All utterances were manually segmented into sentences. Due to space limitations, we report only the results for the clean part here. Reading-dependent language modeling. …
international conference on acoustics, speech, and signal processing | 2000
Zhipeng Zhang; Sadaoki Furui; Katsutoshi Ohtsuki
In order to improve the performance of speech recognition systems when speakers change frequently and each of them utters a series of several sentences, a new unsupervised, online and incremental speaker adaptation technique combined with automatic detection of speaker changes is proposed. The speaker change is detected by comparing likelihoods using speaker-independent and speaker-adaptive Gaussian mixture models (GMMs). Both the phone HMM and GMM are adapted by MLLR transformation. In a broadcast news transcription task, this method reduces the word error rate by 10.0%. In comparison with the conventional method that uses HMMs for the speaker change detection, the GMM-based method requires a significantly less number of computations at the cost of only a slightly lower word recognition rate.
Speech Communication | 2002
Zhipeng Zhang; Sadaoki Furui; Katsutoshi Ohtsuki
Abstract This paper describes a new unsupervised, on-line and incremental speaker adaptation technique that improves the performance of speech recognition systems when there are frequent changes in speaker identity and each speaker utters a series of several sentences. The speaker change is detected using speaker-independent (SI) and speaker-adaptive (SA) Gaussian mixture models (GMMs), and both phone hidden Markov model (HMM) and GMM are adapted by maximum likelihood linear regression (MLLR) transformation. Using this method, the word error rate of a broadcast news transcription task was reduced by 10.0% relative to the results using the SI models.
Archive | 2007
Zhipeng Zhang; Kei Kikuiri; Nobuhiko Naka; Tomoyuki Ohya
How to obtain clean speech signal in noisy environments is a crucial issue for improving the performance of mobile phones. We propose to supplement the existing normal air-conductive microphone with a bone-conductive microphone for noise reduction. We propose to apply the 1CA (Independent Component Analysis)-based technique to the air and bone-conductive microphone combination for speech enhancement. The speech signal output by the bone-conductive microphone has the advantage of very high SNR, which well supports the generation of a clean speech signal in combination with a normal microphone. We evaluate this method by a Japanese digital recognition system. The results confirm that the proposed method can allow a mobile phone to obtain a clean speech signal even if the background noise is relatively high.
conference of the international speech communication association | 1999
Katsutoshi Ohtsuki; Sadaoki Furui; Naoyuki Sakurai; Atsushi Iwasaki; Zhipeng Zhang
Archive | 1999
Katsutoshi Ohtsuki; Sadaoki Furui; Naoyuki Sakurai; Atsushi Iwasaki; Zhipeng Zhang
conference of the international speech communication association | 2003
Zhipeng Zhang; Kiyotaka Otsuji; Sadaoki Furui
conference of the international speech communication association | 2000
Zhipeng Zhang; Sadaoki Furui
Archive | 2005
Sadaoki Furui; Zhipeng Zhang; Tsutomu Horikoshi; Toshiaki Sugimura