Shuwu Zhang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shuwu Zhang is active.

Explore More

Publication

Featured researches published by Shuwu Zhang.

international conference natural language processing | 2005

SVM-based audio scene classification

Hongchen Jiang; Junmei Bai; Shuwu Zhang; Bo Xu

Audio scene classification is very important in audio indexing, retrieval and video content analysis. In this paper we present our approach that uses support vector machine (SVM) for audio scene classification, which classifies audio clips into one of five classes: pure speech, non-pure speech, music, environment sound, and silence. Among them, non-pure speech may further be divided into speech with music and speech with noise. We also describe two methods to select effective and robust audio feature sets. Based on these feature sets, we have evaluated and compared the performance of two kinds of classification frameworks on a testing database that is composed of about 4-hour audio data. The experimental results have shown that the SVM-based method yields high accuracy with high processing speed.

international conference natural language processing | 2003

Chinese-English bilingual speech recognition

Shengmin Yu; Sheng Hu; Shuwu Zhang; Bo Xu

Two methods of construct a Chinese-English bilingual phone inventory are proposed and investigated. Our research focuses on a robust, suitable and compact phone combination of the two utterly different languages. The first method is to combine Chinese phonemes and English phonemes together. It can provide the required consistency with the western languages. The second method is to combine Chinese INITIALs and FINALs (IFs) with English phonemes in the bilingual acoustic modeling. Experimental results show that the first method is more compact and flexible in acoustic modeling than the second method. But the performance decreases significantly about 1.9% and 3.8% in Chinese and English test respectively. On the contrary, the second method achieves higher word accuracy than the first. Its performance degrades only 0.3% and 2.2% for two languages, but with more parameters included in acoustic model. Some issues of building this bilingual speech recognizer are also addressed.

international conference on biometrics | 2006

A comparative study of feature and score normalization for speaker verification

Rong Zheng; Shuwu Zhang; Bo Xu

In speaker verification, it is necessary to reduce the influence of different environmental conditions. In this paper, two stages of normalization techniques, feature normalization and score normalization, are examined for decreasing the mismatch between training and testing acoustic conditions. At the first stage, cepstral mean and variance normalization (CMVN) is modified to normalize the cepstral coefficients with the similar segmental parameter statistics. Next, due to score variability between verification trials, Test-dependent zero-score normalization (TZnorm) and Zero-dependent test-score normalization (ZTnorm) are comparatively presented to transform the output scores entirely and make the speaker-independent decision threshold more robust under adverse conditions. Experiments on NIST2002 SRE corpus show that the normalizations with CMVN in feature stage and ZTnorm in score stage achieved 20.3% relative reduction of EER and 18.1% relative reduction of the minimal DCF compared to the baseline system using CMN and zero normalization.

international conference on acoustics, speech, and signal processing | 2003

A vector statistical piecewise polynomial approximation algorithm for environment compensation in telephone LVCSR

Zhaobing Han; Shuwu Zhang; Huayun Zhang; Bo Xu

A vector statistical piecewise polynomial (VPP) approximation algorithm is proposed for environment compensation in speech signals that are degraded by both additive and convolutive noise. By investigating the model of the telephone environment, we address a piecewise polynomial, namely two linear polynomials and a quadratic polynomial, to approximate the environment function precisely. The VPP is applied either to stationary noise, or to non-stationary noise. In the first case, batch EM is used in the log-spectral domain; in the second case, recursive EM with iterative stochastic approximation is developed in the cepstral domain. Both approaches are based on the minimum mean squared error (MMSE) sense. Experimental results are presented on the application of this approach in improving the performance of Mandarin large vocabulary continuous speech recognition (LVCSR) in background noise and different transmission channels (such as fixed telephone line and GSM). The method can reduce the average character error rate (CER) by about 18%.

international conference natural language processing | 2005

A robust unsupervised speaker clustering of speech utterances

Shilei Zhang; Shuwu Zhang; Bo Xu

This paper aims at developing and investigating efficient, robust and unsupervised algorithm for speaker clustering. Each utterance is modeled as a single Gaussian model distribution. A novel distance metric is proposed in this paper for the purpose of determining stopping criteria. The advantage of the proposed method is that it achieves comparable performance without requiring an adjusting threshold term. In this paper, we adopt the framework of agglomerative hierarchical clustering (AHC) with the merging criterion using Kullback-Leibler (KL) distance. The proposed stopping criterion can ensure a right number of speaker clusters. The efficiency of the proposed algorithm is demonstrated with various experiments on data from NIST and HUB5, respectively.

international conference natural language processing | 2003

Multi-layer structure MLLR adaptation algorithm based on subspace regression classes

Xiangyu Mu; Shuwu Zhang; Bo Xu

In many adaptation algorithms were proposed in the last decade, most notable MAP estimation and MLLR transformation. When the amount of adaptation data is limited, adaptation can be done by grouping similar Gaussians together to form regression classes and then transforming the Gaussians in groups. We propose a rapid MLLR adaptation algorithm with multiply layer structure, which is called SRCMLR. The method groups the Gaussians at a finer acoustic subspace level, which is constructed on the target driven. It generates the regression class dynamically for each subspace, basing on the outcome of the former MLLR transformation. Because of the new algorithms special transformation structure and cluster space, there are fewer parameters to estimate for the subsequent MLLR transformation matrix, so computation load in performing transformation is much reduced. Experiments show that the use of SRCMLLR is more effective than other methods when the adaptation data is scare.

international conference natural language processing | 2005

Relative effectiveness of score normalization methods in speaker identification fusing acoustic and prosodic information

Rong Zheng; Shuwu Zhang; Bo Xu

In this paper, an investigation on the fusion of acoustic and prosodic information for GMM-UBM based text-independent speaker identification is presented. When acoustic and prosodic based systems are established, it is advantageous to normalize the dynamic ranges of the score dimensions, that is, likelihood scores from different quality of acoustic- and prosodic-based models. Score normalization methods, linear scaling to unit range and linear scaling to unit variance, are applied to transform the output scores using the background instances so as to obtain meaningful comparison between speaker models. In this fusion system based on linear score weighting approach, the performance of speaker identification is further improved when incorporating prosodic level of information. Experimental results on part of the NIST 1999 SRE corpus are reported.

international conference natural language processing | 2003

A new combined modeling of continuous speech recognition

Zhaobing Han; Lei Jia; Shuwu Zhang; Bo Xu

Robust estimate of a large number of parameters against the availability of training data is a crucial issue in triphone based continuous speech recognition. To cope with the issue, two major context-clustering methods, agglomerative (AGG) and tree-based (TB), have been widely studied. In this paper, we analyze two algorithms with respect to their advantages and disadvantages and introduce a novel combined method that takes advantage of each method to cluster and tie similar acoustic states for highly detailed acoustic models. In addition, we devise a two-level clustering approach for TB, which uses the tree-based state tying for rare acoustic phonetic events twice. For LVCSR, the experimental results showed the performance could be highly improved by using the proposed combined method, compared with those of using the popular TB method alone.

conference of the international speech communication association | 2006