Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jian-Lai Zhou is active.

Publication


Featured researches published by Jian-Lai Zhou.


international conference on acoustics, speech, and signal processing | 2004

Segmental tonal modeling for phone set design in Mandarin LVCSR

Chao Huang; Yu Shi; Jian-Lai Zhou; Min Chu; Terry Wang; Eric Chang

Modeling units play a very important role in state-of-art speech recognition systems. The design and selection of them will directly impact the performance of the final speech recognition engine. As a tonal language, Mandarins modeling units are more special for the tonal processing. In this paper, after fully investigating several dominant modeling strategies, we propose a new phone set design strategy for Mandarin, called segmental tonal modeling. Instead of modeling tone types directly, we realize them implicitly and jointly by two segments, which both carry tonal information. Both HTK and SAPI based experiments confirmed that such a method is very efficient. In addition to improving the accuracy by 9-23%, it greatly reduces the decoding time by 30-45%. Given the similar decoding speed, the new phone set configuration can reduce the error rate by relatively 35%.


international conference on acoustics, speech, and signal processing | 2004

Tone articulation modeling for Mandarin spontaneous speech recognition

Jian-Lai Zhou; Ye Tian; Yu Shi; Chao Huang; Eric Chang

Tone modeling is an unavoidable problem in Mandarin speech recognition. In continuous speech, the pitch contour exhibits variable patterns, and it is strongly influenced by its tone context. Although several effective methods have been proposed to improve the accuracy for tonal syllables in Mandarin continuous speech recognition, many recognition errors are caused by poor tone discrimination capability of the acoustic model. Furthermore, the case becomes worse for the recognition of spontaneous speech. In this paper, we report our work on tone articulation modeling. Tone context dependent models are used to model unstable pitch patterns caused by co-articulation in continuous speech. Corresponding acoustic features are investigated as well. Our methods are evaluated on two test sets: one is reading-style speech data, the other is spontaneous. The experimental results show that for the test set of casual speech, the proposed method turns out to be more effective than tone context independent model, while they are comparable for the test set of reading-style speech. Several factors which have potential to improve the proposed method are discussed in the final part in this paper.


international conference on acoustics, speech, and signal processing | 2004

Refining segmental boundaries for TTS database using fine contextual-dependent boundary models

Lijuan Wang; Yong Zhao; Min Chu; Jian-Lai Zhou; Zhigang Cao

This paper proposed a post-refining method with fine contextual-dependent GMM for the auto-segmentation task. A GMM trained with a super feature vector extracted from multiple evenly spaced frames near the boundary is suggested to describe the waveform evolution across a boundary. CART is used to cluster acoustically similar GMM, so that the GMM for each leaf node is reliably trained by the limited manually labeled boundaries. An accuracy of 90% is thus achieved when only 250 manually labeled sentences are provided to train the refining models.


international conference on acoustics, speech, and signal processing | 2004

Tone recognition with fractionized models and outlined features

Ye Tian; Jian-Lai Zhou; Min Chu; Eric Chang

Different feature extraction and tone modeling schemes are investigated on both speaker-dependent and speaker-independent continuous speech databases. Tone recognition features can be classified as detailed features which use the entire F0 curve, and outlined features which capture the main structure of the F0 curve. Tone models of different size, ranging from very simple one-tone-one-model tone models to complex phoneme-dependent tone models, have different abilities to characterize tone. Our experiments show two conclusions. First, the detailed information of the F0 curve is not necessary for tone recognition. The outlined features can, not only reduce the number of parameters, but also improve the accuracy of tone recognition. The proposed subsection average F0 and /spl Delta/F0 are shown to be effective outlined features. The second conclusion is that the one-tone-one-model scheme is not sufficient. Building phoneme-dependent tone models can highly improve the recognition accuracy, especially for speaker-independent data. Thus we suggest using fractionized models, trained with the outlined features, for tone recognition.


IEICE Transactions on Information and Systems | 2006

Context-Dependent Boundary Model for Refining Boundaries Segmentation of TTS Units

Lijuan Wang; Yong Zhao; Min Chu; Frank K. Soong; Jian-Lai Zhou; Zhigang Cao

For producing high quality synthesis, a concatenation-based Text-to-Speech (TTS) system usually requires a large number of segmental units to cover various acoustic-phonetic contexts. However, careful manual labeling and segmentation by human experts, which is still the most reliable way to prepare such units, is labor intensive. In this paper we adopt a two-step procedure to automate the labeling, segmentation and refinement process. In the first step, coarse segmentation of speech data is performed by aligning speech signals with the corresponding sequence of Hidden Markov Models (HMMs). Then in the second step, segment boundaries are refined with a proposed Context-Dependent Boundary Model (CDBM). Classification and Regression Tree (CART) is adopted to organize available data into a structured hierarchical tree, where acoustically similar boundaries are clustered together to train tied CDBM models for boundary refinement. Optimal CDBM parameters and training conditions are found through a series of experimental studies. Comparing with manual segmentation reference, segmentation accuracy (within a tolerance of 20 ms) is improved by the CDBMs from 78.1% (baseline) to 94.8% in Mandarin Chinese and from 81.4% to 92.7% in English, with about 1,000 manually segmented sentences used in training the models. To further reduce the amount of manual data for training CDBMs of a new speaker, we adapt a well-trained CDBM via efficient adaptation algorithms. With only 10--20 manually segmented sentences as adaptation data, the adapted CDBM achieves a segmentation accuracy of 90%.


international symposium on chinese spoken language processing | 2006

Improved mandarin speech recognition by lattice rescoring with enhanced tone models

Huanliang Wang; Yao Qian; Frank K. Soong; Jian-Lai Zhou; Jiqing Han

Tone plays an important lexical role in spoken tonal languages like Mandarin Chinese. In this paper we propose a two-pass search strategy for improving tonal syllable recognition performance. In the first pass, instantaneous F0 information is employed along with corresponding cepstral information in a 2-stream HMM based decoding. The F0 stream, which incorporates both discrete voiced/unvoiced information and continuous F0 contour, is modeled with a multi-space distribution. With just the first-pass decoding, we recently reported a relative improvement of 24% reduction of tonal syllable recognition errors on a Mandarin Chinese database [5]. In the second pass, F0 information over a horizontal, longer time span is used to build explicit tone models for rescoring the lattice generated in the first pass. Experimental results on the same Mandarin database show that an additional 8% relative error reduction of tonal syllable recognition is obtained by the second-pass search, lattice rescoring with enhanced tone models.


international conference on acoustics, speech, and signal processing | 2006

Improved Chinese Character Input by Merging Speech and Handwriting Recognition Hypotheses

Xi Zhou; Ye Tian; Jian-Lai Zhou; Frank K. Soong; Bei-qian Dai

In this paper we propose to merge speech and handwriting recognition hypotheses together for improving the performance of Chinese character input. The recognition result of handwriting character input can be reliable when the character is written rather squarely. However, more legible of square handwriting tends to slow down the input (stroke writing) speed. On the other hand, speech input is fairly efficient but a large number of homonyms and its vulnerability to adverse environment prevent speech from being used as a robust Chinese character input method. The handwriting stroke information and acoustic speech information, in many cases, are complementary to each other. In this study we use independent, statistically trained HMMs for recognizing each input mode individually but merge recognition hypotheses from the two recognizers. Generalized posterior probabilities are used to synchronize, compare and merge hypotheses appropriately. Experimental results have shown that significant input speedup can be obtained while maintaining the same recognition performance.


international conference on acoustics, speech, and signal processing | 2006

Auto-Segmentation Based Partitioning and Clustering Approach to Robust Endpointing

Yu Shi; Frank K. Soong; Jian-Lai Zhou

An auto segmentation based partitioning and clustering approach to robust voice activity detection (VAD) is proposed. It is done in two successive steps: homogeneous frame partitioning and segment clustering. The first step, due to its auto segmentation nature, does not need a noise model, and is applicable to different noise types and SNRs. The algorithm is a dynamic programming based procedure and provides a graceful performance in finding segmentation thresholds. Multiple parameters like energy, pitch and voicing information can be easily incorporated into the procedure. The algorithm is evaluated on the test sets in the Aurora2 database. The algorithm shows its robustness at low SNR operating environments; the endpoint estimate errors are shown to have small variance


IEEE Transactions on Audio, Speech, and Language Processing | 2006

Tree-Based Covariance Modeling of Hidden Markov Models

Ye Tian; Jian-Lai Zhou; Hui Lin; Hui Jiang

In this paper, we present a tree-based, full covariance hidden Markov modeling technique for automatic speech recognition applications. A multilayered tree is built first to organize all covariance matrices into a hierarchical structure. Kullback-Leibler divergence is used in the tree-building to measure inter-Gaussian distortion and successive splitting is used to construct the multilayer covariance tree. To cope with the data sparseness problem in estimating a full covariance matrix, we interpolate the diagonal covariance matrix of a leaf-node at the bottom of the tree with the full covariance of its parent and ancestors along the path up to the root node. The interpolation coefficients are estimated in the maximum likelihood sense via the EM algorithm. The interpolation is performed in three different parametric forms: 1) inverse covariance matrix, 2) covariance matrix, and 3) off-diagonal terms of the full covariance matrix. The proposed algorithm is tested in three different databases: 1) the DARPA resource management (RM), 2) the switchboard, and 3) a Chinese dictation. In all three databases, we show that the proposed tree-based full covariance modeling consistently performs better than the baseline diagonal covariance modeling. The algorithm outperforms other covariance modeling techniques, including: 1) the semi-tied covariance modeling (STC), 2) heteroscedastic linear discriminant analysis (HLDA), 3) mixtures of inverse covariance (MIC), and 4) direct full covariance modeling


international conference on acoustics, speech, and signal processing | 2006

Weighted Likelihood Ratio (WLR) Hidden Markov Model for Noisy Speech Recognition

Chao Huang; Yingchun Huang; Frank K. Soong; Jian-Lai Zhou

In this paper we present a weighted likelihood ratio (WLR) based hidden Markov model and apply it to speech recognition in noise. The WLR measure emphasizes spectral peaks than valleys in comparing two given speech spectra. The measure is more consistent with human perception of speech formants where natural resonances of vocal track are and tends to be more robust to broad-band noise interferences than other measures. A complete HMM framework of this measure is derived and a mixture of exponential kernels is used to model the output probability density function. The new WLR-HMM is tested on the Aurora2 connected digits database in noise. It shows more robust performance than the MFCC trained GMM baseline system. When combined with the dynamic cepstral features, the multiple-stream WLR-HMM shows a 39% relative improvement over the baseline system

Collaboration


Dive into the Jian-Lai Zhou's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ren-Hua Wang

University of Science and Technology of China

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge