Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Zheng-Hua Tan is active.

Publication


Featured researches published by Zheng-Hua Tan.


Archive | 2008

Automatic speech recognition on mobile devices and over communication networks

Zheng-Hua Tan; Børge Lindberg

Network Speech Recognition.- Network, Distributed and Embedded Speech Recognition: An Overview.- Speech Coding and Packet Loss Effects on Speech and Speaker Recognition.- Speech Recognition Over Mobile Networks.- Speech Recognition Over IP Networks.- Distributed Speech Recognition.- Distributed Speech Recognition Standards.- Speech Feature Extraction and Reconstruction.- Quantization of Speech Features: Source Coding.- Error Recovery: Channel Coding and Packetization.- Error Concealment.- Embedded Speech Recognition.- Algorithm Optimizations: Low Computational Complexity.- Algorithm Optimizations: Low Memory Footprint.- Fixed-Point Arithmetic.- Systems and Applications.- Software Architectures for Networked Mobile Speech Applications.- Speech Recognition in Mobile Phones.- Handheld Speech to Speech Translation System.- Automotive Speech Recognition.- Energy Aware Speech Recognition for Mobile Devices.


IEEE Journal of Selected Topics in Signal Processing | 2010

Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection

Zheng-Hua Tan; Børge Lindberg

Frame-based speech processing inherently assumes a stationary behavior of speech signals in a short period of time. Over a long time, the characteristics of the signals can change significantly and frames are not equally important, underscoring the need for frame selection. In this paper, we present a low-complexity and effective frame selection approach based on a posteriori signal-to-noise ratio (SNR) weighted energy distance: The use of an energy distance, instead of, e.g., a standard cepstral distance, makes the approach computationally efficient and enables fine granularity search, and the use of a posteriori SNR weighting emphasizes the reliable regions in noisy speech signals. It is experimentally found that the approach is able to assign a higher frame rate to fast changing events such as consonants, a lower frame rate to steady regions like vowels and no frames to silence, even for very low SNR signals. The resulting variable frame rate analysis method is applied to three speech processing tasks that are essential to natural interaction with intelligent environments. First, it is used for improving speech recognition performance in noisy environments. Second, the method is used for scalable source coding schemes in distributed speech recognition where the target bit rate is met by adjusting the frame rate. Third, it is applied to voice activity detection. Very encouraging results are obtained for all three speech processing tasks.


Speech Communication | 2005

Automatic Speech Recognition over Error-Prone Wireless Networks

Zheng-Hua Tan; Paul Dalsgaard; Børge Lindberg

Abstract The past decade has witnessed a growing interest in deploying automatic speech recognition (ASR) in communication networks. The networks such as wireless networks present a number of challenges due to e.g. bandwidth constraints and transmission errors. The introduction of distributed speech recognition (DSR) largely eliminates the bandwidth limitations and the presence of transmission errors becomes the key robustness issue. This paper reviews the techniques that have been developed for ASR robustness against transmission errors. In the paper, a model of network degradations and robustness techniques is presented. These techniques are classified into three categories: error detection, error recovery and error concealment (EC). A one-frame error detection scheme is described and compared with a frame-pair scheme. As opposed to vector level techniques a technique for error detection and EC at the sub-vector level is presented. A number of error recovery techniques such as forward error correction and interleaving are discussed in addition to a review of both feature-reconstruction and ASR-decoder based EC techniques. To enable the comparison of some of these techniques, evaluation has been conduced on the basis of the same speech database and channel. Special attention is given to the unique characteristics of DSR as compared to streaming audio e.g. voice-over-IP. Additionally, a technique for adapting ASR to the varying quality of networks is presented. The frame-error-rate is here used to adjust the discrimination threshold with the goal of optimising out-of-vocabulary detection. This paper concludes with a discussion of applicability of different techniques based on the channel characteristics and the system requirements.


IEEE Transactions on Audio, Speech, and Language Processing | 2012

A Joint Approach for Single-Channel Speaker Identification and Speech Separation

Pejman Mowlaee; Rahim Saeidi; Mads Græsbøll Christensen; Zheng-Hua Tan; Tomi Kinnunen; Pasi Fränti; Søren Holdt Jensen

In this paper, we present a novel system for joint speaker identification and speech separation. For speaker identification a single-channel speaker identification algorithm is proposed which provides an estimate of signal-to-signal ratio (SSR) as a by-product. For speech separation, we propose a sinusoidal model-based algorithm. The speech separation algorithm consists of a double-talk/single-talk detector followed by a minimum mean square error estimator of sinusoidal parameters for finding optimal codevectors from pre-trained speaker codebooks. In evaluating the proposed system, we start from a situation where we have prior information of codebook indices, speaker identities and SSR-level, and then, by relaxing these assumptions one by one, we demonstrate the efficiency of the proposed fully blind system. In contrast to previous studies that mostly focus on automatic speech recognition (ASR) accuracy, here, we report the objective and subjective results as well. The results show that the proposed system performs as well as the best of the state-of-the-art in terms of perceived quality while its performance in terms of speaker identification and automatic speech recognition results are generally lower. It outperforms the state-of-the-art in terms of intelligibility showing that the ASR results are not conclusive. The proposed method achieves on average, 52.3% ASR accuracy, 41.2 points in MUSHRA and 85.9% in speech intelligibility.


international conference on acoustics, speech, and signal processing | 2017

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

Dong Yu; Morten Kolbæk; Zheng-Hua Tan; Jesper Jensen

We propose a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, commonly known as the cocktail-party problem. Different from the multi-class regression technique and the deep clustering (DPCL) technique, our novel approach minimizes the separation error directly. This strategy effectively solves the long-lasting label permutation problem, that has prevented progress on deep learning based techniques for speech separation. We evaluated PIT on the WSJ0 and Danish mixed-speech separation tasks and found that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages. Since PIT is simple to implement and can be easily integrated and combined with other advanced techniques, we believe improvements built upon PIT can eventually solve the cocktail-party problem.


IEEE Transactions on Neural Networks | 2018

Decorrelation of Neutral Vector Variables: Theory and Applications

Zhanyu Ma; Jing-Hao Xue; Arne Leijon; Zheng-Hua Tan; Zhen Yang; Jun Guo

In this paper, we propose novel strategies for neutral vector variable decorrelation. Two fundamental invertible transformations, namely, serial nonlinear transformation and parallel nonlinear transformation, are proposed to carry out the decorrelation. For a neutral vector variable, which is not multivariate-Gaussian distributed, the conventional principal component analysis cannot yield mutually independent scalar variables. With the two proposed transformations, a highly negatively correlated neutral vector can be transformed to a set of mutually independent scalar variables with the same degrees of freedom. We also evaluate the decorrelation performances for the vectors generated from a single Dirichlet distribution and a mixture of Dirichlet distributions. The mutual independence is verified with the distance correlation measurement. The advantages of the proposed decorrelation strategies are intensively studied and demonstrated with synthesized data and practical application evaluations.


IEEE Transactions on Audio, Speech, and Language Processing | 2007

Exploiting Temporal Correlation of Speech for Error Robust and Bandwidth Flexible Distributed Speech Recognition

Zheng-Hua Tan; Paul Dalsgaard; Børge Lindberg

In this paper, the temporal correlation of speech is exploited in front-end feature extraction, client-based error recovery, and server-based error concealment (EC) for distributed speech recognition. First, the paper investigates a half frame rate (HFR) front-end that uses double frame shifting at the client side. At the server side, each HFR feature vector is duplicated to construct a full frame rate (FFR) feature sequence. This HFR front-end gives comparable performance to the FFR front-end but contains only half the FFR features. Second, different arrangements of the other half of the FFR features creates a set of error recovery techniques encompassing multiple description coding and interleaving schemes where interleaving has the advantage of not introducing a delay when there are no transmission errors. Third, a subvector-based EC technique is presented where error detection and concealment is conducted at the subvector level as opposed to conventional techniques where an entire vector is replaced even though only a single bit error occurs. The subvector EC is further combined with weighted Viterbi decoding. Encouraging recognition results are observed for the proposed techniques. Lastly, to understand the effects of applying various EC techniques, this paper introduces three approaches consisting of speech feature, dynamic programming distance, and hidden Markov model state duration comparison


IEEE Transactions on Audio, Speech, and Language Processing | 2017

Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems

Morten Kolbk; Zheng-Hua Tan; Jesper Jensen

In this paper, we study aspects of single microphone speech enhancement (SE) based on deep neural networks (DNNs). Specifically, we explore the generalizability capabilities of state-of-the-art DNN-based SE systems with respect to the background noise type, the gender of the target speaker, and the signal-to-noise ratio (SNR). Furthermore, we investigate how specialized DNN-based SE systems, which have been trained to be either noise type specific, speaker specific or SNR specific, perform relative to DNN based SE systems that have been trained to be noise type general, speaker general, and SNR general. Finally, we compare how a DNN-based SE system trained to be noise type general, speaker general, and SNR general performs relative to a state-of-the-art short-time spectral amplitude minimum mean square error (STSA-MMSE) based SE algorithm. We show that DNN-based SE systems, when trained specifically to handle certain speakers, noise types and SNRs, are capable of achieving large improvements in estimated speech quality (SQ) and speech intelligibility (SI), when tested in matched conditions. Furthermore, we show that improvements in estimated SQ and SI can be achieved by a DNN-based SE system when exposed to unseen speakers, genders and noise types, given a large number of speakers and noise types have been used in the training of the system. In addition, we show that a DNN-based SE system that has been trained using a large number of speakers and a wide range of noise types outperforms a state-of-the-art STSA-MMSE based SE method, when tested using a range of unseen speakers and noise types. Finally, a listening test using several DNN-based SE systems tested in unseen speaker conditions show that these systems can improve SI for some SNR and noise type configurations but degrade SI for others.


international conference on acoustics, speech, and signal processing | 2013

Developing a speaker identification system for the DARPA RATS project

Oldrich Plchot; Spyros Matsoukas; Pavel Matejka; Najim Dehak; Jeff Z. Ma; Sandro Cumani; Ondrej Glembek; Hynek Hermansky; Sri Harish Reddy Mallidi; Nima Mesgarani; Richard M. Schwartz; Mehdi Soufifar; Zheng-Hua Tan; Samuel Thomas; Bing Zhang; Xinhui Zhou

This paper describes the speaker identification (SID) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state of the art detection capabilities on audio from highly degraded communication channels. We present results using multiple SID systems differing mainly in the algorithm used for voice activity detection (VAD) and feature extraction. We show that (a) unsupervised VAD performs as well supervised methods in terms of downstream SID performance, (b) noise-robust feature extraction methods such as CFCCs out-perform MFCC front-ends on noisy audio, and (c) fusion of multiple systems provides 24% relative improvement in EER compared to the single best system when using a novel SVM-based fusion algorithm that uses side information such as gender, language, and channel id.


international conference on pattern recognition | 2010

Signal-to-Signal Ratio Independent Speaker Identification for Co-channel Speech Signals

Rahim Saeidi; Pejman Mowlaee; Tomi Kinnunen; Zheng-Hua Tan; Mads Græsbøll Christensen; Søren Holdt Jensen; Pasi Fränti

In this paper, we consider speaker identification for the co-channel scenario in which speech mixture from speakers is recorded by one microphone only. The goal is to identify both of the speakers from their mixed signal. High recognition accuracies have already been reported when an accurately estimated signal-to-signal ratio (SSR) is available. In this paper, we approach the problem without estimating SSR. We show that a simple method based on fusion of adapted Gaussian mixture models and Kullback-Leibler divergence calculated between models, achieves an accuracy of 97% and 93% when the two target speakers enlisted as three and two most probable speakers, respectively.

Collaboration


Dive into the Zheng-Hua Tan's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Zhanyu Ma

Beijing University of Posts and Telecommunications

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Tomi Kinnunen

University of Eastern Finland

View shared research outputs
Top Co-Authors

Avatar

Jun Guo

Beijing University of Posts and Telecommunications

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge