Is this you? Create Your Porfile

Kehuang Li

Georgia Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kehuang Li is active.

Explore More

Publication

Featured researches published by Kehuang Li.

international conference on acoustics, speech, and signal processing | 2015

A deep neural network approach to speech bandwidth expansion

Kehuang Li; Chin-Hui Lee

We propose a deep neural network (DNN) approach to speech bandwidth expansion (BWE) by estimating the spectral mapping function from narrowband (4 kHz in bandwidth) to wideband (8 kHz in bandwidth). Log-spectrum power is used as the input and output features to perform the required nonlinear transformation, and DNNs are trained to realize this high-dimensional mapping function. When evaluating the proposed approach on a large-scale 10-hour test set, we found that the DNN-expanded speech signals give excellent objective quality measures in terms of segmental signal-to-noise ratio and log-spectral distortion when compared with conventional BWE based on Gaussian mixture models (GMMs). Subjective listening tests also give a 69% preference score for DNN-expanded speech over 31% for GMM when the phase information is assumed known. For tests in real operation when the phase information is imaged from the given narrowband signal the preference comparison goes up to 84% versus 16%. A correct phase recovery can further increase the BWE performance for the proposed DNN method.

IEEE Transactions on Audio, Speech, and Language Processing | 2017

A Reverberation-Time-Aware Approach to Speech Dereverberation Based on Deep Neural Networks

Bo Wu; Kehuang Li; Minglei Yang; Chin-Hui Lee

A reverberation-time-aware deep-neural-network (DNN)-based speech dereverberation framework is proposed to handle a wide range of reverberation times. There are three key steps in designing a robust system. First, in contrast to sigmoid activation and min-max normalization in state-of-the-art algorithms, a linear activation function at the output layer and global mean-variance normalization of target features are adopted to learn the complicated nonlinear mapping function from reverberant to anechoic speech and to improve the restoration of the low-frequency and intermediate-frequency contents. Next, two key design parameters, namely, frame shift size in speech framing and acoustic context window size at the DNN input, are investigated to show that RT60-dependent parameters are needed in the DNN training stage in order to optimize the system performance in diverse reverberant environments. Finally, the reverberation time is estimated to select the proper frame shift and context window sizes for feature extraction before feeding the log-power spectrum features to the trained DNNs for speech dereverberation. Our experimental results indicate that the proposed framework outperforms the conventional DNNs without taking the reverberation time into account, while achieving a performance only slightly worse than the oracle cases with known reverberation times even for extremely weak and severe reverberant conditions. It also generalizes well to unseen room sizes, loudspeaker and microphone positions, and recorded room impulse responses.

international conference on acoustics, speech, and signal processing | 2014

A maximal figure-of-merit learning approach to maximizing mean average precision with deep neural network based classifiers

Kehuang Li; Zhen Huang; You-Chi Cheng; Chin-Hui Lee

We propose a maximal figure-of-merit (MFoM) learning framework to directly maximize mean average precision (MAP) which is a key performance metric in many multi-class classification tasks. Conventional classifiers based on support vector machines cannot be easily adopted to optimize the MAP metric. On the other hand, classifiers based on deep neural networks (DNNs) have recently been shown to deliver a great discrimination capability in automatic speech recognition and image classification as well. However, DNNs are usually optimized with the minimum cross entropy criterion. In contrast to most conventional classification methods, our proposed approach can be formulated to embed DNNs and MAP into the objective function to be optimized during training. The combination of the proposed maximum MAP (MMAP) technique and DNNs introduces nonlinearity to the linear discriminant function (LDF) in order to increase the flexibility and discriminant power of the original MFoM-trained LDF based classifiers. Tested on both automatic image annotation and audio event classification, the experimental results show consistent improvements of MAP on both datasets when compared with other state-of-the-art classifiers without using MMAP.

2017 Hands-free Speech Communications and Microphone Arrays (HSCMA) | 2017

A unified deep modeling approach to simultaneous speech dereverberation and recognition for the reverb challenge

Bo Wu; Kehuang Li; Zhen Huang; Sabato Marco Siniscalchi; Minglei Yang; Chin-Hui Lee

We propose a unified deep neural network (DNN) approach to achieve both high-quality enhanced speech and high-accuracy automatic speech recognition (ASR) simultaneously on the recent REverberant Voice Enhancement and Recognition Benchmark (RE-VERB) Challenge. These two goals are accomplished by two proposed techniques, namely DNN-based regression to enhance reverberant and noisy speech, followed by DNN-based multi-condition training that takes clean-condition, multi-condition and enhanced speech all into consideration. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop. We then show that in clean-condition training, we obtain the best word error rate (WER) of 13.28% on the 1-channel REVERB simulated evaluation data with the proposed DNN-based pre-processing scheme. Similarly we attain a competitive single-system WER of 8.75% with the proposed multi-condition training strategy and the same less-discriminative log power spectrum features used in the enhancement stage. Finally by leveraging upon joint training with more discriminative ASR features and improved neural network based language models a state-of-the-art WER of 4.46% is attained with a single ASR system, and single-channel information. Another state-of-the-art WER of 4.10% is achieved through system combination.

conference of the international speech communication association | 2016

Detecting Mispronunciations of L2 Learners and Providing Corrective Feedback Using Knowledge-Guided and Data-Driven Decision Trees.

Wei Li; Kehuang Li; Sabato Marco Siniscalchi; Nancy F. Chen; Chin-Hui Lee

We propose a novel decision tree based framework to detect phonetic mispronunciations produced by L2 learners caused by using inaccurate speech attributes, such as manner and place of articulation. Compared with conventional score-based CAPT (computer assisted pronunciation training) systems, our proposed framework has three advantages: (1) each mispronunciation in a tree can be interpreted and communicated to the L2 learners by traversing the corresponding path from a leaf node to the root node; (2) corrective feedback based on speech attribute features, which are directly used to describe how consonants and vowels are produced using related articulators, can be provided to the L2 learners; and (3) by building the phone-dependent decision tree, the relative importance of the speech attribute features of a target phone can be automatically learned and used to distinguish itself from other phones. This information can provide L2 learners speech attribute feedback that is ranked in order of importance. In addition to the abovementioned advantages, experimental results confirm that the proposed approach can detect most pronunciation errors and provide accurate diagnostic feedback.

international conference on acoustics, speech, and signal processing | 2017

A transfer learning and progressive stacking approach to reducing deep model sizes with an application to speech enhancement

Sicheng Wang; Kehuang Li; Zhen Huang; Sabato Marco Siniscalchi; Chin-Hui Lee

Leveraging upon transfer learning, we distill the knowledge in a conventional wide and deep neural network (DNN) into a narrower yet deeper model with fewer parameters and comparable system performance for speech enhancement. We present three transfer-learning solutions to accomplish our goal. First, the knowledge embedded in the form of the output values of a high-performance DNN is used to guide the training of a smaller DNN model in sequential transfer learning. In the second multi-task transfer learning solution, the smaller DNN is trained to learn the output value of the larger DNN, and the speech enhancement task in parallel. Finally, a progressive stacking transfer learning is accomplished through multi-task learning, and DNN stacking. Our experimental evidences demonstrate 5 times parameter reduction while maintaining similar enhancement performance with the proposed framework.

international conference on acoustics, speech, and signal processing | 2014

Deep learning vector quantization for acoustic information retrieval

Zhen Huang; Chao Weng; Kehuang Li; You-Chi Cheng; Chin-Hui Lee

We propose a novel deep learning vector quantization (DLVQ) algorithm based on deep neural networks (DNNs). Utilizing a strong representation power of this deep learning framework, with any vector quantization (VQ) method as an initializer, the proposed DLVQ technique is capable of learning a code-constrained codebook and thus improves over conventional VQ to be used in classification problems. Tested on an audio information retrieval task, the proposed DLVQ achieves a quite promising performance when it is initialized by the k-means VQ technique. A 10.5% relative gain in mean average precision (MAP) is obtained after fusing the k-means and DLVQ results together.

international conference on acoustics, speech, and signal processing | 2013

Online whole-word and stroke-based modeling for hand-written letter recognition in in-car environments

You-Chi Cheng; Kehuang Li; Zhe Feng; Fulinag Weng; Chin-Hui Lee

A finger-written, camera-based, hand gesture recognition framework for English letters in an in-vehicle environment based on Hidden Markov models is proposed. Due to the nature of the constrained hand-movement situations on the steering column, we are confronted with at least two challenging research issues, namely varying illumination conditions and noisy hand gestures. The first difficulty is alleviated by utilizing the contrast for background-foreground separation and skin model adaptation. We also adopt sub-letter stroke modeling to reduce the noisy frames of the beginning and ending parts of the letter gestures followed by the trajectory re-normalization. Moreover, the geometric relationship between letter pairs is also utilized to distinguish highly confusable letters. Finally, score fusion between whole-letter and sub-stroke models can be used to further improve the performance. When compared with the baseline system with simple features, our experimental results show that an overall relative error reduction of 66.03% can be achieved by integrating the above four new pieces of information.

conference of the international speech communication association | 2016

An Iterative Phase Recovery Framework with Phase Mask for Spectral Mapping with an Application to Speech Enhancement.

Kehuang Li; Bo Wu; Chin-Hui Lee

We propose an iterative phase recovery framework to improve spectral mapping with an application to improving the performance of state-of-the-art speech enhancement systems using magnitude-based spectral mapping with deep neural networks (DNNs). We further propose to use an estimated time-frequency mask to reduce sign uncertainty in the overlap-add waveform reconstruction algorithm. In a series of enhancement experiments using a DNN baseline system, by directly replacing the original phase of noisy speech with the estimated phase obtained with a classical phase recovery algorithm, the proposed iterative technique reduces the log-spectral distortion (LSD) by 0.41 dB from the DNN baseline, and increases the perceptual evaluation speech quality (PESQ) by 0.05 over the DNN baseline, averaging over a wide range of signal and noise conditions. The proposed phase mask mechanism further increases the segmental signal-to-noise ratio (SegSNR) by 0.44 dB at an expense of a slight degradation in LSD and PESQ comparing with the algorithm without using any phase mask.

IEEE Journal of Selected Topics in Signal Processing | 2017

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Bo Wu; Kehuang Li; Fengpei Ge; Zhen Huang; Minglei Yang; Sabato Marco Siniscalchi; Chin-Hui Lee

We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that “only good signal processing can lead to top ASR performance” in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28% on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46% is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76% on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.

Explore More