Ming Tu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ming Tu is active.

Explore More

Publication

Featured researches published by Ming Tu.

IEEE Transactions on Audio, Speech, and Language Processing | 2015

Convex weighting criteria for speaking rate estimation

Yishan Jiao; Visar Berisha; Ming Tu; Julie M. Liss

Speaking rate estimation directly from the speech waveform is a long-standing problem in speech signal processing. In this paper, we pose the speaking rate estimation problem as that of estimating a temporal density function whose integral over a given interval yields the speaking rate within that interval. In contrast to many existing methods, we avoid the more difficult task of detecting individual phonemes within the speech signal and we avoid heuristics such as thresholding the temporal envelope to estimate the number of vowels. Rather, the proposed method aims to learn an optimal weighting function that can be directly applied to time-frequency features in a speech signal to yield a temporal density function. We propose two convex cost functions for learning the weighting functions and an adaptation strategy to customize the approach to a particular speaker using minimal training. The algorithms are evaluated on the TIMIT corpus, on a dysarthric speech corpus, and on the ICSI Switchboard spontaneous speech corpus. Results show that the proposed methods outperform three competing methods on both healthy and dysarthric speech. In addition, for spontaneous speech rate estimation, the result show a high correlation between the estimated speaking rate and ground truth values.

international conference on acoustics, speech, and signal processing | 2017

Objective assessment of pathological speech using distribution regression

Ming Tu; Visar Berisha; Julie M. Liss

Objective assessment of pathological speech is an important part of existing systems for automatic diagnosis and treatment of various speech disorders. In this paper, we propose a new regression method for this application. Rather than treating speech samples from each speaker as individual data instances, we treat each speakers data as a probability distribution. We propose a simple non-parametric learning method to make predictions for out-of-sample speakers based on a probability distance measure to the speakers in the training set. This is in contrast to traditional learning methods that rely on Euclidean distances between individual instances. We evaluate the method on two pathological speech data sets with promising results.

international conference on acoustics, speech, and signal processing | 2016

Ranking the parameters of deep neural networks using the fisher information

Ming Tu; Visar Berisha; Martin Woolf; Jae-sun Seo; Yu Cao

The large number of parameters in deep neural networks (DNNs) often makes them prohibitive for low-power devices, such as field-programmable gate arrays (FPGA). In this paper, we propose a method to determine the relative importance of all network parameters by measuring the amount of information that the network output carries about each of the parameters - the Fisher Information. Based on the importance ranking, we design a complexity reduction scheme that discards unimportant parameters and assigns more quantization bits to more important parameters. For evaluation, we construct a deep autoencoder and learn a non-linear dimensionality reduction scheme for accelerometer data measuring the gait of individuals with Parkinsons disease. Experimental results confirm that the proposed ranking method can help reduce the complexity of the network with minimal impact on performance.

international conference on acoustics, speech, and signal processing | 2016

Online speaking rate estimation using recurrent neural networks

Yishan Jiao; Ming Tu; Visar Berisha; Julie M. Liss

A reliable online speaking rate estimation tool is useful in many domains, including speech recognition, speech therapy intervention, speaker identification, etc. This paper proposes an online speaking rate estimation model based on recurrent neural networks (RNNs). Speaking rate is a long-term feature of speech, which depends on how many syllables were spoken over an extended time window (seconds). We posit that since RNNs can capture long-term dependencies through the memory of previous hidden states, they are a good match for the speaking rate estimation task. Here we train a long short-term memory (LSTM) RNN on a set of speech features that are known to correlate with speech rhythm. An evaluation on spontaneous speech shows that the method yields a higher correlation between the estimated rate and the ground-truth rate when compared to the state-of-the-art alternatives. The evaluation on longitudinal pathological speech shows that the proposed method can capture long-term and short-term changes in speaking rate.

Journal of the Acoustical Society of America | 2016

The relationship between perceptual disturbances in dysarthric speech and automatic speech recognition performance

Ming Tu; Alan Wisler; Visar Berisha; Julie M. Liss

State-of-the-art automatic speech recognition (ASR) engines perform well on healthy speech; however recent studies show that their performance on dysarthric speech is highly variable. This is because of the acoustic variability associated with the different dysarthria subtypes. This paper aims to develop a better understanding of how perceptual disturbances in dysarthric speech relate to ASR performance. Accurate ratings of a representative set of 32 dysarthric speakers along different perceptual dimensions are obtained and the performance of a representative ASR algorithm on the same set of speakers is analyzed. This work explores the relationship between these ratings and ASR performance and reveals that ASR performance can be predicted from perceptual disturbances in dysarthric speech with articulatory precision contributing the most to the prediction followed by prosody.

international conference on acoustics, speech, and signal processing | 2017

Speech enhancement based on Deep Neural Networks with skip connections

Ming Tu; Xianxian Zhang

Speech enhancement under noise condition has always been an intriguing research topic. In this paper, we propose a new Deep Neural Networks (DNNs) based architecture for speech enhancement. In contrast to standard feed forward network architecture, we add skip connections between network inputs and outputs to indirectly force the DNNs to learn ideal ratio mask. We also show that the performance can be further improved by stacking multiple such network blocks. Experimental results demonstrate that our proposed architecture can achieve considerably better performance than the existing method in terms of three commonly used objective measurements under two real noise conditions.

ieee computer society annual symposium on vlsi | 2016

Reducing the Model Order of Deep Neural Networks Using Information Theory

Ming Tu; Visar Berisha; Yu Cao; Jae-sun Seo

Deep neural networks are typically represented by a much larger number of parameters than shallow models, making them prohibitive for small footprint devices. Recent research shows that there is considerable redundancy in the parameter space of deep neural networks. In this paper, we propose a method to compress deep neural networks by using the Fisher Information metric, which we estimate through a stochastic optimization method that keeps track of second-order information in the network. We first remove unimportant parameters and then use non-uniform fixed point quantization to assign more bits to parameters with higher Fisher Information estimates. We evaluate our method on a classification task with a convolutional neural network trained on the MNIST data set. Experimental results show that our method outperforms existing methods for both network pruning and quantization.

conference of the international speech communication association | 2016

Accent identification by combining deep neural networks and recurrent neural networks trained on long and short term features

Yishan Jiao; Ming Tu; Visar Berisha; Julie M. Liss

Automatic identification of foreign accents is valuable for many speech systems, such as speech recognition, speaker identification, voice conversion, etc. The INTERSPEECH 2016 Native Language Sub-Challenge is to identify the native languages of non-native English speakers from eleven countries. Since differences in accent are due to both prosodic and articulation characteristics, a combination of long-term and short-term training is proposed in this paper. Each speech sample is processed into multiple speech segments with equal length. For each segment, deep neural networks (DNNs) are used to train on long-term statistical features, while recurrent neural networks (RNNs) are used to train on short-term acoustic features. The result for each speech sample is calculated by linearly fusing the results from the two sets of networks on all segments. The performance of the proposed system greatly surpasses the provided baseline system. Moreover, by fusing the results with the baseline system, the performance can be further improved.

asilomar conference on signals, systems and computers | 2016

Models for objective evaluation of dysarthric speech from data annotated by multiple listeners

Ming Tu; Yishan Jiao; Visar Berisha; Julie M. Liss

In subjective evaluation of dysarthric speech, the inter-rater agreement between clinicians can be low. Disagreement among clinicians results from differences in their perceptual assessment abilities, familiarization with a client, clinical experiences, etc. Recently, there has been interest in developing signal processing and machine learning models for objective evaluation of subjective speech quality. In this paper, we propose a new method to address this problem by collecting subjective ratings from multiple evaluators and modeling the reliability of each annotator within a machine learning framework. In contrast to previous work, our model explicitly models the dependence of the speaker on an evaluators reliability. We evaluate the model on a series of experiments on a dysarthric speech database and show that our method outperforms other similar approaches.

asilomar conference on signals, systems and computers | 2015

Estimating speaking rate in spontaneous discourse

Yishan Jiao; Visar Berisha; Ming Tu; Timothy Huston; Julie M. Liss

In this paper we consider the problem of estimating the speaking rate directly from the speech waveform. We propose an algorithm that poses the speaking rate estimation problem as a convex optimization problem. In contrast to existing methods, we avoid the more difficult task of detecting individual syllables within the speech signal and we avoid heuristics like thresholding a loudness function. The algorithm was evaluated on the ICSI Switchboard spontaneous speech corpus and a speech corpus obtained from publicly-available interviews on Youtube.

Explore More