Thomas Fang Zheng
Tsinghua University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Thomas Fang Zheng.
meeting of the association for computational linguistics | 2014
Miao Fan; Deli Zhao; Qiang Zhou; Zhiyuan Liu; Thomas Fang Zheng; Edward Y. Chang
The essence of distantly supervised relation extraction is that it is an incomplete multi-label classification problem with sparse and noisy features. To tackle the sparsity and noise challenges, we propose solving the classification problem using matrix completion on factorized matrix of minimized rank. We formulate relation classification as completing the unknown labels of testing items (entity pairs) in a sparse matrix that concatenates training and testing textual features with training labels. Our algorithmic framework is based on the assumption that the rank of item-byfeature and item-by-label joint matrix is low. We apply two optimization models to recover the underlying low-rank matrix leveraging the sparsity of feature-label matrix. The matrix completion problem is then solved by the fixed point continuation (FPC) algorithm, which can find the global optimum. Experiments on two widely used datasets with different dimensions of textual features demonstrate that our low-rank matrix completion approach significantly outperforms the baseline and the state-of-the-art methods.
IEEE Transactions on Audio, Speech, and Language Processing | 2016
Meng Sun; Xiongwei Zhang; Hugo Van hamme; Thomas Fang Zheng
Unseen noise estimation is a key yet challenging step to make a speech enhancement algorithm work in adverse environments. At worst, the only prior knowledge we know about the encountered noise is that it is different from the involved speech. Therefore, by subtracting the components which cannot be adequately represented by a well defined speech model, the noises can be estimated and removed. Given the good performance of deep learning in signal representation, a deep auto encoder (DAE) is employed in this work for accurately modeling the clean speech spectrum. In the subsequent stage of speech enhancement, an extra DAE is introduced to represent the residual part obtained by subtracting the estimated clean speech spectrum (by using the pre-trained DAE) from the noisy speech spectrum. By adjusting the estimated clean speech spectrum and the unknown parameters of the noise DAE, one can reach a stationary point to minimize the total reconstruction error of the noisy speech spectrum. The enhanced speech signal is thus obtained by transforming the estimated clean speech spectrum back into time domain. The above proposed technique is called separable deep auto encoder (SDAE). Given the under-determined nature of the above optimization problem, the clean speech reconstruction is confined in the convex hull spanned by a pre-trained speech dictionary. New learning algorithms are investigated to respect the non-negativity of the parameters in the SDAE. Experimental results on TIMIT with 20 noise types at various noise levels demonstrate the superiority of the proposed method over the conventional baselines.
international conference on acoustics, speech, and signal processing | 2001
William Byrne; Veera Venkataramani; Terri Kamm; Thomas Fang Zheng; Zhanjiang Song; Pascale Fung; Y. Lui; Umar Ruhi
Pronunciation modeling for large vocabulary speech recognition attempts to improve recognition accuracy by identifying and modeling pronunciations that are not in the ASR systems pronunciation lexicon. Pronunciation variability in spontaneous Mandarin is studied using the newly created CASS corpus of phonetically annotated spontaneous speech. Pronunciation modeling techniques developed for English,are applied to this corpus to train pronunciation models which are then used for Mandarin broadcast news transcription.
Journal of Computer Science and Technology | 2006
Jing Li; Thomas Fang Zheng; William Byrne; Daniel Jurafsky
A framework for dialectal Chinese speech recognition is proposed and studied, in which a relatively small dialectal Chinese (or in other words Chinese influenced by the native dialect) speech corpus and dialect-related knowledge are adopted to transform a standard Chinese (or Putonghua, abbreviated as PTH) speech recognizer into a dialectal Chinese speech recognizer. Two kinds of knowledge sources are explored: one is expert knowledge and the other is a small dialectal Chinese corpus. These knowledge sources provide information at four levels: phonetic level, lexicon level, language level, and acoustic decoder level. This paper takes Wu dialectal Chinese (WDC) as an example target language. The goal is to establish a WDC speech recognizer from an existing PTH speech recognizer based on the Initial-Final structure of the Chinese language and a study of how dialectal Chinese speakers speak Putonghua. The authors propose to use context-independent PTH-IF mappings (where IF means either a Chinese Initial or a Chinese Final), context-independent WDC-IF mappings, and syllable-dependent WDC-IF mappings (obtained from either experts or data), and combine them with the supervised maximum likelihood linear regression (MLLR) acoustic model adaptation method. To reduce the size of the multi-pronunciation lexicon introduced by the IF mappings, which might also enlarge the lexicon confusion and hence lead to the performance degradation, a Multi-Pronunciation Expansion (MPE) method based on the accumulated uni-gram probability (AUP) is proposed. In addition, some commonly used WDC words are selected and added to the lexicon. Compared with the original PTH speech recognizer, the resulting WDC speech recognizer achieves 10–18% absolute Character Error Rate (CER) reduction when recognizing WDC, with only a 0.62% CER increase when recognizing PTH. The proposed framework and methods are expected to work not only for Wu dialectal Chinese but also for other dialectal Chinese languages and even other languages.
IEEE Transactions on Signal Processing | 2013
Dong Wang; Ravichander Vipperla; Nicholas W. D. Evans; Thomas Fang Zheng
The unsupervised learning of spectro-temporal patterns within speech signals is of interest in a broad range of applications. Where patterns are non-negative and convolutive in nature, relevant learning algorithms include convolutive non-negative matrix factorization (CNMF) and its sparse alternative, convolutive non-negative sparse coding (CNSC). Both algorithms, however, place unrealistic demands on computing power and memory which prohibit their application in large scale tasks. This paper proposes a new online implementation of CNMF and CNSC which processes input data piece-by-piece and updates learned patterns gradually with accumulated statistics. The proposed approach facilitates pattern learning with huge volumes of training data that are beyond the capability of existing alternatives. We show that, with unlimited data and computing resources, the new online learning algorithm almost surely converges to a local minimum of the objective cost function. In more realistic situations, where the amount of data is large and computing power is limited, online learning tends to obtain lower empirical cost than conventional batch learning.
international symposium on chinese spoken language processing | 2006
Jian Liu; Thomas Fang Zheng; Wenhu Wu
In this paper, a novel pitch mean based frequency warping (PMFW) method is proposed to reduce the pitch variability in speech signals at the front-end of speech recognition. The warp factors used in this process are calculated based on the average pitch of a speech segment. Two functions to describe the relations between the frequency warping factor and the pitch mean are defined and compared. We use a simple method to perform frequency warping in the Mel-filter bank frequencies based on different warping factors. To solve the problem of mismatch in bandwidth between the original and the warped spectra, the Mel-filters selection strategy is proposed. At last, the PMFW mel-frequency cepstral coefficient (MFCC) is extracted based on the regular MFCC with several modifications. Experimental results show that the new PMFW MFCCs are more distinctive than the regular MFCCs.
Speech Communication | 2006
Zhenyu Xiong; Thomas Fang Zheng; Zhanjiang Song; Frank K. Soong; Wenhu Wu
Abstract We propose a tree-based kernel selection (TBKS) algorithm as a computationally efficient approach to the Gaussian mixture model–universal background model (GMM–UBM) based speaker identification. All Gaussian components in the universal background model are first clustered hierarchically into a tree and the corresponding acoustic space is mapped into structurally partitioned regions. When identifying a speaker, each test input feature vector is scored against a small subset of all Gaussian components. As a result of this TBKS process, computation complexity can be significantly reduced. We improve the efficiency of the proposed system further by applying a previously proposed observation reordering based pruning (ORBP) to screen out unlikely candidate speakers. The approach is evaluated on a speech database of 1031 speakers, in both clean and noisy conditions. The experimental results show that by integrating TBKS and ORBP together we can speed up the computation efficiency by a factor of 15.8 with only a very slight degradation of identification performance, i.e., an increase of 1% of relative error rate, compared with a baseline GMM–UBM system. The improved search efficiency is also robust to additive noise.
IEEE Transactions on Audio, Speech, and Language Processing | 2016
Lantian Li; Dong Wang; Chenhao Zhang; Thomas Fang Zheng
Short utterance speaker recognition (SUSR) is highly challenging due to the limited enrollment and/or test data. We argue that the difficulty can be largely attributed to the mismatched prior distributions of the speech data used to train the universal background model (UBM) and those for enrollment and test. This paper presents a novel solution that distributes speech signals into a multitude of acoustic subregions that are defined by speech units, and models speakers within the subregions. To avoid data sparsity, a data-driven approach is proposed to cluster speech units into speech unit classes, based on which robust subregion models can be constructed. Further more, we propose a model synthesis approach based on maximum likelihood linear regression (MLLR) to deal with no-data speech unit classes. The experiments were conducted on a publicly available database SUD12. The results demonstrated that on a text-independent speaker recognition task where the test utterances are no longer than 2 seconds and mostly shorter than 0.5 seconds, the proposed subregion modeling offered a 21.51% relative reduction in equal error rate (EER), compared with the standard GMM-UBM baseline. In addition, with the model synthesis approach, the performance can be greatly improved in scenarios where no enrollment data are available for some speech unit classes.
international conference on audio, language and image processing | 2010
Jue Hou; Yi Liu; Thomas Fang Zheng; Jesper Olsen; Jilei Tian
In this paper, we propose an approach of multi-layered feature combination associated with support vector machine (SVM) for Chinese accent identification. The multi-layered features include both segmental and suprasegmental information, such as MFCC and pitch contour, to capture the diversity of variations in Chinese accented speech. The pitch contour is estimated using cubic polynomial method to model the variant characters in different accents in Chinese. We train two GMM acoustic models in order to express the features of a certain accent. As the original criterion of the GMM model cannot deal with such multi-layered features, the SVM is utilized to make the decision. The effectiveness of the proposed approach was evaluated on the 863 Chinese accent corpus. Our approach yields a significant 10% relative error rate reduction compared with traditional approaches using sole feature at single level in Chinese accented speech identification.
web intelligence | 2015
Miao Fan; Qiang Zhou; Thomas Fang Zheng; Ralph Grishman
Traditional way of storing facts in triplets (head_entity, relation, tail_entity), abbreviated as (h, r, t), allows the knowledge to be intuitively displayed and easily acquired by human beings, but hardly computed or even reasoned about by AI machines. Inspired by the success in applying Distributed Representations to AI-related fields, recent studies expect to represent each entity and relation with a unique low-dimensional embedding, which is different from the symbolic and atomic framework of displaying knowledge in triplets. In this way, the knowledge computing and reasoning can be essentially facilitated by means of a simple vector calculation, i.e. h + r ≈ t. We thus contribute an effective model to learn better embeddings satisfying the formula by pulling the positive tail entities t+ together and close to h + r (Nearest Neighbor), and simultaneously pushing the negatives t-away from the positives t+ via keeping a Large Margin. We also design a corresponding learning algorithm to efficiently find the optimal solution based on Stochastic Gradient Descent in iterative fashion. Quantitative experiments illustrate that our approach can achieve the state-of-the-art performance, compared with several recent methods on some benchmark datasets for two classical applications, i.e. Link prediction and Triplet classification. Moreover, we analyze the parameter complexities among all the evaluated models, and analytical results indicate that our model needs fewer computational resources while outperforming the other methods.