Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Tom Ko is active.

Publication


Featured researches published by Tom Ko.


Speech Communication | 2014

Eigentrigraphemes for under-resourced languages

Tom Ko; Brian Mak

Grapheme-based modeling has an advantage over phone-based modeling in automatic speech recognition for under-resourced languages when a good dictionary is not available. Recently we proposed a new method for parameter estimation of context-dependent hidden Markov model (HMM) called eigentriphone modeling. Eigentriphone modeling outperforms conventional tied-state HMM by eliminating the quantization errors among the tied states. The eigentriphone modeling framework is very flexible and can be applied to any group of modeling unit provided that they may be represented by vectors of the same dimension. In this paper, we would like to port the eigentriphone modeling method from a phone-based system to a grapheme-based system; the new method will be called eigentrigrapheme modeling. Experiments on four official South African under-resourced languages (Afrikaans, South African English, Sesotho, siSwati) show that the new eigentrigrapheme modeling method reduces the word error rates of conventional tied-state trigrapheme modeling by an average of 4.08% relative.


IEEE Transactions on Audio, Speech, and Language Processing | 2013

Eigentriphones for Context-Dependent Acoustic Modeling

Tom Ko; Brian Mak

Most automatic speech recognizers employ tied-state triphone hidden Markov models (HMM), in which the corresponding triphone states of the same base phone are tied. State tying is commonly performed with the use of a phonetic regression class tree which renders robust context-dependent modeling possible by carefully balancing the amount of training data with the degree of tying. However, tying inevitably introduces quantization error: triphones tied to the same state are not distinguishable in that state. Recently we proposed a new triphone modeling approach called eigentriphone modeling in which all triphone models are, in general, distinct. The idea is to create an eigenbasis for each base phone (or phone state) and all its triphones (or triphone states) are represented as distinct points in the space spanned by the basis. We have shown that triphone HMMs trained using model-based or state-based eigentriphones perform at least as well as conventional tied-state HMMs. In this paper, we further generalize the definition of eigentriphones over clusters of acoustic units. Our experiments on TIMIT phone recognition and the Wall Street Journal 5K-vocabulary continuous speech recognition show that eigentriphones estimated from state clusters defined by the nodes in the same phonetic regression class tree used in state tying result in further performance gain.


international conference on acoustics, speech, and signal processing | 2011

Eigentriphones: A basis for context-dependent acoustic modeling

Tom Ko; Brian Mak

In context-dependent acoustic modeling, it is important to strike a balance between detailed modeling and data sufficiency for robust estimation of model parameters. In the past, parameter sharing or tying is one of the most common techniques to solve the problem. In recent years, another technique which may be loosely and collectively called the subspace approach tries to express a phonetic or sub-phonetic unit in terms of a small set of canonical vectors or units. In this paper, we investigate the development of an eigenbasis over the triphones and model each triphone as a point in the basis. We call the eigenvectors in the basis eigentriphones. From another perspective, we investigate the use of the eigenvoice adaptation method as a general acoustic modeling method for training triphones — especially the less frequent triphones without tying their states so that all the triphones are really distinct from each other and thus may be more discriminative. Experimental evaluation on the 5K-vocabulary HUB2 recognition task shows that a triphone HMM system trained using only eigentriphones without state tying may achieve slightly better performance than the common tied-state triphones.


international conference on acoustics, speech, and signal processing | 2012

Derivation of eigentriphones by weighted principal component analysis

Tom Ko; Brian Mak

Last year we proposed a new acoustic modeling method called eigentriphones in which all triphones are distinct (with no tied states) so that they may be more discriminative. In our method, frequent triphones are used to derive an eigenbasis using PCA, and the infrequent triphones are then “adapted” as a linear combination of the eigenvectors which are also called eigentriphones. Although the eigentriphones method compares favorably with traditional tied-state triphones, the PCA procedure has two limitations: (1) only the frequent triphones are employed, and (2) they are considered “equal” even though some are more robust than the others. In this paper, weighted PCA is proposed to solve both problems so that all triphones-frequent and infrequent triphones-may contribute to the derivation of the eigentriphones, each at a different extent depending on its sample count. Experimental evaluation on the WSJ 5Kvocabulary speech recognition task shows that weighted PCA produces better models than simple PCA, and its performance is fairly independent of the number of eigentriphones once more than 20% of them are used. As a consequence, all triphones may be represented by fewer eigentriphones, resulting in a more compact model.


international conference on acoustics, speech, and signal processing | 2010

Improving speech recognition by explicit modeling of phone deletions

Tom Ko; Brian Mak

In a paper published by Greenberg in 1998, it was said that in conversational speech, phone deletion rate may go as high as 12% whereas syllable deletion rate is about 1%. The finding prompted a new research direction of syllable modeling for speech recognition. To date, the syllable approach has not yet fulfilled its promise. On the other hand, there were few attempts to model phone deletions explicitly in current ASR systems. In this paper, fragmented word models were derived from well-trained cross-word triphone models, and phone deletion was implemented by skip arcs for words consisting of at least four phonemes. An evaluation on CSR-II WSJ1 Hub2 5K task shows that even with this limited implementation of phone deletions in read speech, we obtained a word error rate reduction of 6.73%.


international conference on acoustics, speech, and signal processing | 2014

Subspace Gaussian mixture model with state-dependent subspace dimensions

Tom Ko; Brian Mak; Cheung-Chi Leung

In recent years, under the hidden Markov modeling (HMM) framework, the use of subspace Gaussian mixture models (SGMMs) has demonstrated better recognition performance than traditional Gaussian mixture models (GMMs) in automatic speech recognition. In state-of-the-art SGMM formulation, a fixed subspace dimension is assigned to every phone states. While a constant subspace dimension is easier to implement, it may, however, lead to overfitting or underfitting of some state models as the data is usually distributed unevenly among the states. In a later extension of SGMM, states are split to sub-states with an appropriate objective function so that the problem is eased by increasing the state-specific parameters for the underfitting state. In this paper, we propose another solution and allow each sub-state to have a different subspace dimension depending on its amount of training frames so that the state-specific parameters can be robustly estimated. Experimental evaluation on the Switchboard recognition task shows that our proposed method brings improvement to the existing SGMM training procedure.


international symposium on chinese spoken language processing | 2010

Problems of modeling phone deletion in conversational speech for speech recognition

Brian Mak; Tom Ko

Recently we proposed a novel method to explicitly model the phone deletion phenomenon in speech, and introduced the context-dependent fragmented word model (CD-FWM). An evaluation on the WSJ1 Hub2 5K task shows that even in read speech, CD-FWM could reduce word error rate (WER) by a relative 10.3%. Since it is generally expected that the phone deletion phenomenon is more pronounced in conversational and spontaneous speech than in read speech, we extend our investigation of modeling phone deletion in conversation using CD-FWM on the SVitchboard 500-word task in this paper. To our surprise, much smaller recognition gain is obtained. Through a series of analyses, we present some plausible explanations for why phone deletion modeling is more successful in read speech than in conversational speech, and suggest future directions in improving CD-FWM for recognizing conversational speech.


conference of the international speech communication association | 2015

Audio augmentation for speech recognition.

Tom Ko; Vijayaditya Peddinti; Daniel Povey; Sanjeev Khudanpur


conference of the international speech communication association | 2011

A Fully Automated Derivation of State-based Eigentriphones for Triphone Modeling with No Tied States Using Regularization

Tom Ko; Brian Mak


conference of the international speech communication association | 2009

Automatic estimation of decoding parameters using large-margin iterative linear programming

Brian Mak; Tom Ko

Collaboration


Dive into the Tom Ko's collaboration.

Top Co-Authors

Avatar

Brian Mak

Hong Kong University of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Daniel Povey

Johns Hopkins University

View shared research outputs
Top Co-Authors

Avatar

Dongpeng Chen

Hong Kong University of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Yingke Zhu

Hong Kong University of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

David Snyder

Johns Hopkins University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge