Is this you? Create Your Porfile

Brian Mak

Hong Kong University of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Brian Mak is active.

Explore More

Publication

Featured researches published by Brian Mak.

IEEE Transactions on Speech and Audio Processing | 1994

A robust algorithm for word boundary detection in the presence of noise

Jean-Claude Junqua; Brian Mak; Ben Reaves

The authors address the problem of automatic word boundary detection in quiet and in the presence of noise. Attention has been given to automatic word boundary detection for both additive noise and noise-induced changes in the talkers speech production (Lombard reflex). After a comparison of several automatic word boundary detection algorithms in different noisy-Lombard conditions, they propose a new algorithm that is robust in the presence of noise. This new algorithm identifies islands of reliability (essentially the portion of speech contained between the first and the last vowel) using time and frequency-based features and then, after a noise classification, applies a noise adaptive procedure to refine the boundaries. It is shown that this new algorithm outperforms the commonly used algorithm developed by Lamel (1981) et al. and several other recently developed methods. They evaluated the average recognition error rate due to word boundary detection in an HMM-based recognition system across several signal-to-noise ratios and noise conditions. The recognition error rate decreased to about 20% compared to an average of approximately 50% obtained with a modified version of the Lamel et al. algorithm. >

IEEE Transactions on Speech and Audio Processing | 2001

Subspace distribution clustering hidden Markov model

Enrico Bocchieri; Brian Mak

Most contemporary laboratory recognizers require too much memory to run, and are too slow for mass applications. One major cause of the problem is the large parameter space of their acoustic models. In this paper, we propose a new acoustic modeling methodology which we call subspace distribution clustering hidden Markov modeling (SDCHMM) with the aim of achieving much more compact acoustic models. The theory of SDCHMM is based on tying the parameters of a new unit, namely the subspace distribution, of continuous density hidden Markov models (CDHMMs). SDCHMMs can be converted from CDHMMs by projecting the distributions of the CDHMMs onto orthogonal subspaces, and then tying similar subspace distributions over all states and all acoustic models in each subspace, by exploiting the combinatorial effect of subspace distribution encoding, all original full-space distributions can be represented by combinations of a small number of subspace distribution prototypes. Consequently, there is a great reduction in the number of model parameters, and thus substantial savings in memory and computation. This renders SDCHMM very attractive in the practical implementation of acoustic models. Evaluation on the Airline Travel Information System (ATIS) task shows that in comparison to its parent CDHMM system, a converted SDCHMM system achieves seven- to 18-fold reduction in memory requirement for acoustic models, and runs 30%-60% faster without any loss of recognition accuracy.

international conference on spoken language processing | 1996

Phone clustering using the Bhattacharyya distance

Brian Mak; Etienne Barnard

The authors study the use of the classification-based Bhattacharyya distance measure to guide biphone clustering. The Bhattacharyya distance is a theoretical distance measure between two Gaussian distributions which is equivalent to an upper bound on the optimal Bayesian classification error probability. It also has the desirable properties of being computationally simple and extensible to more Gaussian mixtures. Using the Bhattacharyya distance measure in a data-driven approach together with a novel a-level agglomerative hierarchical biphone clustering algorithm, generalized left/right biphones(BGBs) are derived. A neural-net based phone recognizer trained on the BGBs is found to have better frame-level phone recognition than one trained on generalized biphones (BCGBs) derived from a set of commonly used broad categories. They further evaluate the new BGBs on an isolated-word recognition task of perplexity 40 and obtain a 16.2% error reduction over the broad-category generalized biphones (BCGBs) and a 41.8% error reduction over the monophones.

IEEE Transactions on Speech and Audio Processing | 1995

Tone recognition of isolated Cantonese syllables

Tan Lee; P. C. Ching; Lai-Wan Chan; Y. H. Cheng; Brian Mak

Tone identification is essential for the recognition of the Chinese language, specifically far Cantonese which is well known for being very rich in tones. The paper presents an efficient method for tone recognition of isolated Cantonese syllables. Suprasegmental feature parameters are extracted from the voiced portion of a monosyllabic utterance and a three-layer feedforward neural network is used to classify these feature vectors. Using a phonologically complete vocabulary of 234 distinct syllables, the recognition accuracy for single-speaker and multispeaker is given by 89.0% and 87.6% respectively. >

international conference on acoustics speech and signal processing | 1996

The contribution of consonants versus vowels to word recognition in fluent speech

Ronald A. Cole; Yonghong Yan; Brian Mak; Mark A. Fanty; Troy Bailey

Three perceptual experiments were conducted to test the relative importance of vowels vs. consonants to recognition of fluent speech. Sentences were selected from the TIMIT corpus to obtain approximately equal numbers of vowels and consonants within each sentence and equal durations across the set of sentences. In experiments 1 and 2, subjects listened to (a) unaltered TIMIT sentences; (b) sentences in which all of the vowels were replaced by noise; or (c) sentences in which all of the consonants were replaced by noise. The subjects listened to each sentence five times, and attempted to transcribe what they heard. The results of these experiments show that recognition of words depends more upon vowels than consonants-about twice as many words are recognized when vowels are retained in the speech. The effect was observed when occurrences of [1], [r], [w], [y] [m], [n], were included in the sentences (experiment 1) or replaced by noise (experiment 2). Experiment 3 tested the hypothesis that vowel boundaries contain more information about the neighboring consonants than vice versa.

IEEE Transactions on Speech and Audio Processing | 2005

Kernel eigenvoice speaker adaptation

Brian Mak; James Tin-Yau Kwok; Simon Ka-Lung Ho

Eigenvoice-based methods have been shown to be effective for fast speaker adaptation when only a small amount of adaptation data, say, less than 10 s, is available. At the heart of the method is principal component analysis (PCA) employed to find the most important eigenvoices. In this paper, we postulate that nonlinear PCA using kernel methods may be even more effective. The eigenvoices thus derived will be called kernel eigenvoices (KEV), and we will call our new adaptation method kernel eigenvoice speaker adaptation. However, unlike the standard eigenvoice (EV) method, an adapted speaker model found by the kernel eigenvoice method resides in the high-dimensional kernel-induced feature space, which, in general, cannot be mapped back to an exact preimage in the input speaker supervector space. Consequently, it is not clear how to obtain the constituent Gaussians of the adapted model that are needed for the computation of state observation likelihoods during the estimation of eigenvoice weights and subsequent decoding. Our solution is the use of composite kernels in such a way that state observation likelihoods can be computed using only kernel functions without the need of a speaker-adapted model in the input supervector space. In this paper, we investigate two different composite kernels for KEV adaptation: direct sum kernel and tensor product kernel. In an evaluation on the TIDIGITS task, it is found that KEV speaker adaptation using both forms of composite Gaussian kernels are equally effective, and they outperform a speaker-independent model and adapted models found by EV, MAP, or MLLR adaptation using 2.1 and 4.1 s of speech. For example, with 2.1 s of adaptation data, KEV adaptation outperforms the speaker-independent model by 27.5%, whereas EV, MAP, or MLLR adaptation are not effective at all.

international conference on acoustics, speech, and signal processing | 2014

Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition

Dongpeng Chen; Brian Mak; Cheung-Chi Leung; Sunil Sivadas

It is well-known in machine learning that multitask learning (MTL) can help improve the generalization performance of singly learning tasks if the tasks being trained in parallel are related, especially when the amount of training data is relatively small. In this paper, we investigate the estimation of triphone acoustic models in parallel with the estimation of trigrapheme acoustic models under the MTL framework using deep neural network (DNN). As triphone modeling and trigrapheme modeling are highly related learning tasks, a better shared internal representation (the hidden layers) can be learned to improve their generalization performance. Experimental evaluation on three low-resource South African languages shows that triphone DNNs trained by the MTL approach perform significantly better than triphone DNNs that are trained by the single-task learning (STL) approach by ~3-13%. The MTL-DNN triphone models also outperform the ROVER result that combines a triphone STL-DNN and a trigrapheme STL-DNN.

international conference on acoustics, speech, and signal processing | 2006

Improving Reference Speaker Weighting Adaptation by the Use of Maximum-Likelihood Reference Speakers

Brian Mak; Tsz-Chung Lai; Roger Hsiao

We would like to revisit a simple fast adaptation technique called reference speaker weighting (RSW). RSW is similar to eigenvoice (EV) adaptation, and simply requires the model of a new speaker to lie on the span of a set of reference speaker vectors. In the original RSW, the reference speakers are computed through a hierarchical speaker clustering (HSC) algorithm using information such as the gender and speaking rate. We show in this paper that RSW adaptation may be improved if those training speakers that have the highest likelihoods of the adaptation data are selected as the reference speakers; we call them the maximum-likelihood (ML) reference speakers. When RSW adaptation was evaluated on WSJ0 using 5s of adaptation speech, the word error rate reduction can be boosted from 2.54% to 9.15% by using 10 ML reference speakers instead of reference speakers determined from HSC. Moreover, when compared with EV, MAP, MLLR, and eKEV on fast adaptation, we are surprised that the algorithmically simplest RSW technique actually gives the best performance

international conference on acoustics, speech, and signal processing | 2004

A study of various composite kernels for kernel eigenvoice speaker adaptation

Brian Mak; James Tin-Yau Kwok; Simon Ka-Lung Ho

Eigenvoice-based methods have been shown to be effective for fast speaker adaptation when the amount of adaptation data is small, say, less than 10 seconds. In traditional eigenvoice (EV) speaker adaptation, linear principal component analysis (PCA) is used to derive the eigenvoices. Recently, we proposed that eigenvoices found by nonlinear kernel PCA could be more effective, and the eigenvoices thus derived were called kernel eigenvoices (KEV). One of our novelties is the use of composite kernel that makes it possible to compute state observation likelihoods via kernel functions. We investigate two different composite kernels: direct sum kernel and tensor product kernel for KEV adaptation. In an evaluation on the TIDIGITS task, it is found that KEV speaker adaptations using either form of composite kernel are equally effective, and they outperform a speaker-independent model and the adapted models from EV, MAP, or MLLR adaptation using 2.1s and 4.1s of speech. For example, with 2.1s of adaptation data, KEV adaptation outperforms the speaker-independent model by 27.5%, whereas EV, MAP, and MLLR adaptations are not effective at all.

IEEE Transactions on Audio, Speech, and Language Processing | 2015

Multitask learning of deep neural networks for low-resource speech recognition

Dongpeng Chen; Brian Mak

We propose a multitask learning (MTL) approach to improve low-resource automatic speech recognition using deep neural networks (DNNs) without requiring additional language resources. We first demonstrate that the performance of the phone models of a single low-resource language can be improved by training its grapheme models in parallel under the MTL framework. If multiple low-resource languages are trained together, we investigate learning a set of universal phones (UPS) as an additional task again in the MTL framework to improve the performance of the phone models of all the involved languages. In both cases, the heuristic guideline is to select a task that may exploit extra information from the training data of the primary task(s). In the first method, the extra information is the phone-to-grapheme mappings, whereas in the second method, the UPS helps to implicitly map the phones of the multiple languages among each other. In a series of experiments using three low-resource South African languages in the Lwazi corpus, the proposed MTL methods obtain significant word recognition gains when compared with single-task learning (STL) of the corresponding DNNs or ROVER that combines results from several STL-trained DNNs.

Explore More