Ka-Yee Leung
Hong Kong University of Science and Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ka-Yee Leung.
international conference on acoustics, speech, and signal processing | 2004
Ka-Yee Leung; Man-Wai Mak; Sun-Yuan Kung
This paper presents an approach that uses articulatory features (AF) derived from spectral features for telephone-based speaker verification. To minimize the acoustic mismatch caused by different handsets, handset-specific normalization is applied to the spectral features before the AF are extracted. Experimental results based on 150 speakers using 10 different handsets show that AF contain useful speaker-specific information for speaker verification and the use of handset-specific normalization significantly lowers the error rates under the handset mismatched conditions. Results also demonstrate that fusing the scores obtained from an AF-based system with those obtained from a spectral feature-based (MFCC) system helps lower the error rates of the individual systems.
Speech Communication | 2006
Ka-Yee Leung; Man-Wai Mak; Man-Hung Siu; Sun-Yuan Kung
Because of the differences in education background, accents, and so on, different persons have different ways of pronunciation. Therefore, the pronunciation patterns of individuals can be used as features for discriminating speakers. This paper exploits the pronunciation characteristics of speakers and proposes a new conditional pronunciation modeling (CPM) technique for speaker verification. The proposed technique establishes a link between articulatory properties (e.g., manners and places of articulation) and phoneme sequences produced by a speaker. This is achieved by aligning two articulatory feature (AF) streams with a phoneme sequence determined by a phoneme recognizer, which is followed by formulating the probabilities of articulatory classes conditioned on the phonemes as speaker-dependent discrete probabilistic models. The scores obtained from the AF-based pronunciation models are then fused with those obtained from spectral-based acoustic models. A frame-weighted fusion approach is introduced to weight the frame-based fused scores based on the confidence of observing the articulatory classes. The effectiveness of AF-based CPM and the frame-weighted approach is demonstrated in a speaker verification task.
Computer Speech & Language | 2006
Ka-Yee Leung; Man-Hung Siu
Abstract Confidence measures are computed to estimate the certainty that target acoustic units are spoken in specific speech segments. They are applied in tasks such as keyword verification or utterance verification. Because many of the confidence measures use the same set of models and features as in recognition, the resulting scores may not provide an independent measure of reliability. In this paper, we propose two articulatory feature (AF) based phoneme confidence measures that estimate the acoustic reliability based on the match in AF properties. While acoustic-based features, such as Mel-frequency cepstral coefficients (MFCC), are widely used in speech processing, some recent works have focus on linguistically based features, such as the articulatory features that relate directly to the human articulatory process which may better capture speech characteristics. The articulatory features can either replace or complement the acoustic-based features in speech processing. The proposed AF-based measures in this paper were evaluated, in comparison and in combination, with the HMM-based scores on phoneme and keyword verification tasks using children’s speech collected for a computer-based English pronunciation learning project. To fully evaluate their usefulness, the proposed measures and combinations were evaluated on both native and non-native data; and under field test conditions that mis-matches with the training condition. The experimental results show that under the different environments, combinations of the AF scores with the HMM-based scores outperforms HMM-based scores alone on phoneme and keyword verification.
international conference on acoustics, speech, and signal processing | 2003
Ka-Yee Leung; Man-Hung Siu
Confidence measures are used in a number of applications to verify the user input or to measure the certainty of the recognition outputs. Most of the HMM-based systems use MFCC features with Gaussian mixtures models to estimate confidence. We propose a new approach to estimate confidence by combining the posterior probabilities of articulatory features (AF) computed by a set of AF classifiers. This AF-based confidence measure gives comparable performance in terms of classification equal error rate (EER) to the Gaussian mixture-based approach but reduces the computation by 50% (as measured by the approximated number of multiplications) and consumes smaller memory. When the AF-based confidence is combined with confidence from the Gaussian mixtures, the EER is further reduced. This AF confidence can be particularly useful for platforms with limited computing resources such as hand-held devices.
Information Fusion | 2004
Ka-Yee Leung; Man-Hung Siu
Abstract In speech recognition, fusion of multiple systems often results in improved recognition accuracy or robustness. All the previously suggested system fusions mainly focused on the recognition process. Training, on the other hand, are performed independently across different systems. In this paper, we investigated the combination of a Mel frequency cepstral coefficients (MFCC) based acoustic feature (ACF) system and an articulatory feature (AF) based system. In addition to proposing an asynchronous combination during the recognition process that makes the state combination more flexible during recognition, we proposed an efficient combination approach during the model training stage. We show that combining the models during training not only improved performance but also simplified fusion process during recognition. Because fusion during training removes inconsistency between the individual models, such as in state or phoneme alignments, it is particularly useful for highly constrained recognition fusion such as synchronous models combination. Comparing fusion of separately trained AF and ACF systems, fusion of jointly trained AF and ACF models resulted in more than 3% absolute phoneme recognition error reduction on the TIMIT corpus for synchronous and 1% for asynchronous combination.
international symposium on chinese spoken language processing | 2004
Ka-Yee Leung; Man-Wai Mak; Man-Hung Siu; Sun-Yuan Kung
This paper proposes an articulatory feature-based conditional pronunciation modeling (AFCPM) technique for speaker verification. The technique models the pronunciation behavior of speakers by creating a link between the actual phones produced by the speakers and the state of articulations during speech production. Speaker models consisting of conditional probabilities of two articulatory classes are adapted from a set of universal background models (UBM) using the MAP adaptation technique. This adaptation approach aims to prevent over-fitting the speaker models when the amount of speaker data is insufficient for a direct estimation. Experimental results show that the adaptation technique can enhance the discriminating power of speaker models by establishing a tighter coupling between speaker models and the UBM. Results also show that fusing the scores derived from an AFCPM-based system and a conventional spectral-based system achieves a significantly lower error rate than that of the individual systems. This suggests that AFCPM and spectral features are complementary to each other.
conference of the international speech communication association | 2004
Ka-Yee Leung; Man-Wai Mak; Sun-Yuan Kung
conference of the international speech communication association | 2002
Ka-Yee Leung; Man-Hung Siu
Archive | 2002
Ka-Yee Leung
conference of the international speech communication association | 2005
Ka-Yee Leung; Man-Wai Mak; Man-Hung Siu; Sun-Yuan Kung