Is this you? Create Your Porfile

Eng Siong Chng

Nanyang Technological University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Eng Siong Chng is active.

Explore More

Publication

Featured researches published by Eng Siong Chng.

international conference on acoustics, speech, and signal processing | 2012

Vulnerability of speaker verification systems against voice conversion spoofing attacks: The case of telephone speech

Tomi Kinnunen; Zhizheng Wu; Kong Aik Lee; Filip Sedlak; Eng Siong Chng; Haizhou Li

Voice conversion - the methodology of automatically converting ones utterances to sound as if spoken by another speaker - presents a threat for applications relying on speaker verification. We study vulnerability of text-independent speaker verification systems against voice conversion attacks using telephone speech. We implemented a voice conversion systems with two types of features and nonparallel frame alignment methods and five speaker verification systems ranging from simple Gaussian mixture models (GMMs) to state-of-the-art joint factor analysis (JFA) recognizer. Experiments on a subset of NIST 2006 SRE corpus indicate that the JFA method is most resilient against conversion attacks. But even it experiences more than 5-fold increase in the false acceptance rate from 3.24 % to 17.33 %.

IEEE Transactions on Audio, Speech, and Language Processing | 2008

Normalization of the Speech Modulation Spectra for Robust Speech Recognition

Xiong Xiao; Eng Siong Chng; Haizhou Li

In this paper, we study a novel technique that normalizes the modulation spectra of speech signals for robust speech recognition. The modulation spectra of a speech signal are the power spectral density (PSD) functions of the feature trajectories generated from the signal, hence they describe the temporal structure of the features. The modulation spectra are distorted when the speech signal is corrupted by noise. We propose the temporal structure normalization (TSN) filter to reduce the noise effects by normalizing the modulation spectra to reference spectra. The TSN filter is different from other feature normalization methods such as the histogram equalization (HEQ) that only normalize the probability distributions of the speech features. Our previous work showed promising results of TSN on a small vocabulary Aurora-2 task. In this paper, we conduct an inquiry into the theoretical and practical issues of the TSN filter that includes the following. 1) We investigate the effects of noises on the speech modulation spectra and show the general characteristics of noisy speech modulation spectra. The observations help to further explain and justify the TSN filter. 2) We evaluate the TSN filter on the Aurora-4 task and demonstrate its effectiveness for a large vocabulary task. 3) We propose a segment-based implementation of the TSN filter that reduces the processing delay significantly without affecting the performance. Overall, the TSN filter produces significant improvements over the baseline systems, and delivers competitive results when compared to other state-of-the-art temporal filters.

acm multimedia | 2004

Automatic replay generation for soccer video broadcasting

Jinjun Wang; Changsheng Xu; Eng Siong Chng; Kongwah Wah; Qi Tian

While most current approaches for sports video analysis are based on broadcast video, in this paper, we present a novel approach for highlight detection and automatic replay generation for soccer videos taken by the main camera. This research is important as current soccer highlight detection and replay generation from a live game is a labor-intensive process. A robust multi-level, multi-model event detection framework is proposed to detect the event and event boundaries from the video taken by the main camera. This framework explores the possible analysis cues, using a mid-level representation to bridge the gap between low-level features and high-level events. The event detection results and mid-level representation are used to generate replays which are automatically inserted into the video. Experimental results are promising and found to be comparable with those generated by broadcast professionals.

international conference on acoustics, speech, and signal processing | 2013

Synthetic speech detection using temporal modulation feature

Zhizheng Wu; Xiong Xiao; Eng Siong Chng; Haizhou Li

Voice conversion and speaker adaptation techniques present a threat to current state-of-the-art speaker verification systems. To prevent such spoofing attack and enhance the security of speaker verification systems, the development of anti-spoofing techniques to distinguish synthetic and human speech is necessary. In this study, we continue the quest to discriminate synthetic and human speech. Motivated by the facts that current analysis-synthesis techniques operate on frame level and make the frame-by-frame independence assumption, we proposed to adopt magnitude/phase modulation features to detect synthetic speech from human speech. Modulation features derived from magnitude/phase spectrum carry long-term temporal information of speech, and may be able to detect temporal artifacts caused by the frame-by-frame processing in the synthesis of speech signal. From our synthetic speech detection results, the modulation features provide complementary information to magnitude/phase features. The best detection performance is obtained by fusing phase modulation features and phase features, yielding an equal error rate of 0.89%, which is significantly lower than the 1.25% of phase features and 10.98% of MFCC features.

international conference on multimedia and expo | 2004

Sports highlight detection from keyword sequences using HMM

Jinjun Wang; Changsheng Xu; Eng Siong Chng; Qi Tian

Sports video highlight detection is a popular topic. A multi-layer sport event detection framework is described. In the mid-level of this framework, visual and audio keywords are created from low-level features and the original video is converted into a keyword sequence. In the high-level, the temporal pattern of keyword sequences is analyzed by an HMM classifier. The creation of visual and audio keywords can help to bridge the gap between low-level features and high-level semantics. The use of the HMM classifier can automatically find the temporal change character of the event instead of rule based heuristic modeling to map certain keyword sequences into events. Experiments using our model on soccer games produced some promising results

IEEE Transactions on Audio, Speech, and Language Processing | 2014

Exemplar-based sparse representation with residual compensation for voice conversion

Zhizheng Wu; Tuomas Virtanen; Eng Siong Chng; Haizhou Li

We propose a nonparametric framework for voice conversion, that is, exemplar-based sparse representation with residual compensation. In this framework, a spectrogram is reconstructed as a weighted linear combination of speech segments, called exemplars, which span multiple consecutive frames. The linear combination weights are constrained to be sparse to avoid over-smoothing, and high-resolution spectra are employed in the exemplars directly without dimensionality reduction to maintain spectral details. In addition, a spectral compression factor and a residual compensation technique are included in the framework to enhance the conversion performances. We conducted experiments on the VOICES database to compare the proposed method with a large set of state-of-the-art baseline methods, including the maximum likelihood Gaussian mixture model (ML-GMM) with dynamic feature constraint and the partial least squares (PLS) regression based methods. The experimental results show that the objective spectral distortion of ML-GMM is reduced from 5.19 dB to 4.92 dB, and both the subjective mean opinion score and the speaker identification rate are increased from 2.49 and 73.50% to 3.15 and 79.50%, respectively, by the proposed method. The results also show the superiority of our method over PLS-based methods. In addition, the subjective listening tests indicate that the naturalness of the converted speech by our proposed method is comparable with that by the ML-GMM method with global variance constraint.

international conference on pattern recognition | 2006

Automatic Sports Video Genre Classification using Pseudo-2D-HMM

Jinjun Wang; Changsheng Xu; Eng Siong Chng

Building a generic content-based sports video analysis system remains a challenging problem because of the diversity in sports rules and game features which makes it difficult to discover generic low-level features or high-level modeling algorithms. One possible alternative is to first classify the sports genre and then apply specific sports domain knowledge to perform analysis. In this paper we describe a multi-level framework to automatically recognize the genre of the sports video. The system consists of a pseudo-2D-HMM classifier using low-level visual/audio features to evaluate the video clips. The experimental results are satisfactory and extension of the framework to a generic sports video analysis system is being implemented

international conference on acoustics, speech, and signal processing | 2012

A first speech recognition system for Mandarin-English code-switch conversational speech

Ngoc Thang Vu; Dau-Cheng Lyu; Jochen Weiner; Dominic Telaar; Tim Schlippe; Fabian Blaicher; Eng Siong Chng; Tanja Schultz; Haizhou Li

This paper presents first steps toward a large vocabulary continuous speech recognition system (LVCSR) for conversational Mandarin-English code-switching (CS) speech. We applied state-of-the-art techniques such as speaker adaptive and discriminative training to build the first baseline system on the SEAME corpus [1] (South East Asia Mandarin-English). For acoustic modeling, we applied different phone merging approaches based on the International Phonetic Alphabet (IPA) and Bhattacharyya distance in combination with discriminative training to improve accuracy. On language model level, we investigated statistical machine translation (SMT) - based text generation approaches for building code-switching language models. Furthermore, we integrated the provided information from a language identification system (LID) into the decoding process by using a multi-stream approach. Our best 2-pass system achieves a Mixed Error Rate (MER) of 36.6% on the SEAME development set.

international conference on acoustics, speech, and signal processing | 2006

Integrating Acoustic, Prosodic and Phonotactic Features for Spoken Language Identification

Rong Tong; Bin Ma; Donglai Zhu; Haizhou Li; Eng Siong Chng

The fundamental issue of the automatic language identification is to explore the effective discriminative cues for languages. This paper studies the fusion of five features at different level of abstraction for language identification, including spectrum, duration, pitch, n-gram phonotactic, and bag-of-sounds features. We build a system and report test results on NIST 1996 and 2003 LRE datasets. The system is also built to participate in NIST 2005 LRE. The experiment results show that different levels of information provide complementary language cues. The prosodic features are more effective for shorter utterances while the phonotactic features work better for longer utterances. For the task of 12 languages, the system with fusion of five features achieved 2.38% EER for 30-sec speech segments on NIST 1996 dataset

acm multimedia | 2005

Automatic generation of personalized music sports video

Jinjun Wang; Changsheng Xu; Eng Siong Chng; Ling-Yu Duan; Kongwah Wan; Qi Tian

In this paper, we propose a novel automatic approach for personalized music sports video generation. Two research challenges, semantic sports video content selection and automatic video composition, are addressed. For the first challenge, we propose to use multi-modal (audio, video and text) feature analysis and alignment to detect the semantic of events in sports video. For the second challenge, we propose video-centric and music-centric music video composition schemes to automatically generate personalized music sports video based on users preference. The experimental results and user evaluations are promising and show that our systems generated music sports video is comparable to manually generated ones. The proposed approach greatly facilitates the automatic music sports video generation for both professionals and amateurs.

Explore More