Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Xiaodan Zhuang is active.

Publication


Featured researches published by Xiaodan Zhuang.


acm multimedia | 2008

SIFT-Bag kernel for video event analysis

Xi Zhou; Xiaodan Zhuang; Shuicheng Yan; Shih-Fu Chang; Mark Hasegawa-Johnson; Thomas S. Huang

In this work, we present a SIFT-Bag based generative-to-discriminative framework for addressing the problem of video event recognition in unconstrained news videos. In the generative stage, each video clip is encoded as a bag of SIFT feature vectors, the distribution of which is described by a Gaussian Mixture Models (GMM). In the discriminative stage, the SIFT-Bag Kernel is designed for characterizing the property of Kullback-Leibler divergence between the specialized GMMs of any two video clips, and then this kernel is utilized for supervised learning in two ways. On one hand, this kernel is further refined in discriminating power for centroid-based video event classification by using the Within-Class Covariance Normalization approach, which depresses the kernel components with high-variability for video clips of the same event. On the other hand, the SIFT-Bag Kernel is used in a Support Vector Machine for margin-based video event classification. Finally, the outputs from these two classifiers are fused together for final decision. The experiments on the TRECVID 2005 corpus demonstrate that the mean average precision is boosted from the best reported 38.2% in [36] to 60.4% based on our new framework.


Pattern Recognition Letters | 2010

Real-world acoustic event detection

Xiaodan Zhuang; Xi Zhou; Mark Hasegawa-Johnson; Thomas S. Huang

Acoustic Event Detection (AED) aims to identify both timestamps and types of events in an audio stream. This becomes very challenging when going beyond restricted highlight events and well controlled recordings. We propose extracting discriminative features for AED using a boosting approach, which outperform classical speech perceptual features, such as Mel-frequency Cepstral Coefficients and log frequency filterbank parameters. We propose leveraging statistical models better fitting the task. First, a tandem connectionist-HMM approach combines the sequence modeling capabilities of the HMM with the high-accuracy context-dependent discriminative capabilities of an artificial neural network trained using the minimum cross entropy criterion. Second, an SVM-GMM-supervector approach uses noise-adaptive kernels better approximating the KL divergence between feature distributions in different audio segments. Experiments on the CLEAR 2007 AED Evaluation set-up demonstrate that the presented features and models lead to over 45% relative performance improvement, and also outperform the best system in the CLEAR AED Evaluation, on detection of twelve general acoustic events in a real seminar environment.


international conference on acoustics, speech, and signal processing | 2009

Acoustic fall detection using Gaussian mixture models and GMM supervectors

Xiaodan Zhuang; Jing Huang; Gerasimos Potamianos; Mark Hasegawa-Johnson

We present a system that detects human falls in the home environment, distinguishing them from competing noise, by using only the audio signal from a single far-field microphone. The proposed system models each fall or noise segment by means of a Gaussian mixture model (GMM) supervector, whose Euclidean distance measures the pairwise difference between audio segments. A support vector machine built on a kernel between GMM supervectors is employed to classify audio segments into falls and various types of noise. Experiments on a dataset of human falls, collected as part of the Netcarity project, show that the method improves fall classification F-score to 67% from 59% of a baseline GMM classifier. The approach also effectively addresses the more difficult fall detection problem, where audio segment boundaries are unknown. Specifically, we employ it to reclassify confusable segments produced by a dynamic programming scheme based on traditional GMMs. Such post-processing improves a fall detection accuracy metric by 5% relative.


international conference on acoustics, speech, and signal processing | 2008

Feature analysis and selection for acoustic event detection

Xiaodan Zhuang; Xi Zhou; Thomas S. Huang; Mark Hasegawa-Johnson

Speech perceptual features, such as Mel-frequency Cepstral Coefficients (MFCC), have been widely used in acoustic event detection. However, the different spectral structures between speech and acoustic events degrade the performance of the speech feature sets. We propose quantifying the discriminative capability of each feature component according to the approximated Bayesian accuracy and deriving a discriminative feature set for acoustic event detection. Compared to MFCC, feature sets derived using the proposed approaches achieve about 30% relative accuracy improvement in acoustic event detection.


Multimodal Technologies for Perception of Humans | 2008

HMM-Based Acoustic Event Detection with AdaBoost Feature Selection

Xi Zhou; Xiaodan Zhuang; Ming Liu; Hao Tang; Mark Hasegawa-Johnson; Thomas S. Huang

Because of the spectral difference between speech and acous- tic events, we propose using Kullback-Leibler distance to quantify the discriminant capability of all speech feature components in acoustic event detection. Based on these distances, we use AdaBoost to select a discriminant feature set and demonstrate that this feature set outperforms classical speech feature set such as MFCC in one-pass HMM-based acoustic event detection. We implement an HMM-based acoustic events detection system with lattice rescoring using a feature set selected by the above AdaBoost based approach.


international conference on pattern recognition | 2008

Face age estimation using patch-based hidden Markov model supervectors

Xiaodan Zhuang; Xi Zhou; Mark Hasegawa-Johnson; Thomas S. Huang

Recent studies in patch-based Gaussian Mixture Model (GMM) approaches for face age estimation present promising results. We propose using a hidden Markov model (HMM) supervector to represent face image patches, to improve from the previous GMM supervector approach by capturing the spatial structure of human faces and loosening the assumption of identical face patch distribution within a face image. The Euclidean distance of HMM supervectors constructed from two face images measures the similarity of the human faces, derived from the approximated Kullback-Leibler divergence between the joint distributions of patches with implicit unsupervised alignment of different regions in two human faces. The proposed HMM supervector approach compares favorably with the GMM supervector approach in face age estimation on a large face dataset.


international conference on pattern recognition | 2010

Novel Gaussianized vector representation for improved natural scene categorization

Xi Zhou; Xiaodan Zhuang; Hao Tang; Mark Hasegawa-Johnson; Thomas S. Huang

We present a novel Gaussianized vector representation for scene images by an unsupervised approach. Each image is first encoded as an ensemble of orderless bag of features. A global Gaussian Mixture Model (GMM) learned from all images is then used to randomly distribute each feature into one Gaussian component by a multinomial trial. The posteriors of the feature on all the Gaussian components serve as the parameters of the multinomial distribution. Finally, the normalized means of the features distributed in every Gaussian component are concatenated to form a supervector, which is a compact representation for each scene image. We prove that these supervectors observe the standard normal distribution. The Gaussianized vector representation is a more generalized form of the widely used histogram representation. Our experiments on scene categorization tasks using this vector representation show significantly improved performance compared with the histogram-of-features representation. This paper is an extended version of our work that won the IBM Best Student Paper Award at the 2008 International Conference on Pattern Recognition (ICPR 2008) (Zhou et al., 2008).


international conference on acoustics, speech, and signal processing | 2011

Synthesizing visual speech trajectory with minimum generation error

Lijuan Wang; Yi-Jian Wu; Xiaodan Zhuang; Frank K. Soong

In this paper, we propose a minimum generation error (MGE) training method to refine the audio-visual HMM to improve visual speech trajectory synthesis. Compared with the traditional maximum likelihood (ML) estimation, the proposed MGE training explicitly optimizes the quality of generated visual speech trajectory, where the audio-visual HMM modeling is jointly refined by using a heuristic method to find the optimal state alignment and a probabilistic descent algorithm to optimize the model parameters under the MGE criterion. In objective evaluation, compared with the ML-based method, the proposed MGE-based method achieves consistent improvement in the mean square error reduction, correlation increase, and recovery of global variance. It also improves the naturalness and audio-visual consistency perceptually in the subjective test.


international conference on pattern recognition | 2008

A novel Gaussianized vector representation for natural scene categorization

Xi Zhou; Xiaodan Zhuang; Hao Tang; Mark Hasegawa-Johnson; Thomas S. Huang

This paper presents a novel Gaussianized vector representation for scene images by an unsupervised approach. First, each image is encoded as an ensemble of orderless bag of features, and then a global Gaussian Mixture Model (GMM) learned from all images is used to randomly distribute each feature into one Gaussian component by a multinomial trial. The parameters of the multinomial distribution are defined by the posteriors of the feature on all the Gaussian components. Finally, the normalized means of the features distributed in every Gaussian component are concatenated to form a supervector, which is a compact representation for each scene image. We prove that these super-vectors observe the standard normal distribution. Our experiments on scene categorization tasks using this vector representation show significantly improved performance compared with the bag-of-features representation.


international conference on acoustics, speech, and signal processing | 2012

Improving faster-than-real-time human acoustic event detection by saliency-maximized audio visualization

Kai Hsiang Lin; Xiaodan Zhuang; Camille Goudeseune; Sarah King; Mark Hasegawa-Johnson; Thomas S. Huang

We propose a saliency-maximized audio spectrogram as a representation that lets human analysts quickly search for and detect events in audio recordings. By rendering target events as visually salient patterns, this representation minimizes the time and effort needed to examine a recording. In particular, we propose a transformation of a conventional spectrogram that maximizes the mutual information between the spectrograms of isolated target events and the estimated saliency of the overall visual representation. When subjects are shown spectrograms that are saliency-maximized, they perform significantly better in a 1/10-real-time acoustic event detection task.

Collaboration


Dive into the Xiaodan Zhuang's collaboration.

Top Co-Authors

Avatar

Xi Zhou

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Brandyn White

University of Central Florida

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge