Is this you? Create Your Porfile

Huy Dat Tran

Agency for Science, Technology and Research

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Huy Dat Tran is active.

Explore More

Publication

Featured researches published by Huy Dat Tran.

IEEE Transactions on Audio, Speech, and Language Processing | 2013

Image Feature Representation of the Subband Power Distribution for Robust Sound Event Classification

Jonathan William Dennis; Huy Dat Tran; Eng Siong Chng

The ability to automatically recognize a wide range of sound events in real-world conditions is an important part of applications such as acoustic surveillance and machine hearing. Our approach takes inspiration from both audio and image processing fields, and is based on transforming the sound into a two-dimensional representation, then extracting an image feature for classification. This provided the motivation for our previous work on the spectrogram image feature (SIF). In this paper, we propose a novel method to improve the sound event classification performance in severe mismatched noise conditions. This is based on the subband power distribution (SPD) image - a novel two-dimensional representation that characterizes the spectral power distribution over time in each frequency subband. Here, the high-powered reliable elements of the spectrogram are transformed to a localized region of the SPD, hence can be easily separated from the noise. We then extract an image feature from the SPD, using the same approach as for the SIF, and develop a novel missing feature classification approach based on a nearest neighbor classifier (kNN). We carry out comprehensive experiments on a database of 50 environmental sound classes over a range of challenging noise conditions. The results demonstrate that the SPD-IF is both discriminative over the broad range of sound classes, and robust in severe non-stationary noise.

IEEE Transactions on Audio, Speech, and Language Processing | 2011

Sound Event Recognition With Probabilistic Distance SVMs

Huy Dat Tran; Haizhou Li

Unlike other audio or speech signals, sound events have a relatively short time span. They are usually distinguished by their unique spectro-temporal signature. This paper proposes a novel classification method based on probabilistic distance support vector machines (SVMs). We study a parametric approach to characterizing sound signals using the distribution of the subband temporal envelope (STE), and kernel techniques for the subband probabilistic distance (SPD) under the framework of SVM. We show that generalized gamma modeling is well devised for sound characterization, and that the probabilistic distance kernel provides a closed form solution to the calculation of divergence distance, which tremendously reduces computational cost. We conducted experiments on a database of ten types of sound events. The results show that the proposed classification method significantly outperforms conventional SVM classifiers with Mel-frequency cepstral coefficients (MFCCs). The rapid computation of probabilistic distance also makes the proposed method an obvious choice for online sound event recognition.

international conference on acoustics, speech, and signal processing | 2013

Temporal coding of local spectrogram features for robust sound recognition

Jonathan William Dennis; Qiang Yu; Huajin Tang; Huy Dat Tran; Haizhou Li

There is much evidence to suggest that the human auditory system uses localised time-frequency information for the robust recognition of sounds. Despite this, conventional systems typically rely on features extracted from short windowed frames over time, covering the whole frequency spectrum. Such approaches are not inherently robust to noise, as each frame will contain a mixture of the spectral information from noise and signal. Here, we propose a novel approach based on the temporal coding of Local Spectrogram Features (LSFs), which generate spikes that are used to train a Spiking Neural Network (SNN) with temporal learning. LSFs represent robust location information in the spectrogram surrounding keypoints, which are detected in a signal-driven manner such that the effect of noise on the temporal coding is reduced. Our experiments demonstrate the robust performance of our approach across a variety of noise conditions, such that it is able to outperform the conventional frame-based baseline methods.

international conference on acoustics, speech, and signal processing | 2011

Probabilistic distance SVM with Hellinger-Exponential Kernel for sound event classification

Huy Dat Tran; Haizhou Li

This paper presents a novel method for sound event classification based on probabilistic distance SVM. The basic idea is to embed probabilistic distances into classical SVM to classify the sound events. The main point of this method is that the long-term characterization of sound events are better used in the classification compared to conventional method. Furthermore, taking into account the relative short time span of sound events, we develop a probabilistic distance SVM approach based on Hellinger distance from exponential modeling of temporal subband envelopes. An experiment on classifying 10 types of sound events was carried out and showed promising results of the proposed method compared to conventional methods.

international conference on acoustics, speech, and signal processing | 2009

Sound event classification based on Feature Integration, Recursive Feature Elimination and Structured Classification

Huy Dat Tran; Haizhou Li

This paper proposes a novel system for sound event classification based on Feature Integration, Recursive Feature Elimination Support Vector Machine (RFESVM) and Structured Classification. The key points of the proposed method can be summarized as follows: 1) the integration of various feature extraction methods coming from different research communities in one system; 2) the use of feature selection to analyze and select the optimal subset of the integrated features; 3) the adoption of a knowledge-based taxonomic structured classification scheme. Particularly, six groups of features including temporal shape, spectral shape, spectrogram, perceptual cepstral coefficients, harmonic and rhythmic feature sets are investigated in this paper. For the feature selection, the employed RFESVM method enables to select the optimal feature subset taking into account their mutual information. We further develop different feature elimination strategies for RFESVM depending on the requirements of complexity. The RFESVM is combined with a structured classification designed for our task in surveillance and security applications. The proposed method is tested in two realistic environments and the experimental results show good improvements of the classification performance compared to the conventional method.

international conference on acoustics, speech, and signal processing | 2011

Jump Function Kolmogorov for overlapping audio event classification

Huy Dat Tran; Haizhou Li

This paper presents a novel method for audio event classification in overlapping conditions. The method is based on Jump Function Kolmogorov (JFK), a stochastic representation, which is (a) additive, thus the sum of signal and noise yields the sum of their JFKs; (b) sparse, therefore audio events are separable in this domain. The proposed method is an extension of our previous works for classification under noise-mismatch conditions. Similar to that approach, the robustness of the JFK feature is obtained by limiting them within confidence intervals, which can be learned in advance. However, in order to classify overlapped events, we design the classification system as a set of event detectors and develop a novel approach which maps JFKs to a specific feature for each detector. The experiment shows that the proposed method achieves promising results in very challenging overlapping conditions.

international conference on acoustics, speech, and signal processing | 2015

Combining robust spike coding with spiking neural networks for sound event classification

Jonathan William Dennis; Huy Dat Tran; Haizhou Li

This paper proposes a novel biologically inspired method for sound event classification which combines spike coding with a spiking neural network (SNN). Our spike coding extracts keypoints that represent the local maxima components of the sound spectrogram, and are encoded based on their local time-frequency information; hence both location and spectral information are being extracted. We then design a modified tempotron SNN that, unlike the original tempotron, allows the network to learn the temporal distributions of spike coding input, in an analogous way to the generalized Hough transform. The proposed method simultaneously enhances the sparsity of the sound event spectrogram, producing a representation which is robust against noise, as well as maximises the discriminability of the spike coding input in terms of its temporal information, which is important for sound event classification. Experimental results on a large dataset of 50 environment sound events show the superiority of both the spike coding versus the raw spectrogram and the SNN versus conventional cross-entropy neural networks.

IEEE Transactions on Audio, Speech, and Language Processing | 2015

Generalized Hough transform for speech pattern classification

Jonathan William Dennis; Huy Dat Tran; Haizhou Li

While typical hybrid neural network architectures for automatic speech recognition (ASR) use a context window of frame-based features, this may not be the best approach to capture the wider temporal context, which contains phonetic and linguistic information that is equally important. In this paper, we introduce a system that integrates both the spectral and geometrical shape information from the acoustic spectrum, inspired by research in the field of machine vision. In particular, we focus on the Generalized Hough Transform (GHT), which is a sophisticated technique that can model the geometrical distribution of speech information over the wider temporal context. To integrate the GHT as part of a hybrid-ASR system, we propose to use a neural network, with features derived from the probabilistic Hough voting step of the GHT, to implement an improved version of the GHT where the output of the network represents the conventional target class posteriors. A major advantage of our approach is that each step of the GHT is highly interpretable, particularly compared to deep neural network (DNN) systems which are commonly treated as powerful black-box classifiers that give little insight into how the output is achieved. Experiments are carried out on two speech pattern classification tasks. The first is the TIMIT phoneme classification, which demonstrates the performance of the approach on a standard ASR task. The second is a spoken word recognition challenge, which highlights the flexibility of the approach to capture phonetic information within a longer temporal context.

international conference on acoustics, speech, and signal processing | 2014

A discriminatively trained Hough Transform for frame-level phoneme recognition

Jonathan William Dennis; Huy Dat Tran; Haizhou Li; Eng Siong Chng

Despite recent advances in the use of Artificial Neural Network (ANN) architectures for automatic speech recognition (ASR), relatively little attention has been given to using feature inputs beyond MFCCs in such systems. In this paper, we propose an alternative to conventional MFCC or filterbank features, using an approach based on the Generalised Hough Transform (GHT). The GHT is a common approach used in the field of image processing for the task of object detection, where the idea is to learn the spatial distribution of a codebook of feature information relative to the location of the target class. During recognition, a simple weighted summation of the codebook activations is commonly used to detect the presence of the target classes. Here we propose to learn the weighting discriminatively in an ANN, where the aim is to optimise the static phone classification error at the output of the network. As such an ANN is common to hybrid ASR architectures, the output activations from the GHT can be considered as a novel feature for ASR. Experimental results on the TIMIT phoneme recognition task demonstrate the state-of-the-art performance of the approach.

asia pacific signal and information processing association annual summit and conference | 2014

Enhanced local feature approach for overlapping sound event recognition

Jonathan William Dennis; Huy Dat Tran

In this paper, we propose a feature-based approach to address the challenging task of recognising overlapping sound events from single channel audio. Our approach is based on our previous work on Local Spectrogram Features (LSFs), where we combined a local spectral representation of the spectrogram with the Generalised Hough Transform (GHT) voting system for recognition. Here we propose to take the output from the GHT and use it as a feature for classification, and demonstrate that such an approach can improve upon the previous knowledge-based scoring system. Experiments are carried out on a challenging set of five overlapping sound events, with the addition of non-stationary background noise and volume change. The results show that the proposed system can achieve a detection rate of 99% and 91% in clean and 0dB noise conditions respectively, which is a strong improvement over our previous work.

Explore More