Huy Phan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Huy Phan is active.

Explore More

Publication

Featured researches published by Huy Phan.

IEEE Transactions on Audio, Speech, and Language Processing | 2015

Random regression forests for acoustic event detection and classification

Huy Phan; Marco Maaß; Radoslaw Mazur; Alfred Mertins

Despite the success of the automatic speech recognition framework in its own application field, its adaptation to the problem of acoustic event detection has resulted in limited success. In this paper, instead of treating the problem similar to the segmentation and classification tasks in speech recognition, we pose it as a regression task and propose an approach based on random forest regression. Furthermore, event localization in time can be efficiently handled as a joint problem. We first decompose the training audio signals into multiple interleaved superframes which are annotated with the corresponding event class labels and their displacements to the temporal onsets and offsets of the events. For a specific event category, a random-forest regression model is learned using the displacement information. Given an unseen superframe, the learned regressor will output the continuous estimates of the onset and offset locations of the events. To deal with multiple event categories, prior to the category-specific regression phase, a superframe-wise recognition phase is performed to reject the background superframes and to classify the event superframes into different event categories. While jointly posing event detection and localization as a regression problem is novel, the superior performance on two databases ITC-Irst and UPC-TALP demonstrates the efficiency and potential of the proposed approach.

conference of the international speech communication association | 2016

Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks.

Huy Phan; Lars Hertel; Marco Maaß; Alfred Mertins

We present in this paper a simple, yet efficient convolutional neural network (CNN) architecture for robust audio event recognition. Opposing to deep CNN architectures with multiple convolutional and pooling layers topped up with multiple fully connected layers, the proposed network consists of only three layers: convolutional, pooling, and softmax layer. Two further features distinguish it from the deep architectures that have been proposed for the task: varying-size convolutional filters at the convolutional layer and 1-max pooling scheme at the pooling layer. In intuition, the network tends to select the most discriminative features from the whole audio signals for recognition. Our proposed CNN not only shows state-of-the-art performance on the standard task of robust audio event recognition but also outperforms other deep architectures up to 4.5% in terms of recognition accuracy, which is equivalent to 76.3% relative error reduction.

international joint conference on neural network | 2016

Comparing time and frequency domain for audio event recognition using deep learning

Lars Hertel; Huy Phan; Alfred Mertins

Recognizing acoustic events is an intricate problem for a machine and an emerging field of research. Deep neural networks achieve convincing results and are currently the state-of-the-art approach for many tasks. One advantage is their implicit feature learning, opposite to an explicit feature extraction of the input signal. In this work, we analyzed whether more discriminative features can be learned from either the time-domain or the frequency-domain representation of the audio signal. For this purpose, we trained multiple deep networks with different architectures on the Freiburg-106 and ESC-10 datasets. Our results show that feature learning from the frequency domain is superior to the time domain. Moreover, additionally using convolution and pooling layers, to explore local structures of the audio signal, significantly improves the recognition performance and achieves state-of-the-art results.

international conference of the ieee engineering in medicine and biology society | 2013

Metric learning for automatic sleep stage classification

Huy Phan; Quan Do; The-Luan Do; Duc-Lung Vu

We introduce in this paper a metric learning approach for automatic sleep stage classification based on single-channel EEG data. We show that learning a global metric from training data instead of using the default Euclidean metric, the k-nearest neighbor classification rule outperforms state-of-the-art methods on Sleep-EDF dataset with various classification settings. The overall accuracy for Awake/Sleep and 4-class classification setting are 98.32% and 94.49% respectively. Furthermore, the superior accuracy is achieved by performing classification on a low-dimensional feature space derived from time and frequency domains and without the need for artifact removal as a preprocessing step.

workshop on applications of signal processing to audio and acoustics | 2015

A multi-channel fusion framework for audio event detection

Huy Phan; Marco Maass; Lars Hertel; Radoslaw Mazur; Alfred Mertins

We propose in this paper a simple, yet efficient multi-channel fusion framework for joint acoustic event detection and classification. The joint problem on individual channels is posed as a regression problem to estimate event onset and offset positions. As an intermediate result, we also obtain the posterior probabilities which measure the confidence that event onsets and offsets are present at a temporal position. It facilitates the fusion problem by accumulating the posterior probabilities of different channels. The detection hypotheses are then determined based on the summed posterior probabilities. While the proposed fusion framework appears to be simple and natural, it significantly outperforms all the single-channel baseline systems on the ITC-Irst database. We also show that adding channels one by one into the fusion system yields performance improvements, and the performance of the fusion system is always better than those of the individual-channel counterparts.

acm multimedia | 2016

Label Tree Embeddings for Acoustic Scene Classification

Huy Phan; Lars Hertel; Marco Maass; Philipp Koch; Alfred Mertins

We present in this paper an efficient approach for acoustic scene classification by exploring the structure of class labels. Given a set of class labels, a category taxonomy is automatically learned by collectively optimizing a clustering of the labels into multiple meta-classes in a tree structure. An acoustic scene instance is then embedded into a low-dimensional feature representation which consists of the likelihoods that it belongs to the meta-classes. We demonstrate state-of-the-art results on two different datasets for the acoustic scene classification task, including the DCASE 2013 and LITIS Rouen datasets.

IEEE Transactions on Audio, Speech, and Language Processing | 2016

Learning representations for nonspeech audio events through their similarities to speech patterns

Huy Phan; Lars Hertel; Marco Maass; Radoslaw Mazur; Alfred Mertins

The human auditory system is very well matched to both human speech and environmental sounds. Therefore, the question arises whether human speech material may provide useful information for training systems for analyzing nonspeech audio signals, e.g., in a classification task. In order to answer this question, we consider speech patterns as basic acoustic concepts, which embody and represent the target nonspeech signal. To find out how similar the nonspeech signal is to speech, we classify it with a classifier trained on the speech patterns and use the classification posteriors to represent the closeness to the speech bases. The speech similarities are finally employed as a descriptor to represent the target signal. We further show that a better descriptor can be obtained by learning to organize the speech categories hierarchically with a tree structure. Furthermore, these descriptors are generic. That is, once the speech classifier has been learned, it can be employed as a feature extractor for different datasets without retraining. Lastly, we propose an algorithm to select a sufficient subset, which provides an approximate representation capability of the entire set of available speech patterns. We conduct experiments for the application of audio event analysis. Phone triplets from the TIMIT dataset were used as speech patterns to learn the descriptors for audio events of three different datasets with different complexity, including UPC-TALP, Freiburg-106, and NAR. The experimental results on the event classification task show that a good performance can be easily obtained even if a simple linear classifier is used. Furthermore, fusion of the learned descriptors as an additional source leads to state-of-the-art performance on all the three target datasets.

international conference on multimedia and expo | 2015

Early event detection in audio streams

Huy Phan; Marco Maass; Radoslaw Mazur; Alfred Mertins

Audio event detection has been an active field of research in recent years. However, most of the proposed methods, if not all, analyze and detect complete events and little attention has been paid for early detection. In this paper, we present a system which enables early audio event detection in continuous audio recordings in which an event can be reliably recognized when only a partial duration is observed. Our evaluation on the ITC-Irst database, one of the standard database of the CLEAR 2006 evaluation, shows that: on one hand, the proposed system outperforms the best baseline system by 16% and 8% in terms of detection error rate and detection accuracy respectively; on the other hand, even partial events are enough to achieve the performance that is obtainable when the whole events are observed.

IEEE Transactions on Audio, Speech, and Language Processing | 2017

Improved Audio Scene Classification Based on Label-Tree Embeddings and Convolutional Neural Networks

Huy Phan; Lars Hertel; Marco Maass; Philipp Koch; Radoslaw Mazur; Alfred Mertins

In this paper, we present an efficient approach for audio scene classification. We aim at learning representations for scene examples by exploring the structure of their class labels. A category taxonomy is automatically learned by collectively optimizing a tree-structured clustering of the given labels into multiple metaclasses. A scene recording is then transformed into a label-tree embedding image. Elements of the image represent the likelihoods that the scene instance belongs to the metaclasses. We investigate classification with label-tree embedding features learned from different low-level features as well as their fusion. We show that the combination of multiple features is essential to obtain good performance. While averaging label-tree embedding images over time yields good performance, we argue that average pooling possesses an intrinsic shortcoming. We alternatively propose an improved classification scheme to bypass this limitation. We aim at automatically learning common templates that are useful for the classification task from these images using simple but tailored convolutional neural networks. The trained networks are then employed as a feature extractor that matches the learned templates across a label-tree embedding image and produce the maximum matching scores as features for classification. Since audio scenes exhibit rich content, template learning and matching on low-level features would be inefficient. With label-tree embedding features, we have quantized and reduced the low-level features into the likelihoods of the metaclasses, on which the template learning and matching are efficient. We study both training convolutional neural networks on stacked label-tree embedding images and multistream networks. Experimental results on the DCASE2016 and LITIS Rouen datasets demonstrate the efficiency of the proposed methods.

international conference on acoustics, speech, and signal processing | 2017

CNN-LTE: A class of 1-X pooling convolutional neural networks on label tree embeddings for audio scene classification

Huy Phan; Philipp Koch; Lars Hertel; Marco Maass; Radoslaw Mazur; Alfred Mertins

We present in this work an approach for audio scene classification. Firstly, given the label set of the scenes, a label tree is automatically constructed where the labels are grouped into meta-classes. This category taxonomy is then used in the feature extraction step in which an audio scene instance is transformed into a label tree embedding image. Elements of the image indicate the likelihoods that the scene instances belong to different meta-classes. A class of simple 1-X (i.e. 1-max, 1-mean, and 1-mix) pooling convolutional neural networks, which are tailored for the task at hand, are finally learned on top of the image features for scene recognition. Experimental results on the DCASE 2013 and DCASE 2016 datasets demonstrate the efficiency of the proposed method.

Explore More