Is this you? Create Your Porfile

Po-Sen Huang

University of Illinois at Urbana–Champaign

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Po-Sen Huang is active.

Explore More

Publication

Featured researches published by Po-Sen Huang.

conference on information and knowledge management | 2013

Learning deep structured semantic models for web search using clickthrough data

Po-Sen Huang; Xiaodong He; Jianfeng Gao; Li Deng; Alex Acero; Larry P. Heck

Latent semantic models, such as LSA, intend to map a query to its relevant documents at the semantic level where keyword-based matching often fails. In this study we strive to develop a series of new latent semantic models with a deep structure that project queries and documents into a common low-dimensional space where the relevance of a document given a query is readily computed as the distance between them. The proposed deep structured semantic models are discriminatively trained by maximizing the conditional likelihood of the clicked documents given a query using the clickthrough data. To make our models applicable to large-scale Web search applications, we also use a technique called word hashing, which is shown to effectively scale up our semantic models to handle large vocabularies which are common in such tasks. The new models are evaluated on a Web document ranking task using a real-world data set. Results show that our best model significantly outperforms other latent semantic models, which were considered state-of-the-art in the performance prior to the work presented in this paper.

international conference on acoustics, speech, and signal processing | 2012

Singing-voice separation from monaural recordings using robust principal component analysis

Po-Sen Huang; Scott Chen; Paris Smaragdis; Mark Hasegawa-Johnson

Separating singing voices from music accompaniment is an important task in many applications, such as music information retrieval, lyric recognition and alignment. Music accompaniment can be assumed to be in a low-rank subspace, because of its repetition structure; on the other hand, singing voices can be regarded as relatively sparse within songs. In this paper, based on this assumption, we propose using robust principal component analysis for singing-voice separation from music accompaniment. Moreover, we examine the separation result by using a binary time-frequency masking method. Evaluations on the MIR-1K dataset show that this method can achieve around 1~1.4 dB higher GNSDR compared with two state-of-the-art approaches without using prior training or requiring particular features.

international conference on acoustics, speech, and signal processing | 2014

Deep learning for monaural speech separation

Po-Sen Huang; Minje Kim; Mark Hasegawa-Johnson; Paris Smaragdis

Monaural source separation is useful for many real-world applications though it is a challenging problem. In this paper, we study deep learning for monaural speech separation. We propose the joint optimization of the deep learning models (deep neural networks and recurrent neural networks) with an extra masking layer, which enforces a reconstruction constraint. Moreover, we explore a discriminative training criterion for the neural networks to further enhance the separation performance. We evaluate our approaches using the TIMIT speech corpus for a monaural speech separation task. Our proposed models achieve about 3.8~4.9 dB SIR gain compared to NMF models, while maintaining better SDRs and SARs.

IEEE Transactions on Audio, Speech, and Language Processing | 2015

Joint optimization of masks and deep recurrent neural networks for monaural source separation

Po-Sen Huang; Minje Kim; Mark Hasegawa-Johnson; Paris Smaragdis

Monaural source separation is important for many real world applications. It is challenging because, with only a single channel of information available, without any constraints, an infinite number of solutions are possible. In this paper, we explore joint optimization of masking functions and deep recurrent neural networks for monaural source separation tasks, including speech separation, singing voice separation, and speech denoising. The joint optimization of the deep recurrent neural networks with an extra masking layer enforces a reconstruction constraint. Moreover, we explore a discriminative criterion for training neural networks to further enhance the separation performance. We evaluate the proposed system on the TSP, MIR-1K, and TIMIT datasets for speech separation, singing voice separation, and speech denoising tasks, respectively. Our approaches achieve 2.30-4.98 dB SDR gain compared to NMF models in the speech separation task, 2.30-2.48 dB GNSDR gain and 4.32-5.42 dB GSIR gain compared to existing models in the singing voice separation task, and outperform NMF and DNN baselines in the speech denoising task.

international conference on acoustics, speech, and signal processing | 2014

Kernel methods match Deep Neural Networks on TIMIT

Po-Sen Huang; Haim Avron; Tara N. Sainath; Vikas Sindhwani; Bhuvana Ramabhadran

Despite their theoretical appeal and grounding in tractable convex optimization techniques, kernel methods are often not the first choice for large-scale speech applications due to their significant memory requirements and computational expense. In recent years, randomized approximate feature maps have emerged as an elegant mechanism to scale-up kernel methods. Still, in practice, a large number of random features is required to obtain acceptable accuracy in predictive tasks. In this paper, we develop two algorithmic schemes to address this computational bottleneck in the context of kernel ridge regression. The first scheme is a specialized distributed block coordinate descent procedure that avoids the explicit materialization of the feature space data matrix, while the second scheme gains efficiency by combining multiple weak random feature models in an ensemble learning framework. We demonstrate that these schemes enable kernel methods to match the performance of state of the art Deep Neural Networks on TIMIT for speech recognition and classification tasks. In particular, we obtain the best classification error rates reported on TIMIT using kernel methods.

international conference on consumer electronics | 2009

A block-based 2D-to-3D conversion system with bilateral filter

Chao-Chung Cheng; Chung-Te Li; Po-Sen Huang; Tsung-Kai Lin; Yi-Min Tsai; Liang-Gee Chen

The three-dimensional (3D) displays provide a dramatic improvement of visual quality over the 2D displays. The conversion of existing 2D videos to 3D videos is necessary for multimedia application. This paper presents an automatic and robust system to convert 2D videos to 3D videos. The proposed 2D-to-3D conversion combines two major depth generation modules, the depth from motion and depth from geometrical perspective. A block-based algorithm is applied and cooperates with the bilateral filter to diminish block effect and generate comfortable depth map. After generating the depth map, the multi-view video is rendered to 3D display.

international conference on acoustics, speech, and signal processing | 2011

Improving acoustic event detection using generalizable visual features and multi-modality modeling

Po-Sen Huang; Xiaodan Zhuang; Mark Hasegawa-Johnson

Acoustic event detection (AED) aims to identify both timestamps and types of multiple events and has been found to be very challenging. The cues for these events often times exist in both audio and vision, but not necessarily in a synchronized fashion. We study improving the detection and classification of the events using cues from both modalities. We propose optical flow based spatial pyramid histograms as a generalizable visual representation that does not require training on labeled video data. Hidden Markov models (HMMs) are used for audio-only modeling, and multi-stream HMMs or coupled HMMs (CHMM) are used for audio-visual joint modeling. To allow the flexibility of audio-visual state asynchrony, we explore effective CHMM training via HMM state-space mapping, parameter tying and different initialization schemes. The proposed methods successfully improve acoustic event classification and detection on a multimedia meeting room dataset containing eleven types of general non-speech events without using extra data resource other than the video stream accompanying the audio observations. Our systems perform favorably compared to previously reported systems leveraging ad-hoc visual cue detectors and localization information obtained from multiple microphones.

international conference on acoustics, speech, and signal processing | 2013

Predicting speech recognition confidence using deep learning with word identity and score features

Po-Sen Huang; Kshitiz Kumar; Chaojun Liu; Yifan Gong; Li Deng

Confidence classifiers for automatic speech recognition (ASR) provide a quantitative representation for the reliability of ASR decoding. In this paper, we improve the ASR confidence measure performance for an utterance using two distinct approaches: (1) to define and incorporate additional predictors in the confidence classifier including those based on the word identity and on the aggregated words, and (2) to train the confidence classifier built on deep learning architectures including the deep neural network (DNN) and the kernel deep convex network (K-DCN). Our experiments show that adding the new predictors to our multi-layer perceptron (MLP)-based baseline classifier provides 38.6% relative reduction in the correct-reject rate as our measure of the classifier performance. Further, replacing the MLP with the DNN and K-DCN provides an additional 14.5% and 47.5% in the relative performance gain, respectively.

international conference on acoustics, speech, and signal processing | 2012

How to put it into words - using random forests to extract symbol level descriptions from audio content for concept detection

Po-Sen Huang; Robert Mertens; Ajay Divakaran; Gerald Friedland; Mark Hasegawa-Johnson

This paper presents a system that uses symbolic representations of audio concepts as words for the descriptions of audio tracks, that enable it to go beyond the state of the art, which is audio event classification of a small number of audio classes in constrained settings, to large-scale classification in the wild. These audio words might be less meaningful for an annotator but they are descriptive for computer algorithms. We devise a random-forest vocabulary learning method with an audio word weighting scheme based on TF-IDF and TD-IDD, so as to combine the computational simplicity and accurate multi-class classification of the random forest with the data-driven discriminative power of the TF-IDF/TD-IDD methods. The proposed random forest clustering with text-retrieval methods significantly outperforms two state-of-the-art methods on the dry-run set and the full set of the TRECVID MED 2010 dataset.

north american chapter of the association for computational linguistics | 2018

DISCOURSE-AWARE NEURAL REWARDS FOR COHERENT TEXT GENERATION

Antoine Bosselut; Asli Celikyilmaz; Xiaodong He; Jianfeng Gao; Po-Sen Huang; Yejin Choi

In this paper, we investigate the use of discourse-aware rewards with reinforcement learning to guide a model to generate long, coherent text. In particular, we propose to learn neural rewards to model cross-sentence ordering as a means to approximate desired discourse structure. Empirical results demonstrate that a generator trained with the learned reward produces more coherent and less repetitive text than models trained with cross-entropy or with reinforcement learning with commonly used scores as rewards.

Explore More