Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Kyu Jeong Han is active.

Publication


Featured researches published by Kyu Jeong Han.


Computer Speech & Language | 2013

Automatic speaker age and gender recognition using acoustic and prosodic level information fusion

Ming Li; Kyu Jeong Han; Shrikanth Narayanan

The paper presents a novel automatic speaker age and gender identification approach which combines seven different methods at both acoustic and prosodic levels to improve the baseline performance. The three baseline subsystems are (1) Gaussian mixture model (GMM) based on mel-frequency cepstral coefficient (MFCC) features, (2) Support vector machine (SVM) based on GMM mean supervectors and (3) SVM based on 450-dimensional utterance level features including acoustic, prosodic and voice quality information. In addition, we propose four subsystems: (1) SVM based on UBM weight posterior probability supervectors using the Bhattacharyya probability product kernel, (2) Sparse representation based on UBM weight posterior probability supervectors, (3) SVM based on GMM maximum likelihood linear regression (MLLR) matrix supervectors and (4) SVM based on the polynomial expansion coefficients of the syllable level prosodic feature contours in voiced speech segments. Contours of pitch, time domain energy, frequency domain harmonic structure energy and formant for each syllable (segmented using energy information in the voiced speech segment) are considered for analysis in subsystem (4). The proposed four subsystems have been demonstrated to be effective and able to achieve competitive results in classifying different age and gender groups. To further improve the overall classification performance, weighted summation based fusion of these seven subsystems at the score level is demonstrated. Experiment results are reported on the development and test set of the 2010 Interspeech Paralinguistic Challenge aGender database. Compared to the SVM baseline system (3), which is the baseline system suggested by the challenge committee, the proposed fusion system achieves 5.6% absolute improvement in unweighted accuracy for the age task and 4.2% for the gender task on the development set. On the final test set, we obtain 3.1% and 3.8% absolute improvement, respectively.


IEEE Transactions on Audio, Speech, and Language Processing | 2008

Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarization

Kyu Jeong Han; Samuel Kim; Shrikanth Narayanan

Many current state-of-the-art speaker diarization systems exploit agglomerative hierarchical clustering (AHC) as their speaker clustering strategy, due to its simple processing structure and acceptable level of performance. However, AHC is known to suffer from performance robustness under data source variation. In this paper, we address this problem. We specifically focus on the issues associated with the widely used clustering stopping method based on Bayesian information criterion (BIC) and the merging-cluster selection scheme based on generalized likelihood ratio (GLR). First, we propose a novel alternative stopping method for AHC based on information change rate (ICR). Through experiments on several meeting corpora, the proposed method is demonstrated to be more robust to data source variation than the BIC-based one. The average improvement obtained in diarization error rate (DER) by this method is 8.76% (absolute) or 35.77% (relative). We also introduce a selective AHC (SAHC) in the paper, which first runs AHC with the ICR-based stopping method only on speech segments longer than 3 s and then classifies shorter speech segments into one of the clusters given by the initial AHC. This modified version of AHC is motivated by our previous analysis that the proportion of short speech turns (or segments) in a data source is a significant factor contributing to the robustness problem arising in the GLR-based merging-cluster selection scheme. The additional performance improvement obtained by SAHC is 3.45% (absolute) or 14.08% (relative) in terms of averaged DER.


ieee automatic speech recognition and understanding workshop | 2007

Robust speaker clustering strategies to data source variation for improved speaker diarization

Kyu Jeong Han; Samuel Kim; Shrikanth Narayanan

Agglomerative hierarchical clustering (AHC) has been widely used in speaker diarization systems to classify speech segments in a given data source by speaker identity, but is known to be not robust to data source variation. In this paper, we identify one of the key potential sources of this variability that negatively affects clustering error rate (CER), namely short speech segments, and propose three solutions to tackle this issue. Through experiments on various meeting conversation excerpts, the proposed methods are shown to outperform simple AHC in terms of relative CER improvements in the range of 17-32%.


international conference on acoustics, speech, and signal processing | 2008

Novel inter-cluster distance measure combining GLR and ICR for improved agglomerative hierarchical speaker clustering

Kyu Jeong Han; Shrikanth Narayanan

Agglomerative hierarchical clustering (AHC) has been a popular strategy for speaker clustering, due to its simple structure but acceptable level of performance. One of the main challenges in AHC that affects clustering performance is how to select the closest cluster pair for merging at every recursion. For this, generalized likelihood ratio (GLR) has been widely adopted as an inter-cluster distance measure. However, it tends to be affected by the size of the clusters considered, which could result in erroneous selection of the cluster pair to be merged during AHC. To tackle this problem, we propose a novel alternative to GLR in this paper, which is a combination of GLR and information change rate (ICR) that we recently introduced for addressing the aforementioned tendency of GLR. Experiments on various meeting speech data show that this combined measure improves clustering performance on average by around 30% (relative).


Journal of Multimedia | 2010

Multimodal Speaker Segmentation and Identification in Presence of Overlapped Speech Segments

Viktor Rozgic; Kyu Jeong Han; Panayiotis G. Georgiou; Shrikanth Narayanan

We describe a multimodal algorithm for speaker segmentation and identification with two main contributions: First, we propose a hidden Markov model architecture that performs fusion of three information sources: a multicamera system for participant localization, a microphone array for speaker localization, and a speaker identification system. Second, we present a novel likelihood model for the microphone array observations for dealing with overlapped speech. We propose a modification of the Steered Power Response Generalized Cross Correlation Phase Transform (SPR-GCC-PHAT) function that takes into account the possible microphone occlusions and use its local maxima as microphone array observations. The likelihood of the extracted local maxima given positions of active speakers is modeled using the Joint Probabilistic Data Association (JPDA) framework. The state in the proposed hidden Markov model is a vector of the speaker activity indicators of present participants, and the unknown parameter is the mapping of participants’ locations to the set of all possible participants’ identities. We present and compare two ways for the joint estimation of the states and the unknown parameter: the first, a forward Bayesian filter that performs sequential estimate updates as new observations arrive and the second, a batch decoding using the Viterbi algorithm. Results show that, for both decoding algorithms, the proposed method outperforms standard speaker segmentation systems based on (a) speaker identification and (b) microphone array processing, for dataset with significant portion (27.4%) of overlapped speech and scores as high as 94.4% on the F-measure scale.


spoken language technology workshop | 2012

Frame-based phonotactic Language Identification

Kyu Jeong Han; Jason W. Pelecanos

This paper describes a frame-based phonotactic Language Identification (LID) system, which was used for the LID evaluation of the Robust Automatic Transcription of Speech (RATS) program by the Defense Advanced Research Projects Agency (DARPA). The proposed approach utilizes features derived from frame-level phone log-likelihoods from a phone recognizer. It is an attempt to capture not only phone sequence information but also short-term timing information for phone N-gram events, which is lacking in conventional phonotactic LID systems that simply count phone N-gram events. Based on this new method, we achieved 26% relative improvement in terms of Cavg for the RATS LID evaluation data compared to phone N-gram counts modeling. We also observed that it had a significant impact on score combination with our best acoustic system based on Mel-Frequency Cepstral Coefficients (MFCCs).


international symposium on multimedia | 2009

A Low-Complexity Dynamic Face-Voice Feature Fusion Approach to Multimodal Person Recognition

Dhaval Shah; Kyu Jeong Han; Shrikanth Narayanan

In this paper, we show the importance of face-voice correlation for audio-visual person recognition. We evaluate the performance of a system which uses the correlation between audio-visual features during speech against audio-only, video-only and audio-visual systems which use audio and visual features independently neglecting the interdependency of a persons spoken utterance and the associated facial movements. Experiments performed on the Vid-TIMIT dataset show that the proposed multimodal scheme has lower error rate than all other comparison conditions and is more robust against replay attacks. The simplicity of the fusion technique also allows the use of only one classifier which greatly simplifies system design and allows for a simple real-time DSP implementation.


international conference on acoustics, speech, and signal processing | 2011

Forensically inspired approaches to automatic speaker recognition

Kyu Jeong Han; Mohamed Kamal Omar; Jason W. Pelecanos; Cezar Pendus; Sibel Yaman; Weizhong Zhu

This paper presents ongoing research leveraging forensic methods for automatic speaker recognition. Some of the methods forensic scientists employ include identifying speaker distinctive audio segments and comparing these segments using features such as pitch, formant, and other information. Other approaches have also involved performing a phonetic analysis to recognize idiolectal attributes, and an implicit analysis of the demographics of speakers. Inspired by these forensic phonetic approaches, we target three threads of work; hot-spot analysis, speaker style and pronunciation modelling, and demographics analysis. As a result of this work we show that a phonetic analysis conditioned on select speech events (or hot-spots) can outperform a phonetic analysis performed over all speech without conditioning. In the area of pronunciation modelling, one set of results demonstrate significantly improved robustness by exploiting phonetic structure in an automatic speech recognition system. For demographics analysis, we present state-of-the-art results of systems capable of detecting dialect, non-nativeness and native language.


International Journal of Semantic Computing | 2010

ROBUST MULTIMODAL PERSON RECOGNITION USING LOW-COMPLEXITY AUDIO-VISUAL FEATURE FUSION APPROACHES

Dhaval Shah; Kyu Jeong Han; Shrikanth Narayanan

In this paper, we first show the importance of face-voice correlation for audio-visual person recognition. We propose a simple multimodal fusion technique which preserves the correlation between audio-visual features during speech and evaluate the performance of such a system against audio-only, video-only, and audio-visual systems which use audio and visual features neglecting the interdependency of a persons spoken utterance and the associated facial movements. Experiments performed on the VidTIMIT dataset show that the proposed multimodal fusion scheme has a lower error rate than all other comparison conditions and is more robust against replay attacks. The simplicity of the fusion technique allows for low-complexity designs for a simple low-cost real-time DSP implementation. We then discuss some problems associated with the previously proposed design and, as a solution to those problems, propose two novel classifier designs which provide more flexibility and a convenient way to represent multimodal data where each modality has different characteristics. We also show that these novel classifier designs offer superior performance in terms of both accuracy and robustness.


multimedia signal processing | 2008

The SAIL speaker diarization system for analysis of spontaneous meetings

Kyu Jeong Han; Panayiotis G. Georgiou; Shrikanth Narayanan

In this paper, we propose a novel approach to speaker diarization of spontaneous meetings in our own multimodal SmartRoom environment. The proposed speaker diarization system first applies a sequential clustering concept to segmentation of a given audio data source, and then performs agglomerative hierarchical clustering for speaker-specific classification (or speaker clustering) of speech segments. The speaker clustering algorithm utilizes an incremental Gaussian mixture cluster modeling strategy, and a stopping point estimation method based on information change rate. Through experiments on various meeting conversation data of approximately 200 minutes total length, this system is demonstrated to provide diarization error rate of 18.90% on average.

Collaboration


Dive into the Kyu Jeong Han's collaboration.

Top Co-Authors

Avatar

Shrikanth Narayanan

University of Southern California

View shared research outputs
Top Co-Authors

Avatar

Ian R. Lane

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar

Panayiotis G. Georgiou

University of Southern California

View shared research outputs
Top Co-Authors

Avatar

Jungsuk Kim

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar

Ming Li

University of Southern California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Dhaval Shah

University of Southern California

View shared research outputs
Researchain Logo
Decentralizing Knowledge