Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Wooil Kim is active.

Publication


Featured researches published by Wooil Kim.


Speech Communication | 2009

Feature compensation in the cepstral domain employing model combination

Wooil Kim; John H. L. Hansen

In this paper, we present an effective cepstral feature compensation scheme which leverages knowledge of the speech model in order to achieve robust speech recognition. In the proposed scheme, the requirement for a prior noisy speech database in off-line training is eliminated by employing parallel model combination for the noise-corrupted speech model. Gaussian mixture models of clean speech and noise are used for the model combination. The adaptation of the noisy speech model is possible only by updating the noise model. This method has the advantage of reduced computational expenses and improved accuracy for model estimation since it is applied in the cepstral domain. In order to cope with time-varying background noise, a novel interpolation method of multiple models is employed. By sequentially calculating the posterior probability of each environmental model, the compensation procedure can be applied on a frame-by-frame basis. In order to reduce the computational expense due to the multiple-model method, a technique of sharing similar Gaussian components is proposed. Acoustically similar components across an inventory of environmental models are selected by the proposed sub-optimal algorithm which employs the Kullback-Leibler similarity distance. The combined hybrid model, which consists of the selected Gaussian components is used for noisy speech model sharing. The performance is examined using Aurora2 and speech data for an in-vehicle environment. The proposed feature compensation algorithm is compared with standard methods in the field (e.g., CMN, spectral subtraction, RATZ). The experimental results demonstrate that the proposed feature compensation schemes are very effective in realizing robust speech recognition in adverse noisy environments. The proposed model combination-based feature compensation method is superior to existing model-based feature compensation methods. Of particular interest is that the proposed method shows up to an 11.59% relative WER reduction compared to the ETSI AFE front-end method. The multi-model approach is effective at coping with changing noise conditions for input speech, producing comparable performance to the matched model condition. Applying the mixture sharing method brings a significant reduction in computational overhead, while maintaining recognition performance at a reasonable level with near real-time operation.


Speech Communication | 2010

Automatic voice onset time detection for unvoiced stops (/p/,/t/,/k/) with application to accent classification

John H. L. Hansen; Sharmistha Gray; Wooil Kim

Articulation characteristics of particular phonemes can provide cues to distinguish accents in spoken English. For example, as shown in Arslan and Hansen (1996, 1997), Voice Onset Time (VOT) can be used to classify mandarin, Turkish, German and American accented English. Our goal in this study is to develop an automatic system that classifies accents using VOT in unvoiced stops. VOT is an important temporal feature which is often overlooked in speech perception, speech recognition, as well as accent detection. Fixed length frame-based speech processing inherently ignores VOT. In this paper, a more effective VOT detection scheme using the non-linear energy tracking algorithm Teager Energy Operator (TEO), across a sub-frequency band partition for unvoiced stops (/p/, /t/ and /k/), is introduced. The proposed VOT detection algorithm also incorporates spectral differences in the Voice Onset Region (VOR) and the succeeding vowel of a given stop-vowel sequence to classify speakers having accents due to different ethnic origin. The spectral cues are enhanced using one of the four types of feature parameter extractions - Discrete Mellin Transform (DMT), Discrete Mellin Fourier Transform (DMFT) and Discrete Wavelet Transform using the lowest and the highest frequency resolutions (DWTlfr and DWThfr). A Hidden Markov Model (HMM) classifier is employed with these extracted parameters and applied to the problem of accent classification. Three different language groups (American English, Chinese, and Indian) are used from the CU-Accent database. The VOT is detected with less than 10% error when compared to the manual detected VOT with a success rate of 79.90%, 87.32% and 47.73% for English, Chinese and Indian speakers (includes atypical cases for Indian case), respectively. It is noted that the DMT and DWTlfr features are good for parameterizing speech samples which exhibit substitution of succeeding vowel after the stop in accented speech. The successful accent classification rates of DMT and DWTlfr features are 66.13% and 71.67%, for /p/ and /t/ respectively, for pairwise accent detection. Alternatively, the DMFT feature works on all accent sensitive words considered, with a success rate of 70.63%. This study shows that effective VOT detection can be achieved using an integrated TEO processing with spectral difference analysis in the VOR that can be employed for accent classification.


international conference on acoustics, speech, and signal processing | 2006

Band-Independent Mask Estimation for Missing-Feature Reconstruction in the Presence of Unknown Background Noise

Wooil Kim; Richard M. Stern

An effective mask estimation scheme for missing-feature reconstruction is described that achieves robust speech recognition in the presence of unknown noise. In previous work on Bayesian classification for mask estimation, white noise and colored noise were used for training mask estimators. This paper, which is concerned with both the simulation of a more diverse set of background environments and with mitigating the sparse training problem, describes a new Bayesian mask-estimation procedure in which each frequency band is trained independently. The new method employs colored noise for training, which is obtained by partitioning each frequency subband. We also propose a reevaluation method of voiced/unvoiced decisions to alleviate performance degradation that is caused by errors in pitch detection. Experimental results indicate that the proposed procedure in conjunction with cluster-based missing-feature imputation improves speech recognition accuracy on the Aurora 2.0 database in the presence for all types of background noise considered


EURASIP Journal on Advances in Signal Processing | 2011

Robust Emotional Stressed Speech Detection Using Weighted Frequency Subbands

John H. L. Hansen; Wooil Kim; Mandar A. Rahurkar; Evan Ruzanski; James L. Meyerhoff

The problem of detecting psychological stress from speech is challenging due to differences in how speakers convey stress. Changes in speech production due to speaker state are not linearly dependent on changes in stress. Research is further complicated by the existence of different stress types and the lack of metrics capable of discriminating stress levels. This study addresses the problem of automatic detection of speech under stress using a previously developed feature extraction scheme based on the Teager Energy Operator (TEO). To improve detection performance a (i) selected sub-band frequency partitioned weighting scheme and (ii) weighting scheme for all frequency bands are proposed. Using the traditional TEO-based feature vector with a closed-speaker Hidden Markov Model-trained stressed speech classifier, error rates of 22.5/13.0% for stress/neutral speech are obtained. With the new weighted sub-band detection scheme, closed-speaker error rates are reduced to 4.7/4.6% for stress/neutral detection, with a relative error reduction of 79.1/64.6%, respectively. For the open-speaker case, stress/neutral speech detection error rates of 69.7/16.2% using traditional features are used to 13.1/4.0% (a relative 81.3/75.4% reduction) with the proposed automatic frequency sub-band weighting scheme. Finally, issues related to speaker dependent/independent scenarios, vowel duration, and mismatched vowel type on stress detection performance are discussed.


IEEE Transactions on Audio, Speech, and Language Processing | 2009

Time–Frequency Correlation-Based Missing-Feature Reconstruction for Robust Speech Recognition in Band-Restricted Conditions

Wooil Kim; John H. L. Hansen

Band-limited speech represents one of the most challenging factors for robust speech recognition. This is especially true in supporting audio corpora from sources that have a range of conditions in spoken document retrieval requiring effective automatic speech recognition. The missing-feature reconstruction method has a problem when applied to band-limited speech reconstruction, since it assumes the observations in the unreliable regions are always greater than the latent original clean speech. The approach developed here depends only on reliable components to calculate the posterior probability to mitigate the problem. This study proposes an advanced method to effectively utilize the correlation information of the spectral components across time and frequency axes in an effort to increase the performance of missing-feature reconstruction in band-limited conditions. We employ an F1 area window and cutoff border window in order to include more knowledge on reliable components which are highly correlated with the cutoff frequency band. To detect the cutoff regions for missing-feature reconstruction, blind mask estimation is also presented, which employs the synthesized band-limited speech model without secondary training data. Experiments to evaluate the performance of the proposed methods are accomplished using the SPHINX3 speech recognition engine and the TIMIT corpus. Experimental results demonstrate that the proposed time-frequency (TF) correlation based missing-feature reconstruction method is significantly more effective in improving band-limited speech recognition accuracy. By employing the proposed TF-missing feature reconstruction method, we obtain up to 14.61% of average relative improvement in word error rate (WER) for four available bandwidths with cutoff frequencies 1.0, 1.5, 2.0, and 2.5 kHz, respectively, compared to earlier formulated methods. Experimental results on the National Gallery of the Spoken Word (NGSW) corpus also show the proposed method is effective in improving band-limited speech recognition in real-life spoken document conditions.


Speech Communication | 2011

Mask classification for missing-feature reconstruction for robust speech recognition in unknown background noise

Wooil Kim; Richard M. Stern

Missing-feature techniques to improve speech recognition accuracy are based on the blind determination of which cells in a spectrogram-like display of speech are corrupted by the effects of noise or other types of disturbance (and hence are missing). In this paper we present three new approaches that improve the speech recognition accuracy obtained using missing-feature techniques. It had been found in previous studies (e.g. Seltzer et al., 2004) that Bayesian approaches to missing-feature classification are effective in ameliorating the effects of various types of additive noise. While Seltzer et al. primarily used white noise for training their Bayesian classifier, we have found that this is not the best type of training signal when noise with greater spectral and/or temporal variation is encountered in the testing environment. The first innovation introduced in this paper, referred to as frequency-dependent classification, involves independent classification in each of the various frequency bands in which the incoming speech is analyzed based on parallel sets of frequency-dependent features. The second innovation, referred to as colored-noise generation using multi-band partitioning, involves the use of masking noises with artificially-introduced spectral and temporal variation in training the Bayesian classifier used to determine which spectro-temporal components of incoming speech are corrupted by noise in unknown testing environments. The third innovation consists of an adaptive method to estimate the a priori values of the mask classifier that determines whether a particular time-frequency segment of the test data should be considered to be reliable or not. It is shown that these innovations provide improved speech recognition accuracy on a small vocabulary test when missing-feature restoration is applied to incoming speech that is corrupted by additive noise of an unknown nature, especially at lower signal-to-noise ratios.


IEEE Transactions on Audio, Speech, and Language Processing | 2011

A Novel Mask Estimation Method Employing Posterior-Based Representative Mean Estimate for Missing-Feature Speech Recognition

Wooil Kim; John H. L. Hansen

This paper proposes a novel mask estimation method for missing-feature reconstruction to improve speech recognition performance in various types of background noise conditions. A conventional mask estimation method based on spectral subtraction degrades performance, due to incorrect estimation of the noise signal which fails to accurately represent the variations of background noise during the incoming speech utterance. The proposed mask estimation method utilizes a Posterior-based Representative Mean (PRM) estimate for determining the reliability of the input speech spectral components, which is obtained as a weighted sum of the mean parameters of the speech model using the posterior probability. To obtain the noise-corrupted speech model, a model combination method is employed, which was proposed in our previous study for a feature compensation method. Experimental results demonstrate that the proposed mask estimation method provides more separable distributions for the reliable/unreliable component classifier compared to the conventional mask estimation method. The recognition performance is evaluated using the Aurora 2.0 framework over various types of background noise conditions and the CU-Move real-life in-vehicle corpus. The performance evaluation shows that the proposed mask estimation method is considerably more effective at increasing speech recognition performance in various types of background noise conditions, compared to the conventional mask estimation method which is based on spectral subtraction. By employing the proposed PRM-based mask estimation for missing-feature reconstruction, we obtain +23.41% and +9.45% average relative improvements in word error rate for all four types of noise conditions and CU-Move corpus, respectively, compared to conventional mask estimation methods.


international conference on acoustics, speech, and signal processing | 2010

Angry emotion detection from real-life conversational speech by leveraging content structure

Wooil Kim; John H. L. Hansen

This study proposes an effective angry speech detection approach by leveraging content structure within the input speech. A classifier based on an “emotional” language model score is formulated and combined with acoustic feature based classifiers including TEO-based feature and conventionalMel frequency cepstral coefficients (MFCC). The proposed detection algorithm is evaluated on real-life conversational speech which was recorded between customers and call center operators over a telephone network. Analysis on the conversational speech corpus presents a distinctive property between neutral and angry speech in word distribution and frequently occurring words. An improvement of up to 6.23% in Equal Error Rate (EER) is obtained by combining the TEO-based and MFCC features, and emotional language model score based classifiers.


IEEE Transactions on Audio, Speech, and Language Processing | 2010

Missing-Feature Reconstruction by Leveraging Temporal Spectral Correlation for Robust Speech Recognition in Background Noise Conditions

Wooil Kim; John H. L. Hansen

This paper proposes a novel missing-feature reconstruction method to improve speech recognition in background noise environments. The existing missing-feature reconstruction method utilizes log-spectral correlation across frequency bands. In this paper, we propose to employ a temporal spectral feature analysis to improve the missing-feature reconstruction performance by leveraging temporal correlation across neighboring frames. In a similar manner with the conventional method, a Gaussian mixture model is obtained by training over the obtained temporal spectral feature set. The final estimates for missing-feature reconstruction are obtained by a selective combination of the original frequency correlation based method and the proposed temporal correlation-based method. Performance of the proposed method is evaluated on the TIMIT speech corpus using various types of background noise conditions and the CU-Move in-vehicle speech corpus. Experimental results demonstrate that the proposed method is more effective at increasing speech recognition performance in adverse conditions. By employing the proposed temporal-frequency based reconstruction method, a +17.71% average relative improvement in word error rate (WER) is obtained for white, car, speech babble, and background music conditions over 5-, 10-, and 15-dB SNR, compared to the original frequency correlation-based method. We also obtain a +16.72% relative improvement in real-life in-vehicle conditions using data from the CU-Move corpus.


international conference on acoustics, speech, and signal processing | 2012

Feature compensation employing online GMM adaptation for speech recognition in unknown severely adverse environments

Wooil Kim; John H. L. Hansen

This study proposes an effective feature compensation-method to improve speech recognition in real-life speech conditions, where (i) severe background noise and channel distortion simultaneously exist, (ii) no development data is available, and (iii) clean data for ASR training and the latent clean speech in the test data are mismatched in the acoustic structure. The proposed feature compensation method employs an online GMM adaptation procedure which is based on MLLR, and a minimum statistics replacement technique for non-speech segments. The DARPA Tank corpus is used for performance evaluation, which includes severe real-life noisy conditions. The clean Broadcast News (BN) corpus is used for training the speech recognition system in this study. Experimental results show that the proposed feature compensation scheme outperforms GMM-based FMLLR and the ETSI AFE for DARPA Tank data, achieving a +5.56% relative improvement compared to FMLLR. These results demonstrate that the proposed feature compensation scheme is effective at improving speech recognition performance in unknown real-life adverse environments.

Collaboration


Dive into the Wooil Kim's collaboration.

Top Co-Authors

Avatar

John H. L. Hansen

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Richard M. Stern

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jun Won Suh

University of Texas at Dallas

View shared research outputs
Researchain Logo
Decentralizing Knowledge