Younggwan Kim
KAIST
                                 Network
                            
                            Latest external collaboration on country level. Dive into details by clicking on the dots.
                                 Publication
                            
                            Featured researches published by Younggwan Kim.
IEEE Transactions on Audio, Speech, and Language Processing | 2015
Myung Jong Kim; Younggwan Kim; Hoirin Kim
This paper presents a new method for automatically assessing the speech intelligibility of patients with dysarthria, which is a motor speech disorder impeding the physical production of speech. The proposed method consists of two main steps: feature representation and prediction. In the feature representation step, the speech utterance is converted into a phone sequence using an automatic speech recognition technique and is then aligned with a canonical phone sequence from a pronunciation dictionary using a weighted finite state transducer to capture the pronunciation mappings such as match, substitution, and deletion. The histograms of the pronunciation mappings on a pre-defined word set are used for features. Next, in the prediction step, a structured sparse linear model incorporated with phonological knowledge that simultaneously addresses phonologically structured sparse feature selection and intelligibility prediction is proposed. Evaluation of the proposed method on a database of 109 speakers consisting of 94 dysarthric and 15 control speakers yielded a root mean square error of 8.14 compared to subjectively rated scores in the range of 0 to 100. This is a promising performance in which the system can be successfully applied to help speech therapists in diagnosing the degree of speech disorder.
IEEE Transactions on Neural Systems and Rehabilitation Engineering | 2017
Myungjong Kim; Younggwan Kim; Joohong Yoo; Jun Wang; Hoirin Kim
This paper addresses the problem of recognizing the speech uttered by patients with dysarthria, which is a motor speech disorder impeding the physical production of speech. Patients with dysarthria have articulatory limitation, and therefore, they often have trouble in pronouncing certain sounds, resulting in undesirable phonetic variation. Modern automatic speech recognition systems designed for regular speakers are ineffective for dysarthric sufferers due to the phonetic variation. To capture the phonetic variation, Kullback–Leibler divergence-based hidden Markov model (KL-HMM) is adopted, where the emission probability of state is parameterized by a categorical distribution using phoneme posterior probabilities obtained from a deep neural network-based acoustic model. To further reflect speaker-specific phonetic variation patterns, a speaker adaptation method based on a combination of L2 regularization and confusion-reducing regularization, which can enhance discriminability between categorical distributions of the KL-HMM states while preserving speaker-specific information is proposed. Evaluation of the proposed speaker adaptation method on a database of several hundred words for 30 speakers consisting of 12 mildly dysarthric, 8 moderately dysarthric, and 10 non-dysarthric control speakers showed that the proposed approach significantly outperformed the conventional deep neural network-based speaker adapted system on dysarthric as well as non-dysarthric speech.
international conference on acoustics, speech, and signal processing | 2014
Younggwan Kim; Hoirin Kim
Maximum a posterior (MAP) adaptation is one of the popular and powerful methods for obtaining a speaker-specific acoustic model. Basically, MAP adaptation needs a data storage for speaker adaptive (SA) model as much as speaker independent (SI) model needs. Modern speech recognition systems have a huge number of parameters and deal with millions of users. To reduce the data storage for SA models, in this paper, we propose a constrained maximum likelihood estimation-based speaker adaptation with L1 regularization. By the proposed method, we can more efficiently perform the model adjustments for SA models without almost any loss of phone recognition performance than the conventional sparse MAP adaptation method.
acm multimedia | 2010
Myung Jong Kim; Younggwan Kim; JaeDeok Lim; Hoirin Kim
This paper addresses the problem of recognizing malicious sounds, such as sexual scream or moan, to detect and block the objectionable multimedia contents. The malicious sounds show the distinct characteristics that have large temporal variations and fast spectral transitions. Therefore, extracting appropriate features to properly represent these characteristics is important in achieving a better performance. In this paper, we employ segment-based two-dimensional Mel-frequency cepstral coefficients and histograms of gradient directions as a feature set to characterize both the temporal variations and spectral transitions within a long-range segment of the target signal. Gaussian mixture model (GMM) is adopted to statistically represent the malicious and non-malicious sounds, and the test sounds are classified by a maximum a posterior probability (MAP) method. Evaluation of the proposed feature extraction method on a database of several hundred malicious and non-malicious sound clips yielded precision of 91.31% and recall of 94.27%. This result suggests that this approach could be used as an alternative to the image-based methods.
EURASIP Journal on Advances in Signal Processing | 2011
Younggwan Kim; Youngjoo Suh; Hoirin Kim
The role of the statistical model-based voice activity detector (SMVAD) is to detect speech regions from input signals using the statistical models of noise and noisy speech. The decision rule of SMVAD is based on the likelihood ratio test (LRT). The LRT-based decision rule may cause detection errors because of statistical properties of noise and speech signals. In this article, we first analyze the reasons why the detection errors occur and then propose two modified decision rules using reliable likelihood ratios (LRs). We also propose an effective weighting scheme considering spectral characteristics of noise and speech signals. In the experiments proposed in this study, with almost no additional computations, the proposed methods show significant performance improvement in various noise conditions. Experimental results also show that the proposed weighting scheme provides additional performance improvement over the two proposed SMVADs.
conference of the international speech communication association | 2016
Jahyun Goo; Younggwan Kim; Hyungjun Lim; Hoirin Kim
In this paper, we propose a simple speaker normalization for deep neural network (DNN) using i-vectors, the state-of-the-art technique for speaker recognition, for automatic speech recognition. There have been already many techniques using ivectors for speaker adaptation or speaker variability reduction of DNN acoustic models. However, in order to add the speaker information into the acoustic feature, most of those techniques have to train a large number of parameters while dimensionality of the i-vector is quite small. We tried to apply a componentwise shift to the acoustic features by linearly transformed ivector, and then achieved the better performance than typical approaches. On top of that, we propose to modify this structure to adapt each frame of the features, reducing the number of parameters. Experiments were conducted on the TED-LIUM release-1 corpus, and the proposed method showed some performance gains.
Journal of the Korean society of speech sciences | 2015
Jahyun Goo; Younggwan Kim; Hoirin Kim
In this paper, we propose L1-norm regularization for state vector adaptation of subspace Gaussian mixture model (SGMM). When you design a speaker adaptation system with GMM-HMM acoustic model, MAP is the most typical technique to be considered. However, in MAP adaptation procedure, large number of parameters should be updated simultaneously. We can adopt sparse adaptation such as L1-norm regularization or sparse MAP to cope with that, but the performance of sparse adaptation is not good as MAP adaptation. However, SGMM does not suffer a lot from sparse adaptation as GMM-HMM because each Gaussian mean vector in SGMM is defined as a weighted sum of basis vectors, which is much robust to the fluctuation of parameters. Since there are only a few adaptation techniques appropriate for SGMM, our proposed method could be powerful especially when the number of adaptation data is limited. Experimental results show that error reduction rate of the proposed method is better than the result of MAP adaptation of SGMM, even with small adaptation data.
EURASIP Journal on Advances in Signal Processing | 2015
Younggwan Kim; Myung Jong Kim; Hoirin Kim
To reduce data storage for speaker adaptive (SA) models, in our previous work, we proposed a sparse speaker adaptation method which can efficiently reduce the number of adapted parameters by using Euclidean projection onto the L1-ball (EPL1) while maintaining recognition performance comparable to maximum a posteriori (MAP) adaptation. In the EPL1-based sparse speaker adaptation framework, however, the adapted Gaussian mean vectors are mostly concentrated on dimensions having large variances because of assuming unit variance for all dimensions. To make EPL1 more flexible, in this paper, we propose scaled norm-based Euclidean projection (SNEP) which can consider dimension-specific variances. By using SNEP, we also propose a new sparse speaker adaptation method which can consider the variances of a speaker-independent model. Our experiments show that the adapted components of mean vectors are evenly distributed in all dimensions, and we can obtain sparsely adapted models with no loss of phone recognition performance from the proposed method compared with MAP adaptation.
international conference on multimedia and expo | 2013
Myung Jong Kim; Younggwan Kim; Seungki Hong; Hoirin Kim
This paper addresses the problem of automatically detecting infant crying sounds. Infant crying sounds show the distinct and regular time-frequency patterns that include a clear harmonic structure and a unique melody. Therefore, extracting appropriate features to properly represent these characteristics is important in achieving a good performance. In this paper, we propose weighted segment-based two-dimensional linear-frequency cepstral coefficients to characterize the time-frequency patterns within a long-range segment of the target signal. A Gaussian mixture model is adopted to statistically represent the crying and non-crying sounds, and test sounds are classified by using a likelihood ratio test. Evaluation of the proposed feature extraction method on a database of several hundred crying and non-crying sound clips yields an average equal error rate of 4.42% in various noisy environments, showing over 20% relative improvements compared to conventional feature extraction methods.
conference of the international speech communication association | 2015
Myung Jong Kim; Joohong Yoo; Younggwan Kim; Hoirin Kim
