Myung Jong Kim
KAIST
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Myung Jong Kim.
IEEE Transactions on Audio, Speech, and Language Processing | 2015
Myung Jong Kim; Younggwan Kim; Hoirin Kim
This paper presents a new method for automatically assessing the speech intelligibility of patients with dysarthria, which is a motor speech disorder impeding the physical production of speech. The proposed method consists of two main steps: feature representation and prediction. In the feature representation step, the speech utterance is converted into a phone sequence using an automatic speech recognition technique and is then aligned with a canonical phone sequence from a pronunciation dictionary using a weighted finite state transducer to capture the pronunciation mappings such as match, substitution, and deletion. The histograms of the pronunciation mappings on a pre-defined word set are used for features. Next, in the prediction step, a structured sparse linear model incorporated with phonological knowledge that simultaneously addresses phonologically structured sparse feature selection and intelligibility prediction is proposed. Evaluation of the proposed method on a database of 109 speakers consisting of 94 dysarthric and 15 control speakers yielded a root mean square error of 8.14 compared to subjectively rated scores in the range of 0 to 100. This is a promising performance in which the system can be successfully applied to help speech therapists in diagnosing the degree of speech disorder.
IEEE Transactions on Multimedia | 2012
Myung Jong Kim; Hoirin Kim
In this paper, the problem of detecting objectionable sounds, such as sexual screaming or moaning, to classify and block objectionable multimedia content is addressed. Objectionable sounds show distinctive characteristics, such as large temporal variations and fast spectral transitions, which are different from general audio signals, such as speech and music. To represent these characteristics, segment-based two-dimensional Mel-frequency cepstral coefficients and histograms of gradient directions are used as a feature set to characterize the time-frequency dynamics within a long-range segment of the target signal. After extracting the features, they are transformed to features with lower dimensions while preserving discriminative information using linear discriminant analysis based on a combination of global and local Fisher criteria. A Gaussian mixture model is adopted to statistically represent objectionable and non-objectionable sounds, and test sounds are classified by using a likelihood ratio test. Evaluation of the proposed feature extraction method on a database of several hundred objectionable and non-objectionable sound clips yielded precision/recall breakeven point of 91.25%, which is a promising performance which shows that the system can be applied to help an image-based approach to block such multimedia content.
content based multimedia indexing | 2011
Myung Jong Kim; Hoirin Kim
This paper focuses on the problem of classifying pornographic sounds, such as sexual scream or moan, to detect and block the objectionable multimedia contents. To represent the large temporal variations of pornographic sounds, we propose a novel feature extraction method based on Radon transform. Radon transform provides a way to extract the global trend of orientations in a 2-D region and therefore it is applicable to the time-frequency spectrograms in the long-range segment to capture the large temporal variations of sexual sounds. Radon feature is extracted using histograms and flux of Radon coefficients. We adopt Gaussian mixture model to statistically represent the pornographic and non-pornographic sounds, and the test sounds are classified by using likelihood ratio test. Evaluations on several hundred pornographic and non-pornographic sound clips indicate that the proposed features can achieve satisfactory results that this approach could be used as an alternative to the image-based methods.
international conference on acoustics, speech, and signal processing | 2016
Hyungjun Lim; Myung Jong Kim; Hoirin Kim
A well-trained acoustic model that effectively captures the characteristics of sound events is a critical factor to develop more reliable system for sound event classification. Deep neural network (DNN) which has an ability to extract discriminative representation of features can be a good candidate for acoustic model of sound events. Compared to other data such as speech or image, the amount of sound database is often insufficient for learning the DNN properly, resulting in overfitting problems. In this paper, we propose a cross-acoustic transfer learning framework that can effectively train the DNN even with insufficient sound data by employing rich speech data. Three datasets are used to evaluate our proposed method; one sound dataset is from Real World Computing Partnership (RWCP) DB and two speech datasets are from Resource Management (RM) and Wall Street Journal (WSJ) DBs. A series of experimental results verify that cross-acoustic transfer learning performs significantly better than the baseline DNN which was trained only from sound data, achieving 26.24% relative classification error rate (CER) improvement over the DNN baseline system.
conference of the international speech communication association | 2016
Myung Jong Kim; Jun Wang; Hoirin Kim
Dysarthria is a neuro-motor speech disorder that impedes the physical production of speech. Patients with dysarthria often have trouble in pronouncing certain sounds, resulting in undesirable phonetic variation. Current automatic speech recognition systems designed for the general public are ineffective for dysarthric sufferers due to the phonetic variation. In this paper, we investigate dysarthric speech recognition using Kullback-Leibler divergence-based hidden Markov models. In the model, the emission probability of state is modeled by a categorical distribution using phoneme posterior probabilities from a deep neural network, and therefore, it can effectively capture the phonetic variation of dysarthric speech. Experimental evaluation on a database of several hundred words uttered by 30 speakers consisting of 12 mildly dysarthric, 8 moderately dysarthric, and 10 control speakers showed that our approach provides substantial improvement over the conventional Gaussian mixture model and deep neural network based speech recognition systems.
international conference on computers helping people with special needs | 2012
Myung Jong Kim; Hoirin Kim
This paper addresses the problem of assessing the speech intelligibility of patients with dysarthria, which is a motor speech disorder. Dysarthric speech produces spectral distortion caused by poor articulation. To characterize the distorted spectral information, several features related to phonetic quality are extracted. Then, we find the best feature set which not only produces a small prediction error but also keeps their mutual dependency low. Finally, the selected features are linearly combined using a multiple regression model. Evaluation of the proposed method on a database of 94 patients with dysarthria proves the effectiveness in predicting subjectively rated scores.
IEEE Transactions on Audio, Speech, and Language Processing | 2017
Myung Jong Kim; Beiming Cao; Ted Mau; Jun Wang
Silent speech recognition (SSR) converts nonaudio information such as articulatory movements into text. SSR has the potential to enable persons with laryngectomy to communicate through natural spoken expression. Current SSR systems have largely relied on speaker-dependent recognition models. The high degree of variability in articulatory patterns across different speakers has been a barrier for developing effective speaker-independent SSR approaches. Speaker-independent SSR approaches, however, are critical for reducing the amount of training data required from each speaker. In this paper, we investigate speaker-independent SSR from the movements of flesh points on tongue and lips with articulatory normalization methods that reduce the interspeaker variation. To minimize the across-speaker physiological differences of the articulators, we propose Procrustes matching-based articulatory normalization by removing locational, rotational, and scaling differences. To further normalize the articulatory data, we apply feature-space maximum likelihood linear regression and i-vector. In this paper, we adopt a bidirectional long short-term memory recurrent neural network (BLSTM) as an articulatory model to effectively model the articulatory movements with long-range articulatory history. A silent speech dataset with flesh-point articulatory movements was collected using an electromagnetic articulograph from 12 healthy and two laryngectomized English speakers. Experimental results showed the effectiveness of our speaker-independent SSR approaches on healthy as well as laryngectomy speakers. In addition, BLSTM outperformed the standard deep neural network. The best performance was obtained by the BLSTM with all the three normalization approaches combined.
acm multimedia | 2010
Myung Jong Kim; Younggwan Kim; JaeDeok Lim; Hoirin Kim
This paper addresses the problem of recognizing malicious sounds, such as sexual scream or moan, to detect and block the objectionable multimedia contents. The malicious sounds show the distinct characteristics that have large temporal variations and fast spectral transitions. Therefore, extracting appropriate features to properly represent these characteristics is important in achieving a better performance. In this paper, we employ segment-based two-dimensional Mel-frequency cepstral coefficients and histograms of gradient directions as a feature set to characterize both the temporal variations and spectral transitions within a long-range segment of the target signal. Gaussian mixture model (GMM) is adopted to statistically represent the malicious and non-malicious sounds, and the test sounds are classified by a maximum a posterior probability (MAP) method. Evaluation of the proposed feature extraction method on a database of several hundred malicious and non-malicious sound clips yielded precision of 91.31% and recall of 94.27%. This result suggests that this approach could be used as an alternative to the image-based methods.
conference of the international speech communication association | 2018
Beiming Cao; Myung Jong Kim; Jun R. Wang; Jan P. H. van Santen; Ted Mau; Jun Wang
Articulation-to-speech (ATS) synthesis generates audio waveform directly from articulatory information. Current works in ATS used articulatory movement information (spatial coordinates) only. The orientation information of articulatory flesh points has rarely been used, although some devices (e.g., electromagnetic articulography) provide that. Previous work indicated that orientation information contains significant information for speech production. In this paper, we explored the performance of applying orientation information of flesh points on articulators (i.e., tongue, lips and jaw) in ATS. Experiments using articulators’ movement information with or without orientation information were conducted using standard deep neural networks (DNNs) and long-short term memory-recurrent neural networks (LSTM-RNNs). Both objective and subjective evaluations indicated that adding orientation information of flesh points on articulators in addition to movement information generated higher quality speech output than using movement information only.
international conference on acoustics, speech, and signal processing | 2017
Jun Wang; Myung Jong Kim; Angel W. Hernandez-Mulero; Daragh Heitzman; Paul Ferrari
Patients with locked-in-syndrome (fully paralyzed but aware) struggle in their life and communication. Providing a level of communication offers these patients a chance to resume a meaningful life. Current brain-computer interface (BCI) communication requires users to build words from single letters selected on a screen, which is extremely inefficient. Faster approaches for their speech communication are highly needed. This project investigated the possibility to decode spoken phrases from non-invasive brain activity (MEG) signals. This direct brain-to-text mapping approach may provide a significantly faster communication rate than current BCIs can provide. We used dynamic time warping and Wiener filtering for noise reduction and then Gaussian mixture model and artificial neural network as the decoders. Preliminary results showed the possibility of decoding speech production from non-invasive brain signals. The best phrase classification accuracy was up to 94.54% from single-trial whole-head MEG recordings.