Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Ki-Seung Lee is active.

Publication


Featured researches published by Ki-Seung Lee.


IEEE Transactions on Biomedical Engineering | 2008

EMG-Based Speech Recognition Using Hidden Markov Models With Global Control Variables

Ki-Seung Lee

It is well known that a strong relationship exists between human voices and the movement of articulatory facial muscles. In this paper, we utilize this knowledge to implement an automatic speech recognition scheme which uses solely surface electromyogram (EMG) signals. The sequence of EMG signals for each word is modelled by a hidden Markov model (HMM) framework. The main objective of the work involves building a model for state observation density when multichannel observation sequences are given. The proposed model reflects the dependencies between each of the EMG signals, which are described by introducing a global control variable. We also develop an efficient model training method, based on a maximum likelihood criterion. In a preliminary study, 60 isolated words were used as recognition variables. EMG signals were acquired from three articulatory facial muscles. The findings indicate that such a system may have the capacity to recognize speech signals with an accuracy of up to 87.07%, which is superior to the independent probabilistic model.


IEEE Transactions on Audio, Speech, and Language Processing | 2007

Statistical Approach for Voice Personality Transformation

Ki-Seung Lee

A voice transformation method which changes the source speakers utterances so as to sound similar to those of a target speaker is described. Speaker individuality transformation is achieved by altering the LPC cepstrum, average pitch period and average speaking rate. The main objective of the work involves building a nonlinear relationship between the parameters for the acoustical features of two speakers, based on a probabilistic model. The conversion rules involve the probabilistic classification and a cross correlation probability between the acoustic features of the two speakers. The parameters of the conversion rules are estimated by estimating the maximum likelihood of the training data. To obtain transformed speech signals which are perceptually closer to the target speakers voice, prosody modification is also involved. Prosody modification is achieved by scaling excitation spectrum and time scale modification with appropriate modification factors. An evaluation by objective tests and informal listening tests clearly indicated the effectiveness of the proposed transformation method. We also confirmed that the proposed method leads to smoothly evolving spectral contours over time, which, from a perceptual standpoint, produced results that were superior to conventional vector quantization (VQ)-based methods


IEEE Transactions on Audio, Speech, and Language Processing | 2006

MLP-based phone boundary refining for a TTS database

Ki-Seung Lee

The automatic labeling of a large speech corpus plays an important role in the development of a high-quality Text-To-Speech (TTS) synthesis system. This paper describes a method for the automatic labeling of speech signals, which mainly involves the construction of a large database for a TTS synthesis system. The main objective of the work involves the refinement of an initial estimation of phone boundaries which are provided by an alignment, based on a Hidden Markov Model. A multilayer perceptron (MLP) was employed to refine the phone boundaries. To increase the accuracy of phoneme segmentation, several specialized MLPs were individually trained based on phonetic transition. The optimum partitioning of the entire phonetic transition space and the corresponding MLPs were constructed from the standpoint of minimizing the overall deviation from the hand-labeling position. The experimental results showed that more than 93% of all phone boundaries have a boundary deviation from a reference position smaller than 20 ms. We also confirmed that the database constructed using the proposed method produced results that were perceptually comparable to a hand-labeled database, based on subjective listening tests.


Speech Communication | 2002

A segmental speech coder based on a concatenative TTS

Ki-Seung Lee; Richard V. Cox

An extremely low bit rate speech coder based on a recognition/synthesis paradigm is proposed. In our speech coder, the speech signal is produced in a way which is similar to concatenative speech synthesis of text-to-speech (TTS). Hence, database construction, unit selection and prosody modification, which are the major parts of concatenative TTS, are employed to implement the speech coder. The synthesis units are automatically found in a large database using a joint segmentation/classification scheme. Dynamic programming (DP) is applied to unit selection in which two cost functions, an acoustic target cost and a concatenation cost are used to increase naturalness as well as intelligibility. Prosodic differences between the selected unit and the input segment are compensated for by time-scale and pitch modifications which are based on the harmonic plus noise (HNM) model framework. In single speaker tests, the proposed scheme gave intelligible and natural sounding speech at an average bit rate of about 580 b/s.


IEEE Transactions on Consumer Electronics | 2010

A real-time audio system for adjusting the sweet spot to the listener's position

Ki-Seung Lee; Seok-Pil Lee

In the present study, a new stereophonic playback system was proposed, where the cross-talk signals would be reasonably cancelled at an arbitrary listener position. The system was composed of two major parts: the listener position tracking part and the sound rendering part. The position of the listener was estimated using acoustic signals from the listener (i.e. voice or hand-clapping signals). A direction of arrival (DOA) algorithm was adopted to estimate the directions of acoustic sources where the room reverberation effects were taken into consideration. A Crosstalk cancellation filter was designed using a free-field model. To determine the maximum tolerable shift of the listener position, a quantitative analysis of the channel separation ratio according to the displacement of the listener position was performed. Prototype hardware was implemented using a microprocessor board, a DSP board, a multi-channel ADC board and an analog frontend. The results showed that the average mean square error between the true direction of a listener and the estimated direction was about 5 degrees. More than 80% of the tested subjects indicated that better stereo images were obtained by the proposed system, compared with the non-processed signals.


IEEE Transactions on Biomedical Engineering | 2008

SNR-Adaptive Stream Weighting for Audio-MES ASR

Ki-Seung Lee

Myoelectric signals (MESs) from the speakers mouth region have been successfully shown to improve the noise robustness of automatic speech recognizers (ASRs), thus promising to extend their usability in implementing noise-robust ASR. In the recognition system presented herein, extracted audio and facial MES features were integrated by a decision fusion method, where the likelihood score of the audio-MES observation vector was given by a linear combination of class-conditional observation log-likelihoods of two classifiers, using appropriate weights. We developed a weighting process adaptive to SNRs. The main objective of the paper involves determining the optimal SNR classification boundaries and constructing a set of optimum stream weights for each SNR class. These two parameters were determined by a method based on a maximum mutual information criterion. Acoustic and facial MES data were collected from five subjects, using a 60-word vocabulary. Four types of acoustic noise including babble, car, aircraft, and white noise were acoustically added to clean speech signals with SNR ranging from -14 to 31 dB. The classification accuracy of the audio ASR was as low as 25.5%. Whereas, the classification accuracy of the MES ASR was 85.2%. The classification accuracy could be further improved by employing the proposed audio-MES weighting method, which was as high as 89.4% in the case of babble noise. A similar result was also found for the other types of noise.


IEEE Transactions on Biomedical Engineering | 2010

Prediction of Acoustic Feature Parameters Using Myoelectric Signals

Ki-Seung Lee

It is well-known that a clear relationship exists between human voices and myoelectric signals (MESs) from the area of the speakers mouth. In this study, we utilized this information to implement a speech synthesis scheme in which MES alone was used to predict the parameters characterizing the vocal-tract transfer function of specific speech signals. Several feature parameters derived from MES were investigated to find the optimal feature for maximization of the mutual information between the acoustic and the MES features. After the optimal feature was determined, an estimation rule for the acoustic parameters was proposed, based on a minimum mean square error (MMSE) criterion. In a preliminary study, 60 isolated words were used for both objective and subjective evaluations. The results showed that the average Euclidean distance between the original and predicted acoustic parameters was reduced by about 30% compared with the average Euclidean distance of the original parameters. The intelligibility of the synthesized speech signals using the predicted features was also evaluated. A word-level identification ratio of 65.5% and a syllable-level identification ratio of 73% were obtained through a listening test.


international conference on acoustics, speech, and signal processing | 2003

Context-adaptive phone boundary refining for a TTS database

Ki-Seung Lee; Jeong-Su Kim

A method for the automatic segmentation of speech signals is described. The method is dedicated to the construction of a large database for a Text-To-Speech (TTS) synthesis system. The main issue of the work involves the refinement of an initial estimation of phone boundaries which are provided by an alignment, based on a Hidden Markov Model (HMM). Multi-layer perceptron (MLP) was used as a phone boundary detector. To increase the performance of segmentation, a technique which individually trains an MLP according to phonetic transition is proposed. The optimum partitioning of the entire phonetic transition space is constructed from the standpoint of minimizing the overall deviation from hand labelling positions. With single speaker stimuli, the experimental results showed that more than 95% of all phone boundaries have a boundary deviation from the reference position smaller than 20 ms, and the refinement of the boundaries reduces the root mean square error by about 25%.


IEEE Transactions on Audio, Speech, and Language Processing | 2011

A Relevant Distance Criterion for Interpolation of Head-Related Transfer Functions

Ki-Seung Lee; Seok-Pil Lee

In binaural synthesis, in order to realize more precise and accurate spatial sound, it would be desirable to measure a large number of the head-related transfer functions (HRTFs) in various directions. To reduce the size of the HRTFs, interpolation is often employed, where the HRTF for any direction can be obtained by a limited number of the representative HRTFs. In this paper, it is determined which distortion measure for interpolation of the HRTFs in the horizontal plane is most suitable for predicting audible differences in sound location. Four kinds of HRTF sets, measured using three human heads and one mannequin (KEMAR), were prepared for this study. Using various objective distortion criteria, the differences between interpolated and measured HRTFs were computed. These were then related to the results from the listening tests through receiver operator characteristic (ROC) curves. The results of the present study indicated that for the HRTF sets measured from three human heads, the best predictor of performance was obtained using the distortion measurement computed from the mel-cepstral coefficients, whereas the distortion measurement associated with interaural time delay predicted audible differences in sound location reasonably well for the KEMAR HRTF set. A feasibility test was conducted to verify the usefulness of the selected distortion measurement.


Speech Communication | 2014

A unit selection approach for voice transformation

Ki-Seung Lee

Abstract A voice transformation (VT) method that can make the utterance of a source speaker mimic that of a target speaker is described. Speaker individuality transformation is achieved by altering four feature parameters, which include the linear prediction coefficients cepstrum (LPCC), Δ LPCC, LP-residual and pitch period. The main objective of this study involves construction of an optimal sequence of features selected from a target speaker’s database, to maximize both the correlation probabilities between the transformed and the source features and the likelihood of the transformed features with respect to the target model. A set of two-pass conversion rules is proposed, where the feature parameters are first selected from a database then the optimal sequence of the feature parameters is then constructed in the second pass. The conversion rules were developed using a statistical approach that employed a maximum likelihood criterion. In constructing an optimal sequence of the features, a hidden Markov model (HMM) with global control variables (GCV) was employed to find the most likely combination of the features with respect to the target speaker’s model. The effectiveness of the proposed transformation method was evaluated using objective tests and formal listening tests. We confirmed that the proposed method leads to perceptually more preferred results, compared with the conventional methods.

Collaboration


Dive into the Ki-Seung Lee's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge