Dongsuk Yook
Korea University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dongsuk Yook.
IEEE Transactions on Consumer Electronics | 2008
In Chul Yoo; Dongsuk Yook
We present a wearable sound recognition system to assist the hearing impaired. Traditionally, hearing aid dogs are specially trained to facilitate the daily life of the hearing impaired. However, since training hearing aid dogs is costly and time-consuming, it would be desirable to substitute them with an automatic sound recognition system using speech recognition technologies. As the sound recognition system will be used in home environments where background noises and reverberations are high, conventional speech recognition techniques are not directly applicable, since their performance drops off rapidly in these environments. In this paper, we introduce a new sound recognition algorithm which is optimized for mechanical sounds such as doorbells. The new algorithm uses a new distance measure called the normalized peak domination ratio (NPDR) that is based on the characteristic spectral peaks of these sounds. The proposed algorithm showed a sound recognition accuracy of 99.7%, and noise rejection accuracy of 99.7%.
IEEE Transactions on Consumer Electronics | 2009
Youngkyu Cho; Dongsuk Yook; Sukmoon Chang; Hyunsoo Kim
Sound source localization (SSL) is a major function of robot auditory systems for intelligent home robots. The steered response power-phase transform (SRP-PHAT) is a widely used method for robust SSL. However, it is too slow to run in real time, since SRP-PHAT searches a large number of candidate sound source locations. This paper proposes a search space clustering method designed to speed up the SRP-PHAT based sound source localization algorithm for intelligent home robots equipped with small scale microphone arrays. The proposed method reduces the number of candidate sound source locations by 30.6% and achieves 46.7% error reduction compared to conventional methods.
IEEE Transactions on Consumer Electronics | 2009
Hyeopwoo Lee; Sukmoon Chang; Dongsuk Yook; Yong-Serk Kim
Voice activity detection plays an important role for an efficient voice interface between human and mobile devices, since it can be used as a trigger to activate an automatic speech recognition module of a mobile device. If the input speech signal can be recognized as a predefined magic word coming from a legitimate user, it can be utilized as a trigger. In this paper, we propose a voice trigger system using a keyword-dependent speaker recognition technique. The voice trigger must be able to perform keyword recognition, as well as speaker recognition, without using computationally demanding speech recognizers to properly trigger a mobile device with low computational power consumption. We propose a template based method and a hidden Markov model (HMM) based method for the voice trigger to solve this problem. The experiments using a Korean word corpus show that the template based method performed 4.1 times faster than the HMM based method. However, the HMM based method reduced the recognition error by 27.8% relatively compared to the template based method. The proposed methods are complementary and can be used selectively depending on the device of interest.
pacific rim international conference on artificial intelligence | 2002
Soonkyu Lee; Dongsuk Yook
We describe audio-to-visual conversion techniques for efficient multimedia communications. The audio signals are automatically converted to visual images of mouth shape. The visual speech can be represented as a sequence of visemes, which are the generic face images corresponding to particular sounds. Visual images synchronized with audio signals can provide user-friendly interface for man machine interactions. Also, it can be used to help the people with impaired-hearing. We use HMMs (hidden Markov models) to convert audio signals to a sequence of visemes. In this paper, we compare two approaches in using HMMs. In the first approach, an HMM is trained for each viseme, and the audio signals are directly recognized as a sequence of visemes. In the second approach, each phoneme is modeled with an HMM, and a general phoneme recognizer is utilized to produce a phoneme sequence from the audio signals. The phoneme sequence is then converted to a viseme sequence. We implemented the two approaches and tested them on the TIMIT speech corpus. The viseme recognizer shows 33.9% error rate, and the phoneme-based approach exhibits 29.7% viseme recognition error rate. When similar viseme classes are merged, we have found that the error rates can be reduced to 20.5% and 13.9%, respectably.
IEEE Transactions on Audio, Speech, and Language Processing | 2015
In Chul Yoo; Hyeontaek Lim; Dongsuk Yook
Voice activity detection (VAD) can be used to distinguish human speech from other sounds, and various applications can benefit from VAD-including speech coding and speech recognition. To accurately detect voice activity, the algorithm must take into account the characteristic features of human speech and/or background noise. In many real-life applications, noise frequently occurs in an unexpected manner, and in such situations, it is difficult to determine the characteristics of noise with sufficient accuracy. As a result, robust VAD algorithms that depend less on making correct noise estimates are desirable for real-life applications. Formants are the major spectral peaks of the human voice, and these are highly useful to distinguish vowel sounds. The characteristics of the spectral peaks are such that, these peaks are likely to survive in a signal after severe corruption by noise, and so formants are attractive features for voice activity detection under low signal-to-noise ratio (SNR) conditions. However, it is difficult to accurately extract formants from noisy signals when background noise introduces unrelated spectral peaks. Therefore, this paper proposes a simple formant-based VAD algorithm to overcome the problem of detecting formants under conditions with severe noise. The proposed method achieves a much faster processing time and outperforms standard VAD algorithms under various noise conditions. The proposed method is robust against various types of noise and produces a light computational load, so it is suitable for use in various applications.
IEEE Signal Processing Letters | 2007
Dong-Hyun Kim; Dongsuk Yook
This paper presents a transformation-based rapid adaptation technique for robust speech recognition using a linear spectral transformation (LST) and a maximum mutual information (MMI) criterion. Previously, a maximum likelihood linear spectral transformation (ML-LST) algorithm was proposed for fast adaptation in unknown environments. Since the MMI estimation method does not require evenly distributed training data and increases the a posteriori probability of the word sequences of the training data, we combine the linear spectral transformation method and the MMI estimation technique in order to achieve extremely rapid adaptation using only one word of adaptation data. The proposed algorithm, called MMI-LST, was implemented using the extended Baum-Welch algorithm and phonetic lattices, and evaluated on the TIMIT and FFMTIMIT corpora. It provides a relative reduction in the speech recognition error rate of 11.1% using only 0.25 s of adaptation data.
IEEE Transactions on Systems, Man, and Cybernetics | 2016
Dongsuk Yook; Taewoo Lee; Youngkyu Cho
Steered response power phase transform (SRP-PHAT) is a method that is widely used for robust sound source localization (SSL). However, since SRP-PHAT searches over a large number of candidate locations, it is too slow to run in real-time for large-scale microphone array systems. In this paper, we propose a robust two-level search space clustering method to speed-up SRP-PHAT-based SSL. The proposed method divides the candidate locations of the sound source into a set of groups and finds a small number of groups that are likely to contain the maximum power location. By searching within the small number of groups, the computational costs are reduced by 61.8% compared to a previously proposed method without loss of accuracy.
intelligent data engineering and automated learning | 2002
Soonkyu Lee; Dongsuk Yook
Visual images synchronized with audio signals can provide user-friendly interface for man machine interactions. The visual speech can be represented as a sequence of visemes, which are the generic face images corresponding to particular sounds. We use HMMs (hidden Markov models) to convert audio signals to a sequence of visemes. In this paper, we compare two approaches in using HMMs. In the first approach, an HMM is trained for each triviseme which is a viseme with its left and right context, and the audio signals are directly recognized as a sequence of trivisemes. In the second approach, each triphone is modeled with an HMM, and a general triphone recognizer is used to produce a triphone sequence from the audio signals. The triviseme or triphone sequence is then converted to a viseme sequence. The performances of the two viseme recognition systems are evaluated on the TIMIT speech corpus.
IEEE Transactions on Consumer Electronics | 2013
Sunhyung Lee; Dongsuk Yook; Sukmoon Chang
The conventional audio fingerprinting system by Haitsma uses a lookup table to identify the candidate songs in the database, which contains the sub-fingerprints of songs, and searches the candidates to find a song whose bit error rate is the lowest. However, this approach has a drawback that the number of database accesses increases dramatically, especially when the database contains a large number of songs or when a matching sub-fingerprint is not found in the lookup table due to a heavily degraded input signal. In this paper, a novel search method is proposed to overcome these difficulties. The proposed method partitions each song found from the lookup table into blocks, assigns a weight to each block, and uses the weight as a search priority to speed up the search process while reducing the number of database accesses. Various results from our experiment show the significant improvement in search speed while maintaining the search accuracy comparable to the conventional method.
IEEE Transactions on Consumer Electronics | 2009
Hyeopwoo Lee; Dongsuk Yook
When speech-based interfaces are used for small handheld devices such as cellular phones and personal digital assistants in mobile environments with unknown noises and surrounding talkers, all signals except the legitimate users voice must be rejected as noise signals by the system. This paper proposes a new algorithm that detects the users voice in spatial and temporal domains using directional and spectral information. It rejects undesirable signals that originate from noise sources or surrounding talkers. Experimental results indicate the proposed algorithm reduces the voice activity detection error rate by 34.3% relative to the conventional methods.