Soonil Kwon | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Soonil Kwon is active.

Explore More

Publication

Featured researches published by Soonil Kwon.

Multimedia Tools and Applications | 2018

Determining speaker attributes from stress-affected speech in emergency situations with hybrid SVM-DNN architecture

Jamil Ahmad; Muhammad Sajjad; Seungmin Rho; Soonil Kwon; Mi Young Lee; Sung Wook Baik

In the millions of emergency reporting calls made each year, about a quarter are non-emergencies. To avoid responding to such situations, forensic examination of the reported situation in the presence of speech as evidence has become an indispensable requirement for emergency response centers. Caller profile information like gender, age, emotional state, transcript, and contextual sounds determined from emergency calls, may be highly beneficial for their sophisticated forensic analysis. However, callers reporting emergency situations often express emotional stress which cause variations in speech production. Furthermore, low voice quality, and background noise make it very difficult to efficiently recognize caller attributes in such unconstrained environments. To overcome limitations of traditional classification systems in such situations, a hybrid two-stage classification scheme is proposed in this paper. Our framework consist of an ensemble of support vector machines (e-SVM) and deep neural networks (DNN) in a cascade. The first stage e-SVM consists of two models discriminatively trained on normal and stressful speech from emergency calls. Deep neural network forming the second stage of classification pipeline, is utilized only in case of ambiguous prediction results from the first stage. The adaptive nature of this two stage classification scheme helps achieve efficiency and high performance. Experiments conducted with a large dataset affirm the suitability of proposed architecture for efficient real-time speaker attribute recognition. The framework is evaluated for gender recognition from emergency calls in the presence of emotions and background noise. The framework yields significant performance improvements in comparison with other similar state-of-the-art gender recognition approaches.

International Journal of Human-computer Interaction | 2012

Voice-Driven Sound Effect Manipulation

Soonil Kwon

Authoring tools for sketching the motion of characters to be animated have been studied for contents such as computer animations, games, and user-created content. However the natural interface for sound editing has not been sufficiently studied. This article proposes an intuitive interface method in which sound sample is selected and edited by speaking sound-imitation words (onomatopoeia). An experiment with the method based on statistical models, which is generally used for pattern recognition, showed up to 99% in the accuracy of recognition. In the other experiment for sound editing, syllable segmentation was first executed, and then a syllabic time scale of sound samples was modified by the Synchronized Overlap-Add algorithm. The energy by the syllable was then modified according to utterances of sound-imitation words. The experiment showed that the proposed method, compared to modification by the whole, achieved about 20.4% and 65.6% relative improvement in the time displacement of peaks and syllabic boundaries between modified sound samples and sound-imitation utterances.

IEICE Electronics Express | 2011

Focused word spotting in spoken Korean based on fundamental frequency

Soonil Kwon

Focused word spotting makes a contribution to keyword extraction and speech understanding in spoken Korean. The average and variance of fundamental frequency were expected to be important factors to detect verbal focus of a speaker. From the experiment of focused word spotting in each sentence by statistically modeling those factors, I achieved a spotting accuracy of 84%.

Computer Speech & Language | 2016

Preprocessing for elderly speech recognition of smart devices

Soonil Kwon; Sung-Jae Kim; Joon Yeon Choeh

HighlightsPreprocessed elderly voice signals were tested with an android smart phone.Speech recognition accuracy increased to 1.5% by increasing the speech rate.Speech recognition accuracy increased to 4.2% by eliminating intersyllabic pauses.Speech recognition accuracy increased to 6% by boosting formant frequency bands.After all the preprocessing, 12% increase in the recognition accuracy was achieved. Due to the increasing aging population in modern society and to the proliferation of smart devices, there is a need to enhance speech recognition among smart devices in order to make information easily accessible to the elderly as it is to the younger population. In general, speech recognition systems are optimized to an average adults voice and tend to exhibit a lower accuracy rate when recognizing an elderly persons voice, due to the effects of speech articulation and speaking style. Additional costs are bound to be incurred when adding modifications to current speech recognitions systems for better speech recognition among elderly users. Thus, using a preprocessing application on a smart device can not only deliver better speech recognition but also substantially reduce any added costs. Audio samples of 50 words uttered by 80 elderly and young adults were collected and comparatively analyzed. The speech patterns of the elderly have a slower speech rate with longer inter-syllabic silence length and slightly lower speech intelligibility. The speech recognition rate for elderly adults could be improved by means of increasing the speech rate, adding a 1.5% increase in accuracy, eliminating silence periods, adding another 4.2% increase in accuracy, and boosting the energy of the formant frequency bands for a 6% boost in accuracy. After all the preprocessing, a 12% increase in the accuracy of elderly speech recognition was achieved. Through this study, we show that speech recognition of elderly voices can be improved through modifying specific aspects of differences in speech articulation and speaking style. In the future, we will conduct studies on methods that can precisely measure and adjust speech rate and find additional factors that impact intelligibility.

Archive | 2014

Computer Assisted English Learning System with Gestures for Young Children

Seng Il Jung; Joon Yeon Choeh; Sung-Wook Baik; Soonil Kwon; Jong-Weon Lee

Kids use the computer assisted language learning systems to learn English. The contents of the system are well designed and kids enjoy them. From Cognitive Psychology we found gestures played useful role in learning so we developed the language learning system utilizing gestures. The system provides similar contents as the existing system but tries to enforce users to follow gestures related to given words. We compared the proposed system with an existing one in terms of memorizing test scores. The average improvement achieved using the proposed system was little better than one achieved using the existing system.

International Journal of Computational Intelligence Systems | 2013

User-Personality Classification Based on the Non-Verbal Cues from Spoken Conversations

Soonil Kwon; Joon Yeon Choeh; Jong-Weon Lee

Abstract Technology that detects user personality based on user speech signals must be researched to enhance the function of interaction between a user and virtual agent that takes place through a speech interface. In this study, personality patterns were automatically classified as either extroverted or introverted. Personality patterns were recognized based on non-verbal cues such as the rate, energy, pitch, and silent intervals of speech with patterns of their change. Through experimentation, a maximum pattern classification accuracy of 86.3% was achieved. Using the same data, another pattern classification test was manually carried out by people to see how well the automatic pattern classification of personal traits performed. The results in the second manual test showed an accuracy of 86.6%. This proves that the automatic pattern classification of personal traits can achieve results comparable to the level of performance accomplished by humans. The Silent Intervals feature of the automatic pattern cl...

international conference on ubiquitous information management and communication | 2011

Sound sketching via voice

Soonil Kwon; Lae-Hyun Kim

Content authoring tools require a natural interface method for easy editing of sound samples. We propose a novel method that modifies the time scale and energy of sound samples by speaking sound-imitation words. The method makes it possible to modify sound samples based on the users intention by simple utterances. The syllabic time scale of the sound samples is modified by Synchronized Overlap-Add algorithm, and then the energy by the syllable is modified according to the energy of sound-imitation utterances. The experiment showed that the proposed method achieved a 20.4% improvement in peak alignment compared with modification by the lump.

Multimedia Tools and Applications | 2018

Fear emotion classification in speech by acoustic and behavioral cues

Shin-ae Yoon; Guiyoung Son; Soonil Kwon

Machine-based emotional speech classification has become a requirement for natural and familiar human-computer interactions. Because emotional speech recognition systems use a person’s voice to spontaneously detect their emotional status and take subsequent appropriate actions, they can be used widely for various reason in call centers and emotional based media services. Emotional speech recognition systems are primarily developed using emotional acoustic data. While there are several emotional acoustic databases available for emotion recognition systems in other countries, there is currently no real situational data related to the “fear emotion” available. Thus, in this study, we collected acoustic data recordings which represent real urgent and fearful situations from an emergency call center. To classify callers’ emotions more accurately, we also included the additional behavioral feature “interjection” which can be classified as a type of disfluency which arises due to cognitive dysfunction observed in spontaneous speech when a speaker gets hyperemotional. We used Support Vector Machines (SVM), with the interjections feature, as well as conventionally used acoustic features (i.e., F0 variability, voice intensity variability, and Mel-Frequency Cepstral Coefficients; MFCCs) to identify which emotional category acoustic data fell into. The results of our study revealed that the MFCC was the best acoustic feature for spontaneous fear speech classification. In addition, we demonstrated the validity of behavioral features as an important criteria for emotional classification improvement.

Multimedia Tools and Applications | 2017

Deep features-based speech emotion recognition for smart affective services

Abdul Malik Badshah; Nasir Rahim; Noor Ullah; Jamil Ahmad; Khan Muhammad; Mi Young Lee; Soonil Kwon; Sung Wook Baik

Emotion recognition from speech signals is an interesting research with several applications like smart healthcare, autonomous voice response systems, assessing situational seriousness by caller affective state analysis in emergency centers, and other smart affective services. In this paper, we present a study of speech emotion recognition based on the features extracted from spectrograms using a deep convolutional neural network (CNN) with rectangular kernels. Typically, CNNs have square shaped kernels and pooling operators at various layers, which are suited for 2D image data. However, in case of spectrograms, the information is encoded in a slightly different manner. Time is represented along the x-axis and y-axis shows frequency of the speech signal, whereas, the amplitude is indicated by the intensity value in the spectrogram at a particular position. To analyze speech through spectrograms, we propose rectangular kernels of varying shapes and sizes, along with max pooling in rectangular neighborhoods, to extract discriminative features. The proposed scheme effectively learns discriminative features from speech spectrograms and performs better than many state-of-the-art techniques when evaluated its performance on Emo-DB and Korean speech dataset.

Journal of New Music Research | 2017

High-Accuracy Frequency Analysis of Harmonic Signals Using Improved Phase Difference Estimation and Window Switching

Hee Suk Pang; Jun-Seok Lim; Soonil Kwon

Accurate frequency tracking of harmonic signals is an essential process for analysis of musical instrument and singing sounds. For fine frequency estimation, we propose two methods for enhancement ...

Explore More