Do Gil Lee | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Do Gil Lee is active.

Explore More

Publication

Featured researches published by Do Gil Lee.

IEEE Intelligent Systems | 2007

Automatic Word Spacing Using Probabilistic Models Based on Character n-grams

Do Gil Lee; Hae Chang Rim; Dongsuk Yook

On the Internet, information is largely in text form, which often includes such errors as spelling mistakes. These errors complicate natural language processing because most NLP applications arent robust and assume that the input data is noise free. Preprocessing is necessary to deal with these errors and meet the growing need for automatic text processing. One kind of such preprocessing is automatic word spacing. This process decides correct boundaries between words in a sentence containing spacing errors, which are a type of spelling error. Except for some Asian languages such as Chinese and Japanese, most languages have explicit word spacing. In these languages, word spacing is crucial to increase readability and to accurately communicate a texts meaning. Automatic word spacing plays an important role not only as a spell-checker module but also as a preprocessor for a morphological analyzer, which is a fundamental tool for NLP applications. Furthermore, automatic word spacing can serve as a postprocessor for optical-character-recognition systems and speech recognition systems

IEEE Transactions on Audio, Speech, and Language Processing | 2009

Probabilistic Modeling of Korean Morphology

Do Gil Lee; Hae Chang Rim

This paper proposes new probabilistic models for analyzing Korean morphology. In order to take advantage of the characteristics of Korean morphology, the proposed models are based on three linguistic units: eojeol (a Korean spacing unit), morpheme, and syllable. Unlike previous approaches that are based on rules and dictionaries, the probabilistic approach proposed in this study can automatically acquire complete linguistic knowledge from part-of-speech (POS) tagged corpora. In addition, this approach, without any system modification, is easily applicable to other corpora with different tag sets and annotation guidelines. The three different models and their combinations are evaluated on three corpora over a wide range of conditions. The eojeol-unit and syllable-unit models compensate for the weaknesses of the morpheme-unit model. The eojeol-unit model performed efficiently, and improved the precision. The syllable-unit model improved in precision as well, showing a particularly robust performance in treating unknown words. The proposed approach is also proven to outperform the previous approaches.

Expert Systems With Applications | 2015

Knowledge-based question answering using the semantic embedding space

Min Chul Yang; Do Gil Lee; So Young Park; Hae Chang Rim

We extract semantic links of words and logical properties from unstructured data.We jointly encode semantics of words and logical properties into an embedding space.Embedding space provides semantic similarities between word and logical properties.Questions and potential answers can be represented on the embedding space.Potential answers are ranked based on semantic similarities with a given question. Semantic transformation of a natural language question into its corresponding logical form is crucial for knowledge-based question answering systems. Most previous methods have tried to achieve this goal by using syntax-based grammar formalisms and rule-based logical inference. However, these approaches are usually limited in terms of the coverage of the lexical trigger, which performs a mapping task from words to the logical properties of the knowledge base, and thus it is easy to ignore implicit and broken relations between properties by not interpreting the full knowledge base. In this study, our goal is to answer questions in any domains by using the semantic embedding space in which the embeddings encode the semantics of words and logical properties. In the latent space, the semantic associations between existing features can be exploited based on their embeddings without using a manually produced lexicon and rules. This embedding-based inference approach for question answering allows the mapping of factoid questions posed in a natural language onto logical representations of the correct answers guided by the knowledge base. In terms of the overall question answering performance, our experimental results and examples demonstrate that the proposed method outperforms previous knowledge-based question answering baseline methods with a publicly released question answering evaluation dataset: WebQuestions.

IEEE Transactions on Consumer Electronics | 2010

Natural language-based user interface for mobile devices with limited resources

So Young Park; Jeunghyun Byun; Hae Chang Rim; Do Gil Lee; Heuiseok Lim

In this paper, we propose a natural language-based interface model to enable a user to articulate a request without having any specific knowledge about a mobile device. In consideration of the very limited computing and memory capacity of the mobile device and to keep the development cost low, the proposed model does not depend on typical natural language techniques, but on a ranking technique, that is simplified based on the mathematical derivation process with the following assumptions. One assumption is that a device control command consists of a function and its parameters. The other assumption is that the parameter is represented as few predictable patterns, whereas the function can be represented as various sentence patterns. To deal with these various sentence patterns, the proposed model selects the top ranked command candidate with the highest score after generating all possible candidates with their scores. Furthermore, the ranking score function is designed to achieve a high discriminative capability by the simulation of the process of generating every candidate. Experimental results show that the proposed model with 2.9 megabytes performs at 96.27% accuracy, which is slightly lower than 97.06% of the baseline model with 135.2 megabytes.

Journal of Information Processing Systems | 2015

A maximum entropy-based bio-molecular event extraction model that considers event generation

Hyoung Gyu Lee; So Young Park; Hae Chang Rim; Do Gil Lee; Hong Woo Chun

In this paper, we propose a maximum entropy-based model, which can mathematically explain the bio-molecular event extraction problem. The proposed model generates an event table, which can represent the relationship between an event trigger and its arguments. The complex sentences with distinctive event structures can be also represented by the event table. Previous approaches intuitively designed a pipeline system, which sequentially performs trigger detection and arguments recognition, and thus, did not clearly explain the relationship between identified triggers and arguments. On the other hand, the proposed model generates an event table that can represent triggers, their arguments, and their relationships. The desired events can be easily extracted from the event table. Experimental results show that the proposed model can cover 91.36% of events in the training dataset and that it can achieve a 50.44% recall in the test dataset by using the event table.

Digital Scholarship in the Humanities | 2016

An All-Words Sense Tagging Method for Resource-Deficient Languages

Bong Jun Yi; Do Gil Lee; Hae Chang Rim

All-words sense tagging is the task of determining the correct senses of all content words in a given text. Many methods utilizing various language resources, such as a machine readable dictionary (MRD), sense tagged corpus, and WordNet, have been proposed for tagging senses to all words rather than a small number of sample words. However, sense tagging methods that require vast resources cannot be used for resource-deficient languages. The conventional sense tagging method for resource-deficient languages, which utilizes only an MRD, suffers from low recall and low precision because it determines senses only when a gloss word in the dictionary exactly matches a context word. In this study, we propose an all-words sense tagging method that is effective for resource-deficient languages in particular. It requires an MRD, which is the essential resource for all-words sense tagging, and a raw corpus, which is easily acquired and freely available. The proposed sense tagging method attempts to find semantically related context words based on the co-occurrence information extracted from the raw corpus and utilizes these words for tagging the senses of the target word. The experimental results of an evaluation of the proposed sense tagging algorithm on a Korean test corpus consisting of approximately 15 million words show that it can tag senses to all contents words automatically with high precision. Furthermore, we also show that a semantic concordancer can be developed based on the automatic sense tagged corpus.

Mathematical Problems in Engineering | 2015

The Effects of Feature Optimization on High-Dimensional Essay Data

Bong Jun Yi; Do Gil Lee; Hae Chang Rim

Current machine learning (ML) based automated essay scoring (AES) systems have employed various and vast numbers of features, which have been proven to be useful, in improving the performance of the AES. However, the high-dimensional feature space is not properly represented, due to the large volume of features extracted from the limited training data. As a result, this problem gives rise to poor performance and increased training time for the system. In this paper, we experiment and analyze the effects of feature optimization, including normalization, discretization, and feature selection techniques for different ML algorithms, while taking into consideration the size of the feature space and the performance of the AES. Accordingly, we show that the appropriate feature optimization techniques can reduce the dimensions of features, thus, contributing to the efficient training and performance improvement of AES.

meeting of the association for computational linguistics | 2009

A Novel Word Segmentation Approach for Written Languages with Word Boundary Markers

Han Cheol Cho; Do Gil Lee; Jung Tae Lee; Pontus Stenetorp; Jun’ichi Tsujii; Hae Chang Rim

Most NLP applications work under the assumption that a user input is error-free; thus, word segmentation (WS) for written languages that use word boundary markers (WBMs), such as spaces, has been regarded as a trivial issue. However, noisy real-world texts, such as blogs, e-mails, and SMS, may contain spacing errors that require correction before further processing may take place. For the Korean language, many researchers have adopted a traditional WS approach, which eliminates all spaces in the user input and re-inserts proper word boundaries. Unfortunately, such an approach often exacerbates the word spacing quality for user input, which has few or no spacing errors; such is the case, because a perfect WS model does not exist. In this paper, we propose a novel WS method that takes into consideration the initial word spacing information of the user input. Our method generates a better output than the original user input, even if the user input has few spacing errors. Moreover, the proposed method significantly outperforms a state-of-the-art Korean WS model when the user input initially contains less than 10% spacing errors, and performs comparably for cases containing more spacing errors. We believe that the proposed method will be a very practical pre-processing module.

IEICE Transactions on Information and Systems | 2010