Byung-Won On | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Byung-Won On is active.

Explore More

Publication

Featured researches published by Byung-Won On.

information quality in information systems | 2005

Effective and scalable solutions for mixed and split citation problems in digital libraries

Dongwon Lee; Byung-Won On; Jaewoo Kang; Sanghyun Park

In this paper, we consider two important problems that commonly occur in bibliographic digital libraries, which seriously degrade their data qualities: Mixed Citation (MC) problem (i.e., citations of different scholars with their names being homonyms are mixed together) and Split Citation (SC) problem (i.e., citations of the same author appear under different name variants). In particular, we investigate an effective yet scalable solution since citations in such digital libraries tend to be large-scale. After formally defining the problems and accompanying challenges, we present an effective solution that is based on the state-of-the-art sampling-based approximate join algorithm. Our claim is verified through preliminary experimental results.

Communications of The ACM | 2007

Are your citations clean

Dongwon Lee; Jaewoo Kang; Prasenjit Mitra; C. Lee Giles; Byung-Won On

If they are, only one can refer to a distinct document; if not, many can refer to the same document.

international conference on data mining | 2006

Improving Grouped-Entity Resolution Using Quasi-Cliques

Byung-Won On; Ergin Elmacioglu; Dongwon Lee; Jaewoo Kang; Jian Pei

The entity resolution (ER) problem, which identifies duplicate entities that refer to the same real world entity, is essential in many applications. In this paper, in particular, we focus on resolving entities that contain a group of related elements in them (e.g., an author entity with a list of citations, a singer entity with song list, or an intermediate result by GROUP BY SQL query). Such entities, named as grouped-entities, frequently occur in many applications. The previous approaches toward grouped-entity resolution often rely on textual similarity, and produce a large number of false positives. As a complementing technique, in this paper, we present our experience of applying a recently proposed graph mining technique, Quasi-Clique, atop conventional ER solutions. Our approach exploits contextual information mined from the group of elements per entity in addition to syntactic similarity. Extensive experiments verify that our proposal improves precision and recall up to 83% when used together with a variety of existing ER solutions, but never worsens them.

international conference theory and practice digital libraries | 2004

System support for name authority control problem in digital libraries: OpenDBLP approach

Yoojin Hong; Byung-Won On; Dongwon Lee

In maintaining Digital Libraries, having bibliographic data up-to-date is critical, yet often minor irregularities may cause information isolation. Unlike documents for which various kinds of unique ID systems exist (e.g., DOI, ISBN), other bibliographic entities such as author and publication venue do not have unique IDs. Therefore, in current Digital Libraries, tracking such bibliographic entities is not trivial. For instance, suppose a scholar changes her last name from A to B. Then, a user, searching for her publications under the new name B, cannot get old publications that appeared under A although they are by the same person. For such a scenario, since both A and B are the same person, it would be desirable for Digital Libraries to track their identities accordingly. In this paper, we investigate this problem known as name authority control, and present our system-oriented solution. We first identify three core building blocks that underlie the phenomenon, and show taxonomy where different combinations of the building blocks can occur. Then, we consider how systems can support the problem in two common functions of Digital Libraries – Update and Search. Finally, our test-bed called OpenDBLP is presented where the suggested solution is fully implemented as a proof of the concept.

acm/ieee joint conference on digital libraries | 2006

An effective approach to entity resolution problem using quasi-clique and its application to digital libraries

Byung-Won On; Ergin Elmacioglu; Dongwon Lee; Jaewoo Kang; Jian Pei

We study how to resolve entities that contain a group of related elements in them (e.g., an author entity with a list of citations or an intermediate result by GROUP BY SQL query). Such entities, named as grouped-entities, frequently occur in many applications. By exploiting contextual information mined from the group of elements per entity in addition to syntactic similarity, we show that our approach, Quasi-Clique, improves precision and recall unto 91% when used together with a variety of existing entity resolution solutions, but never worsens them

international conference on data engineering | 2009

GuruMine: A Pattern Mining System for Discovering Leaders and Tribes

Amit Goyal; Byung-Won On; Francesco Bonchi; Laks V. S. Lakshmanan

In this demo we introduce GuruMine, a pattern mining system for the discovery of leaders, i.e., influential users in social networks, and their tribes, i.e., a set of users usually influenced by the same leader over several actions. GuruMine is built upon a novel pattern mining framework for leaders discovery, that we introduced in [1]. In particular, we consider social networks where users perform actions. Actions may be as simple as tagging resources (urls) as in del.icio.us, rating songs as in Yahoo! Music, or movies as in Yahoo! Movies, or users buying gadgets such as cameras, handholds, etc. and blogging a review on the gadgets. The assumption is that actions performed by a user can be seen by their network friends. Users seeing their friends actions are sometimes tempted to perform those actions. On the basis of the propagation of such influence, in [1] we provided various notion of leaders and developed algorithms for their efficient discovery. GuruMine provides users with a friendly graphical interface for selecting the actions of interest, and the kind of leaders to mine. The set of parameters driving the pattern discovery process can be iteratively refined, and the result is updated, if possible without incurring a completely new computation. Once a set of leaders has been extracted, GuruMine can easily validate them on a set of actions unseen during the pattern mining, by analyzing the portion of network reached by the influence of the selected leaders on the unseen actions. GuruMine also offers various visualizations over the social networks: the propagation of an action, the leaders, their tribes, and the interactions between different leaders and tribes. In this demo we will show: (i) how the pattern mining process can be driven towards the discovery of a good set of leaders, (ii) the ease of use of GuruMine system, and (iii) its outstanding performances on large real-world social networks and actions databases.

Knowledge and Information Systems | 2012

Scalable clustering methods for the name disambiguation problem

Byung-Won On; Ingyu Lee; Dongwon Lee

When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when (part of) “names” of entities are used as their identifier, the problem is often referred to as a name disambiguation problem, where goal is to sort out the erroneous entities due to name homonyms (e.g., If only last name is used as the identifier, one cannot distinguish “Masao Obama” from “Norio Obama”). In this paper, in particular, we study the scalability issue of the name disambiguation problem—when (1) a small number of entities with large contents or (2) a large number of entities get un-distinguishable due to homonyms. First, we carefully examine two of the state-of-the-art solutions to the name disambiguation problem and point out their limitations with respect to scalability. Then, we propose two scalable graph partitioning algorithms known as multi-level graph partitioning and multi-level graph partitioning and merging to solve the large-scale name disambiguation problem. Our claim is empirically validated via experimentation—our proposal shows orders of magnitude improvement in terms of performance while maintaining equivalent or reasonable accuracy compared to competing solutions.

advances in social networks analysis and mining | 2010

Mining Interaction Behaviors for Email Reply Order Prediction

Byung-Won On; Ee-Peng Lim; Jing Jiang; Amruta Purandare; Loo-Nin Teow

In email networks, user behaviors affect the way emails are sent and replied. While knowing these user behaviors can help to create more intelligent email services, there has not been much research into mining these behaviors. In this paper, we investigate user engagingness and responsiveness as two interaction behaviors that give us useful insights into how users email one another. Engaging users are those who can effectively solicit responses from other users. Responsive users are those who are willing to respond to other users. By modeling such behaviors, we are able to mine them and to identify engaging or responsive users. This paper proposes four types of models to quantify engagingness and responsiveness of users. These behaviors can be used as features in the email reply order prediction task which predicts the email reply order given an email pair. Our experiments show that engagingness and responsiveness behavior features are more useful than other non-behavior features in building a classifier for the email reply order prediction task. When combining behavior and non-behavior features, our classifier is also shown to predict the email reply order with good accuracy.

international conference on digital information management | 2014

A big data management system for energy consumption prediction models

Wonjin Lee; Byung-Won On; Ingyu Lee; Jungin Choi

In this work, we develop a prototype about a big data management system for storing, indexing, and searching for huge-scale energy usage data. Rather than existing, commercial relational databases such as Oracle and IBM-DB2, this system is able to provide us with high availability and performance at low cost. It is also able to manage unstructured data and store big data in distributed environment. In addition, using data access APIs, target data is quickly retrieved from our proposed system. To utilize our prototype system, we also propose an energy consumption prediction model based on penalized linear regression-based map/reduce algorithms. Then, we exploit discriminate features with respect to time stamp. Finally, given a time stamp (e.g., 2014-01-05 12:01:08), our proposed learning model will give us a predicted value about the energy usage (e.g., 90 watt) at that time. According to our experimental results obtained from about 7.5 million records, each of which consists of an energy usage and time stamp during three months in 2014, it turns out that our prediction model can predict real values that are very close to actual energy usage at that time, and is about 1.72 times faster than in a single machine.

Microprocessors and Microsystems | 2011

Study of the performance impact of a cache buffer in solid-state disks

Gyu Sang Choi; Byung-Won On

An SSD generally has a small memory, called cache buffer, to increase its performance and the frequently accessed data are maintained in this cache buffer. These cached data must periodically write back to the NAND Flash memory to prevent the data loss due to sudden power-off, and it should immediately flush all dirty data items into a non-volatile storage media (i.e., NAND Flash memory), when receiving a flush command, while the flush command is supported in Serial ATA (SATA) and Serial Attached SCSI (SAS). Thus, a flush command is an important factor to give significant impact on SSD performance. In this paper, we have investigated the impact of a flush command on SSD performance and have conducted in-depth experiments with versatile workloads, using the modified FlashSim simulator. Our performance measurements using PC and server workloads provide several interesting conclusions. First, a cache buffer without a flush command could improve SSD performance as a cache buffer size increases, since more requested data could be handled in the cache buffer. Second, our experiments have revealed that a flush command might give a negative impact on SSD performance. The average response time per request with a flush command is getting worse compared to not supporting the flush command, as cache buffer size increases. Finally, we have proposed the backend flushing scheme to nullify the negative performance impact of the flush command. The backend flushing scheme first writes the requested data into a cache buffer and sends the acknowledgment of the request completion to a host system. Then, it writes back the data in the cache buffer to NAND Flash memory. Thus, the proposed scheme could improve SSD performance since it might reduce the number of the dirty data items in a cache buffer to write back to NAND Flash memory. All these results suggest that a flush command could give a negative impact on SSD performance and our proposed backend flushing scheme could improve the SSD performance while supporting a flush command.

Explore More