Myeong-Seon Gil | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Myeong-Seon Gil is active.

Explore More

Publication

Featured researches published by Myeong-Seon Gil.

The Journal of Supercomputing | 2017

Moving metadata from ad hoc files to database tables for robust, highly available, and scalable HDFS

Hee-Sun Won; Minh Chau Nguyen; Myeong-Seon Gil; Yang-Sae Moon; Kyu-Young Whang

As a representative large-scale data management technology, Apache Hadoop is an open-source framework for processing a variety of data such as SNS, medical, weather, and IoT data. Hadoop largely consists of HDFS, MapReduce, and YARN. Among them, we focus on improving the HDFS metadata management scheme responsible for storing and managing big data. We note that the current HDFS incurs many problems in system utilization due to its file-based metadata management. To solve these problems, we propose a novel metadata management scheme based on RDBMS for improving the functional aspects of HDFS. Through analysis of the latest HDFS, we first present five problems caused by its metadata management and derive three requirements of robustness, availability, and scalability for resolving these problems. We then design an overall architecture of the advanced HDFS, A-HDFS, which satisfies these requirements. In particular, we define functional modules according to HDFS operations and also present the detailed design strategy for adding or modifying the individual components in the corresponding modules. Finally, through implementation of the proposed A-HDFS, we validate its correctness by experimental evaluation and also show that A-HDFS satisfies all the requirements. The proposed A-HDFS significantly enhances the HDFS metadata management scheme and, as a result, ensures that the entire system improves its stability, availability, and scalability. Thus, we can exploit the improved distributed file system based on A-HDFS for various fields and, in addition, we can expect more applications to be actively developed.

international conference on big data and smart computing | 2017

Anomaly detection for big log data using a Hadoop ecosystem

Siwoon Son; Myeong-Seon Gil; Yang-Sae Moon

In this paper, we address a novel method to efficiently manage and analyze a large amount of log data. First, we present a new Apache Hive-based data storage and analysis architecture to process a large volume of Hadoop log data, which rapidly occur in multiple nodes. Second, we design and implement three simple but efficient anomaly detection methods. These methods use moving average and 3-sigma techniques to detect anomalies in log data. Finally, we show that all the three methods detect abnormal intervals properly, and the weighted anomaly detection methods are more precise than the basic one. These results indicate that our research is an excellent and simple approach in detecting anomalies of log data on a Hadoop ecosystem.

The Journal of Supercomputing | 2017

Prefetching-based metadata management in Advanced Multitenant Hadoop

Minh Chau Nguyen; Hee-Sun Won; Siwoon Son; Myeong-Seon Gil; Yang-Sae Moon

Metadata management is an essential part in Apache Hadoop. Performing optimization of metadata accesses enhances big data storing, processing and analyzing, especially in multitenant environments. Nevertheless, as environmental complexity increases, metadata management is becoming more challenging and costly because of the heavy performance issues. In this paper, we propose a novel approach to improve the performance of metadata management for Hadoop in the multitenant environment based on the prefetching mechanism. We create metadata access graphs based on historical access values, define access patterns and then perform prefetching potential items for the near-future requests to minimize the latency. We present a formal algorithm to apply the prefetching mechanism into the Hadoop system and perform the actual implementation on a recent Hadoop system. Experimental results show that the proposed approach can enable the high performance for metadata management as well as maintain advanced multitenancy features.

international conference on big data and smart computing | 2015

Fast index construction for distortion-free subsequence matching in time-series databases

Myeong-Seon Gil; Bum-Soo Kim; Mi-Jung Choi; Yang-Sae Moon

In this paper we address a problem of how we can construct a multidimensional index efficiently in distortion-free subsequence matching. In the previous distortion-free subsequence matching, the index construction is a very time-consuming process since it generates a huge number of data subsequences to consider all possible positions and all possible query lengths. The real experimental results show that, the index construction time reaches several hours for a time-series with a million entries, and this means that the index construction itself is very difficult for large time-series databases. To solve this problem, we first formally analyze the index construction steps, then try to optimize the performance of each step, and finally propose two advanced algorithms of constructing a multidimensional index very fast. In particular, we present the novel concept of store-and-reuse principle, a dynamic programming technique, which stores the intermediate results and reuses them repeatedly in the next steps. Through the store-and-reuse principle, the proposed algorithms construct a multidimensional index much faster than the previous algorithm. Analytical and empirical evaluations showcase the superiority of the proposed algorithms. For a time-series of length 300,000, we reduce the index construction time from 100 minutes to 7.5 minutes, which is one or two orders of magnitude.

Journal of KIISE | 2015

Secure Multiparty Computation of Principal Component Analysis

Sang-Pil Kim; Sanghun Lee; Myeong-Seon Gil; Yang-Sae Moon; Hee-Sun Won

In recent years, many research efforts have been made on privacy-preserving data mining (PPDM) in data of large volume. In this paper, we propose a PPDM solution based on principal component analysis (PCA), which can be widely used in computing correlation among sensitive data sets. The general method of computing PCA is to collect all the data spread in multiple nodes into a single node before starting the PCA computation; however, this approach discloses sensitive data of individual nodes, involves a large amount of computation, and incurs large communication overheads. To solve the problem, in this paper, we present an efficient method that securely computes PCA without the need to collect all the data. The proposed method shares only limited information among individual nodes, but obtains the same result as that of the original PCA. In addition, we present a dimensionality reduction technique for the proposed method and use it to improve the performance of secure similar document detection. Finally, through various experiments, we show that the proposed method effectively and efficiently works in a large amount of multi-dimensional data.

database systems for advanced applications | 2018

Secure Computation of Pearson Correlation Coefficients for High-Quality Data Analytics.

Sun-Kyong Hong; Myeong-Seon Gil; Yang-Sae Moon

In this paper, we present a secure method of computing Pearson correction coefficients while preserving data privacy as well as data quality in the distributed computing environment. In general data analytical/mining processes, individual data owners need to provide their original data to the third parties. In many cases, however, the original data contain sensitive information, and the data owners do not want to disclose their data in the original form for the purpose of privacy preservation. In this paper, we address a problem of secure multiparty computation of Pearson correlation coefficients. For the secure Pearson correlation computation, we first propose an advanced solution by exploiting the secure scalar product. We then present an approximate solution by adopting the lower-dimensional transformation. We finally empirically show that the proposed solutions are practical methods in terms of execution time and data quality.

International Journal of Distributed Sensor Networks | 2018

Variable size sampling to support high uniformity confidence in sensor data streams

Hajin Kim; Myeong-Seon Gil; Yang-Sae Moon; Mi-Jung Choi

In order to rapidly process large amounts of sensor stream data, it is effective to extract and use samples that reflect the characteristics and patterns of the data stream well. In this article, we focus on improving the uniformity confidence of KSample, which has the characteristics of random sampling in the stream environment. For this, we first analyze the uniformity confidence of KSample and then derive two uniformity confidence degradation problems: (1) initial degradation, which rapidly decreases the uniformity confidence in the initial stage, and (2) continuous degradation, which gradually decreases the uniformity confidence in the later stages. We note that the initial degradation is caused by the sample range limitation and the past sample invariance, and the continuous degradation by the sampling range increase. For each problem, we present a corresponding solution, that is, we provide the sample range extension for sample range limitation, the past sample change for past sample invariance, and the use of UC-window for sampling range increase. By reflecting these solutions, we then propose a novel sampling method, named UC-KSample, which largely improves the uniformity confidence. Experimental results show that UC-KSample improves the uniformity confidence over KSample by 2.2 times on average, and it always keeps the uniformity confidence higher than the user-specified threshold. We also note that the sampling accuracy of UC-KSample is higher than that of KSample in both numeric sensor data and text data. The uniformity confidence is an important sampling metric in sensor data streams, and this is the first attempt to apply uniformity confidence to KSample. We believe that the proposed UC-KSample is an excellent approach that adopts an advantage of KSample, dynamic sampling over a fixed sampling ratio, while improving the uniformity confidence.

Archive | 2016

Hive-Based Anomaly Detection in Hadoop Log Data Management

Siwoon Son; Myeong-Seon Gil; Seokwoo Yang; Yang-Sae Moon

In this paper, we address how to manage and analyze a large volume of log data, which have been difficult to be handled in the traditional computing environment. To handle a large volume of Hadoop log data, which rapidly occur in multiple servers, we present new data storage architecture to efficiently analyze those big log data through Apache Hive. We then design and implement a simple but efficient anomaly detection method, which identifies abnormal status of servers from log data, based on moving average and 3-sigma techniques. We also show effectiveness of the proposed detection method by demonstrating that it properly detects anomalies from Hadoop log data.

international conference on big data and smart computing | 2015

Partial denoising boundary image matching using time-series matching techniques

Bum-Soo Kim; Myeong-Seon Gil; Mi-Jung Choi; Yang-Sae Moon

Removing noise, called denoising, is an essential factor for achieving the intuitive and accurate results in boundary image matching. This paper deals with a partial denoising problem that tries to allow a limited amount of noise embedded in boundary images. To solve this problem, we first define partial denoising time-series that can be generated from an original image time-series by removing a variety of partial noises. We then propose an efficient mechanism that quickly obtains those partial denoising time-series in the time-series domain rather than the image domain. Next, we present the partial denoising distance, which is the minimum distance from a query time-series to all possible partial denoising time-series generated from a data time-series. We then use this partial denoising distance as a similarity measure in boundary image matching. Using the partial denoising distance, however, incurs a severe computational overhead since there are a large number of partial denoising time-series to be considered. To solve this problem, we derive a tight lower bound for the partial denoising distance and formally prove its correctness. We also propose partial denoising boundary image matching exploiting the partial denoising distance in boundary image matching. Through extensive experiments, we finally show that our lower bound-based approach improves search performance by up to an order of magnitude in partial denoising-based boundary image matching.

database systems for advanced applications | 2015

Performance Analysis of Hadoop-Based SQL and NoSQL for Processing Log Data

Siwoon Son; Myeong-Seon Gil; Yang-Sae Moon; Hee-Sun Won

Recently, many companies and research organizations are seeking scalable solutions by using Hadoop ecosystems. The log data management with large-scale and real-time properties is one of the appropriate application on top of Hadoop. In this paper, we focus on SQL and NoSQL choices for building Hadoop-based log data management system. For this purpose, we first select major products supporting SQL and NoSQL, and we then present an appropriate scheme for each product by considering its own characteristics. All the schema are for real-time monitoring and analyzing the log data. For each product, we implement insertion and selection operations of log data in Hadoop, and we analyze the performance of these operation. Analysis results show that MariaDB and MongoDB are fast in the insertion, and PostgreSQL and HBase are fast in the selection. We believe that our evaluation results will be very helpful for users to choose Hadoop SQL and NoSQL products for handling large-scale and real-time log data.

Explore More