Myungho Lee | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Myungho Lee is active.

Explore More

Publication

Featured researches published by Myungho Lee.

international conference on computational science and its applications | 2006

An intelligent garbage collection algorithm for flash memory storages

Longzhe Han; Yeonseung Ryu; Tae-sun Chung; Myungho Lee; Sukwon Hong

Flash memory cannot be overwritten unless erased in advance. In order to avoid having to erase during every update, non-in-place-update schemes have been used. Since updates are not performed in place, obsolete data are later reclaimed by garbage collection. In this paper, we study a new garbage collection algorithm to reduce the cleaning cost such as the number of erase operations and the number of data copies. The proposed scheme automatically predicts the future I/O workload and intelligently selects the victims according to the predicted I/O workload. Experimental results show that the proposed scheme performs well especially when the degree of locality is high.

parallel computing | 2006

Performance impact of resource conflicts on chip multi-processor servers

Myungho Lee; Yeonseung Ryu; Sugwon Hong; Chungki Lee

Chip Multi-Processors (CMPs) are becoming mainstream microprocessors for High Performance Computing and commercial business applications as well. Multiple CPU cores on CMPs allow multiple software threads executing on the same chip at the same time. Thus they promise to deliver higher capacity of computations performed per chip in a given time interval. However, resource sharing among the threads executing on the same chip can cause conflicts and lead to performance degradation. Thus, in order to obtain high performance and scalability on CMP servers, it is crucial to first understand the performance impact that the resource conflicts have on the target applications. In this paper, we evaluate the performance impact of the resource conflicts on an example high-end CMP server, Sun Fire E25K, using a standard OpenMP benchmark suite, SPEC OMPL.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

High Throughput Parallel Implementation of Aho-Corasick Algorithm on a GPU

Nhat-Phuong Tran; Myungho Lee; Sugwon Hong; Jaeyoung Choi

Pattern matching is an important operation in various applications such as computer and network security, bioinformatics, image processing, among many others. Aho-Corasick (AC) algorithm is a multiple patterns matching algorithm commonly used for such applications. In order to meet the highly demanding performance requirements imposed on these applications, achieving high performance for AC algorithm is crucial. In this paper, we present a high performance parallel implementation of AC algorithm on a Graphic Processing Unit (GPU) which efficiently utilizes the high degree of on-chip parallelism and the memory hierarchy of the GPU so that the aggregate performance (or throughput) of the GPU can be maximized. For this purpose our approach carefully places and caches the input text data and the reference pattern data used for pattern matching in the on-chip shared memories and the texture caches of the GPU. Furthermore, it efficiently schedules the off-chip global memory loads and the shared memory stores in order to minimize the overheads in loading the input data to the shared memories and also to minimize the shared memory bank conflicts. The proposed approach leads to a significant cut-down of the effective memory access latencies and leads to impressive performance improvements. Experimental results on Nvidia GeForce GTX 285 GPU show that our approach delivers up to 127Gbps throughput performance and up to 222-times speedup compared with a serial version running on 2.2Ghz Core2Duo Intel processor.

ieee international conference on high performance computing data and analytics | 2012

Memory Efficient Parallelization for Aho-Corasick Algorithm on a GPU

Nhat-Phuong Tran; Myungho Lee; Sugwon Hong; Minho Shin

Pattern matching is a commonly used operation in many applications including image processing, computer and network security, bioinformatics, among many others. Aho-Corasick (AC) algorithm is one of the well-known pattern matching techniques and it is intensively used in computer and network security. In order to meet the real-time performance requirements imposed on these security applications, developing a high-speed parallelization technique is essential for the AC algorithm. In this paper, we present a new memory efficient parallelization technique which efficiently places and caches the input text data and the reference data in the on-chip shared memories and texture caches of the Graphic Processing Unit (GPU). Furthermore, the new approach efficiently schedules memory accesses in order to minimize the overhead in loading data to the on-chip shared memories. The approach cuts down the effective memory access latencies and leads to significant performance improvements. Experimental results on Nvidia GeForce 9500GT GPU shows up to 15-times speedup compared with a serial version on 2.2Ghz Core2Duo Intel processor, and 15Gbps throughput performance.

Cluster Computing | 2014

Towards an integrated management system based on abstraction of heterogeneous virtual resources

Yongseong Cho; Jongsun Choi; Jaeyoung Choi; Myungho Lee

Virtualization technology reduces the costs for server installation, operation, and maintenance and it can simplify development of distributed systems. Currently, there are various virtualization technologies such as Xen, KVM, VMware, and etc, and all these technologies support various virtualization functions individually on the heterogeneous platforms. Therefore, it is important to be able to integrate and manage these heterogeneous virtualized resources in order to develop distributed systems based on the current virtualization techniques. This paper presents an integrated management system that is able to provide information for the usage of heterogeneous virtual resources and also to control them. The main focus of the system is to abstract various virtual resources and to reconfigure them flexibly. For this, an integrated management system has been developed and implemented based on a libvirt-based virtualization API and data distribution service (DDS).

International Conference on ICT for Smart Society | 2013

High performance string matching for security applications

Nhat-Phuong Tran; Myungho Lee

Aho-Corasick (AC) algorithm is a commonly used string matching algorithm. It performs multiple patterns matching for computer and network security applications. These applications impose high computational requirements, thus efficient parallelization of the AC algorithm is crucial. In this paper, we present a multi-stream based parallelization approach for the string matching using the AC algorithm on the latest Nvidia Kepler architecture. Our approach efficiently utilizes the HyperQ feature of the Kepler GPU so that multiple steams generated from a number of OpenMP threads running on the host multicore processor can be efficiently executed on a large number of fine-grain processing cores of the GPU. Experimental results show that our approach delivers up to 585Gbps throughput performance on Nvidia Tesla K20 GPU.

international conference on computational science and its applications | 2005

A space-efficient flash memory software for mobile devices

Yeonseung Ryu; Tae-Sun Chung; Myungho Lee

Flash memory is becoming popular storage media for mobile computing devices. In this paper, we study a new block management scheme in Flash Translation Layer (FTL) for flash memory storages which considers the space utilization. Proposed scheme classifies data blocks according to their write access frequencies and improves the space utilization by managing the blocks according to their hotness degree. To evaluate the proposed scheme, we developed a simulator and performed trace-driven simulations.

Scientific Programming | 2017

Performance Optimization of 3D Lattice Boltzmann Flow Solver on a GPU

Nhat-Phuong Tran; Myungho Lee; Sugwon Hong

Lattice Boltzmann Method (LBM) is a powerful numerical simulation method of the fluid flow. With its data parallel nature, it is a promising candidate for a parallel implementation on a GPU. The LBM, however, is heavily data intensive and memory bound. In particular, moving the data to the adjacent cells in the streaming computation phase incurs a lot of uncoalesced accesses on the GPU which affects the overall performance. Furthermore, the main computation kernels of the LBM use a large number of registers per thread which limits the thread parallelism available at the run time due to the fixed number of registers on the GPU. In this paper, we develop high performance parallelization of the LBM on a GPU by minimizing the overheads associated with the uncoalesced memory accesses while improving the cache locality using the tiling optimization with the data layout change. Furthermore, we aggressively reduce the register uses for the LBM kernels in order to increase the run-time thread parallelism. Experimental results on the Nvidia Tesla K20 GPU show that our approach delivers impressive throughput performance: 1210.63 Million Lattice Updates Per Second (MLUPS).

ieee international conference on high performance computing data and analytics | 2015

Memory-Efficient Parallelization of 3D Lattice Boltzmann Flow Solver on a GPU

Nhat-Phuong Tran; Myungho Lee; Dong Hoon Choi

Lattice Boltzmann Method (LBM) is a powerful numerical simulation method of the fluid flow. With its data parallel nature and the simple kernel structure, it is a promising candidate for a parallel implementation on a GPU. The LBM, however, is heavily data-intensive and memory bound. In particular, moving the data to the adjacent cells in the streaming computation phase of the LBM incurs a lot of uncoalesced accesses on the GPU which affects the overall performance. In this paper, we parallelize the LBM on a GPU by incorporating memory-efficient techniques such as the tiling optimization with the data layout changes and the data update scheme so called a pull scheme. Furthermore, we developed optimization techniques such as removing branch divergences, reducing the register uses, and reducing the number of double precision floating-point instructions. Experimental results on Nvidia Tesla K20 GPU show that our approach delivers up to 1105 MLUPS (Million Lattice Updates Per Second) and 156-times speedup compared with a serial implementation.

Cluster Computing | 2015

High performance parallelization of Boyer---Moore algorithm on many-core accelerators

Yosang Jeong; Myungho Lee; Dukyun Nam; Jik-Soo Kim; Soonwook Hwang

Boyer–Moore (BM) algorithm is a single pattern string matching algorithm. It is considered as the most efficient string matching algorithm and used in many applications. The algorithm first calculates two string shift rules based on the given pattern string in the preprocessing phase. Using the two shift rules, pattern matching operations are performed against the target input string in the second phase. The string shift rules calculated in the first phase let parts of the target input string be skipped where there are no matches to be found in the second phase. The second phase is a time consuming process and needs to be parallelized in order to realize the high performance string matching. In this paper, we parallelize the BM algorithm on the latest many-core accelerators such as the Intel Xeon Phi and the Nvidia Tesla K20 GPU along with the general-purpose multi-core microprocessors. For the parallel string matching, the target input data is partitioned amongst multiple threads. Data lying on the threads’ boundaries is searched redundantly so that the pattern string lying on the boundary between two neighboring threads cannot be missed. The redundant data search overheads increases significantly for a large number of threads. For a fixed target input length, the number of possible matches decreases as the pattern length increases. Furthermore, the positions of the pattern string are spread all over the target data randomly. This leads to the unbalanced workload distribution among threads. We employ the dynamic scheduling and the multithreading techniques to deal with the load balancing issue. We also use the algorithmic cascading technique to maximize the benefit of the multithreading and to reduce the overheads associated with the redundant data search between neighboring threads. Our parallel implementation leads to

Explore More