Eun-Jin Im | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Eun-Jin Im is active.

Explore More

Publication

Featured researches published by Eun-Jin Im.

ieee international conference on high performance computing data and analytics | 2004

Sparsity: Optimization Framework for Sparse Matrix Kernels

Eun-Jin Im; Katherine A. Yelick; Richard W. Vuduc

Sparse matrix–vector multiplication is an important computational kernel that performs poorly on most modern processors due to a low compute-to-memory ratio and irregular memory access patterns. Optimization is difficult because of the complexity of cache-based memory systems and because performance is highly dependent on the non-zero structure of the matrix. The SPARSITY system is designed to address these problems by allowing users to automatically build sparse matrix kernels that are tuned to their matrices and machines. SPARSITY combines traditional techniques such as loop transformations with data structure transformations and optimization heuristics that are specific to sparse matrices. It provides a novel framework for selecting optimization parameters, such as block size, using a combination of performance models and search. In this paper we discuss the optimization of two operations: a sparse matrix times a dense vector and a sparse matrix times a set of dense vectors. Our experience indicates that register level optimizations are effective for matrices arising in certain scientific simulations, in particular finite-element problems. Cache level optimizations are important when the vector used in multiplication is larger than the cache size, especially for matrices in which the non-zero structure is random. For applications involving multiple vectors, reorganizing the computation to perform the entire set of multiplications as a single operation produces significant speedups. We describe the different optimizations and parameter selection techniques and evaluate them on several machines using over 40 matrices taken from a broad set of application domains. Our results demonstrate speedups of up to 4× for the single vector case and up to 10× for the multiple vector case.

international conference on computational science | 2001

Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY

Eun-Jin Im; Katherine A. Yelick

Sparse matrix-vector multiplication is an important computational kernel that tends to perform poorly on modern processors, largely because of its high ratio of memory operations to arithmetic operations. Optimizing this algorithm is difficult, both because of the complexity of memory systems and because the performance is highly dependent on the nonzero structure of the matrix. The Sparsity system is designed to address these problem by allowing users to automatically build sparse matrix kernels that are tuned to their matrices and machines. The most difficult aspect of optimizing these algorithms is selecting among a large set of possible transformations and choosing parameters, such as block size. In this paper we discuss the optimization of two operations: a sparse matrix times a dense vector and a sparse matrix times a set of dense vectors. Our experience indicates that for matrices arising in scientific simulations, register level optimizations are critical, and we focus here on the optimizations and parameter selection techniques used in Sparsity for register-level optimizations. We demonstrate speedups of up to 2× for the single vector case and 5× for the multiple vector case.

ieee international conference on high performance computing data and analytics | 2011

Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems

Kamesh Madduri; Khaled Z. Ibrahim; Samuel Williams; Eun-Jin Im; Stephane Ethier; John Shalf; Leonid Oliker

The gyrokinetic Particle-in-Cell (PIC) method is a critical computational tool enabling petascale fusion simulation re- search. In this work, we present novel multi- and manycore-centric optimizations to enhance performance of GTC, a PIC-based production code for studying plasma microturbulence in tokamak devices. Our optimizations encompass all six GTC sub-routines and include multi-level particle and grid decompositions designed to improve multi-node parallel scaling, particle binning for improved load balance, GPU acceleration of key subroutines, and memory-centric optimizations to improve single-node scaling and reduce memory utilization. The new hybrid MPI-OpenMP and MPI-OpenMP-CUDA GTC versions achieve up to a 2× speedup over the production Fortran code on four parallel systems - clusters based on the AMD Magny-Cours, Intel Nehalem-EP, IBM BlueGene/P, and NVIDIA Fermi architectures. Finally, strong scaling experiments provide insight into parallel scalability, memory utilization, and programmability trade-offs for large-scale gyrokinetic PIC simulations, while attaining a 1.6× speedup on 49,152 XE6 cores.

parallel computing | 2011

Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms

Kamesh Madduri; Eun-Jin Im; Khaled Z. Ibrahim; Samuel Williams; Stephane Ethier; Leonid Oliker

The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this work, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTCs key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broad range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3-4.7x on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.

international conference on information and communication technology convergence | 2011

A homogeneous parallel brute force cracking algorithm on the GPU

Anh-Duy Vu; Jea-Il Han; Hong-An Nguyen; Young-Man Kim; Eun-Jin Im

From the early days of computing, passwords have been considered as the essential authentication method to protect accesses to computer systems and users. Due to their importance, sensitiveness and confidentiality, many cryptography mechanisms have been utilized to secure password storage. Among them, cryptography hash methods are the most popular solutions. A cryptography hash function converts plaintext passwords to unreadable message digests which frustrates attackers from exploiting system failures and stealing stored passwords. On the other hand, it is possible to get the plaintext passwords from digests. We examined brute force attack to get the original passwords from the hashed ones and studied some existing GPU-based brute force cracking tools. These applications implement a hybrid algorithm that generates available passwords on CPU side and hashes them in parallel on GPU side. In this paper, we propose a new homogeneous parallel brute force cracking algorithm that performs all the works on GPU side. In our experiments, we successfully cracked many kinds of passwords. For example, with 6-digit passwords, it took about 0.23 ms for initialization, 1.97 ms for combination generation, and 52.81 ms for brute-force. So we need less than 1 second to crack passwords of this kind.

Journal of Parallel and Distributed Computing | 2014

A grand spread estimator using a graphics processing unit

Seon-Ho Shin; Eun-Jin Im; MyungKeun Yoon

The spread of a source is defined as the number of distinct destinations to which the source has sent packets during a measurement period. Spread estimation is essential in traffic monitoring, measurement, intrusion detection, to mention a few. To support high speed networking, recent research suggests implementing a spread estimator in fast but small on-chip memory such as SRAM. A state-of-the-art estimator can hold succinct information about 10 million distinct packets using 1 MB SRAM. This implies that a measurement period should restart whenever every 10 million distinct packets fill up the SRAM. Spread estimation is a challenging problem because two spread values from different measurement periods cannot be aggregated to derive the total value. Therefore, current spread estimators have a serious limitation concerning the length of the measurement period because SRAM is available a few megabytes at most. In this paper, we propose a spread estimator that utilizes a large memory space of a graphics processing unit on a commodity PC. The proposed estimator utilizes a 1 GB memory, a hundred times larger than those of current spread estimators, and its throughput is still around 160 Gbps. According to our experiments, the proposed scheme can cover a measurement period of a few dozen hours while the current state-of-the-art can cover only one hour. To the best of our knowledge, this has not been achieved by any spread estimators thus far.

international conference on ubiquitous and future networks | 2012

Social network analysis algorithm on a many-core GPU

Sang Won Seo; Joohyun Kyong; Eun-Jin Im

Proliferation of social network service permeates lives of people on internet, and its social, political and cultural significance prompts the need to understand and to analyze the contents and structures of such services. The sheer volume of such social network is enormous, and it necessitates development and implementation of efficient social network analysis algorithms. Among these, Influence Maximization is one example of such algorithm. The objective of the influence maximization algorithm is to find a small subset of nodes, so-called seed-nodes, that result in maximization of the spread of influence through the edges in the graph which represents connections in social network. As the cost-efficient, high-performance computing power of many-core GPUs is widely utilized in nearly all areas of computing, we apply our expertise in GPU parallelization to the influence maximization algorithm. The graph algorithms are known as one of sparse algorithms since its irregular data structure requires indirect accesses to memory, resulting in lowbandwidth memory access. Sparse algorithms are one area where many researchers are focused on its efficient parallelization, because the usage of such algorithms is universal and thus, vital to broad application areas, from scientific simulations to social studies. In this paper, we introduce algorithms to compute influence maximization of social network, and adopt this algorithm to fit parallel implementation on many-core GPU. We also analyze our implementations in terms of factors affecting GPU performance.

international conference on information and communication technology convergence | 2010

Fast forwarding table lookup exploiting GPU memory architecture

Youngjun Lee; Minseon Jeong; Sanghwan Lee; Eun-Jin Im

As the traffic of the Internet increases and diversifies, the needs for a fast flexible router have made researchers to work on software routers. The existing software router systems may utilize the cluster structure of multiple machines or GPU systems. Especially, Packet Shader, which uses GPU to exploit GPUs extensive parallelism, shows higher performance compared to other existing software routers. However, Packet Shader does not utilize the memory architecture in the GPU system. Basically, GPU has different types of memories such as constant, texture, and global. Since different types of memories show different memory access performance, we propose a unique index architecture for the forwarding table by exploiting the memory architecture of the GPU system. Through a preliminary evaluation, we show that the forwarding table lookup using GPU can outperform the CPU only system.

international conference on service oriented computing | 2013

A Reliable, Safe, and Secure Run-Time Platform for Cyber Physical Systems

Sung-Soo Lim; Eun-Jin Im; Nikil D. Dutt; Kyung-woo Lee; Insik Shin; Chang-Gun Lee; Insup Lee

This paper introduces a global research collaboration project performed by a Korean-USA research group. The project aims at designing and implementing a run-time platform for reliable, safe, and secure cyber physical systems (CPS). The project consists of layered sub-projects including SoC design for reliable systems, virtualized software architecture for dynamically upgradable systems, and middleware architecture for safety critical networked applications. This paper describes the objectives of each sub-project and the current accomplishments.

parallel computing | 2004

Performance tuning of matrix triple products based on matrix structure

Eun-Jin Im; Ismail Bustany; Cleve Ashcraft; James Demmel; Katherine A. Yelick

Sparse matrix computations arise in many scientific and engineering applications, but their performance is limited by the growing gap between processor and memory speed. In this paper, we present a case study of an important sparse matrix triple product problem that commonly arises in primal-dual optimization method. Instead of a generic two-phase algorithm, we devise and implement a single pass algorithm that exploits the block diagonal structure of the matrix. Our algorithm uses fewer floating point operations and roughly half the memory of the two-phase algorithm. The speed-up of the one-phase scheme over the two-phase scheme is 2.04 on a 900 MHz Intel Itanium-2, 1.63 on an 1 GHz Power-4, and 1.99 on a 900 MHz Sun Ultra-3.

Explore More