Is this you? Create Your Porfile

Hyojong Kim

Georgia Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hyojong Kim is active.

Explore More

Publication

Featured researches published by Hyojong Kim.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

OpenCL Performance Evaluation on Modern Multi Core CPUs

Joo Hwan Lee; Kaushik Patel; Nimit Nigania; Hyojong Kim; Hyesoon Kim

Utilizing heterogeneous platforms for computation has become a general trend making the portability issue important. OpenCL (Open Computing Language) serves the purpose by enabling portable execution on heterogeneous architectures. However, unpredictable performance variation on different platforms has become a burden for programmers who write OpenCL programs. This is especially true for conventional multicore CPUs, since the performance of general OpenCL applications on CPUs lags behind the performance expected by the programmer considering the conventional parallel programming model. In this paper, we evaluate the performance of OpenCL programs on out-of-order multicore CPUs from the architectural perspective. We evaluate OpenCL programs on various aspects, including scheduling overhead, instruction-level parallelism, address space, data location, locality, and vectorization, comparing OpenCL to conventional parallel programming models for CPUs. Our evaluation indicates different performance characteristic of OpenCL programs and also provides insight into the optimization metrics for better performance on CPUs.

high-performance computer architecture | 2017

GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks

Lifeng Nai; Ramyad Hadidi; Jaewoong Sim; Hyojong Kim; Pranith Kumar; Hyesoon Kim

With the emergence of data science, graph computing has become increasingly important these days. Unfortunately, graph computing typically suffers from poor performance when mapped to modern computing systems because of the overhead of executing atomic operations and inefficient utilization of the memory subsystem. Meanwhile, emerging technologies, such as Hybrid Memory Cube (HMC), enable the processing-in-memory (PIM) functionality with offloading operations at an instruction level. Instruction offloading to the PIM side has considerable potentials to overcome the performance bottleneck of graph computing. Nevertheless, this functionality for graph workloads has not been fully explored, and its applications and shortcomings have not been well identified thus far. In this paper, we present GraphPIM, a full-stack solution for graph computing that achieves higher performance using PIM functionality. We perform an analysis on modern graph workloads to assess the applicability of PIM offloading and present hardware and software mechanisms to efficiently make use of the PIM functionality. Following the real-world HMC 2.0 specification, GraphPIM provides performance benefits for graph applications without any user code modification or ISA changes. In addition, we propose an extension to PIM operations that can further bring performance benefits for more graph applications. The evaluation results show that GraphPIM achieves up to a 2.4X speedup with a 37% reduction in energy consumption.

IEEE Micro | 2015

Accelerating Application Start-up with Nonvolatile Memory in Android Systems

Hyojong Kim; Hong-Yeol Lim; Dilan Manatunga; Hyesoon Kim; Gi-Ho Park

Application launch time in mobile systems is critical in many cases because it can adversely affect user experience. Android has employed several software techniques to reduce application launch time. For example, Android shares memory space among applications to reduce the loading time of libraries. It also keeps applications in memory, even after the applications are terminated, to reduce start-up time. However, not much research has been done from a hardware perspective to reduce application launch time. In this article, the authors analyze memory usage patterns of Android applications and suggest several hardware optimization techniques. They also demonstrate the benefit of using a phase-change memory such as nonvolatile memory to accelerate start-up time.

Scientific Programming | 2015

OpenCL performance evaluation on modern multicore CPUs

Joo Hwan Lee; Nimit Nigania; Hyesoon Kim; Kaushik Patel; Hyojong Kim

Proceedings of the 2015 International Symposium on Memory Systems | 2015

Understanding Energy Aspects of Processing-near-Memory for HPC Workloads

Hyojong Kim; Hyesoon Kim; Sudhakar Yalamanchili; Arun Rodrigues

Interests in the concept of processing-near-memory (PNM) have been reignited with recent improvements of the 3D integration technology. In this work, we analyze the energy consumption characteristics of a system which comprises a conventional processor and a 3D memory stack with fully-programmable cores. We construct a high-level analytical energy model based on the underlying architecture and the technology with which each component is built. From the preliminary experiments with 11 HPC benchmarks from Mantevo benchmark suite, we observed that misses per kilo instructions (MPKI) of last-level cache (LLC) is one of the most important characteristics in determining the friendliness of the application to the PNM execution.

ACM Transactions on Architecture and Code Optimization | 2017

CAIRO: A Compiler-Assisted Technique for Enabling Instruction-Level Offloading of Processing-In-Memory

Ramyad Hadidi; Lifeng Nai; Hyojong Kim; Hyesoon Kim

Three-dimensional (3D)-stacking technology and the memory-wall problem have popularized processing-in-memory (PIM) concepts again, which offers the benefits of bandwidth and energy savings by offloading computations to functional units inside the memory. Several memory vendors have also started to integrate computation logics into the memory, such as Hybrid Memory Cube (HMC), the latest version of which supports up to 18 in-memory atomic instructions. Although industry prototypes have motivated studies for investigating efficient methods and architectures for PIM, researchers have not proposed a systematic way for identifying the benefits of instruction-level PIM offloading. As a result, compiler support for recognizing offloading candidates and utilizing instruction-level PIM offloading is unavailable. In this article, we analyze the advantages of instruction-level PIM offloading in the context of HMC-atomic instructions for graph-computing applications and propose CAIRO, a compiler-assisted technique and decision model for enabling instruction-level offloading of PIM without any burden on programmers. To develop CAIRO, we analyzed how instruction offloading enables performance gain in both CPU and GPU workloads. Our studies show that performance gain from bandwidth savings, the ratio of number of cache misses to total cache accesses, and the overhead of host atomic instructions are the key factors in selecting an offloading candidate. Based on our analytical models, we characterize the properties of beneficial and nonbeneficial candidates for offloading. We evaluate CAIRO with 27 multithreaded CPU and 36 GPU benchmarks. In our evaluation, CAIRO not only doubles the speedup for a set of PIM-beneficial workloads by exploiting HMC-atomic instructions but also prevents slowdown caused by incorrect offloading decisions for other workloads.Three-dimensional (3D)-stacking technology and the memory-wall problem have popularized processing-in-memory (PIM) concepts again, which offers the benefits of bandwidth and energy savings by offloading computations to functional units inside the memory. Several memory vendors have also started to integrate computation logics into the memory, such as Hybrid Memory Cube (HMC), the latest version of which supports up to 18 in-memory atomic instructions. Although industry prototypes have motivated studies for investigating efficient methods and architectures for PIM, researchers have not proposed a systematic way for identifying the benefits of instruction-level PIM offloading. As a result, compiler support for recognizing offloading candidates and utilizing instruction-level PIM offloading is unavailable. In this article, we analyze the advantages of instruction-level PIM offloading in the context of HMC-atomic instructions for graph-computing applications and propose CAIRO, a compiler-assisted technique and decision model for enabling instruction-level offloading of PIM without any burden on programmers. To develop CAIRO, we analyzed how instruction offloading enables performance gain in both CPU and GPU workloads. Our studies show that performance gain from bandwidth savings, the ratio of number of cache misses to total cache accesses, and the overhead of host atomic instructions are the key factors in selecting an offloading candidate. Based on our analytical models, we characterize the properties of beneficial and nonbeneficial candidates for offloading. We evaluate CAIRO with 27 multithreaded CPU and 36 GPU benchmarks. In our evaluation, CAIRO not only doubles the speedup for a set of PIM-beneficial workloads by exploiting HMC-atomic instructions but also prevents slowdown caused by incorrect offloading decisions for other workloads.

field-programmable custom computing machines | 2014

Harmonica: An FPGA-Based Data Parallel Soft Core

Chad D. Kersey; Sudhakar Yalamanchili; Hyojong Kim; Nimit Nigania; Hyesoon Kim

In this poster, we introduce Harmonica, a customizable FPGA-hosted core for massively parallel data intensive applications, designed for use in the proposed Cymric processor-near-memory architecture. We also discuss the deployment of Harmonica in the Cymric prototype, its first use in a full FPGA-based system incorporating a memory hierarchy. Given the nascent state of processor-near-memory (PNM) architectures, Cymric has adopted an approach based on the ability to rapidly develop and analyze a space of potential architectural solutions. Harmonica facilitates this by providing both instruction set and microarchitecture level configurability, allowing the same core design to be used in widely varying applications. Core dimensions such as SIMD width and word size can be configured as well as the set of and types of functional units present. This flexibility is enabled by HARP, a family of instruction set architectures for SIMT cores, and CHDL, a C++-based environment used to describe the processor in lieu of a traditional HDL.Community detection is a classic and very difficult task in social network analysis. A large number of methods have been developed for both efficient and effective community detection. However, much of the existing methods are heavily dependent on the number of links in the network, and thus they often suffer from the computational inefficiency when meeting large edge-intensive networks. In this paper, we present a novel SIMPLifying and Ensembling (SIMPLE) framework for parallel community detection. It employs the random link sampling to simplify the network and obtain basic partitionings on every sampled graphs. Then, the K-means-based Consensus Clustering is used to ensemble a number of basic partitionings to get high-quality community structures. Meanwhile, steps of random sampling and sampled graph partitioning are encapsulated into MapReduce to further improve the efficiency. Experiments on four real-world social networks analyze key parameters and factors inside SIMPLE, and demonstrate the effectiveness of the SIMPLE.The data placement strategy greatly affects the efficiency of MapReduce. The current strategy only takes the map phase into account to optimize the map time. But the ignored shuffle phase may increase the total running time significantly in many jobs. We propose a new data placement strategy, named OPTAS, which optimizes both the map and shuffle phases to reduce their total time. However, the huge search space makes it difficult to find out an optimal data placement instance (DPI) rapidly. To address this problem, an algorithm is proposed which can prune most of the search space and find out an optimal result quickly. The search space firstly is segmented in ascending order according to the potential map time. Within each segment, we propose an efficient method to construct a local optimal DPI with the minimal total time of both the map and shuffle phases. To find the global optimal DPI, we scan the local optimal DPIs in order. We have proven that the global optimal DPI can be found as the first local optimal DPI whose total time stops decreasing, thus further pruning the search space. In practice, we find that at most fourteen local optimal DPIs are scanned in tens of thousands of segments with the pruning strategy. Extensive experiments with real trace data verify not only the theoretic analysis of our pruning strategy and construction method but also the optimality of OPTAS. The best improvements obtained in our experiments can be over 40% compared with the existing strategy used by MapReduce.

ACM Transactions on Architecture and Code Optimization | 2018

CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems

Hyojong Kim; Ramyad Hadidi; Lifeng Nai; Hyesoon Kim; Nuwan Jayasena; Yasuko Eckert; Onur Kayiran; Gabriel H. Loh

To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach. To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively used by a thread block can be estimated, and that exclusive data (of any size) can be localized to one GPU with coarse-grained interleaved pages; (2) using the affinity-based thread block scheduling policy, we can co-place compute and data together; and (3) by using dual address mode with lightweight changes to virtual to physical page mappings, we can selectively choose different interleaved memory pages for each data structure. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote traffic over a baseline system.

international parallel and distributed processing symposium | 2017

SimProf: A Sampling Framework for Data Analytic Workloads

Jen-Cheng Huang; Lifeng Nai; Pranith Kumar; Hyojong Kim; Hyesoon Kim

Today, there is a steep rise in the amount of data being collected from diverse applications. Consequently, data analytic workloads are gaining popularity to gain insight that can benefit the application, e.g., financial trading, social media analysis. To study the architectural behavior of the workloads, architectural simulation is one of the most common approaches. However, because of the long-running nature of the workloads, it is not trivial to identify which parts of the analysis to simulate. In the current work, we introduce SimProf, a sampling framework for data analytic workloads. Using this tool, we are able to select representative simulation points based on the phase behavior of the analysis at a method level granularity. This provides a better understanding of the simulation point and also reduces the simulation time for different input sets. We present the framework for Apache Hadoop and Apache Spark frameworks, which can be easily extended to other data analytic workloads.

international parallel and distributed processing symposium | 2018