Junwhan Ahn | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Junwhan Ahn is active.

Explore More

Publication

Featured researches published by Junwhan Ahn.

international symposium on computer architecture | 2015

A scalable processing-in-memory accelerator for parallel graph processing

Junwhan Ahn; Sungpack Hong; Sungjoo Yoo; Onur Mutlu; Kiyoung Choi

The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

international symposium on computer architecture | 2015

PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture

Junwhan Ahn; Sungjoo Yoo; Onur Mutlu; Kiyoung Choi

Processing-in-memory (PIM) is rapidly rising as a viable solution for the memory wall crisis, rebounding from its unsuccessful attempts in 1990s due to practicality concerns, which are alleviated with recent advances in 3D stacking technologies. However, it is still challenging to integrate the PIM architectures with existing systems in a seamless manner due to two common characteristics: unconventional programming models for in-memory computation units and lack of ability to utilize large on-chip caches. In this paper, we propose a new PIM architecture that (I) does not change the existing sequential programming models and (2) automatically decides whether to execute PIM operations in memory or processors depending on the locality of data. The key idea is to implement simple in-memory computation using compute-capable memory commands and use specialized instructions, which we call PIM-enabled instructions, to invoke in-memory computation. This allows PIM operations to be interoperable with existing programming models, cache coherence protocols, and virtual memory mechanisms with no modification. In addition, we introduce a simple hardware structure that monitors the locality of data accessed by a PIM-enabled instruction at runtime to adaptively execute the instruction at the host processor (instead of in memory) when the instruction can benefit from large on-chip caches. Consequently, our architecture provides the illusion that PIM operations are executed as if they were host processor instructions. We provide a case study of how ten emerging data-intensive workloads can benefit from our new PIM abstraction and its hardware implementation. Evaluations show that our architecture significantly improves system performance and, more importantly, combines the best parts of conventional and PlM architectures by adapting to data locality of applications.

high-performance computer architecture | 2014

DASCA: Dead Write Prediction Assisted STT-RAM Cache Architecture

Junwhan Ahn; Sungjoo Yoo; Kiyoung Choi

Spin-Transfer Torque RAM (STT-RAM) has been considered as a promising candidate for on-chip last-level caches, replacing SRAM for better energy efficiency, smaller die footprint, and scalability. However, it also introduces several new challenges into last-level cache design that need to be overcome for feasible deployment of STT-RAM caches. Among other things, mitigating the impact of slow and energy-hungry write operations is of the utmost importance. In this paper, we propose a new mechanism to reduce write activities of STT-RAM last-level caches. The key observation is that a significant amount of data written to last-level caches is not actually re-referenced again during the lifetime of the corresponding cache blocks. Such write operations, which we call dead writes, can bypass the cache without incurring extra misses by definition. Based on this, we propose Dead Write Prediction Assisted STT-RAM Cache Architecture (DASCA), which predicts and bypasses dead writes for write energy reduction. For this purpose, we first propose a novel classification of dead writes, which is composed of dead-on-arrival fills, dead-value fills, and closing writes, as a theoretical model for redundant write elimination. On top of that, we present a dead write predictor based on a state-of-the-art dead block predictor. Evaluations show that our architecture achieves an energy reduction of 68% (62%) in last-level caches and an additional energy reduction of 10% (16%) in main memory and even improves system performance by 6% (14%) on average compared to the STT-RAM baseline in a single-core (quad-core) system.

international symposium on low power electronics and design | 2013

Write intensity prediction for energy-efficient non-volatile caches

Junwhan Ahn; Sungjoo Yoo; Kiyoung Choi

This paper presents a novel concept called write intensity prediction for energy-efficient non-volatile caches as well as the architecture that implements the concept. The key idea is to correlate write intensity of cache blocks with addresses of memory access instructions that incur cache misses of those blocks. The predictor keeps track of instructions that tend to load write-intensive blocks and utilizes that information to predict write intensity of blocks. Based on this concept, we propose a block placement strategy driven by write intensity prediction for SRAM/STT-RAM hybrid caches. Experimental results show that the proposed approach reduces write energy consumption by 55% on average compared to the existing hybrid cache architecture.

international symposium on circuits and systems | 2012

Lower-bits cache for low power STT-RAM caches

Junwhan Ahn; Kiyoung Choi

As power-efficient design becomes more important, spin-transfer torque RAM (STT-RAM) has drawn a lot of attention due to its ability to meet both high performance and low power consumption. However, its high write energy incurs an increase of dynamic power consumption and may offset power saving due to its low static power. This paper proposes a novel technique called lower-bits caches for reducing write activities of STT-RAM L2 caches. Based on the observation that upper bits of data are not changed as frequently as lower bits in most applications, the technique tries to hide frequent bit changes in lower bits from the L2 cache. Experimental results show that our architecture reduced 25 percent of energy consumed by the L2 cache and slightly improved performance at the same time compared to the STT-RAM baseline.

IEEE Transactions on Computers | 2016

Prediction Hybrid Cache: An Energy-Efficient STT-RAM Cache Architecture

Junwhan Ahn; Sungjoo Yoo; Kiyoung Choi

Spin-transfer torque RAM (STT-RAM) has emerged as an energy-efficient and high-density alternative to SRAM for large on-chip caches. However, its high write energy has been considered as a serious drawback. Hybrid caches mitigate this problem by incorporating a small SRAM cache for write-intensive data along with an STT-RAM cache. In such architectures, choosing cache blocks to be placed into the SRAM cache is the key to their energy efficiency. This paper proposes a new hybrid cache architecture called prediction hybrid cache. The key idea is to predict write intensity of cache blocks at the time of cache misses and determine block placement based on the prediction. We design a write intensity predictor that realize the idea by exploiting a correlation between write intensity of blocks and memory access instructions that incur cache misses of those blocks. It includes a mechanism to dynamically adapt the predictor to application characteristics. We also design a hybrid cache architecture in which write-intensive blocks identified by the predictor are placed into the SRAM region. Evaluations show that our scheme reduces energy consumption of hybrid caches by 28 percent (31 percent) on average compared to the existing hybrid cache architecture in a single-core (quad-core) system.

ACM Transactions on Architecture and Code Optimization | 2013

Power-Efficient Predication Techniques for Acceleration of Control Flow Execution on CGRA

Kyuseung Han; Junwhan Ahn; Kiyoung Choi

Coarse-grained reconfigurable architecture typically has an array of processing elements which are controlled by a centralized unit. This makes it difficult to execute programs having control divergence among PEs without predication. However, conventional predication techniques have a negative impact on both performance and power consumption due to longer instruction words and unnecessary instruction-fetching decoding nullifying steps. This article reveals performance and power issues in predicated execution which have not been well-addressed yet. Furthermore, it proposes fast and power-efficient predication mechanisms. Experiments conducted through gate-level simulation show that our mechanism improves energy-delay product by 11.9% to 23.8% on average.

design automation conference | 2014

Dynamic Power Management of Off-Chip Links for Hybrid Memory Cubes

Junwhan Ahn; Sungjoo Yoo; Kiyoung Choi

The Hybrid Memory Cube (HMC) is a 3D-stacked DRAM architecture designed for substantially improved memory bandwidth. In particular, its I/O interface achieves up to 320GB/s of external bandwidth through high-speed serial links. However, it comes at a cost of large static power of off-chip links, which dominates total power consumption of HMCs. Therefore, we propose an adaptive mechanism to partially disable off-chip links of HMCs with a minimal performance impact. We also present two-level prefetching with in-HMC prefetch buffers to further improve its efficiency in the presence of prefetching. Evaluations show that our scheme reduces energy consumption of HMCs by 51% on average.

international conference on hardware/software codesign and system synthesis | 2016

Zero and data reuse-aware fast convolution for deep neural networks on GPU

Hyunsun Park; Dong-Young Kim; Junwhan Ahn; Sungjoo Yoo

Convolution operations dominate the total execution time of deep convolutional neural networks (CNNs). In this paper, we aim at enhancing the performance of the state-of-the-art convolution algorithm (called Winograd convolution) on the GPU. Our work is based on two observations: (1) CNNs often have abundant zero weights and (2) the performance benefit of Winograd convolution is limited mainly due to extra additions incurred during data transformation. In order to exploit abundant zero weights, we propose a low-overhead and efficient hardware mechanism that skips multiplications that will always give zero results regardless of input data (called ZeroSkip). In addition, to leverage the second observation, we present data reuse optimization for addition operations in Winograd convolution (called AddOpt), which improves the utilization of local registers, thereby reducing on-chip cache accesses. Our experiments with a real-world deep CNN, VGG-16, on GPGPU-Sim and Titan X show that the proposed methods, ZeroSkip and AddOpt, achieve 51.8% higher convolution performance than the baseline Winograd convolution. Moreover, even without any hardware modification, AddOpt alone gives 35.6% higher performance on a real hardware platform, Titan X.

computer vision and pattern recognition | 2017

Weighted-Entropy-Based Quantization for Deep Neural Networks

Eunhyeok Park; Junwhan Ahn; Sungjoo Yoo

Quantization is considered as one of the most effective methods to optimize the inference cost of neural network models for their deployment to mobile and embedded systems, which have tight resource constraints. In such approaches, it is critical to provide low-cost quantization under a tight accuracy loss constraint (e.g., 1%). In this paper, we propose a novel method for quantizing weights and activations based on the concept of weighted entropy. Unlike recent work on binary-weight neural networks, our approach is multi-bit quantization, in which weights and activations can be quantized by any number of bits depending on the target accuracy. This facilitates much more flexible exploitation of accuracy-performance trade-off provided by different levels of quantization. Moreover, our scheme provides an automated quantization flow based on conventional training algorithms, which greatly reduces the design-time effort to quantize the network. According to our extensive evaluations based on practical neural network models for image classification (AlexNet, GoogLeNet and ResNet-50/101), object detection (R-FCN with 50-layer ResNet), and language modeling (an LSTM network), our method achieves significant reductions in both the model size and the amount of computation with minimal accuracy loss. Also, compared to existing quantization schemes, ours provides higher accuracy with a similar resource constraint and requires much lower design effort.

Explore More