Hyeran Jeon
University of Southern California
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hyeran Jeon.
international symposium on computer architecture | 2015
Sangpil Lee; Keunsoo Kim; Gunjae Koo; Hyeran Jeon; Won Woo Ro; Murali Annavaram
This paper presents Warped-Compression, a warp-level register compression scheme for reducing GPU power consumption. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Removing data redundancy of register values through register compression reduces the effective register width, thereby enabling power reduction opportunities. GPU register files are huge as they are necessary to keep concurrent execution contexts and to enable fast context switching. As a result register file consumes a large fraction of the total GPU chip power. GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. To reduce register file data redundancy warped-compression uses low-cost and implementationefficient base-delta-immediate (BDI) compression scheme, that takes advantage of banked register file organization used in GPUs. Since threads within a warp write values with strong similarity, BDI can quickly compress and decompress by selecting either a single register, or one of the register banks, as the primary base and then computing delta values of all the other registers, or banks. Warped-compression can be used to reduce both dynamic and leakage power. By compressing register values, each warp-level register access activates fewer register banks, which leads to reduction in dynamic power. When fewer banks are used to store the register content, leakage power can be reduced by power gating the unused banks. Evaluation results show that register compression saves 25% of the total register file power consumption.
ieee international symposium on workload characterization | 2014
Qiumin Xu; Hyeran Jeon; Murali Annavaram
Large graph processing is now a critical component of many data analytics. Graph processing is used from social networking Web sites that provide context-aware services from user connectivity data to medical informatics that diagnose a disease from a given set of symptoms. Graph processing has several inherently parallel computation steps interspersed with synchronization needs. Graphics processing units (GPUs) are being proposed as a power-efficient choice for exploiting the inherent parallelism. There have been several efforts to efficiently map graph applications to GPUs. However, there have not been many characterization studies that provide an in-depth understanding of the interaction between the GPGPU hardware components and graph applications that are mapped to execute on GPUs. In this study, we compiled 12 graph applications and collected the performance and utilization statistics of the core components of GPU while running the applications on both a cycle accurate simulator and a real GPU card. We present detailed application execution characteristics on GPUs. Then, we discuss and suggest several approaches to optimize GPU hardware for enhancing the graph application performance.
international symposium on microarchitecture | 2012
Hyeran Jeon; Murali Annavaram
General purpose graphics processing units (GPGPUs) are feature rich GPUs that provide general purpose computing ability with massive number of parallel threads. The massive parallelism combined with programmability made GPGPUs the most attractive choice in supercomputing centers. Unsurprisingly, most of the GPGPU-based studies have been focusing on performance improvement leveraging GPGPUs high degree of parallelism. However, for many scientific applications that commonly run on supercomputers, program correctness is as important as performance. Few soft or hard errors could lead to corrupt results and can potentially waste days or even months of computing effort. In this research we exploit unique architectural characteristics of GPGPUs to propose a light weight error detection method, called Warped Dual Modular Redundancy (Warped-DMR). Warped-DMR detects errors in computation by relying on opportunistic spatial and temporal dual-modular execution of code. Warped-DMR is light weight because it exploits the underutilized parallelism in GPGPU computing for error detection. Error detection spans both within a warp as well as between warps, called intra-warp and inter-warp DMR, respectively. Warped-DMR achieves 96% error coverage while incurring a worst-case 16% performance overhead without extra execution units or programmers effort.
international symposium on computer architecture | 2016
Qiumin Xu; Hyeran Jeon; Keunsoo Kim; Won Woo Ro; Murali Annavaram
As technology scales, GPUs are forecasted to incorporate an ever-increasing amount of computing resources to support thread-level parallelism. But even with the best effort, exposing massive thread-level parallelism from a single GPU kernel, particularly from general purpose applications, is going to be a difficult challenge. In some cases, even if there is sufficient thread-level parallelism in a kernel, there may not be enough available memory bandwidth to support such massive concurrent thread execution. Hence, GPU resources may be underutilized as more general purpose applications are ported to execute on GPUs. In this paper, we explore multiprogramming GPUs as a way to resolve the resource underutilization issue. There is a growing hardware support for multiprogramming on GPUs. Hyper-Q has been introduced in the Kepler architecture which enables multiple kernels to be invoked via tens of hardware queue streams. Spatial multitasking has been proposed to partition GPU resources across multiple kernels. But the partitioning is done at the coarse granularity of streaming multiprocessors (SMs) where each kernel is assigned to a subset of SMs. In this paper, we advocate for partitioning a single SM across multiple kernels, which we term as intra-SM slicing. We explore various intra-SM slicing strategies that slice resources within each SM to concurrently run multiple kernels on the SM. Our results show that there is not one intra-SM slicing strategy that derives the best performance for all application pairs. We propose Warped-Slicer, a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance. The model relies on a set of short online profile runs to determine how each kernels performance varies as more thread blocks from each kernel are assigned to an SM. The model takes into account the interference effect of shared resource usage across multiple kernels. The model is also computationally efficient and can determine the resource partitioning quickly to enable dynamic decision making as new kernels enter the system. We demonstrate that the proposed Warped-Slicer approach improves performance by 23% over the baseline multiprogramming approach with minimal hardware overhead.
IEEE Transactions on Computers | 2010
Hyeran Jeon; Woo Hyong Lee; Sung Woo Chung
Load balancing has been known as an essential feature for enhancing the performance of distributed systems. For embedded systems, however, this is not always true since load balancing leads to lavish power consumption by fully utilizing all the embedded cores even for a small number of tasks. Furthermore, the previously proposed load unbalancing strategies do not concern much about the characteristics of the embedded systems real workload. In this paper, to resolve this problem, we propose a novel load unbalancing strategy based on the task characteristics: periodic and aperiodic. In the proposed strategy, the periodic tasks that are more likely to be executed repeatedly are concentrated on the minimum number of cores, whereas the aperiodic tasks that are not likely to occur again soon are distributed to the maximum number of cores. The experimental results on an ARM11MPCore test chip show that the proposed strategy reduces power consumption and mean waiting time of the aperiodic tasks by up to 26 percent and 82 percent, respectively, compared to the load balancing strategy. As compared to the aggressive load unbalancing strategy, the proposed strategy also reduces mean waiting time of the aperiodic tasks by 92 percent with similar power efficiency.
international symposium on microarchitecture | 2015
Hyeran Jeon; Gokul Subramanian Ravi; Nam Sung Kim; Murali Annavaram
To support massive number of parallel thread contexts, Graphics Processing Units (GPUs) use a huge register file, which is responsible for a large fraction of GPUs total power and area. The conventional belief is that a large register file is inevitable for accommodating more parallel thread contexts, and technology scaling makes it feasible to incorporate ever increasing size of register file. In this paper, we demonstrate that the register file size need not be large to accommodate more threads context. We first characterize the useful lifetime of a register and show that register lifetimes vary drastically across various registers that are allocated to a kernel. While some registers are alive for the entire duration of the kernel execution, some registers have a short lifespan. We propose GPU register file virtualization that allows multiple warps to share physical registers. Since warps may be scheduled for execution at different points in time, we propose to proactively release dead registers from one warp and re-allocate them to a different warp that may occur later in time, thereby reducing the needless demand for physical registers. By using register virtualization, we shrink the architected register space to a smaller physical register space. By under-provisioning the physical register file to be smaller than the architected register file we reduce dynamic and static power consumption. We then develop a new register throttling mechanism to run applications that exceed the size of the under-provisioned register file without any deadlock. Our evaluation shows that even after halving the architected register file size using our proposed GPU register file virtualization applications run successfully with negligible performance overhead.
international conference on parallel processing | 2010
Hyeran Jeon; Yinglong Xia; Viktor K. Prasanna
Exact inference is a key problem in exploring probabilistic graphical models. The computational complexity of inference increases dramatically with the parameters of the graphical model. To achieve scalability over hundreds of threads remains a fundamental challenge. In this paper, we use a lightweight scheduler hosted by the CPU to allocate cliques in junction trees to the GPGPU at run time. The scheduler merges multiple small cliques or splits large cliques dynamically so as to maximize the utilization of the GPGPU resources. We implement node level primitves on the GPGPU to process the cliques assigned by the CPU. We propose a conflict free potential table organization and an efficient data layout for coalescing memory accesses. In addition, we develop a double buffering based asynchronous data transfer between CPU and GPGPU to overlap clique processing on the GPGPU with data transfer and scheduling activities. Our implementation achieved 30X speedup compared with state-of-the-art multicore processors.
computing frontiers | 2013
Hyeran Jeon; Kaoutar El Maghraoui; Gokul B. Kandiraju
The Flash Translation Layer (FTL) is the core engine for Solid State Disks (SSD). It is responsible for managing the virtual to physical address mappings and emulating the functionality of a normal block-level device. SSD performance is highly dependent on the design of the FTL. For the last few years, several FTL schemes have been proposed. Hybrid FTL schemes have gained more popularity since they try to combine the benefits of both page-level mapping and block-level mapping schemes. Examples include BAST, FAST, LAST, etc. To provide high performance, FTL designers face several cross cutting issues: the right balance between coarse and fine grain address mapping, the asymmetric nature of reads and writes, the write amplification property of Flash memory, and the wear-out behavior of flash. The MapReduce paradigm has become a very popular paradigm for performing parallel and distributed computations on large data. Hadoop, an open-source implementation of MapReduce, has accelerated MapReduce adoption. Flash SSD is increasingly being used as a storage solution in Hadoop deployments for faster processing and better energy utilization. Little work has been done to understand the endurance implications of SSD on Hadoop-based workloads. In this paper, using a highly flexible and reconfigurable kernel-level simulation infrastructure, we investigate the internal characteristics of various hybrid FTL schemes using a representative set of Hadoop workloads. Our investigation brings out the wear-out behavior of SSD for Hadoop-based workloads including wear-leveling details, garbage collection, translation and block/page mappings, and advocates the need for dynamic tuning of FTL parameters for these workloads.
international test conference | 2014
Hyeran Jeon; Gabriel H. Loh; Murali Annavaram
Die-stacked (3D) DRAM is a promising memory architecture to satisfy the high bandwidth and low latency needs of many computing systems. However, with technology scaling, memory devices are expected to experience significant increase in single and multi-bit errors. 3D DRAM will have the added burden of protecting against single through-silicon-via (TSV) failures, which translate into multiple bit errors in a single cache line, as well as multiple TSV failures that may lead to an entire channel failure. To exploit the wide interface capability of 3D DRAM, large chunks of data are laid out contiguously in a single channel and an entire cache line is sourced from a single channel. Conventional approaches such as ECC DIMM and chipkill-correct are inefficient in the context of 3D DRAM because they spread data across multiple DRAM layers to protect against failures and also place restrictions on the number of memory layers that must be protected together. This paper proposes a two-level error correction technique while taking into account 3D DRAMs unique organization. First, we propose a new symbol-based ECC layout to cover 3D DRAM specific errors as well as various well-known DRAM failure modes. Then, an XOR based correction code (XCC) is used to correct the multi-bit errors that are not correctable by the symbol based ECC. To take advantage of 3D DRAMs channel level parallelism, a permutation-based ECC placement is used. As an optimization, an ECC cache and decoupled XCC update are used for improving read and write performance without compromising reliability. The proposed approaches effectively reduce the FIT rate with almost negligible performance overhead.
high-performance computer architecture | 2017
Mohammad Abdel-Majeed; Alireza Shafaei; Hyeran Jeon; Massoud Pedram; Murali Annavaram
GPU adoption for general purpose computing hasbeen accelerating. To support a large number of concurrentlyactive threads, GPUs are provisioned with a very large registerfile (RF). The RF power consumption is a critical concern. Oneoption to reduce the power consumption dramatically is touse near-threshold voltage(NTV) to operate the RF. However, operating MOSFET devices at NTV is fraught with stabilityand reliability concerns. The adoption of FinFET devices inchip industry is providing a promising path to operate theRF at NTV while satisfactorily tackling the stability andreliability concerns. However, the fundamental problem of NTVoperation, namely slow access latency, remains. To tackle thischallenge in this paper we propose to build a partitioned RFusing FinFET technology. The partitioned RF design exploitsour observation that applications exhibit strong preference toutilize a small subset of their registers. One way to exploitthis behavior is to cache the RF content as has been proposedin recent works. However, caching leads to unnecessary areaoverheads since a fraction of the RF must be replicated. Furthermore, we show that caching is not efficient as weincrease the number of issued instructions per cycle, which isthe expected trend in GPU designs. The proposed partitionedRF splits the registers into two partitions: the highly accessedregisters are stored in a small RF that switches betweenhigh and low power modes. We use the FinFETs back gatecontrol to provide low overhead switching between the twopower modes. The remaining registers are stored in a largeRF partition that always operates at NTV. The assignment ofthe registers to the two partitions will be based on statisticscollected by the a hybrid profiling technique that combines thecompiler based profiling and the pilot warp profiling techniqueproposed in this paper. The partitioned FinFET RF is able tosave 39% and 54% of the RF leakage and the dynamic energy, respectively, and suffers less than 2% performance overhead.