Sangyeun Cho | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sangyeun Cho is active.

Explore More

Publication

Featured researches published by Sangyeun Cho.

international symposium on microarchitecture | 2009

Flip-N-Write: a simple deterministic technique to improve PRAM write performance, energy and endurance

Sangyeun Cho; Hyunjin Lee

The phase-change random access memory (PRAM) technology is fast maturing to production levels. Main advantages of PRAM are non-volatility, byte addressability, in-place programmability, low-power operation, and higher write endurance than that of current flash memories. However, the relatively low write bandwidth and the less-than-desirable write endurance of PRAM remain room for improvement. This paper proposes and evaluates Flip-N-Write, a simple microarchitectural technique to replace a PRAM write operation with a more efficient read-modify-write operation. On a write, after quick bit-by-bit inspection of the original data word and the new data word, Flip-N-Write writes either the new data word or the “flipped” value of it. Flip-N-Write introduces a single bit associated with each PRAM word to indicate whether the PRAM word has been flipped or not. We analytically and experimentally show that the proposed technique reduces the PRAM write time by half, more than doubles the write endurance, and achieves commensurate savings in write energy under the same instantaneous write power constraint. Due to its simplicity, Flip-N-Write is straightforward to implement within a PRAM device.

IEEE Computer Architecture Letters | 2008

Corollaries to Amdahl's Law for Energy

Sangyeun Cho; Rami G. Melhem

This paper studies the important interaction between parallelization and energy consumption in a parallelizable application. Given the ratio of serial and parallel portion in an application and the number of processors, we first derive the optimal frequencies allocated to the serial and parallel regions in the application to minimize the total energy consumption, while the execution time is preserved (i.e., speedup = 1). We show that dynamic energy improvement due to parallelization has a function rising faster with the increasing number of processors than the speed improvement function given by the well-known Amdahls Law. Furthermore, we determine the conditions under which one can obtain both energy and speed improvement, as well as the amount of improvement. The formulas we obtain capture the fundamental relationship between parallelization, speedup, and energy consumption and can be directly utilized in energy aware processor resource management. Our results form a basis for several interesting research directions in the area of power and energy aware parallel processing.

IEEE Transactions on Parallel and Distributed Systems | 2010

On the Interplay of Parallelization, Program Performance, and Energy Consumption

Sangyeun Cho; Rami G. Melhem

This paper derives simple, yet fundamental formulas to describe the interplay between parallelism of an application, program performance, and energy consumption. Given the ratio of serial and parallel portions in an application and the number of processors, we derive optimal frequencies allocated to the serial and parallel regions in an application to either minimize the total energy consumption or minimize the energy-delay product. The impact of static power is revealed by considering the ratio between static and dynamic power and quantifying the advantages of adding to the architecture capability to turn off individual processors and save static energy. We further determine the conditions under which one can obtain both energy and speed improvement, as well as the amount of improvement. While the formulas we obtain use simplifying assumptions, they provide valuable theoretical insights into energy-aware processor resource management. Our results form a basis for several interesting research directions in the area of energy-aware multicore processor architectures.

international conference on parallel processing | 2012

Characterizing Machines and Workloads on a Google Cluster

Zitao Liu; Sangyeun Cho

Cloud computing offers high scalability, flexibility and cost-effectiveness to meet emerging computing requirements. Understanding the characteristics of real workloads on a large production cloud cluster benefits not only cloud service providers but also researchers and daily users. This paper studies a large-scale Google cluster usage trace dataset and characterizes how the machines in the cluster are managed and the workloads submitted during a 29-day period behave. We focus on the frequency and pattern of machine maintenance events, job- and task-level workload behavior, and how the overall cluster resources are utilized.

high-performance computer architecture | 2011

CloudCache: Expanding and shrinking private caches

Hyunjin Lee; Sangyeun Cho; Bruce R. Childers

The number of cores in a single chip multiprocessor is expected to grow in coming years. Likewise, aggregate on-chip cache capacity is increasing fast and its effective utilization is becoming ever more important. Furthermore, available cores are expected to be underutilized due to the power wall and highly heterogeneous future workloads. This trend makes existing L2 cache management techniques less effective for two problems: increased capacity interference between working cores and longer L2 access latency. We propose a novel scalable cache management framework called CloudCache that creates dynamically expanding and shrinking L2 caches for working threads with fine-grained hardware monitoring and control. The key architectural components of CloudCache are L2 cache chaining, inter- and intra-bank cache partitioning, and a performance-optimized coherence protocol. Our extensive experimental evaluation demonstrates that CloudCache significantly improves performance of a wide range of workloads when all or a subset of cores are occupied.

international symposium on computer architecture | 1999

Decoupling local variable accesses in a wide-issue superscalar processor

Sangyeun Cho; Pen Chung Yew; Gyungho Lee

Providing adequate data bandwidth is extremely important for a wide-issue superscalar processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes an alternative or complementary approach for providing more data bandwidth, called the data-decoupled architecture. The approach, with support from the compiler and/or hardware, partitions the memory stream into two independent streams early in the processor pipeline, and feeds each stream to a separate memory access queue and cache. Under this model, the paper studies the potential of decoupling memory accesses to programs local variables that are allocated on the run-time stack. Using a set of integer and floating-point programs from the SPEC95 benchmark suite, it is shown that local variable accesses constitute a large portion of all the memory references, while their reference space is very small, averaging around 7 words per (static) procedure. To service local variable accesses quickly, two optimizations, fast data forwarding and access combining, are proposed and studied. Some of the important design parameters, such as the cache size, the number of cache ports, and the degree of access combining, are studied based on simulations. The potential performance of the proposed scheme is measured using various configurations, and it is concluded that the scheme can become a viable alternative to building a single multi-ported data cache.

ieee computer society annual symposium on vlsi | 2007

Performance of Graceful Degradation for Cache Faults

Hyunjin Lee; Sangyeun Cho; Bruce R. Childers

In sub-90nm technologies, more frequent hard faults pose a serious burden on processor design and yield control. In addition to manufacturing-time chip repair schemes, microarchitectural techniques to make processor components resilient to hard faults become increasingly important. This paper considers defects in cache memory and studies their impact on program performance using a fault degradable cache model. We first describe how defects at the circuit level in cache manifest themselves at the microarchitecture level. We then examine several strategies for masking faults, by disabling faulty resources, such as lines, sets, ways, ports, or even the whole cache. We also propose an efficient cache set remapping scheme to recover lost performance due to failed sets. Using a new simulation tool, called CAFE, we study how the cache faults impact program performance under the various masking schemes

international conference on supercomputing | 2013

Active disk meets flash: a case for intelligent SSDs

Sangyeun Cho; Chanik Park; Hyunok Oh; Sungchan Kim; Youngmin Yi; Gregory R. Ganger

Intelligent solid-state drives (iSSDs) allow execution of limited application functions (e.g., data filtering or aggregation)on their internal hardware resources, exploiting SSD characteristics and trends to provide large and growing performance and energy efficiency benefits. Most notably, internal flash media bandwidth can be significantly (2-4x or more) higher than the external bandwidth with which the SSD is connected to a host system, and the higher internal bandwidth can be exploited within an iSSD. Also, SSD bandwidth is projected to increase rapidly over time, creating a substantial energy cost for streaming of data to an external CPU for processing, which can be avoided via iSSD processing. This paper makes a case for iSSDs by detailing these trends, quantifying the potential benefifits across a range of application activities, describing how SSD architectures could be extended cost-effectively, and demonstrating the concept with measurements of a prototype iSSD running simple data scan functions. Our analyses indicate that, with less than a 2% increase in hardware cost over a traditional SSD, an iSSD can provide 2-4x performance increases and 5-27x energy efficiency gains for a range of data-intensive computations.

international conference on parallel processing | 2008

TPTS: A Novel Framework for Very Fast Manycore Processor Architecture Simulation

Sangyeun Cho; Socrates Demetriades; Shayne Evans; Lei Jin; Hyunjin Lee; Kiyeon Lee; Michael Moeng

The slow speed of conventional execution-driven architecture simulators is a serious impediment to obtaining desirable research productivity. This paper proposes and evaluates a fast manycore processor simulation framework called two-phase trace-driven simulation (TPTS), which splits detailed timing simulation into a trace generation phase and a trace simulation phase. Much of the simulation overhead caused by uninteresting architectural events is only incurred once during the trace generation phase and can be omitted in the repeated trace-driven simulations. We design and implement tsim, an event-driven manycore processor simulator that models detailed memory hierarchy, interconnect, and coherence protocol models based on the proposed TPTS framework. By applying aggressive event filtering, tsim achieves an impressive simulation speed of 146 MIPS, when running 16-thread parallel applications.

international symposium on computer architecture | 2016

Biscuit: a framework for near-data processing of big data workloads

Bon-Cheol Gu; Andre S. Yoon; Duck-Ho Bae; Insoon Jo; Jin-Young Lee; Jonghyun Yoon; Jeong-Uk Kang; Moon-sang Kwon; Chanho Yoon; Sangyeun Cho; Jaeheon Jeong; Duckhyun Chang

Data-intensive queries are common in business intelligence, data warehousing and analytics applications. Typically, processing a query involves full inspection of large in-storage data sets by CPUs. An intuitive way to speed up such queries is to reduce the volume of data transferred over the storage network to a host system. This can be achieved by filtering out extraneous data within the storage, motivating a form of near-data processing. This work presents Biscuit, a novel near-data processing framework designed for modern solid-state drives. It allows programmers to write a data-intensive application to run on the host system and the storage system in a distributed, yet seamless manner. In order to offer a high-level programming model, Biscuit builds on the concept of data flow. Data processing tasks communicate through typed and data-ordered ports. Biscuit does not distinguish tasks that run on the host system and the storage system. As the result, Biscuit has desirable traits like generality and expressiveness, while promoting code reuse and naturally exposing concurrency. We implement Biscuit on a host system that runs the Linux OS and a high-performance solid-state drive. We demonstrate the effectiveness of our approach and implementation with experimental results. When data filtering is done by hardware in the solid-state drive, the average speed-up obtained for the top five queries of TPC-H is over 15x.

Explore More