Yeongon Cho | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yeongon Cho is active.

Explore More

Publication

Featured researches published by Yeongon Cho.

high-performance computer architecture | 2014

Improving GPGPU resource utilization through alternative thread block scheduling

Minseok Lee; Seokwoo Song; Joosik Moon; John Kim; Woong Seo; Yeongon Cho; Soojung Ryu

High performance in GPGPU workloads is obtained by maximizing parallelism and fully utilizing the available resources. The thousands of threads are assigned to each core in units of CTA (Cooperative Thread Arrays) or thread blocks - with each thread block consisting of multiple warps or wavefronts. The scheduling of the threads can have significant impact on overall performance. In this work, explore alternative thread block or CTA scheduling; in particular, we exploit the interaction between the thread block scheduler and the warp scheduler to improve performance. We explore two aspects of thread block scheduling - (1) LCS (lazy CTA scheduling) which restricts the maximum number of thread blocks allocated to each core, and (2) BCS (block CTA scheduling) where consecutive thread blocks are assigned to the same core. For LCS, we leverage a greedy warp scheduler to help determine the optimal number of thread blocks by only measuring the number of instructions issued while for BCS, we propose an alternative warp scheduler that is aware of the “block” of CTAs allocated to a core. With LCS and the observation that maximum number of CTAs does not necessary maximize performance, we also propose mixed concurrent kernel execution that enables multiple kernels to be allocated to the same core to maximize resource utilization and improve overall performance.

field-programmable technology | 2012

ULP-SRP: Ultra Low-Power Samsung Reconfigurable Processor for Biomedical Applications

Changmoo Kim; Moo-Kyoung Chung; Yeongon Cho; Mario Konijnenburg; Soojung Ryu; Jeongwook Kim

The latest biomedical applications require low energy consumption, high performance and wide energy-performance scalability to adapt to various working environments. This paper presents ULP-SRP, an energy efficient reconfigurable processor for the biomedical applications. ULP-SRP uses a Coarse Grained Reconfigurable Array (CGRA) for high performance data processing with low energy consumption. For the scalability, we propose three performance modes and Unified Memory Architecture (UMA). Energy optimization is accomplished by run-time mode switching along with automatic power gating. Experimental results show that ULP-SRP achieved 46.1% energy reduction compared to previous works.

international conference on computer design | 2012

Providing cost-effective on-chip network bandwidth in GPGPUs

Hanjoon Kim; John Kim; Woong Seo; Yeongon Cho; Soojung Ryu

Network-on-chip (NoC) bandwidth has a significant impact on overall performance in throughput-oriented processors such as GPG-PUs. Although it has been commonly assumed that high NoC bandwidth can be provided through abundant on-chip wires, we show that increasing NoC router frequency results in a more cost-effective NoC. However, router arbitration critical path can limit the NoC router frequency. Thus, we propose a direct all-to-all network overlaid on mesh (DA2mesh) NoC architecture that exploits the traffic characteristics of GPGPU and removes arbitration from the router pipeline. DA2mesh simplifies the router pipeline with 36% improvement of performance while reducing NoC energy by 15%.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2013

Mapping and Scheduling of Tasks and Communications on Many-Core SoC Under Local Memory Constraint

Jinho Lee; Moo-Kyoung Chung; Yeongon Cho; Soojung Ryu; Jung Ho Ahn; Kiyoung Choi

There has been extensive research on mapping and scheduling tasks on a many-core SoC. However, none considers the optimization of communication types, which can significantly affect performance, energy consumption, and local memory usage of the SoC. This paper presents an approach to automatic mapping and scheduling of tasks and communications on a many-core SoC. The key idea is to decide the type of each communication between message passing and shared memory when we do the mapping and scheduling. By assigning a proper type to each communication, we can optimize the energy consumption, performance, or energy-delay product. To solve the optimization problem, the approach adopts a probabilistic algorithm coupled with some heuristics. To enhance throughput of the system, it performs software pipelined scheduling of the tasks using a modified iterative modulo scheduling technique. Experiments show that our algorithm achieves on average 50.1% lower energy consumption, 21.0% higher throughput, and 64.9% lower energy- delay product, compared to shared memory only communication.

high-performance computer architecture | 2016

iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs

Minseok Lee; Gwangsun Kim; John Kim; Woong Seo; Yeongon Cho; Soojung Ryu

Thread or warp scheduling in GPGPUs has been shown to have a significant impact on overall performance. Recently proposed warp schedulers have been based on a greedy warp scheduler where some warps are prioritized over other warps. However, a single warp scheduling policy does not necessarily provide good performance across all types of workloads; in particular, we show that greedy warp schedulers are not necessarily optimal for workloads with inter-warp locality while a simple round-robin warp scheduler provides better performance. Thus, we argue that instead of single, static warp scheduling, an adaptive warp scheduler that dynamically changes the warp scheduler based on the workload characteristics should be leveraged. In this work, we propose an instruction-issue pattern-based adaptive warp scheduler (iPAWS) that dynamically adapts between a greedy warp scheduler and a fair, round-robin scheduler. We exploit the observation that workloads that favor a greedy warp scheduler will have an instruction-issue pattern that is biased towards some warps while workloads that favor a fair, round-robin warp scheduler will tend to issue instructions across all of the warps. Our evaluations show that iPAWS is able to adapt to the more optimal warp scheduler dynamically and achieve performance that is within a few percent of the statically determined, more optimal warp scheduler. We also show that iPAWS can be extended to other warp schedulers, including the cache-conscious wavefront scheduling (CCWS) and Memory Aware Scheduling and Cache Access Re-execution (MASCAR) to exploit the benefits of other warp schedulers while still providing adaptivity in warp scheduling.

design, automation, and test in europe | 2014

Energy-efficient scheduling for memory-intensive GPGPU workloads

Seokwoo Song; Minseok Lee; John Kim; Woong Seo; Yeongon Cho; Soojung Ryu

High performance for a GPGPU workload is obtained by maximizing parallelism and fully utilizing the available resources. However, this is not necessarily energy efficient, especially for memory-intensive GPGPU workloads. In this work, we propose Throttle CTA (cooperative-thread array) Scheduling (TCS) where we leverage two type of throttling - throttling the number of actives cores and throttling of warp execution in the cores - to improve energy-efficiency for memory-intensive GPGPU workloads. The algorithm requires the global CTA or thread block scheduler to reduce the number of cores with assigned thread blocks while leveraging the local warp scheduler to throttle memory requests for some of the cores to further reduce power consumption. The proposed TCS scheduling does not require off-line analysis but can be done dynamically during execution. Instead of relying on conventional metrics such as miss-per-kilo-instruction (MPKI), we leverage the memory access latency metric to determine the memory intensity of the workloads. Our evaluations show that TCS reduces energy by up to 48% (38% on average) across different memory-intensive workload while having very little impact on performance for compute-intensive workloads.

field-programmable technology | 2012

Implementation of a volume rendering on coarse-grained reconfigurable multiprocessor

Seunghun Jin; Sangheon Lee; Moo-Kyoung Chung; Yeongon Cho; Soojung Ryu

In this paper, we present reconfigurable multiprocessor architecture for volume rendering. The multiprocessor consists of sixteen reconfigurable processors to exploit data parallelism of the volume rendering. Each processor has VLIW core and reconfigurable coarse-grained array specialized for control and data-intensive part of the program, respectively. The coarse-grained array can be configured dynamically, so that it can efficiently process different kernels of the volume rendering. The multiprocessor is implemented using verilog HDL and realized onto a commercial FPGA-based prototyping system. The experimental result shows that the presented multiprocessor has comparable performance to high-end desktop GPUs.

international conference on computer design | 2012

Efficient code compression for coarse grained reconfigurable architectures

Moo-Kyoung Chung; Yeongon Cho; Soojung Ryu

Though Coarse Grained Reconfigurable Architecture (CGRA) is a flexible alternative for high performance computing, it has a crucial problem on instruction code whose size is so large that the instruction memory takes a significant portion of silicon area and power consumption. This article proposes an efficient dictionary-based compression method for the CGRA instruction code, where code bit-fields are rearranged and grouped together according to locality characteristics and the most efficient compression mode is selected for each group and kernel. The proposed method can reinstall the dictionary contents adaptively for each kernel. Experimental results show that the proposed method achieved an average compression ratio 0.56 in 4×4 array of function units for well-optimized applications.

field-programmable technology | 2013

Adaptive compression for instruction code of Coarse Grained Reconfigurable Architectures

Moo-Kyoung Chung; Jun-Kyoung Kim; Yeongon Cho; Soojung Ryu

Coarse Grained Reconfigurable Architecture (CGRA) achieves high performance by exploiting instruction-level parallelism with software pipeline. Large instruction memory is, however, a critical problem of CGRA, which requires large silicon area and power consumption. Code compression is a promising technique to reduce the memory area, bandwidth requirements, and power consumption. We present an adaptive code compression scheme for CGRA instructions based on dictionary-based compression, where compression mode and dictionary contents are adaptively selected for each execution kernel and compression group. In addition, it is able to design hardware decompressor efficiently with two-cycle latency and negligible silicon overhead. The proposed method achieved an average compression ratio 0.52 in a CGRA of 16-functional unit array with the experiments of well-optimized applications.

international conference on consumer electronics | 2014

Low-power reconfigurable audio processor for mobile devices

Seunghun Jin; Woong Seo; Yeongon Cho; Soojung Ryu

Low-power processing of multimedia data is mandatory for the recent mobile devices. In this paper, we present coarse-grained reconfigurable processor for low-power audio processing. By utilizing perfect instruction cache and tightly-coupled scratchpad memory, we can eliminate all the bandwidth consumption caused by external memory access while decoding compressed audio data. Acceleration of audio data processing based on software pipelining technique also helps to reduce power consumption by decreasing overall processing time. Presented audio processor is integrated into commercial application processors and shipped with market-leading mobile devices, providing over 100 hours of audio playback.

Explore More