Woong Seo
Samsung
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Woong Seo.
high-performance computer architecture | 2014
Minseok Lee; Seokwoo Song; Joosik Moon; John Kim; Woong Seo; Yeongon Cho; Soojung Ryu
High performance in GPGPU workloads is obtained by maximizing parallelism and fully utilizing the available resources. The thousands of threads are assigned to each core in units of CTA (Cooperative Thread Arrays) or thread blocks - with each thread block consisting of multiple warps or wavefronts. The scheduling of the threads can have significant impact on overall performance. In this work, explore alternative thread block or CTA scheduling; in particular, we exploit the interaction between the thread block scheduler and the warp scheduler to improve performance. We explore two aspects of thread block scheduling - (1) LCS (lazy CTA scheduling) which restricts the maximum number of thread blocks allocated to each core, and (2) BCS (block CTA scheduling) where consecutive thread blocks are assigned to the same core. For LCS, we leverage a greedy warp scheduler to help determine the optimal number of thread blocks by only measuring the number of instructions issued while for BCS, we propose an alternative warp scheduler that is aware of the “block” of CTAs allocated to a core. With LCS and the observation that maximum number of CTAs does not necessary maximize performance, we also propose mixed concurrent kernel execution that enables multiple kernels to be allocated to the same core to maximize resource utilization and improve overall performance.
international conference on computer design | 2012
Hanjoon Kim; John Kim; Woong Seo; Yeongon Cho; Soojung Ryu
Network-on-chip (NoC) bandwidth has a significant impact on overall performance in throughput-oriented processors such as GPG-PUs. Although it has been commonly assumed that high NoC bandwidth can be provided through abundant on-chip wires, we show that increasing NoC router frequency results in a more cost-effective NoC. However, router arbitration critical path can limit the NoC router frequency. Thus, we propose a direct all-to-all network overlaid on mesh (DA2mesh) NoC architecture that exploits the traffic characteristics of GPGPU and removes arbitration from the router pipeline. DA2mesh simplifies the router pipeline with 36% improvement of performance while reducing NoC energy by 15%.
high-performance computer architecture | 2016
Minseok Lee; Gwangsun Kim; John Kim; Woong Seo; Yeongon Cho; Soojung Ryu
Thread or warp scheduling in GPGPUs has been shown to have a significant impact on overall performance. Recently proposed warp schedulers have been based on a greedy warp scheduler where some warps are prioritized over other warps. However, a single warp scheduling policy does not necessarily provide good performance across all types of workloads; in particular, we show that greedy warp schedulers are not necessarily optimal for workloads with inter-warp locality while a simple round-robin warp scheduler provides better performance. Thus, we argue that instead of single, static warp scheduling, an adaptive warp scheduler that dynamically changes the warp scheduler based on the workload characteristics should be leveraged. In this work, we propose an instruction-issue pattern-based adaptive warp scheduler (iPAWS) that dynamically adapts between a greedy warp scheduler and a fair, round-robin scheduler. We exploit the observation that workloads that favor a greedy warp scheduler will have an instruction-issue pattern that is biased towards some warps while workloads that favor a fair, round-robin warp scheduler will tend to issue instructions across all of the warps. Our evaluations show that iPAWS is able to adapt to the more optimal warp scheduler dynamically and achieve performance that is within a few percent of the statically determined, more optimal warp scheduler. We also show that iPAWS can be extended to other warp schedulers, including the cache-conscious wavefront scheduling (CCWS) and Memory Aware Scheduling and Cache Access Re-execution (MASCAR) to exploit the benefits of other warp schedulers while still providing adaptivity in warp scheduling.
design, automation, and test in europe | 2014
Seokwoo Song; Minseok Lee; John Kim; Woong Seo; Yeongon Cho; Soojung Ryu
High performance for a GPGPU workload is obtained by maximizing parallelism and fully utilizing the available resources. However, this is not necessarily energy efficient, especially for memory-intensive GPGPU workloads. In this work, we propose Throttle CTA (cooperative-thread array) Scheduling (TCS) where we leverage two type of throttling - throttling the number of actives cores and throttling of warp execution in the cores - to improve energy-efficiency for memory-intensive GPGPU workloads. The algorithm requires the global CTA or thread block scheduler to reduce the number of cores with assigned thread blocks while leveraging the local warp scheduler to throttle memory requests for some of the cores to further reduce power consumption. The proposed TCS scheduling does not require off-line analysis but can be done dynamically during execution. Instead of relying on conventional metrics such as miss-per-kilo-instruction (MPKI), we leverage the memory access latency metric to determine the memory intensity of the workloads. Our evaluations show that TCS reduces energy by up to 48% (38% on average) across different memory-intensive workload while having very little impact on performance for compute-intensive workloads.
ifip ieee international conference on very large scale integration | 2015
Namhyung Kim; Junwhan Ahn; Woong Seo; Kiyoung Choi
This paper presents an energy-efficient exclusive last-level cache design based on STT-RAM, which is an emerging memory technology that has higher density and lower static power compared to SRAM. Exclusive caches are known to provide higher effective cache capacity than inclusive caches by removing duplicated copies of cache blocks across hierarchies. However, in exclusive cache hierarchies, every block evicted from the lower-level cache is written back to the last-level cache regardless of its dirtiness thereby incurring extra write overhead. This makes it challenging to use STT-RAM for exclusive last-level caches due to its high write energy and long write latency. To mitigate this problem, we design an SRAM/STT-RAM hybrid cache architecture based on reuse distance prediction. In our architecture, for the cache blocks evicted from the lower-level cache, blocks that are likely to be accessed again soon are inserted into the SRAM region and blocks that are unlikely to be reused are forced to bypass the last-level cache. Evaluation results show that the proposed architecture significantly reduces energy consumption of the last-level cache while slightly improving the system performance.
international conference on consumer electronics | 2014
Seunghun Jin; Woong Seo; Yeongon Cho; Soojung Ryu
Low-power processing of multimedia data is mandatory for the recent mobile devices. In this paper, we present coarse-grained reconfigurable processor for low-power audio processing. By utilizing perfect instruction cache and tightly-coupled scratchpad memory, we can eliminate all the bandwidth consumption caused by external memory access while decoding compressed audio data. Acceleration of audio data processing based on software pipelining technique also helps to reduce power consumption by decreasing overall processing time. Presented audio processor is integrated into commercial application processors and shipped with market-leading mobile devices, providing over 100 hours of audio playback.
Archive | 2011
Young-Chul Cho; Soojung Ryu; Yoon-Jin Kim; Woong Seo; Il-hyun Park; Tae-wook Oh
Archive | 2012
Moo-Kyoung Chung; Soojung Ryu; Ho-Young Kim; Woong Seo; Young-Chul Cho
Archive | 2011
Young Chul Cho; Soojung Ryu; Moo-Kyoung Chung; Ho-Young Kim; Woong Seo
Archive | 2011
Tae-wook Oh; Soojung Ryu; Yoon-Jin Kim; Woong Seo; Young-Chul Cho; Il-hyun Park