Dong-hoon Yoo
Samsung
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dong-hoon Yoo.
field-programmable technology | 2012
Won-Sub Kim; Dong-hoon Yoo; Haewoo Park; Min-wook Ahn
Coarse-grained reconfigurable arrays (CGRAs) architectures aim to offer high performance at low power consumption, especially for digital signal processing and streaming applications. To fully exploit the computing capability of CGRA, it is essential to develop a scheduling algorithm which maps operations over processing elements in CGRA. Modulo scheduling [1] is known as the state-of-art algorithm for CGRA scheduling, and there are many variants [2][3][4]. However, they suffer from dealing with inter-iteration dependences called as recurrences that form cyclic dependences. Hence we propose a new scheduling technique that efficiently handles the cyclic dependences. The key techniques are grouping all the mutually-dependent recurrence cycles into a strongly connected component (SCC), and scheduling the input data flow graph (DFG) based on SCCs. Since grouping removes all the recurrence cycles from DFG, the resulting SCC graph becomes a form of directed acyclic graph (DAG) in which the scheduler can track the total order of SCCs. While processing SCCs one by one, our intra-SCC scheduler analyzes the dependences between every pair of two different operations inside of SCC and produces the schedule of them. Thanks to the well-structured form of the SCC-based graph, we obtain more efficient schedule compared to the previous CGRA scheduling algorithm [2]. The experimental results show that the proposed technique enhances the performance of recurrence-dominant loops up to 3.5X and raises the success rate of modulo-scheduling compared to the previous CGRA scheduling algorithm [2].
languages, compilers, and tools for embedded systems | 2011
Choonki Jang; Jungwon Kim; Jaejin Lee; Hee-Seok Kim; Dong-hoon Yoo; Suk-Jin Kim; Hong-Seok Kim; Soojung Ryu
In this paper, we propose a data partitioning technique for the memory subsystem that consists of a multi-ported scratchpad memory (SPM) unit and a single-ported data cache in coarse-grained reconfigurable arrays (CGRA) architecture. The embedded reconfigurable processor executes programs by switching between the Non-VLIW and VLIW modes depending on the type of the code region to achieve high performance. The VLIW mode exploits code regions with high ILP that require high memory bandwidth and the Non-VLIW mode exploits those with low ILP that require low memory latency. Our data partitioning technique between the SPM and the data cache is based on data interference graph reduction and profiling information. Given an SPM size, it finds the optimal data partitions by taking the VLIW instruction schedule into consideration. We evaluate our data partitioning technique for the CGRA architecture with three representative multimedia applications.
compilers, architecture, and synthesis for embedded systems | 2012
Narasinga Rao Miniskar; Pankaj Shailendra Gode; Soma Kohli; Dong-hoon Yoo
The next generation SoCs for consumer electronics need software solutions for faster time-to-market, lower development cost and higher performance while maintaining lower energy consumption and area. As a result, reconfigurable processors (RPs) have become increasingly important, which enables just enough exibility of accepting software solutions and providing application-specific hardware reconfigurability. Samsung Electronics has developed a reconfigurable processor called Samsung Reconfigurable Processor (SRP), which is the basis of our work. Though, the SRP is a powerful processor, it requires a smart and intelligent compiler to compile the application software while exploring its reconfigurable architecture. The existing compiler for the SRP does not support functional inlining and loop unrolling, and no study has yet been done on these optimizations for the RPs. In this paper, we study the impact of these optimizations on the performance of applications for the SRP processor and we also show how these optimizations are supported in the SRP compiler. We analyze the performance improvement due to these optimizations on various benchmarks namely Sobel Edge filter, JPEG decoder, and Luma Deblocking filter of the H.264 standard. Our experimental results have shown about 83% gain on performance with the functional inlining optimization and the loop unrolling optimization when compared to the original code for Sobel filter and JPEG encoder, and 11% gain on performance for Luma Deblock filter.
Journal of Systems Architecture | 2016
Shin-Haeng Kang; Dong-hoon Yoo; Soonhoi Ha
Abstract Timing simulation of a processor is a key enabling technique to explore the design space of system architecture or to develop the software without an available hardware. We propose a fast cycle-approximate simulation technique for modern superscalar out-of-order processors. The proposed simulation technique is designed in two parts; the front-end provides correct functional execution of the guest application, and the back-end provides a timing model. For the back-end, we developed a novel processor timing model that combines a simple-formula-based analytical model and a scheduling analysis of sampled traces so as to boost up the simulation speed with minimal accuracy loss. Attached with a cache simulator, a branch predictor, and a trace analyzer, the proposed technique is implemented over the popular and portable QEMU emulator, so named TQSIM (Timed QEMU-based SIMulator). Sacrificing around 8 percent of the accuracy, TQSIM enables one or two orders of magnitude faster simulation than a reference cycle-accurate simulation when the target architecture is an ARM Cortex A15 processor. TQSIM is an open-source project currently available online.
international conference on consumer electronics | 2013
Min-wook Ahn; Dong-hoon Yoo; Soojung Ryu; Jeongwook Kim
Modern consumer electronics require efficient and versatile processors as their heart for various functions. This paper shows how our reconfigurable processor (RP) accelerates various multimedia applications even with different features. The RP provides an acceleration mode powered by modulo scheduling algorithm and intrinsic specialized to each application. By exploiting these two, RP can enhance the performance up to 71.54x for audio, video, image signal processing (ISP), 3D graphics and major communication channel applications (DVBT2).
Metals and Materials International | 2015
Hyunki Kim; Dongun Kim; Kanghwan Ahn; Dong-hoon Yoo; Hyun-Sung Son; Gyosung Kim; Kwansoo Chung
In order to measure the flow curves of steel sheets at high temperatures, which are dependent on strain and strain rate as well as temperature and temperature history, a tensile test machine and specimens were newly developed in this work. Besides, an indirect method to characterize mechanical properties at high temperatures was developed by combining experiments and its numerical analysis, in which temperature history were also accounted for. Ultimately, a modified Johnson-Cook type hardening law, accounting for the dependence of hardening behavior with deterioration on strain rate as well as temperature, was successfully developed covering both pre- and post-ultimate tensile strength ranges for a hot press forming steel sheet. The calibrated hardening law obtained based on the inverse characterization method was then applied and validated for hot press forming of a 2-D mini-bumper as for distributions of temperature history, thickness and hardness considering the continuous cooling transformation diagram. The results showed reasonably good agreement with experiments
international conference on consumer electronics | 2014
Tai-song Jin; Min-wook Ahn; Dong-hoon Yoo; Dong-kwan Suh; Yoonseo Choi; Do-Hyung Kim; Shihwa Lee
VLIW (Very Long Instruction Word) is one of the most popular architectures in embedded systems because it has features of low power consumption and low hardware cost. Due to the nature of VLIW architecture such as bundled instructions and large register files, VLIW processors are running with large size of instruction codes in relatively low clock frequency. However compact instruction size and high clock frequency are the most important requirements of modern embedded consumer electronics. In this paper we propose a novel instruction compression scheme to solve the addressed problem. The experiment shows that the proposed scheme can reduce instruction size by 23% and improve clock frequency by 25% in average comparing with conventional compression schemes.
international conference on computer graphics and interactive techniques | 2014
Seok Joong Hwang; Ankur Deshwal; Dong-hoon Yoo; Won-Jong Lee; Youngsam Shin; Jae Don Lee; Soojung Ryu; Jeongwook Kim
This paper presents a shading language compiler optimized for a mobile ray tracing acclerator, i.e., SGRT (Samsung GPU Ray Tracing). By the compiler, application development productivity has been dramatically improved: 1) as an application-specific abstraction layer, a shading language makes it possible for application developers to implement ray generators and shaders much more easily and intuitively than with a general and low-level language, i.e., standard C language (up to 81.6% less source lines). 2) high performance is achievable without a deep understanding of the programmable shader architecture based on CGRA (Coarse-Grained Reconfigurable Array) which is complex to optimize especially in presence of many conditional branches (up to 1.58 times higher throughput).
compilers, architecture, and synthesis for embedded systems | 2014
Narasinga Rao Miniskar; Soma Kohli; Haewoo Park; Dong-hoon Yoo
Reconfigurable processors such as SRP (Samsung Reconfigurable Processors) have become increasingly important, which enables just enough flexibility of accepting software solutions and providing application specific hardware configurability for faster time-to-market, lower development cost and higher performance while maintaining lower energy consumption and area. The reconfigurable processor compilation framework supports wide range of architectures through architecture description template for different domains of applications such as image processing, multimedia, video, and graphics. These architectures support several domain specific compound instructions (also called as intrinsics), which are computationally efficient when compared to the set of general instructions in the processor. Application developers have to use these intrinsics in their programs according to the architecture, which can result very inefficient usage, tedious and more error-prone. More-over, the intrinsics provided by the architecture need constant reference to the intrinsics file during development. In this paper, we propose a retargetable novel methodology for the automatic generation of compound instructions for a given architecture and application source code at compile time. Our approach is able to consider ~75% of total intrinsics in the architectures with the success rate of > 90% in identifying the intrinsics in the benchmarks such as AVC OpenGL Full Engine and OpenGL Vector benchmarks.
ACM Transactions on Architecture and Code Optimization | 2018
Namhyung Kim; Junwhan Ahn; Kiyoung Choi; Daniel Sanchez; Dong-hoon Yoo; Soojung Ryu
This article proposes Benzene, an energy-efficient distributed SRAM/STT-RAM hybrid cache for manycore systems running multiple applications. It is based on the observation that a naïve application of hybrid cache techniques to distributed caches in a manycore architecture suffers from limited energy reduction due to uneven utilization of scarce SRAM. We propose two-level optimization techniques: intra-bank and inter-bank. Intra-bank optimization leverages highly associative cache design, achieving more uniform distribution of writes within a bank. Inter-bank optimization evenly balances the amount of write-intensive data across the banks. Our evaluation results show that Benzene significantly reduces energy consumption of distributed hybrid caches.This article proposes Benzene, an energy-efficient distributed SRAM/STT-RAM hybrid cache for manycore systems running multiple applications. It is based on the observation that a naive application of hybrid cache techniques to distributed caches in a manycore architecture suffers from limited energy reduction due to uneven utilization of scarce SRAM. We propose two-level optimization techniques: intra-bank and inter-bank. Intra-bank optimization leverages highly associative cache design, achieving more uniform distribution of writes within a bank. Inter-bank optimization evenly balances the amount of write-intensive data across the banks. Our evaluation results show that Benzene significantly reduces energy consumption of distributed hybrid caches.