Kyuseung Han | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kyuseung Han is active.

Explore More

Publication

Featured researches published by Kyuseung Han.

field-programmable technology | 2009

FloRA: Coarse-grained reconfigurable architecture with floating-point operation capability

Dongwook Lee; Manhwee Jo; Kyuseung Han; Kiyoung Choi

This paper demonstrates a chip implementation of coarse-grained reconfigurable architecture named FloRA. Two-dimensional array of integer processing elements in the FloRA is configured in run-time to perform integer functions as well as floating-point functions. FloRA is implemented in Dongbu HiTek 130nm process and evaluate by running applications including physics engine and jpeg decoder.

field-programmable technology | 2010

Acceleration of control flow on CGRA using advanced predicated execution

Kyuseung Han; Jong Kyung Paek; Kiyoung Choi

Coarse-grained reconfigurable array is a very attractive architecture from the viewpoint of performance and flexibility. However, because the performance improvement is achieved by exploiting parallelism, the architecture is typically poor at handling control flow, which is sequential in nature. There have been many attempts to overcome this problem by using predicated execution techniques; however, they do not support all types of control flow or suffer from performance degradation in doing so. In addition, predicated execution schemes in general require a longer execution time because both the if- and else-paths are always executed. This paper proposes advanced predicated execution techniques that can handle and accelerate all types of control flow with only 2% hardware overhead. These techniques can also be easily extended to general SIMD machines. We implemented these techniques on a coarse-grained reconfigurable array architecture and verified its functionality and effectiveness by accelerating an H.264 deblocking filter, a kernel which is both data- and control-intensive. The results show that the proposed approach achieves up to 43% improvement in execution time compared to speculation by sacrificing 76% code size, and 24% improvement in execution time compared to the previous full predication approach, with a smaller code size.

ACM Transactions on Architecture and Code Optimization | 2013

Power-Efficient Predication Techniques for Acceleration of Control Flow Execution on CGRA

Kyuseung Han; Junwhan Ahn; Kiyoung Choi

Coarse-grained reconfigurable architecture typically has an array of processing elements which are controlled by a centralized unit. This makes it difficult to execute programs having control divergence among PEs without predication. However, conventional predication techniques have a negative impact on both performance and power consumption due to longer instruction words and unnecessary instruction-fetching decoding nullifying steps. This article reveals performance and power issues in predicated execution which have not been well-addressed yet. Furthermore, it proposes fast and power-efficient predication mechanisms. Experiments conducted through gate-level simulation show that our mechanism improves energy-delay product by 11.9% to 23.8% on average.

design, automation, and test in europe | 2013

Compiling control-intensive loops for CGRAs with state-based full predication

Kyuseung Han; Kiyoung Choi; Jongeun Lee

Predication is an essential technique to accelerate kernels with control flow on CGRAs. While state-based full predication (SFP) can remove wasteful power consumption on issuing/decoding instructions from conventional full predication, generating code for SFP is challenging for general CGRAs, especially when there are multiple conditionals to be handled due to exploiting data level parallelism. In this paper, we present a novel compiler framework addressing central issues such as how to express the parallelism between multiple conditionals, and how to allocate resources to them to maximize the parallelism. In particular, by separating the handling of control flow and data flow, our framework can be integrated with conventional mapping algorithms for mapping data flow. Experimental results demonstrate that our framework can find and exploit parallelism between multiple conditionals, thereby leading to 2.21 times higher performance on average than a naive approach.

design, automation, and test in europe | 2012

State-based full predication for low power coarse-grained reconfigurable architecture

Kyuseung Han; Seongsik Park; Kiyoung Choi

It has been one of the most fundamental challenges in architecture design to achieve high performance with low power while maintaining flexibility. Parallel architectures such as coarse-grained reconfigurable architecture, where multiple PEs are tightly coupled with each other, can be a viable solution to the problem. However, the PEs are typically controlled by a centralized control unit, which makes it hard to parallelize programs requiring different control of each PE. To overcome this limitation, it is essential to convert control flows into data flows by adopting the predicated execution technique, but it may incur additional power consumption. This paper reveals power issues in the predicated execution and proposes a novel technique to mitigate power overhead of predicated execution. Contrary to the conventional approach, the proposed mechanism can decide whether to suppress instruction execution or not without decoding the instructions and does not require additional instruction bits, thereby resulting in energy savings. Experimental results show that energy consumed by the reconfigurable array and its configuration memory is reduced by up to 23.9%.

IEEE Transactions on Dependable and Secure Computing | 2014

Software-Level Approaches for Tolerating Transient Faults in a Coarse-GrainedReconfigurable Architecture

Kyuseung Han; Ganghee Lee; Kiyoung Choi

Coarse-grained reconfigurable architectures have drawn increasing attention due to their merits in performance and flexibility. Typically, they have many processing elements in the form of an array, which is suitable for implementing spatial redundancy used for fault-tolerant systems design. This paper presents a purely software-level approach to implementing transient-fault-tolerance on an existing processing element array without any modification to the architecture. It includes automated design flow to construct a fault-tolerant system and mathematical modeling for analyzing system reliability. Experiments with real-world applications show the effectiveness of the proposed approaches in terms of yield enhancement and system reliability.

international soc design conference | 2011

A host-accelerator communication architecture design for efficient binary acceleration

Yangsu Kim; Kyuseung Han; Kiyoung Choi

Binary acceleration of a kernel on an accelerator may have a data duplication problem. Some data in an address range may be copied into the local memory of the accelerator incurring data copy overhead as well as a coherence problem. Configurable Range Memory (CRM) is a memory shared by the host processor and the accelerator, which can specify its own address range such that the data within the range can be directly loaded into it. However, the memory may need to be carefully designed considering the memory access patterns of the accelerator not to incur any unnecessary overhead. This work presents a new CRM architecture and shows how it improves the system performance with a novel Coarse-Grained Reconfigurable Array (CGRA) architecture.

Integration | 2014

Design of a coarse-grained reconfigurable architecture with floating-point support and comparative study

Manhwee Jo; Dongwook Lee; Kyuseung Han; Kiyoung Choi

With a huge increase in demand for various kinds of compute-intensive applications in electronic systems, researchers have focused on coarse-grained reconfigurable architectures because of their advantages: high performance and flexibility. This paper presents FloRA, a coarse-grained reconfigurable architecture with floating-point support. A two-dimensional array of integer processing elements in FloRA is configured at run-time to perform floating-point operations as well as integer operations. Fabricated using 130nm process, the total area overhead due to additional hardware for floating-point operations is about 7.4% compared to the previous architecture which does not support floating-point operations. The fabricated chip runs at 125MHz clock frequency and 1.2V power supply. Experiments show 11.6x speedup on average compared to ARM9 with a vector-floating-point unit for integer-only benchmark programs as well as programs containing floating-point operations. Compared with other similar approaches including XPP and Butter, the proposed architecture shows much higher performance for integer applications, while maintaining about half the performance of Butter for floating-point applications.

Journal of Applied Animal Research | 2001

Effect of Different Dietary CP Levels on the Growth, Nutrient Utilization and Carcass Characteristics of Finishing Barrows and Gilts Reared in Phase Feeding Regimen

Jun-Yeong Lee; Jungwoo Kim; J. H. Kim; In-Kyoung Kim; Kyuseung Han

Abstract Lee, J. H., Kim, J. D., Kim, J. H., Kim, I. H., Han, In K. 2001. Effect of different dietary CP levels on the growth, nutrient utilization and carcass characteristics of finishing barrows and gilts reared in phase feeding regimen. J. Appl. Anim. Res., 19: 145–163. This experiment was to investigate the effects of different crude protein (CP) sequences on growth performance, nutrients utilization and carcass characteristics of finishing barrows and gilts under three phase feeding regimen. A total of 120 finishing pigs (LandracexLarge WhitexDuroc) averaging 53.3±0.91 kg of body weight and 30 pigs (averaging 52.5±0.57. 81.8±0.79 and 100.7±0.89, respectively) were assigned to the feeding and the metabolic trial, respectively. Experiment was arranged as a 2x3 factorial design, barrows and gilts and three dietary treatments. Each treatment had four replicates with five pigs per replicate. Finishing period (53 to 107 kg) was divided into three phases (53 to 69 kg, 69 to 88 kg and 88 to 107 kg). Dietary treatments included 1) 17%-15%-13% (high CP), 2) 16%-14%-12% (medium CP), 3) 15%-13%-11% (low CP) sequence for finishing period. During the overall experimental period (53 to 107 kg), there was no interaction between sexes and dietary CP levels on growth performances. Barrows consumed more feed (p<0.01) and grew faster (p<0.01) than gilts did. ADG of pigs fed on high dietary CP feeding group was improved significantly than that of pigs fed on low dietary CP feeding group (p<0.05). Average values of essential amino acids (EAA), non-essential amino acids (NEAA) and total amino acids digestibilities were generally not influenced by dietary CP levels and sexes. However, fecal daily nitrogen (N) excretion averaged over all periods was significantly higher in high dietary CP feeing group than in medium and low dietary CP feeding groups (p<0.05). Averaged blood urea nitrogen (BUN) concentration was greater (p<0.05) in barrows than in gilts and was increased as dietary CP concentration increased (p<0.05). Backfat (BF) thickness was greater in barrows than in gilts (p<0.05). Longissimus muscle area (LEA) was greater in gilts than in barrows (p<0.01) and was greater in high and medium dietary CP feeding regimens than in low dietary CP feeding regimen (p<0.05). It is concluded that 16%-14%-12% dietary CP sequence is desirable in respect of economics and environment for practical three phase feeding regimen for gilts and barrows during finishing period.

asia and south pacific design automation conference | 2014

Leveraging parallelism in the presence of control flow on CGRAs

Jihyun Ryoo; Kyuseung Han; Kiyoung Choi

Coarse-Grained Reconfigurable Architectures (CGRAs) are suitable for accelerating data-intensive applications in embedded systems due to high performance and power efficiency. However, as application programs become complex having more control flows in them, it becomes harder to accelerate such programs on CGRAs. Previous researches on this issue have focused on correct execution of control flows rather than their acceleration. This paper reveals how control flows degrade the performance of programs and proposes a software approaches to accelerating control flows by exploiting parallelism residing in each conditionals as well as among conditionals. Experiments show that our proposed techniques improve performance by 2.51 times on average.

Explore More