Choonki Jang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Choonki Jang is active.

Explore More

Publication

Featured researches published by Choonki Jang.

compilers, architecture, and synthesis for embedded systems | 2006

A dynamic code placement technique for scratchpad memory using postpass optimization

Bernhard Egger; Chihun Kim; Choonki Jang; Yoonsung Nam; Jaejin Lee; Sang Lyul Min

In this paper, we propose a fully automatic dynamic scratch-pad memory (SPM) management technique for instructions. Our technique loads required code segments into the SPM on demand at runtime. Our approach is based on postpass analysis and optimization techniques, and it handles the whole program, including libraries. The code mapping is determined by solving mixed integer linear programming formulation that approximates our demand paging technique. We increase the effectiveness of demand paging by extracting from functions natural loops that are smaller in size and have a higher instruction fetch count. The postpass optimizer analyzes the object files of an application and transforms them into an application binary image that enables demand paging to the SPM. We evaluate our technique on eleven embedded applications and compare it to a processor core with an instruction cache in terms of its performance and energy consumption. The cache size is about 20% of the executed code size, and the SPM size is chosen such that its die area is equal to that of the cache. The experimental results show that, on average, the processor core and memory subsystems energy consumption can be reduced by 21.6% and the performance improved by 20.2%. Moreover, in comparison with the optimal static placement strategy, our technique reduces energy consumption by 23.7% and improves performance by 22.9%,on average.

languages, compilers, and tools for embedded systems | 2008

FaCSim: a fast and cycle-accurate architecture simulator for embedded systems

Jaejin Lee; Jung-Hyun Kim; Choonki Jang; Seungkyun Kim; Bernhard Egger; Kwangsub Kim; SangYong Han

There have been strong demands for a fast and cycle-accurate virtual platforms in the embedded systems area where developers can do meaningful software development including performance debugging in the context of the entire platform. In this paper, we describe the design and implementation of a fast and cycle-accurate architecture simulator called FaCSim as a first step towards such a virtual platform. FacSim accurately models the ARM9E-S processor core and ARM926EJ-S processors memory subsystem. It accurately simulates exceptions and interrupts to enable whole-system simulation including the OS. Since it is implemented in a modular manner in C++, it can be easily extended with other system components by subclassing or adding new classes. FaCSim is based on an interpretive simulation technique to provide flexibility, yet achieving high speed. It enables fast cycle-accurate architecture simulation by means of three mechanisms. First, it computes elapsed cycles in each pipeline stage as a chunk and incrementally adds it up to advance the core clock instead of performing cycle-by-cycle simulation. Second, it uses a basic-block cache that caches decoded instructions at the basic-block level. Finally, it is parallelized to exploit multicore systems that are available everywhere these days. Using 21 applications from the EEMBC benchmark suite, FaCSims accuracy is validated against the ARM926EJ-S development board from ARM, and is accurate in a ±7% error margin. Due to basic-block level caching and parallelization, FaCSim is, on average, more than three times faster than ARMulator and more than six times faster than SimpleScalar.

IEEE Transactions on Computers | 2010

Scratchpad Memory Management Techniques for Code in Embedded Systems without an MMU

Bernhard Egger; Seungkyun Kim; Choonki Jang; Jaejin Lee; Sang Lyul Min; Heonshik Shin

We propose a code scratchpad memory (SPM) management technique with demand paging for embedded systems that have no memory management unit. Based on profiling information, a postpass optimizer analyzes and optimizes application binaries in a fully automated process. It classifies the code of the application including libraries into three classes based on a mixed integer linear programming formulation: External code is executed directly from the external memory. Pinned code is loaded into the SPM when the application starts and stays in the SPM. Paged code is loaded into/unloaded from the SPM on demand. We evaluate the proposed technique by running 14 embedded applications both on a cycle-accurate ARM processor simulator and an ARM1136JF-S core. On the simulator, the reference case, a four-way set-associative cache, is compared to a direct-mapped cache and an SPM of comparable die area. On average, we observe an improvement of 12 percent in runtime performance and a 21 percent reduction in energy consumption. On the ARM11 board, the reference case run on the 16-KB four-way set-associative cache is compared to the demand paging solution on the 16-KB SPM, optionally supported by the cache. The measured results show both a runtime performance improvement and a reduction of the energy consumption by 23 percent, on average.

ACM Transactions on Architecture and Code Optimization | 2012

Automatic code overlay generation and partially redundant code fetch elimination

Choonki Jang; Jaejin Lee; Bernhard Egger; Soojung Ryu

There is an increasing interest in explicitly managed memory hierarchies, where a hierarchy of distinct memories is exposed to the programmer and managed explicitly in software. These hierarchies can be found in typical embedded systems and an emerging class of multicore architectures. To run an application that requires more code memory than the available higher-level memory, typically an overlay structure is needed. The overlay structure is generated manually by the programmer or automatically by a specialized linker. Manual code overlaying requires the programmer to deeply understand the program structure for maximum memory savings as well as minimum performance degradation. Although the linker can automatically generate the code overlay structure, its memory savings are limited and it even brings significant performance degradation because traditional techniques do not consider the program context. In this article, we propose an automatic code overlay generation technique that overcomes the limitations of traditional automatic code overlaying techniques. We are dealing with a system context that imposes two distinct constraints: (1) no hardware support for address translation and (2) a spatially and temporally coarse grained faulting mechanism at the function level. Our approach addresses those two constraints as efficiently as possible. Our technique statically computes the Worst-Case Number of Conflict misses (WCNC) between two different code segments using path expressions. Then, it constructs a static temporal relationship graph with the WCNCs and emits an overlay structure for a given higher-level memory size. We also propose an inter-procedural partial redundancy elimination technique that minimizes redundant code copying caused by the generated overlay structure. Experimental results show that our approach is promising.

languages, compilers, and tools for embedded systems | 2011

An instruction-scheduling-aware data partitioning technique for coarse-grained reconfigurable architectures

Choonki Jang; Jungwon Kim; Jaejin Lee; Hee-Seok Kim; Dong-hoon Yoo; Suk-Jin Kim; Hong-Seok Kim; Soojung Ryu

In this paper, we propose a data partitioning technique for the memory subsystem that consists of a multi-ported scratchpad memory (SPM) unit and a single-ported data cache in coarse-grained reconfigurable arrays (CGRA) architecture. The embedded reconfigurable processor executes programs by switching between the Non-VLIW and VLIW modes depending on the type of the code region to achieve high performance. The VLIW mode exploits code regions with high ILP that require high memory bandwidth and the Non-VLIW mode exploits those with low ILP that require low memory latency. Our data partitioning technique between the SPM and the data cache is based on data interference graph reduction and profiling information. Given an SPM size, it finds the optimal data partitions by taking the VLIW instruction schedule into consideration. We evaluate our data partitioning technique for the CGRA architecture with three representative multimedia applications.

ACM Transactions in Embedded Computing Systems | 2011

Demand Paging Techniques for Flash Memory Using Compiler Post-Pass Optimizations

Seungkyun Kim; Kiwon Kwon; Chihun Kim; Choonki Jang; Jaejin Lee; Sang Lyul Min

In this article, we propose an application-specific demand paging mechanism for low-end embedded systems that have flash memory as secondary storage. These systems are not equipped with virtual memory. A small memory space called an execution buffer is used to page the code of an application. An application-specific page manager manages the buffer. The page manager is automatically generated by a compiler post-pass optimizer and combined with the application image. The post-pass optimizer analyzes the executable image and transforms function call/return instructions into calls to the page manager. As a result, each function in the code can be loaded into the memory on demand at runtime. To minimize the overhead incurred by the demand paging technique, code clustering algorithms are also presented. We evaluate our techniques with ten embedded applications, and our approach can reduce the code memory size by on average 39.5% with less than 10% performance degradation and on average 14% more energy consumption. Our demand paging technique provides embedded system designers with a trade-off control mechanism between the cost, performance, and energy efficiency in designing embedded systems. Embedded system designers can choose the code memory size depending on their cost, energy, and performance requirements.

international conference on parallel architectures and compilation techniques | 2011

A Software-Managed Coherent Memory Architecture for Manycores

Jungho Park; Choonki Jang; Jaejin Lee

Cache coherent Non-Uniform Memory Access (cc-NUMA) architectures have been widely used for chip multiprocessors (CMPs). However, they require complicated hardware to properly handle the cache coherence problem. Moreover, it generates heavy on-chip network traffic due to the coherence enforcement. In this work, we propose a simple software-managed coherent memory architecture for many cores. Our memory architecture exploits explicitly addressed local stores. Instead of implementing the complicated cache coherence protocol in hardware, coherence and consistency are supported by software, such as a runtime or an operating system. The local stores together with the software leverage conventional caches to make the architecture much simpler and to generate much less network traffic than conventional ccNUMA-based CMPs. Experimental results indicate that our approach is promising.

international conference on supercomputing | 2011

SRC: an automatic code overlaying technique for multicores with explicitly-managed memory hierarchies

Choonki Jang

In this paper, we propose an efficient code overlay technique that automatically generates an overlay structure for a given memory size for multicores with explicitly-managed memory hierarchies. We observe that finding an efficient overlay structure with minimum memory copying overhead is similar to the problem that finds a code placement with minimum conflict misses in the instruction cache. Our algorithm exploits the temporal-ordering information between functions during program execution. Experimental results on the Cell BE processor indicate that our approach is effective and promising.

Archive | 2011