Min-wook Ahn | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Min-wook Ahn is active.

Explore More

Publication

Featured researches published by Min-wook Ahn.

asia and south pacific design automation conference | 2013

Reevaluating the latency claims of 3D stacked memories

Daniel W. Chang; Gyung-Su Byun; Ho-Young Kim; Min-wook Ahn; Soojung Ryu; Nam Sung Kim; Michael J. Schulte

In recent years, 3D technology has been a popular area of study that has allowed researchers to explore a number of novel computer architectures. One of the more popular topics is that of integrating 3D main memory dies below the computing die and connecting them with through-silicon vias (TSVs). This is assumed to reduce off-chip main memory access latencies by roughly 45% to 60%. Our detailed circuit-level models, however, demonstrate that this latency reduction from the TSVs is significantly less. In this paper, we present these models, compare 2D and 3D main memory latencies, and show that the reduction in latency from using 3D main memory to be no more than 2.4 ns. We also show that although the wider I/O bus width enabled by using TSVs increases performance, it may do so with an increase in power consumption. Although TSVs consume less power per bit transfer than off-chip metal interconnects (11.2 times less power per bit transfer), TSVs typically use considerably more bits and may result in a net increase in power due to the large number of bits in the memory I/O bus. Our analysis shows that although a 3D memory hierarchy exploiting a wider memory bus can increase performance, this performance increase may not justify the net increase in power consumption.

field-programmable technology | 2012

SCC based modulo scheduling for coarse-grained reconfigurable processors

Won-Sub Kim; Dong-hoon Yoo; Haewoo Park; Min-wook Ahn

Coarse-grained reconfigurable arrays (CGRAs) architectures aim to offer high performance at low power consumption, especially for digital signal processing and streaming applications. To fully exploit the computing capability of CGRA, it is essential to develop a scheduling algorithm which maps operations over processing elements in CGRA. Modulo scheduling [1] is known as the state-of-art algorithm for CGRA scheduling, and there are many variants [2][3][4]. However, they suffer from dealing with inter-iteration dependences called as recurrences that form cyclic dependences. Hence we propose a new scheduling technique that efficiently handles the cyclic dependences. The key techniques are grouping all the mutually-dependent recurrence cycles into a strongly connected component (SCC), and scheduling the input data flow graph (DFG) based on SCCs. Since grouping removes all the recurrence cycles from DFG, the resulting SCC graph becomes a form of directed acyclic graph (DAG) in which the scheduler can track the total order of SCCs. While processing SCCs one by one, our intra-SCC scheduler analyzes the dependences between every pair of two different operations inside of SCC and produces the schedule of them. Thanks to the well-structured form of the SCC-based graph, we obtain more efficient schedule compared to the previous CGRA scheduling algorithm [2]. The experimental results show that the proposed technique enhances the performance of recurrence-dominant loops up to 3.5X and raises the success rate of modulo-scheduling compared to the previous CGRA scheduling algorithm [2].

field-programmable technology | 2012

Design evaluation of OpenCL compiler framework for Coarse-Grained Reconfigurable Arrays

Hee Seok Kim; Min-wook Ahn; John A. Stratton; Wen-mei W. Hwu

OpenCL is undoubtedly becoming one of the most popular parallel programming languages as it provides a standardized and portable programming model. However, adopting OpenCL for Coarse-Grained Reconfigurable Arrays (CGRA) is challenging due to divergent architecture capability compared to GPUs. In particular, CGRAs are designed to accelerate loop execution by software pipelining on a grid of functional units exploiting instruction-level parallelism. This is vastly different from a GPU in that it executes data parallel kernels using a large number of parallel threads. Therefore, an OpenCL compiler and runtime for CGRAs must map the threaded parallel programming model to a loop-parallel execution model so that the architecture can best utilize its resources. In this paper, we propose and evaluate a design for an OpenCL compiler framework for CGRAs. The proposed design is composed of a serializer and post optimizer. The serializer transforms parallel execution of work-items to an equivalent loop-based iterative execution in order to avoid expensive multithreading on CGRAs. The resulting code is further optimized by the post optimizer to maximize the coverage of software-pipelinable innermost loops. In order to achieve the goal, various loop-level optimizations can take place in the post optimizer using the loops introduced by the serializer for iterative execution of OpenCL kernels. We provide an analysis of the propose framework from a set of well-studied standard OpenCL kernels by comparing performance of various implementations of benchmarks.

international conference on computer aided design | 2013

Dynamic bandwidth scaling for embedded DSPs with 3D-stacked DRAM and wide I/Os

Daniel W. Chang; Young Hoon Son; Jung Ho Ahn; Ho-Young Kim; Min-wook Ahn; Michael J. Schulte; Nam Sung Kim

3D main memory is an emerging technology that stacks DRAM dies underneath the processor die using through-silicon vias (TSVs). Prior studies assumed that such technology would decrease main memory access latency by 45% to 60%, while also allowing designers to increase main memory bandwidth. Although the latter is true, it was recently shown that the latency savings of 3D main memory is only 6.3%. In this paper, we first analyze memory latency reduction opportunities in a 3D main memory system with Wide I/O by taking better advantage of 3D integration technology and quantify their benefit. Specifically, redesigning the DRAM to memory controller synchronizers and placing the address, command, and data pads closer to the DRAM banks can decrease 3D main memory latency by 24.7%. We show that current 3D DRAM with Wide I/O can increase the geometric mean performance of an embedded processor that is similar to a Texas instrument C67x DSP by 9.7% (and up to 23.3%). Second, we observe that 3D DRAM with Wide IO can increase average system energy consumption of energy-constrained embedded DSPs by 2.6% (and up to 8.9%). To improve I/O energy efficiency, we propose to dynamically scale memory bandwidth (i.e. the I/O width) at runtime based on an applications program phases. Our dynamic bandwidth scaling algorithms increase average performance by 6.6% while increasing average energy consumption by only 0.5%.

international conference on consumer electronics | 2013

The acceleration of various multimedia applications on reconfigurable processor

Min-wook Ahn; Dong-hoon Yoo; Soojung Ryu; Jeongwook Kim

Modern consumer electronics require efficient and versatile processors as their heart for various functions. This paper shows how our reconfigurable processor (RP) accelerates various multimedia applications even with different features. The RP provides an acceleration mode powered by modulo scheduling algorithm and intrinsic specialized to each application. By exploiting these two, RP can enhance the performance up to 71.54x for audio, video, image signal processing (ISP), 3D graphics and major communication channel applications (DVBT2).

international parallel and distributed processing symposium | 2012

Dynamic Operands Insertion for VLIW Architecture with a Reduced Bit-width Instruction Set

Jongwon Lee; Jonghee M. Youn; Jihoon Lee; Min-wook Ahn; Yunheung Paek

Performance, code size and power consumption are all primary concern in embedded systems. To this effect, VLIW architecture has proven to be useful for embedded applications with abundant instruction level parallelism. But due to the long instruction bus width it often consumes more power and memory space than necessary. One way to lessen this problem is to adopt a reduced bit-width instruction set architecture (ISA) that has a narrower instruction word length. This facilitates a more efficient hardware implementation in terms of area and power by decreasing bus-bandwidth requirements and the power dissipation associated with instruction fetches. Also earlier studies reported that it helps to reduce the code size considerably. In practice, however, it is impossible to convert a given ISA fully into an equivalent reduced bit-width one because the narrow instruction word, due to bit-width restrictions, can encode only a small subset of normal instructions in the original ISA. Consequently, existing processors provide narrow instructions in very limited cases along with severe restrictions on register accessibility. The objective of this work is to explore the possibility of complete conversion, as a case study, of an existing 32-bit VLIW ISA into a 16-bit one that supports effectively all 32-bit instructions. To this objective, we attempt to circumvent the bit-width restrictions by dynamically extending the effective instruction word length of the converted 16-bit operations. At compile time when a 32-bit operation is converted to a 16-bit word format, we compute how many bits are additionally needed to represent the whole 32-bit operation and store the bits separately in the VLIW code. Then at run time, these bits are retrieved on demand and inserted to a proper 16-bit operation to reconstruct the original 32-bit representation. According to our experiment, the code size becomes significantly smaller after the conversion to 16-bit VLIW code. Also at a slight run time cost, the machine with the 16-bit ISA consumes much less energy than the original machine.

international conference on consumer electronics | 2014

JTS-based static branch prediction

Tai-song Jin; Jin-Seok Lee; Min-wook Ahn; Yoonseo Choi; Do Hyung Kim; Shihwa Lee

VLIW architectures are popular design choices in embedded computing market because of its capability of delivering performance with low power. Branch prediction plays a key role for minimizing pipeline stalls due to control hazard. Though a hardware branch predictor can result in good predictions, its HW cost often hinders it from being used in low-power VLIW architectures. On the other hand, a software branch prediction by the compiler can achieve comparable prediction quality utilizing delay slots intelligently without HW cost. In this paper, we propose a novel static branch prediction technique using jump target setting (JTS) instructions. The JTS-enabled VLIW architecture is successfully shipped in several commercial consumer electronic devices from Samsung. In our experiment using multimedia applications, the proposed branch prediction scheme outperforms the conventional static branch prediction with delay slots by 9%.

international conference on consumer electronics | 2014

Nop compression scheme for high speed DSPs based on VLIW architecture

Tai-song Jin; Min-wook Ahn; Dong-hoon Yoo; Dong-kwan Suh; Yoonseo Choi; Do-Hyung Kim; Shihwa Lee

VLIW (Very Long Instruction Word) is one of the most popular architectures in embedded systems because it has features of low power consumption and low hardware cost. Due to the nature of VLIW architecture such as bundled instructions and large register files, VLIW processors are running with large size of instruction codes in relatively low clock frequency. However compact instruction size and high clock frequency are the most important requirements of modern embedded consumer electronics. In this paper we propose a novel instruction compression scheme to solve the addressed problem. The experiment shows that the proposed scheme can reduce instruction size by 23% and improve clock frequency by 25% in average comparing with conventional compression schemes.

microprocessor test and verification | 2012

Verification of CGRA Executable Code and Debugging of Memory Dependence Violation

Heejun Shim; Min-wook Ahn; Jin-Sae Jung; Yen-Jo Han; Soojung Ryu

We present verification and debugging of highly optimized executable code that is generated from C source code to run on CGRA (Coarse-Grained Reconfigurable Array). To generate the executable code, the CGRA compiler uses software pipelining technique that maps instructions in a loop body to multiple FUs (functional units) of CGRA for concurrent execution. Often, the programmer chooses to use aggressive optimization as a way to obtain highly performing executable code. For example, the programmer may turn off memory dependence check in order to suppress false dependence that would otherwise result in overly conservative, therefore poorly performing, executable code. A trouble is that it is not easy to verify correctness of the resulting executable code. In this paper, we propose a method to verify CGRA executable code and to detect memory dependence violation if there occurs such violation and to provide source code position where the violation occurs. We use the behavior of VLIW code as a reference and compare it with the behavior of CGRA code. In order to guide the comparison, compiler-generated mapping table information is used.

international conference on consumer electronics | 2015

Source level offloading for special-purpose hardware accelerators

Shin-gyu Kim; Chae-seok Im; Min-wook Ahn; Seung-Won Lee

This paper presents CLOSH, a source level offloading tool for special-purpose hardware accelerators. CLOSH is designed to make it easy to accelerate existing applications when their source code is available. We evaluated CLOSH with one application on our new TV platform, and found that required cycles are decreased by 27.4%.

Explore More