Jaewoo Ahn | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jaewoo Ahn is active.

Explore More

Publication

Featured researches published by Jaewoo Ahn.

architectural support for programming languages and operating systems | 2012

Reducing off-chip memory traffic by selective cache management scheme in GPGPUs

Hyojin Choi; Jaewoo Ahn; Wonyong Sung

The performance of General Purpose Graphics Processing Units (GPGPUs) is frequently limited by the off-chip memory bandwidth. To mitigate this bandwidth wall problem, recent GPUs are equipped with on-chip L1 and L2 caches. However, there has been little work for better utilizing on-chip shared caches in GPGPUs. In this paper, we propose two cache management schemes: write-buffering and read-bypassing. The write buffering technique tries to utilize the shared cache for inter-block communication, and thereby reduces the DRAM accesses as much as the capacity of the cache. The read-bypassing scheme prevents the shared cache from being polluted by streamed data that are consumed only within a thread-block. The proposed schemes can be selectively applied to global memory instructions using newly defined cache operators. We evaluate the effects of the proposed schemes for a few GPGPU applications by simulations. We have shown that the off-chip memory accesses can be successfully reduced by the proposed techniques. We also analyze the effectiveness of these methods when the throughput gap between cores and off-chip memory becomes wider.

signal processing systems | 1998

Pentium-MMX-based implementation of a digital copier

Jaewoo Ahn; Wonyong Sung

We develop real-time image processing programs for a digital copier using a general-purpose microprocessor. To exploit the inherent data parallelism in many image processing algorithms, we use Intels Pentium processor with multimedia extension (MMX). Each step of the digital copier process including the X-Zoom and the error diffusion halftoning is aggressively optimized for the Pentium MMX processor. The X-Zoom process that is based on the linear interpolation method is optimized using the software pipelining technique. For the error diffusion halftoning which requires nonlinear feedback, we exploit both the control-level and data-level parallelism. For the latter approach, a speculative quantization method is developed to break the dependency relation due to feedback and quantization operations. Our implementation acquires the maximum throughput of 30 ppm for A4-size paper using one 166 MHz Pentium MMX CPU, which is approximately five times faster than the code without MMX optimization.

signal processing systems | 1999

A 2 way VLIW processor architecture for embedded multimedia applications

Jiyang Kang; Jaewoo Ahn; Jiyoung Cho; Ki-Il Kum; Wonyong Sung

As the complexity of multimedia applications increases, the need for efficient and compiler-friendly processor architectures also grows. In this paper, a new multimedia processor architecture is proposed. This processor has a 2-issue VLIW architecture with 64-bit SIMD arithmetic functional units to exploit the instruction-level and subword data parallelism found in multimedia applications. Moreover, densely encoded instructions supporting memory operands, DSP-like addressing modes, and SIMD capability boost the performance while keeping the code size and hardware cost small. To maximally utilize this architecture, a software environment including a code converter, a VLIW compiler system, and a compiled simulator has also been developed. The processor core has been synthesized for LSI logic 0.25 /spl mu/m library, which results in the total gate count of 102 K. In spite of the relatively smaller issue rate, the proposed processor shows a comparable or higher performance in terms of both the cycle count and the code size when compared to the 8-issue TMS320C62xx, for DSP benchmark kernels and an H.263 video encoder.

signal processing systems | 2009

SIMD processor based implementation of recursive filtering equations

Jaewoo Ahn; Hoseok Chang; Junho Cho; Wonyong Sung

Implementation of recursive equations using parallel computer architecture has long been of interest because the dependency problem makes it difficult to achieve significant speed-up. In this paper, efficient implementation of recursive filtering equations on partitioned data-path SIMD (Single Instruction Multiple Data) processors is studied. Especially, three parallel computation techniques, which are the block filtering, recursive doubling, and multi-block filtering methods, are implemented and their performances are compared using a Pentium CPU based system. The performance evaluation result of the multi-block processing method on a scalable SIMD processor is also presented.

international conference on image processing | 2010

Parallel implementation of an error diffusion halftoning algorithm with a general purpose graphics processing unit

Becksang Seong; Jaewoo Ahn; Wonyong Sung

General purpose graphics processing units (GPGPUs) contain many execution units, thus they are very attractive for high speed image processing. However, the error diffusion halftoning algorithm can hardly exploit the benefit of massively parallel processing architecture because this algorithm uses feedback of the output error as well as the results of neighboring pixels. In this study, pixels that can be processed without dependency are found by examining the dependency graph. Also, a parallel processing method requiring less synchronization overhead is developed by considering the characteristics of GPGPUs.

international symposium on circuits and systems | 2012

A simulation-based study for DRAM power reduction strategies in GPGPUs

Hyojin Choi; Kyuyeon Hwang; Jaewoo Ahn; Wonyong Sung

General Purpose Graphics Processing Units (GPGPUs) operate many threads concurrently, however they demand large DRAM access because of small internal memory size assigned to each thread. As a result, the power consumption in DRAM components becomes increasingly significant. We have examined a few techniques that can reduce DRAM power consumption in GPGPUs. A GPGPU simulator supporting L2 cache is used for this study. The effects of changing the memory channel organization, DRAM clock frequency, row buffer management policy, open or closed, and the L2 cache memory system are studied. Not only the total DRAM energy consumption but also that due to each DRAM operation, such as active-precharge, burst, and background, are estimated. The examined DRAM power reduction techniques bring negligible execution time changes for solving compute-bound problems, but they result in 12–27% savings of DRAM power consumption.

international symposium on circuits and systems | 2012

Performance analysis of multi-bank DRAM with increased clock frequency

Su-Jin Cho; Jaewoo Ahn; Hyojin Choi; Wonyong Sung

As the performance of computer systems improves, the peak bandwidth of the DRAM system needs to be increased. In this study, we analyze the performance of multi-bank DRAMs when increasing the clock frequency by employing three metrics: data bus busy time, bank busy time and inter-bank interference time. We use a cycle-accurate DRAM model simulator to quantitatively measure each metric. Increasing the DRAM clock frequency obviously contributes to lowering the data bus busy time. From the analysis result, we find that raising the number of banks is needed when increasing the DRAM clock frequency. However, the inter-bank interference time becomes the performance bottleneck as the number of banks increases. We suggest that future multi-bank DRAM system should tackle this side-effect to efficiently exploit the faster clock frequency.

signal processing systems | 2012

Parallel Computation of Adaptive Filtering Algorithms on Multi-Core Systems

Dong-hwan Lee; Jaewoo Ahn; Wonyong Sung

The performance of recent CPUs has been rapidly increasing with the help of parallel architectural supports, such as SIMD (Single Instruction Multiple Data) extensions and multi-core architecture. However, efficient use of such parallel supports for adaptive filtering is difficult due to feedback loops that induce the data dependency problem. In this paper, efficient parallel computation of adaptive filters is studied for multi-core architecture with SIMD arithmetic support. Control- and data-level parallel computation methods are considered, where the former finds parallelism in the evaluation of one output sample, while the latter processes multiple output samples at a time to increase the degree of parallelism. The control-level parallel approach frequently utilizes the pipelining technique to uncover the parallelism, whereas the data-level approach employs a parallel computation method for linear recurrence equations to resolve the dependency. Not only adaptive transversal LMS (Least Mean Square) but also gradient adaptive lattice (GAL) and QR-decomposition based least-square lattice (QRD-LSL) filters are implemented on a PC that employs both SIMD and multi-core architecture.

international symposium on signals, circuits and systems | 2011

Accelerating tetrahedral interpolation with data-level and Thread-Level Parallel optimization

Jaewoo Ahn; Becksang Seong; Wonyong Sung

The tetrahedral interpolation method for color space conversion consumes the longest time in the entire color management process. This makes it difficult to implement a purely software-based high-end image processing system. In this study, SIMD (Single Instruction Multiple Data) and GPGPU (General Purpose Graphics Processing Unit) based optimizations for tetrahedral interpolation are implemented. To exploit DLP (Data-Level Parallelism) with SIMD extensions, the program is restructured and conditional branches are removed so that inter-pixel parallelism is used for tetrahedron determination, while inter-output-channel parallelism is employed for the table lookup and weighted sum. TLP (Thread-Level Parallelism) is exploited with GPGPU by allocating different input pixels to each thread. Memory access cycle is minimized using constant memory for color lookup table. We conclude that both DLP and TLP optimization is essential for recent multi-core CPUs with wider SIMD registers and reducing the communication overhead between the host and the device is critical for TLP optimization with GPGPUs.

international symposium on circuits and systems | 2001

Feedback-directed memory disambiguation for embedded multimedia VLIW computing

Jaewoo Ahn; Soo-Mook Moon; Wonyong Sung

Recently developed VLIW processors have opened a new era of VLIW multimedia computing. Many of these processors are equipped with VLIW scheduling compilers that automate their code generation. Multimedia VLIW compilers should exploit the characteristics specific to multimedia and digital signal processing (DSP) application programs, the most important of which is the enormous memory parallelism inherent in them. In this paper we propose an iterative rescheduling scheme based on feedback-directed memory disambiguation which is performed by the interaction of the scheduling compiler and the application programmer. Our experimental results indicate that the proposed technique is particularly effective in exploiting memory parallelism from multimedia and DSP programs, enhancing the overall performance by as much as 81.3% for a JPEG decoder.

Explore More