Donglok Kim | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Donglok Kim is active.

Explore More

Publication

Featured researches published by Donglok Kim.

international conference on computer design | 2000

A register file with transposed access mode

Yoochang Jung; Stefan G. Berg; Donglok Kim; Yongmin Kim

We introduce a new register file architecture that provides both row-wise and column-wise accesses, thus allowing partitioned instructions to be used in column-wise processing without transposition overhead. This feature can accelerate 2D separable image and video processing algorithms, such as 2D convolution and 2D discrete cosine transform (DCT), by eliminating the transposition steps.

international symposium on microarchitecture | 2001

Data cache and direct memory access in programming mediaprocessors

Donglok Kim; Ravi Managuli; Yongmin Kim

Mediaprocessors provide high performance by using both instruction- and data-level parallelism. Because of the increased computing power, transferring data between off- and on-chip memories without slowing down the core processors performance is challenging. Two methods, data cache and direct memory access, address this problem in different ways.

Journal of Electronic Imaging | 2000

Mapping of two-dimensional convolution on very long instruction word media processors for real-time performance

Ravi Managuli; George York; Donglok Kim; Yongmin Kim

Programmable media processors have been emerging to meet the continuously increasing computational demand in complex digital media applications, such as HDTV and MPEG-4, at an afford- able cost. These media processors provide the flexibility to imple- ment various image computing algorithms along with high perfor- mance, unlike the hardwired approach that has provided high performance for a particular algorithm, but lacks flexibility. However, to achieve high performance on these media processors, a careful and sometimes innovative design of algorithms is essential. In ad- dition, programming techniques, e.g., software pipelining and loop unrolling, are needed to speed up the computations while the data flow can be optimized using a programmable DMA controller. In this paper, we describe an algorithm for two-dimensional convolution, which can be implemented efficiently on many media processors. Implemented on a new media processor called the MAP1000, it takes 7.9 ms to convolve a 5123512 image with a 737 kernel, which is much faster than the previously reported software-based convolution and is comparable with the hardwired implementations. High performance in two-dimensional convolution and other algo- rithms on the MAP1000 clearly demonstrates the feasibility of software-based solutions in demanding imaging and video applica- tions.

parallel computing | 2003

Efficient 2D FFT implementation on mediaprocessors

Coskun Mermer; Donglok Kim; Yongmin Kim

We have developed an efficient implementation to compute the 2D fast Fourier transform (FFT) on a new very long instruction word programmable mediaprocessor. Using instruction-level parallelism and a multimedia instruction set, our radix-4 Cooley-Tukey algorithm optimally maps the FFT computation to the processing resources of the Hitachi/Equators MAP mediaprocessor. We have also achieved more efficient data I/O and lower data transfer time compared to traditional implementations by processing several columns in parallel during the column-wise stage of row-column decomposition. We used a programmable direct memory access engine and a double-buffering scheme in the data cache to perform the computation and the data transfer in parallel. Our implementation resulted in 22.4 ms total execution time for a 512 × 512-point 2D complex FFT, which is faster than previous single-chip programmable or dedicated solutions. The implementations on two other mediaprocessors, the TriMedia TM1100 and the BOPS ManArray, illustrate the importance of the instruction set architecture for achieving high performance and the trend of data I/O becoming the limitation on the 2D FFT performance in newer mediaprocessors.

International Journal of Imaging Systems and Technology | 1998

HIGH-PERFORMANCE IMAGE COMPUTING WITH MODERN MICROPROCESSORS

Chris Basoglu; Donglok Kim; Robert J. Gove; Yongmin Kim

Various levels of parallelism have recently been introduced in advanced microprocessors to meet the demanding computing need in digital video processing and other multimedia applications. Because many imaging algorithms are easily parallelizable, these architectural features and their wide availability at low cost have become a powerful tool in tackling both existing and new imaging applications. At the lowest level, the subword parallelism is used in the new instructions aimed at processing multiple multimedia data simultaneously. Instruction‐level parallelism including subword parallelism is realized in either very long instruction word or superscalar architectures, while on‐chip and/or off‐chip multiprocessing capability is available for easier multiprocessor system designs. One of the difficulties in maximizing the computing throughput via parallelism has been the level of programming in that to obtain the optimal performance, assembly‐level programming has typically been required. We review the architectural features in several modern microprocessors such as TMS320C60, TM‐1000, PowerPC 604, Pentium II, R10000, Alpha 21264, PA‐RISC 8200, UltraSPARC‐II, and TMS320C80. Various obstacles to obtaining the best performance from these microprocessors with high‐level and assembly languages are discussed, and several approaches to overcome these difficulties in diverse imaging applications are presented.

conference on multimedia computing and networking | 1998

Image computing library for a next-generation VLIW multimedia processor

Inga Stotland; Donglok Kim; Yongmin Kim

We have developed the University of Washington Image Computing Library (UWICL), the high-performance image processing library for a next-generation mediaprocessor currently under development, named the Media Accelerated Processor (MAP1000). The primary goal of this library is to provide the algorithm developers and application programmers with a flexible and efficient library of core image computing functions. The UWICL is organized as a set of three hierarchical layers. Each function in this multilayered framework consists of an application module, a function module, and a tight-loop module. The MAP has an intelligent DMA controller called the Data Streamer that allows efficient data flow management. In cache-based architectures, streaming image data from the external memory generates many costly data-cache misses, in many cases leading to a severe performance bottleneck. The MAPs Data Streamer is designed to address this problem. To reduce the number of the data-cache misses further, a ping-ponging data flow scheme is employed in UWICL functions, i.e., while execution units are processing a block of data currently in the data cache, the Data Streamer brings the next data block to the data cache before it is actually needed. We compare the performance of key imaging functions on the MAP and the Texas Instruments TMS320C80, one of the most powerful mediaprocessors currently available. Typically, a MAP function is 1.5 to 6.6 times faster than the corresponding TMS320C80 implementation. Also, we demonstrate the advantages of the UWICLs multilayered library organization over the single-layered approach with an example in implementing the Cannys edge detector. The multilayered implementation of this algorithm outperforms the single- layered version by 26%.

conference on multimedia computing and networking | 1998

Critical review of programmable media processor architectures

Stefan G. Berg; Weiyun Sun; Donglok Kim; Yongmin Kim

In the past several years, there has been a surge of new programmable mediaprocessors introduced to provide an alternative solution to ASICs and dedicated hardware circuitries in the multimedia PC and embedded consumer electronics markets. These processors attempt to combine the programmability of multimedia-enhanced general purpose processors with the performance and low cost of dedicated hardware. We have reviewed five current multimedia architectures and evaluated their strengths and weaknesses.

International Journal of Imaging Systems and Technology | 1998

GENERALIZED IMAGE WARPING USING ENHANCED LOOKUP TABLES

Peter Mattson; Donglok Kim; Yongmin Kim

Image warping includes a wide variety of algorithms used in the spatial transformation of images. Warping typically involves a geometric transformation governed by a set of polynomial equations followed by some form of interpolation. When the geometric transformation does not change from one image to the next, or a more generalized warp is desired, a spatial lookup table (LUT) composed of precalculated inverse (or forward) mapping coordinates has been used instead of performing the same geometric transformation on every image. This article presents an algorithm that takes the LUT approach one step further. It uses an enhanced LUT (ELUT) that incorporates precomputed data transfer and interpolation information. Once computed, it speeds up processing by making data transfers more efficient and eliminating the need for the pixel address and interpolation coefficient calculations. This article also describes the implementation of ELUT‐based image warping on the high‐performance TMS320C80 Multimedia Video Processor (MVP). To produce an interesting and difficult generalized warp on a 512 × 512 8‐bit gray‐scale and 32‐bit color image, it takes only 15.1 and 32.4 ms, respectively.

Proceedings of SPIE | 1995

Networking requirements and the role of multimedia systems in telemedicine

Donglok Kim; James E. Cabral; Yongmin Kim

Programmable multimedia imaging workstations over a high bandwidth network (ATM) were used to explore the role of multimedia workstations in the telemedicine application. The multimedia workstation is based on the MediaStation 5000 which uses a TMS320C80 Multimedia Video Processor to handle the multimedia bitstreams and image display/processing functions. Although the telemedicine workstation exhibited its high performance and programmability, we found from the experiment that a tighter integration of the multimedia capability with the networking component should be one of the most desired improvements for the telemedicine system to become applicable in routine clinical environments.

IEEE Transactions on Circuits and Systems for Video Technology | 2002

Boundary macroblock padding in MPEG-4 video decoding using a graphics coprocessor

Rohit Garg; Chris Y. Chung; Donglok Kim; Yongmin Kim

MPEG-4 is the latest multimedia coding standard that supports object-based coding and manipulation of natural video and synthetic graphics objects. Due to its various features and high coding efficiency, MPEG-4 is becoming popular in video streaming applications. Many graphics coprocessors provide the acceleration of inverse discrete cosine transform (IDCT) and motion compensation for real-time video decoding. Therefore, it is desired to use the graphics coprocessors to accelerate MPEG-4 video decoding as well. Since MPEG-4 video decoding for rectangular video objects is similar to other video coding standards, e.g., MPEG-2, the IDCT and motion compensation can still be executed on the graphics coprocessors. However, we have found that boundary macroblock padding, which is an essential processing step in decoding arbitrarily shaped video objects, could not be efficiently accelerated on the graphics coprocessors due to its complexity. Although we can implement the boundary macroblock padding on the host processor, the frame data processed on the graphics coprocessor need to be transferred to the host processor for padding. In addition, the padded data on the host processor need to be sent back to the graphics coprocessor to be used as a reference for subsequent frames. To avoid this overhead, we present two approaches of boundary macroblock padding. In the first approach, the boundary macroblock padding is partitioned into two tasks, one of which the host processor can perform without the overhead of data transfers. In the second approach, we propose two new instructions and an algorithm that can be easily adopted in the next-generation graphics coprocessors or mediaprocessors, which gives a performance improvement of up to a factor of nine compared to that with the Pentium III.

Explore More