Moo-Kyoung Chung | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Moo-Kyoung Chung is active.

Explore More

Publication

Featured researches published by Moo-Kyoung Chung.

field-programmable technology | 2012

ULP-SRP: Ultra Low-Power Samsung Reconfigurable Processor for Biomedical Applications

Changmoo Kim; Moo-Kyoung Chung; Yeongon Cho; Mario Konijnenburg; Soojung Ryu; Jeongwook Kim

The latest biomedical applications require low energy consumption, high performance and wide energy-performance scalability to adapt to various working environments. This paper presents ULP-SRP, an energy efficient reconfigurable processor for the biomedical applications. ULP-SRP uses a Coarse Grained Reconfigurable Array (CGRA) for high performance data processing with low energy consumption. For the scalability, we propose three performance modes and Unified Memory Architecture (UMA). Energy optimization is accomplished by run-time mode switching along with automatic power gating. Experimental results show that ULP-SRP achieved 46.1% energy reduction compared to previous works.

IEEE Transactions on Consumer Electronics | 2009

Lossless frame memory recompression for video codec preserving random accessibility of coding unit

Sang-Heon Lee; Moo-Kyoung Chung; Sung-Mo Park; Chong-Min Kyung

In recent video applications such as MPEG or H.264/AVC, the bandwidth requirement for frame memory has become one of the most critical problems. Compressing pixel data before storing in off-chip frame memory is required to alleviate this problem. In this paper, we propose a lossless frame memory recompression scheme including 1) a lossless pixel compression algorithm, 2) an efficient address table organization method for random accessibility, and 3) frame memory placement scheme for compressed data to reduce the effective access time of SDRAM by suppressing row switching. Experimental results show that the proposed method reduces the frame data to 48% compared to that of the uncompressed one with H.264/AVC high profile encoder system, where 6.1 kB of SRAM is required for the address table of full HD video.

IEEE Transactions on Computers | 2006

Enhancing performance of HW/SW cosimulation and coemulation by reducing communication overhead

Moo-Kyoung Chung; Chong-Min Kyung

For system-level simulation of a complex system-on-chip design, multiple hardware simulators and emulators can be combined to work together. The simulation performance in this case is often limited by the communication overhead between simulators and emulators. To reduce the amount of communication in this heterogeneous simulation environment, we propose novel methods to find a time interval during which there are no transactions among simulators based on a dynamic prediction of transaction occurrence time for both software and hardware models. We also propose a simulator scheduling algorithm which allows the simulator to work alone without interaction with others when there is no transaction. By so doing, we reduced the amount of pure communication by a factor of 15 to 67 and, as a result, achieved a speed-up factor of 4 to 40 compared to existing lock-step simulation, as shown by experimental results with various application examples.

international symposium on vlsi design, automation and test | 2005

System-level HW/SW co-simulation framework for multiprocessor and multithread SoC

Moo-Kyoung Chung; Sangjun Yang; Sang-Hoon Lee; Chong-Min Kyung

C/C++-based languages such as SystemC or SpecC can be used for both hardware and software description by raising the level of abstraction for hardware. This paper proposes techniques for fast and accurate high-level co-simulation for multithread and multiprocessor SoC design using SystemC for hardware and legacy C with RTOS (real-time operating system) API for software. Automatically modified legacy C synchronizes with SystemC clock events, and communicates with other modules through IO (input/output) variables and transaction level bus models. Generic RTOS scheduler and POS1X APIs are also provided for the real-time application. About three times faster co-simulation speed than the ISS-based co-simulation along with various profiling data with 95% accuracy were achieved.

workshop on parallel and distributed simulation | 2006

Improving Lookahead in Parallel Multiprocessor Simulation Using Dynamic Execution Path Prediction

Moo-Kyoung Chung; Chong-Min Kyung

Simulation performance is dominated by lookahead in null message-based conservative time management of parallel discrete event simulation (PDES). This paper proposes a scheme for software execution path prediction to extend lookahead in parallel multiprocessor simulation. Templates for predicting program execution path are generated by software analysis, then, a processor model gets lookaheads by evaluating the templates at simulation time. We reduced the amount of null messages by a factor of 10 to 50 in parallel simulation with eight clustered workstations and, as a result, achieved a speedup factor of 4 to 7 compared to a conventional method having constant lookahead.

international symposium on vlsi design, automation and test | 2005

System-level performance analysis of embedded system using behavioral C/C++ model

Moo-Kyoung Chung; Sangkwon Na; Chong-Min Kyung

Design iteration time in SoC design flow is reduced through performance exploration at a higher level of abstraction. This paper proposes an accurate and fast performance analysis method in early stage of design process using a behavioral model written in C/C++ language. We made a cycle-accurate but fast and flexible compiled instruction set simulator (ISS) and IP models that represent hardware functionality and performance. System performance analyzer configured by the target communication architecture analyzes the performance utilizing event-traces obtained by running the ISS and IP models. This solution is automated and implemented in the tool, HIPA. We obtain diverse performance profiling results and achieve 95% accuracy using an abstracted C model. We also achieve about 20 times speed-up over corresponding co-simulation tools.

international symposium on circuits and systems | 2014

SimParallel: A high performance parallel SystemC simulator using hierarchical multi-threading

Moo-Kyoung Chung; Jun-Kyoung Kim; Soojung Ryu

As the system complexity increases, the simulation performance becomes one of the most important issues in virtual prototyping. Parallel simulation is a fascinating technique for high-speed simulation utilizing state of the art multi-core processors on a host workstation, but the efficiency of the parallel simulation is low because of the synchronization and communication overhead and unbalanced workloads among cores in the host. This paper proposes a novel technique, hierarchical multi-threading for the efficient parallel simulation of SystemC models where the host cores are able to be maximally utilized with the same number of thread groups. We also present an efficient synchronization and dynamic load balancing scheme for the proposed parallel simulation. Experimental results show that the proposed method achieves speed-up of from 2.9 to 3.3 in quad-core host workstation.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2013

Mapping and Scheduling of Tasks and Communications on Many-Core SoC Under Local Memory Constraint

Jinho Lee; Moo-Kyoung Chung; Yeongon Cho; Soojung Ryu; Jung Ho Ahn; Kiyoung Choi

There has been extensive research on mapping and scheduling tasks on a many-core SoC. However, none considers the optimization of communication types, which can significantly affect performance, energy consumption, and local memory usage of the SoC. This paper presents an approach to automatic mapping and scheduling of tasks and communications on a many-core SoC. The key idea is to decide the type of each communication between message passing and shared memory when we do the mapping and scheduling. By assigning a proper type to each communication, we can optimize the energy consumption, performance, or energy-delay product. To solve the optimization problem, the approach adopts a probabilistic algorithm coupled with some heuristics. To enhance throughput of the system, it performs software pipelined scheduling of the tasks using a modified iterative modulo scheduling technique. Experiments show that our algorithm achieves on average 50.1% lower energy consumption, 21.0% higher throughput, and 64.9% lower energy- delay product, compared to shared memory only communication.

international conference on multimedia and expo | 2010

Low latency variable length coding scheme for frame memory recompression

Sang-Heon Lee; Nak-Woong Eum; Moo-Kyoung Chung; Chong-Min Kyung

In frame memory recompression, decompression latency consists of two components, i.e., memory access cycles for compressed data fetch, and decompression time. Compared to most earlier works which mainly focused on the compression ratio and, therefore, only reducing memory access cycles, this paper proposes a low-latency variable-length coding method called non-zero bit selection scheme (NBS). The proposed NBS enables highly parallel decompression achieving a three-cycle decompression for an 8×8 block, compared to previous methods requiring as many as twelve clock cycles for the case of exponential Golomb code. It is notable that the proposed NBS scheme has achieved this without deterioration of the compression ratio. Experimental result on a number of full HD videos shows that the compression ratio of the proposed method is at least not worse than that obtained with the exponential Golomb code on the average, while reducing the decompression time to 25% compared to the exponential Golomb code.

field-programmable technology | 2012

Implementation of a volume rendering on coarse-grained reconfigurable multiprocessor

Seunghun Jin; Sangheon Lee; Moo-Kyoung Chung; Yeongon Cho; Soojung Ryu

In this paper, we present reconfigurable multiprocessor architecture for volume rendering. The multiprocessor consists of sixteen reconfigurable processors to exploit data parallelism of the volume rendering. Each processor has VLIW core and reconfigurable coarse-grained array specialized for control and data-intensive part of the program, respectively. The coarse-grained array can be configured dynamically, so that it can efficiently process different kernels of the volume rendering. The multiprocessor is implemented using verilog HDL and realized onto a commercial FPGA-based prototyping system. The experimental result shows that the presented multiprocessor has comparable performance to high-end desktop GPUs.

Explore More