Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Young-Jun Kim is active.

Publication


Featured researches published by Young-Jun Kim.


IEEE Journal of Solid-state Circuits | 2006

An SoC with 1.3 gtexels/s 3-D graphics full pipeline for consumer applications

Dong-Hyun Kim; Kyusik Chung; Chang-Hyo Yu; Chun-Ho Kim; Inho Lee; Jun-Sang Bae; Young-Jun Kim; Jae-Hyeon Park; Sungbeen Kim; Yong-Ha Park; Nak-Hee Seong; Jin-Aeon Lee; Jaehong Park; Sung Yong Oh; Seh-Woong Jeong; Lee-Sup Kim

A high-speed three-dimensional (3-D) graphics SoC for consumer applications is presented. A 166-MHz 3-D graphics full pipeline engine with performance of 33 Mvertices/s and 1.3Gtexels/s, and 333-MHz ARM11 RISC processor, and video composition IPs are integrated together on a single chip. The geometry part of 3-D graphics IP provides full programmability in vertex and triangle level, and two-level multi-texturing with trilinear MIPMAP filtering are realized in the rasterization part. Per-pixel effects such as fog effects, alpha blending, and stencil test are also implemented in the proposed 3-D graphics IP. The rasterization architecture is designed for reducing external memory accesses to achieve the peak performance. The chip is fabricated using 0.13/spl mu/m CMOS technology and its area is 7.1/spl times/7.0mm/sup 2/.


IEEE Transactions on Circuits and Systems | 2013

A Reconfigurable SIMT Processor for Mobile Ray Tracing With Contention Reduction in Shared Memory

Hong-Yun Kim; Young-Jun Kim; Jiehwan Oh; Lee-Sup Kim

In this paper, we present a reconfigurable SIMT multi-core processor with a shared memory for mobile ray tracing. The proposed processor addresses two issues of SIMT architecture: branch divergence of concurrently executed threads and contention in a shared memory. Performance degradation due to the branch divergence is reduced by dividing a wide SIMT datapath into several narrow SIMT cores that execute independent threads asynchronously. The contention in a shared memory caused by the multiple SIMT cores is alleviated by introducing a new time-division multiplexing (TDM) scheme using multi-phase clocks. The SIMT cores send their requests to a shared memory sequentially not concurrently by synchronizing the SIMT cores with multi-phase clocks to hide arbitration delays. The processor achieves the same datapath utilization as 4-wide SIMT which has been widely used by CPU-based ray tracers while its area remains 68% of the 4-wide SIMT. As a result, the performance normalized to area is improved by 26% compared to previous work with negligible overheads (2.6% for area and 1% for power consumption). The chip was fabricated in 90 nm CMOS technology, and it contains 2.3 M logic gates and 19.3 KB SRAM. It consumes 221 mW at 100 MHz with Vdd=1.2 V.


international solid-state circuits conference | 2005

An SoC with 1.3 Gtexels/s 3D graphics full pipeline engine for consumer applications

Dong-Hyun Kim; Kyusik Chung; Chang-Hyo Yu; Chun-Ho Kim; Inho Lee; Jaewan Bae; Young-Jun Kim; Young-Jin Chung; Sungbeen Kim; Yong-Ha Park; Nak-Hee Seong; Jin-Aeon Lee; Jaehong Park; Sung Yong Oh; Seh-Woong Jeong; Lee-Sup Kim

A 3D graphics SoC whose performance is 33 Mvertices/s and 1.3 Gtexels/s is designed for consumer applications. The SoC integrates an ARM11 RISC processor, a dedicated 3D graphics full pipeline engine, and video composition IPs. The SoC contains 17.9 M transistors in 50 mm/sup 2/ area and is fabricated in a 0.13 /spl mu/m 7M CMOS process.


IEEE Transactions on Very Large Scale Integration Systems | 2012

Homogeneous Stream Processors With Embedded Special Function Units for High-Utilization Programmable Shaders

Young-Jun Kim; Hyo-Eun Kim; Seok-Hoon Kim; Jun-Seok Park; Seungwook Paek; Lee-Sup Kim

We embed special function units (SFUs) in homogeneous stream processors (SPs) within a graphics processing unit (GPU), to improve its performance in running modern programmable shaders, which make poor use of a single-instruction multiple-data (SIMD) architecture. We also compact instructions, so as to reduce the size of the instruction memory, and reduce area requirements by using a partial SFU in SPs, and a lookup table which is shared between multiple SFUs. The result is an increase of 88% in utilization and a reduction in the normalized area-delay product of 27%, compared to a baseline SIMD architecture. We verified our architecture on an field-programmable gate-array evaluation platform with an ARM9 host processor and a full 3-D graphics pipeline.


international solid-state circuits conference | 2011

A 275mW heterogeneous multimedia processor for IC-stacking on Si-interposer

Hyo-Eun Kim; Jae-Sung Yoon; Kyu-Dong Hwang; Young-Jun Kim; Jun-Seok Park; Lee-Sup Kim

Most data-intensive operations for multimedia applications such as image processing, vision, and 3D graphics require high external memory bandwidth. In augmented-reality (AR) processors [1], both 3D graphics and vision operations are required, so memory bandwidth becomes even more critical. In [1], however, memory bandwidth is not considered, floating-point processing is not supported, and there is no cache memory for texturing, which is a performance bottleneck of common graphics pipelines. In this work, a heterogeneous multimedia processor is presented to process various mobile multimedia applications in a single chip on Si-interposer for high memory bandwidth. The implemented processor has 4 key features: (1) A transceiver pool (TRx) that reconfigures strength of output drivers according to the channel loss for IC-stacking on Si-interposer, (2) A mode-configurable vector processing unit (MCVPU) for framelevel parallelism, (3) An energy-efficient unified filtering unit (UFU) with adaptive block selection (ABS) algorithm for memory-access-efficient texturing, and (4) a unified shader (US) with floating-point scalar processing elements (SPE) and partial special function units (PSFU) to enhance graphics processing performance and quality. With these techniques, we achieve 1.7× frame rate and 8× memory bandwidth improvement in full AR operation.


IEEE Journal of Solid-state Circuits | 2010

A 116 fps/74 mW Heterogeneous 3D-Media Processor for 3-D Display Applications

Seok-Hoon Kim; Hong-Yun Kim; Young-Jun Kim; Kyusik Chung; Dong-Hyun Kim; Lee-Sup Kim

In this paper, a heterogeneous 3D-media processor is presented, which supports all 3-D display applications by combining a 3-D display IP with a 3-D graphics IP and a stereo video decoder. For mobile environments, adaptive power management scheme is proposed, which saves power consumption up to 186 mW by turning off idle functional blocks based on a target application, a target performance, and the run-time ratio between different IPs. As a result, the minimum power consumption of the processor is only 15 mW, while the overall power consumption is 201 mW. As well as the reduction of power consumption, this work shows impressive performance improvement. The proposed fast modulo operators and adopted division-free algorithm reduces the critical latencies of 3-D display image processing. The proposed fast datapath with parallel architecture increase synthesis rate up to 116 fps which is 17 times faster than a previous work. In addition, reordered operation sequence fixes memory bandwidth regardless of the number of images to be produced. In the 3-D graphics IP and the decoding IP, redundant datapath are merged using an IEEE 754 compliant floating-point vector unit to save both chip area and power consumption, which even reduces the critical latency by 30%.


IEEE Journal of Solid-state Circuits | 2012

MRTP: Mobile Ray Tracing Processor With Reconfigurable Stream Multi-Processors for High Datapath Utilization

Hong-Yun Kim; Young-Jun Kim; Lee-Sup Kim

This paper presents a mobile ray tracing processor (MRTP) with reconfigurable stream multi-processors (RSMPs) for high datapath utilization. The MRTP includes three RSMPs that operate in multiple instruction multiple data (MIMD) mode asynchronously to exploit instruction-level parallelism. Each RSMP is based on single instruction multiple thread (SIMT) architecture to exploit thread-level parallelism. An RSMP consists of twelve scalar processing elements (SPEs) that run multiple threads in parallel synchronously: twelve scalar threads or four vector threads depending on an operating mode. A low datapath utilization caused by a branch divergence in SIMT architecture is improved by 19.9% on average by reconfiguring twelve SPEs between scalar SIMT and vector SIMT with 0.1% area overheads. Special function instructions occupy only 2% ~ 8% of kernel instructions so that a partial special function unit (PSFU) is implemented instead of a large dedicated SFU. The access conflicts with a look-up table (LUT) caused by concurrent accesses of twelve SPEs are reduced by a table loader (TBLD). The TBLD monitors concurrent requests from twelve SPEs and reduces an access count to LUT by distributing a coefficient to multiple SPEs with only one read-access to LUT. MRTP with area of 4 × 4 mm2 has been fabricated in 0.13 μm CMOS technology. MRTP achieves a peak performance of 673 K rays per second while consuming 156 mW at 100 MHz with VDD = 1.2 V .


Lab on a Chip | 2016

Dual-patterned immunofiltration (DIF) device for the rapid efficient negative selection of heterogeneous circulating tumor cells

Jiyoon Bu; Yoon-Tae Kang; Young-Jun Kim; Young-Ho Cho; Hee Jin Chang; Hojoong Kim; Byung-In Moon; Ho Gak Kim

The analysis of circulating tumor cells (CTCs) is an emerging field for estimating the metastatic relapse and tumor burden of cancer patients. However, the isolation of CTCs is still challenging due to their ambiguity, rarity, and heterogeneity. Here, we present an anti-CD45 antibody based dual-patterned immunofiltration (DIF) device for the enrichment of heterogeneous CTC subtypes by effective elimination of leukocytes. Our uniquely designed dual-patterned layers significantly enhance the binding chance between immuno-patterns and leukocytes due to the fluidic whirling and the increased binding sites, thus achieving superior negative selection in terms of high-throughput and high purity. From the experiments using lung cancer cells, 97.07 ± 2.79% of leukocytes were eliminated with less than 10% loss of cancerous cells at the flow rate of 1 mL h-1. To verify the device as a potential diagnostic tool, CTCs were collected from 11 cancer patients blood and an average of 283.3 CTC-like cells were identified while less than 1 CTC-like cells were found from healthy donors. The samples were also analyzed by immunohistochemistry and the reverse transcription polymerase chain reaction to identify their heterogeneous characteristics. These remarkable results demonstrate that the present device could help to understand the unknown properties or undiscovered roles of CTCs with a non-biased view.


international symposium on circuits and systems | 2009

Bank-partition and multi-fetch scheme for floating-point special function units in multi-core systems

Young-Jun Kim; Kyusik Chung; Lee-Sup Kim; Seong Mo Park

A table loader unit with bank-partition and multi-fetch feature is proposed for multiple special function units in multi-core systems. By sharing the look-up tables among special function units, the proposed scheme reduces look-up table size by 54% and read power consumption by 75% for 8-core/4-bank configuration. However, the performance loss is less than 25% for various benchmark applications. As the number of cores increases for the future multi-core systems, the effects of area and power saving by the proposed schemes increase, but the performance loss is reduced.


IEEE Transactions on Circuits and Systems for Video Technology | 2012

A Reconfigurable Heterogeneous Multimedia Processor for IC-Stacking on Si-Interposer

Hyo-Eun Kim; Jae-Sung Yoon; Kyu-Dong Hwang; Young-Jun Kim; Jun-Seok Park; Lee-Sup Kim

This paper presents a heterogeneous multimedia processor for embedded media applications such as image processing, vision, 3-D graphics and augmented reality (AR), assuming integrated circuit (IC)-stacking on Si-interposer. This processor embeds reconfigurable output drivers for external memory interface to increase memory bandwidth even in a mobile environment. The implemented output driver reconfigures its driving strength according to channel loss between the implemented processor and the memory, so it enables highspeed data communication while achieving 8× higher memory bandwidth compared to previous embedded media processors. The implemented processor includes three main programmable intellectual properties, mode-configurable vector processing units (MCVPUs), a unified filtering unit (UFU), and a unified shader. MCVPUs have 32 integer (16 bit) cores in order to support dual-mode operations between image-level processing and graphics processing. This mode-configuration enables a frame-level pipelining in AR application, so the proposed processor achieves 1.7× higher frame rate compared to the sequential AR processing. UFU supports 16 types of filtering operations only with a single instruction. Most image-level processing consists of various types of filtering operations, so UFU can improve media processing performance and energy-efficiency. UFU also supports texture filtering which is performance bottleneck of common graphics pipeline. A memory-access-efficient (off-chip memory) texturing algorithm named as an adaptive block selection is proposed to enhance texturing performance in 3-D graphics pipeline. UFU has two-level on-chip memory hierarchies, a 512B level-0 (L0) data buffer, and an 8kB level-1 (L1) static random-access memory (SRAM) cache. The small-sized L0 data buffer limits direct references to the large-sized L1 SRAM cache to reduce energy consumed in on-chip memories. Unified shader consists of four homogeneous scalar processing elements (SPEs) for geometry operations in 3-D graphics. Each SPE has single-precision floating-point data-paths, since precision of geometry operations in 3-D graphics is important in todays handheld devices (high resolution). The proposed media processor is fabricated in 0.13 μm CMOS technology with 4 mm × 4 mm chip size, and dissipates 275 mW for full AR operation.

Collaboration


Dive into the Young-Jun Kim's collaboration.

Researchain Logo
Decentralizing Knowledge