Is this you? Create Your Porfile

Taek-Jun Kwon

University of Southern California

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Taek-Jun Kwon is active.

Explore More

Publication

Featured researches published by Taek-Jun Kwon.

networks on chips | 2010

Fault-Tolerant Flow Control in On-chip Networks

Young Hoon Kang; Taek-Jun Kwon; Jeffrey Draper

Scaling of interconnects exacerbates the already challenging reliability of on-chip networks. Although many researchers have provided various fault handling techniques in chip multi-processors (CMPs), the fault-tolerance of the interconnection network is yet to adequately evolve. As an end-to-end recovery approach delays fault detection and complicates recovery to a consistent global state in such a system, a link-level retransmission is endorsed for recovery, making a higher-level protocol simple. In this paper, we introduce a fault-tolerant flow control scheme for soft error handling in on-chip networks. The fault-tolerant flow control recovers errors at a link-level by requesting retransmission and ensures an error-free transmission on a flit-basis with incorporation of dynamic packet fragmentation. Dynamic packet fragmentation is adopted as a part of fault-tolerant flow control to disengage flits from the fault-containment and recover the faulty flit transmission. Thus, the proposed router provides a high level of dependability at the link-level for both datapath and control planes. In simulation with injected faults, the proposed router is observed to perform well, gracefully degrading while exhibiting 97% error coverage in datapath elements. The proposed router has been implemented using a TSMC 45nm standard cell library. As compared to a router which employs triple modular redundancy (TMR) in datapath elements, the proposed router takes 58% less area and consumes 40% less energy per packet on average.

Microelectronics Journal | 2009

Floating-point division and square root using a Taylor-series expansion algorithm

Taek-Jun Kwon; Jeffrey Draper

Hardware support for floating-point (FP) arithmetic is a mandatory feature of modern microprocessor design. Although division and square root are relatively infrequent operations in traditional general-purpose applications, they are indispensable and becoming increasingly important in many modern applications. Therefore, overall performance can be greatly affected by the algorithms and the implementations used for designing FP-div and FP-sqrt units. In this paper, a fused floating-point multiply/divide/square root unit based on Taylor-series expansion algorithm is proposed. We extended an existing multiply/divide fused unit to incorporate the square root function with little area and latency overhead since Taylors theorem enables us to compute approximations for many well-known functions with very similar forms. The proposed arithmetic unit exhibits a reasonably good area- performance balance.

international symposium on circuits and systems | 2005

Design trade-offs in floating-point unit implementation for embedded and processing-in-memory systems

Taek-Jun Kwon; Jeff Sondeen; Jeffrey Draper

Hardware support for floating-point (FP) arithmetic is a mandatory feature of modern microprocessor design. There are many alternatives in floating-point unit (FPU) design, and overall performance can be greatly affected by the organization of a floating-point unit. In this paper, design considerations and trade-off factors are evaluated for two types of floating-point unit architecture and implementation optimized under different design goals. The implementation results of the proposed FPUs based on standard cell methodology in TSMC 0.18 /spl mu/m technology exhibit that both designs are well optimized for their target applications. A single-instruction issue design is implemented in very small area; however, a design capable of concurrently executing FP add and multiply instructions is achievable with only a modest 24% area increase.

international conference on electronics, circuits, and systems | 2008

Floating-point division and square root implementation using a Taylor-series expansion algorithm

Taek-Jun Kwon; Jeff Sondeen; Jeffrey Draper

Hardware support for floating-point (FP) arithmetic is an essential feature of modern microprocessor design. Although division and square root are relatively infrequent operations in traditional general-purpose applications, they are indispensable and becoming increasingly important in many modern applications. In this paper, a fused floating-point multiply/divide/square root unit based on Taylor-series expansion algorithm is presented. The implementation results of the proposed fused unit based on standard cell methodology in IBM 90 nm technology exhibits that the incorporation of square root function to an existing multiply/divide unit requires only a modest 23% area increase and the same low latency for divide and square root operation can be achieved (12 cycles). The proposed arithmetic unit also exhibits a reasonably good area-performance balance.

european solid-state circuits conference | 2003

An area-efficient standard-cell floating-point unit design for a processing-in-memory system

Joong-Seok Moon; Taek-Jun Kwon; Jeff Sondeen; Jeffrey Draper

The data-intensive architecture (DIVA) system incorporates processing-in-memory (PIM) chips as smart-memory coprocessors to a microprocessor. This architecture exploits inherent memory bandwidth both on chip and across the system to target several classes of bandwidth-limited applications. One of the key capabilities of this architecture is wideword floating-point computation, which enables aggregate floating-point operations. Each PIM chip includes eight basic instructions and IEEE-754 compliant rounding and exceptions. Through pipeline scheduling and a hardware-efficient division algorithm, the resulting FPU is well-balanced between area and performance. This paper details the design and implementation of this FPU based on standard cell methodology in 0.18/spl mu/m CMOS technology. Area, power dissipation and performance are also discussed.

midwest symposium on circuits and systems | 2008

Floating-point division and square root implementation using a Taylor-series expansion algorithm with reduced look-up tables

Taek-Jun Kwon; Jeffrey Draper

Hardware support for floating-point (FP) arithmetic is an essential feature of modern microprocessor design. Although division and square root are relatively infrequent operations in traditional general-purpose applications, they are indispensable and becoming increasingly important in many modern applications. In this paper, a fused floating-point multiply/divide/square root unit using the Taylor-series expansion algorithm with reduced lookup tables is presented. The implementation results of the proposed fused unit based on standard cell methodology in IBM 90 nm technology exhibits that the incorporation of square root function to an existing multiply/divide unit requires only a modest 20% area increase and the same low latency for divide and square root operation can be achieved (12 cycles). The proposed arithmetic unit also exhibits a reasonably good area-performance balance.

international symposium on circuits and systems | 2004

A 0.18 /spl mu/m implementation of a floating-point unit for a processing-in-memory system

Taek-Jun Kwon; Joong-Seok Moon; Jeff Sondeen; Jeffrey Draper

The Data-Intensive Architecture (DIVA) system incorporates Processing-In-Memory (PIM) chips as smart-memory coprocessors to a microprocessor. This architecture exploits inherent memory bandwidth both on chip and across the system to target several classes of bandwidth-limited applications. A key capability of this architecture is the support of parallel single-precision floating-point operations. Each PIM chip includes eight single-precision FPUs, each of which supports eight basic instructions and IEEE-754 compliant rounding and exceptions. Through block sharing and a hardware-efficient division algorithm, the resulting FPU is well-balanced between area and performance. This paper focuses on the novel divide algorithm implemented and documents the fabrication and testing of a prototype FPU based on standard cell methodology in TSMC 0.18 /spl mu/m CMOS technology.

networks on chips | 2009

Dynamic packet fragmentation for increased virtual channel utilization in on-chip routers

Young Hoon Kang; Taek-Jun Kwon; Jeffrey Draper

Conventional packet-switched on-chip routers provide good resource sharing while minimizing latencies through various techniques. A virtual channel (VC) is allocated on a per-packet basis and held until the entire packet exits the VC buffer. This sometimes leads to inefficient use of VCs at high network loads. A blocked packet can affect adjacent routers, resulting in a congestion propagation effect. In such a scenario, VC buffers may be empty although they are regarded as fully occupied by a blocked packet. This paper proposes a dynamic packet fragmentation technique which releases empty VC buffers by fragmenting packets and allowing other packets to use the freed VC buffers. Thus, fragmentation increases VC utilization. Simulation experiments show performance improvement in terms of latency and throughput up to 20% and 7.5%, respectively.

international symposium on circuits and systems | 2006

A double-data rate (DDR) processing-in-memory (PIM) device with wideword floating-point capability

Tim Barrett; Sumit Dharampal Mediratta; Taek-Jun Kwon; Ravinder Singh; Sachit Chandra; Jeff Sondeen; Jeffrey Draper

The data-intensive architecture (DIVA) system incorporates processing-in-memory (PIM) chips as smart-memory coprocessors to a microprocessor. This architecture exploits inherent memory bandwidth both on chip and across the system to target several classes of bandwidth-limited applications. A recently developed PIM chip in TSMC 0.18mum technology incorporates a DDR SDRAM interface for its inclusion in commodity systems, such as the HP zx6000 workstation used on this project. Each PIM chip includes eight single-precision floating-point units (FPU) in the wideword pipeline, enabling significant speedups in the target system. This paper focuses on the integration of new subcomponents into the PIM chip design, system integration, and measured system results, demonstrating the significant GFLOP/W feature offered by PIM computing

international symposium on circuits and systems | 2010

Implementation of adaptive grain signatures for transactional memories

Woojin Choi; Young Hoon Kang; Taek-Jun Kwon; Jeffrey Draper

Hardware signatures for Transactional Memory (TM) systems have been proposed as an efficient mechanism for conflict detection, an essential element in TM for maintaining correctness. A signature misses no conflicts, but could falsely declare conflicts even when no true conflict exists (false positives). In this paper, we show that some false positives can be helpful to the performance by triggering the early abortion of a transaction which would encounter a true conflict later anyway. We propose an adaptive grain signature to improve TM performance by dynamically changing the range of address keys based on the history. With architecture-level simulation and Verilog HDL implementation, we demonstrate that a TM system with our design frequently outperforms baseline TM systems, with marginal area overhead.

Explore More