Taemin Kim | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Taemin Kim is active.

Explore More

Publication

Featured researches published by Taemin Kim.

international symposium on computer architecture | 2016

Energy efficient architecture for graph analytics accelerators

Muhammet Mustafa Ozdal; Serif Yesil; Taemin Kim; Andrey Ayupov; John Greth; Steven M. Burns; Ozcan Ozturk

Specialized hardware accelerators can significantly improve the performance and power efficiency of compute systems. In this paper, we focus on hardware accelerators for graph analytics applications and propose a configurable architecture template that is specifically optimized for iterative vertex-centric graph applications with irregular access patterns and asymmetric convergence. The proposed architecture addresses the limitations of the existing multi-core CPU and GPU architectures for these types of applications. The SystemC-based template we provide can be customized easily for different vertex-centric applications by inserting application-level data structures and functions. After that, a cycle-accurate simulator and RTL can be generated to model the target hardware accelerators. In our experiments, we study several graph-parallel applications, and show that the hardware accelerators generated by our template can outperform a 24 core high end server CPU system by up to 3x in terms of performance. We also estimate the area requirement and power consumption of these hardware accelerators through physical-aware logic synthesis, and show up to 65x better power consumption with significantly smaller area.

international conference on computer aided design | 2015

A Polyhedral-based SystemC Modeling and Generation Framework for Effective Low-power Design Space Exploration

Wei Zuo; Warren Kemmerer; Jong Bin Lim; Louis-Noël Pouchet; Andrey Ayupov; Taemin Kim; Kyungtae Han; Deming Chen

With the prevalence of System-on-Chips there is a growing need for automation and acceleration of the design process. A classical approach is to take a C/C++ specification of the application, convert it to a SystemC (or equivalent) description of hardware implementing this application, and perform successive refinement of the description to improve various design metrics. In this work, we present an automated SystemC generation and design space exploration flow alleviating several productivity and design time issues encountered in the current design process. We first automatically convert a subset of C/C++, namely affine program regions, into a full SystemC description through polyhedral model-based techniques while performing powerful data locality and parallelism transformations. We then leverage key properties of affine computations to design a fast and accurate latency and power characterization flow. Using this flow, we build analytical models of power and performance that can effectively prune away a large amount of inferior design points very fast and generate Pareto-optimal solution points. Experimental results show that (1) our SystemC models can evaluate system performance and power that is only 0.57% and 5.04% away from gate-level evaluation results, respectively; (2) our latency and power analytical models are 3.24% and 5.31% away from the actual Pareto points generated by SystemC simulation, with 2091x faster design-space exploration time on average. The generated Pareto-optimal points provide effective low-power design solutions given different latency constraints.

international conference on computer aided design | 2015

Learning-Based Power Modeling of System-Level Black-Box IPs

Dongwook Lee; Taemin Kim; Kyungtae Han; Yatin Hoskote; Lizy Kurian John; Andreas Gerstlauer

Virtual platform prototypes are widely utilized to enable early system-level design space exploration. Accurate power models for hardware components at high levels of abstraction are needed to enable system-level power analysis and optimization. However, the limited observability of third party IPs renders traditional power modeling methods challenging and inaccurate. In this paper, we present a novel approach for extending behavioral models of black-box hardware IPs with an accurate power estimate. We leverage state-of-the-art-machine learning techniques to synthesize an abstract power model. Our model uses input and output history to track data-dependent pipeline behavior. Furthermore, we introduce a specialized ensemble learning that is composed out of individually selected cycle-by-cycle models to reduce overall complexity and further increase estimation accuracy. Results of applying our approach to various industrial-strength design examples shows that our models predict average power consumption to within 3% of a commercial gate-level power estimation tool, all while running several orders of magnitude faster.

international conference on computer aided design | 2015

Architectural Requirements for Energy Efficient Execution of Graph Analytics Applications

Muhammet Mustafa Ozdal; Serif Yesil; Taemin Kim; Andrey Ayupov; Steven M. Burns; Ozcan Ozturk

Intelligent data analysis has become more important in the last decade especially because of the significant increase in the size and availability of data. In this paper, we focus on the common execution models and characteristics of iterative graph analytics applications. We show that the features that improve work efficiency can lead to significant overheads on existing systems. We identify the opportunities for custom hardware implementation, and outline the desired architectural features for energy efficient computation of graph analytics applications.

asia and south pacific design automation conference | 2014

Edit distance based instruction merging technique to improve flexibility of custom instructions toward flexible accelerator design

Hui Huang; Taemin Kim; Yatin Hoskote

Due to ever shortening time-to-market of a system-on-a-chip (SoC) and increasing NRE cost of designing accelerators in the SoC, a design methodology for a flexible accelerator is desirable. We propose a novel technique to make custom instructions (CIs) of an application specific instruction-set processor (ASIP) flexible. By doing so, CIs can support applications that were not considered at design time of the ASIP, which is difficult to do with a conventional CI design method. We have shown that custom instructions generated by our technique can support future applications by up to 7X better than those from a conventional method.

design automation conference | 2017

Accurate High-level Modeling and Automated Hardware/Software Co-design for Effective SoC Design Space Exploration

Wei Zuo; Louis-Noël Pouchet; Andrey Ayupov; Taemin Kim; Chung-Wei Lin; Shinichi Shiraishi; Deming Chen

A desirable feature of a development tool for SoC design is that, given the important applications in the domain to be targeted by the SoC, a powerful hardware-software partitioning engine is available to determine which function(s) shall be mapped to hardware. However, to provide high-quality partitioning, this engine must be able to consider a rich design space of possible alternate hardware and software implementations for each program region candidate for hardware acceleration, in turn making the task of finding the optimal mapping very difficult given the number of design points to consider and the need for accurate modeling of latency, power and area. In this work we propose a novel framework to enable hardware acceleration of performance-critical parts of an application, by addressing the problem of hardware/software partitioning under power and area constraints to minimize the overall program latency. Our flow is based on the LLVM compiler, and focuses on building a scalable compile-time partitioning algorithm while considering large sets of alternative hardware and software implementations for a particular region. To this end we develop a hybrid approach based on mixing semi-random selection of hardware design points and an Integer Linear Programming formulation of the mapping decision, along with iterative refinements of the solution. Experimental results demonstrate the capability of our approach to consider complex designs and yet output near-optimal partitioning decision. Our package is named RIP (Randomized ILP-based Partitioning), and is open source to benefit the research community.

design, automation, and test in europe | 2014

Automatic generation of custom SIMD instructions for Superword Level Parallelism

Taemin Kim; Yatin Hoskote

Application specific instruction-set processors (ASIPs) have drawn significant attention from System-on-a-Chip (SoC) community due to the capability of fine grain flexibility and customizability. In order to maximize the benefit of ASIP, automatic instruction set extension (ISE) is required. In the past decade, there have been plethora of researches on automatic ISE for custom scalar instruction. However, due to increasing usage of SIMD instructions to exploit data level parallelism (DLP) that exists both across loop iterations and within a basic block called Superword Level Parallelism (SLP), automatic generation of custom SIMD instructions is the inevitable direction of automatic ISE. In this paper, we propose an algorithm that automatically generates custom SIMD instructions from a set of custom scalar instructions to exploit SLP. We have demonstrated 52.4% and 30.8% performance improvement on average over base instruction set and additional custom scalar instructions, respectively.

international conference on computer aided design | 2015

Hardware Accelerator Design for Data Centers

Serif Yesil; Muhammet Mustafa Ozdal; Taemin Kim; Andrey Ayupov; Steven M. Burns; Ozcan Ozturk

As the size of available data is increasing, it is becoming inefficient to scale the computational power of traditional systems. To overcome this problem, customized application-specific accelerators are becoming integral parts of modern system on chip (SOC) architectures. In this paper, we summarize existing hardware accelerators for data centers and discuss the techniques to implement and embed them along with the existing SOCs.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2018

A Template-Based Design Methodology for Graph-Parallel Hardware Accelerators

Andrey Ayupov; Serif Yesil; Muhammet Mustafa Ozdal; Taemin Kim; Steven M. Burns; Ozcan Ozturk

Graph applications have been gaining importance in the last decade due to emerging big data analytics problems such as Web graphs, social networks, and biological networks. For these applications, traditional CPU and GPU architectures suffer in terms of performance and power consumption due to irregular communications, random memory accesses, and load balancing problems. It has been shown that specialized hardware accelerators can achieve much better power and energy efficiency compared to the general purpose CPUs and GPUs. In this paper, we present a template-based methodology specifically targeted for hardware accelerator design of big-data graph applications. Important architectural features that are key for energy efficient execution are implemented in a common template. The proposed template-based methodology is used to design hardware accelerators for different graph applications with little effort. Compared to an application-specific high-level synthesis methodology, we show that the proposed methodology can generate hardware accelerators with up to

Archive | 2012