Is this you? Create Your Porfile

Chen-Han Ho

University of Wisconsin-Madison

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chen-Han Ho is active.

Explore More

Publication

Featured researches published by Chen-Han Ho.

high-performance computer architecture | 2011

Dynamically Specialized Datapaths for energy efficient computing

Venkatraman Govindaraju; Chen-Han Ho; Karthikeyan Sankaralingam

Due to limits in technology scaling, energy efficiency of logic devices is decreasing in successive generations. To provide continued performance improvements without increasing power, regardless of the sequential or parallel nature of the application, microarchitectural energy efficiency must improve. We propose Dynamically Specialized Datapaths to improve the energy efficiency of general purpose programmable processors. The key insights of this work are the following. First, applications execute in phases and these phases can be determined by creating a path-tree of basic-blocks rooted at the inner-most loop. Second, specialized datapaths corresponding to these path-trees, which we refer to as DySER blocks, can be constructed by interconnecting a set of heterogeneous computation units with a circuit-switched network. These blocks can be easily integrated with a processor pipeline. A synthesized RTL implementation using an industry 55nm technology library shows a 64-functional-unit DySER block occupies approximately the same area as a 64 KB single-ported SRAM and can execute at 2 GHz. We extend the GCC compiler to identify path-trees and code-mapping to DySER and evaluate the PAR-SEC, SPEC and Parboil benchmarks suites. Our results show that in most cases two DySER blocks can achieve the same performance (within 5%) as having a specialized hardware module for each path-tree. A 64-FU DySER block can cover 12% to 100% of the dynamically executed instruction stream. When integrated with a dual-issue out-of-order processor, two DySER blocks provide geometric mean speedup of 2.1X (1.15X to 10X), and geometric mean energy reduction of 40% (up to 70%), and 60% energy reduction if no performance improvement is required.

IEEE Micro | 2012

DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing

Venkatraman Govindaraju; Chen-Han Ho; Tony Nowatzki; Jatin Chhugani; Nadathur Satish; Karthikeyan Sankaralingam; Changkyu Kim

The DySER (Dynamically Specializing Execution Resources) architecture supports both functionality specialization and parallelism specialization. By dynamically specializing frequently executing regions and applying parallelism mechanisms, DySER provides efficient functionality and parallelism specialization. It outperforms an out-of-order CPU, Streaming SIMD Extensions (SSE) acceleration, and GPU acceleration while consuming less energy. The full-system field-programmable gate array (FPGA) prototype of DySER integrated into OpenSparc demonstrates a practical implementation.

high performance computer architecture | 2012

Design, integration and implementation of the DySER hardware accelerator into OpenSPARC

Jesse Benson; Ryan Cofell; Chris Frericks; Chen-Han Ho; Venkatraman Govindaraju; Tony Nowatzki; Karthikeyan Sankaralingam

Accelerators and specialization in various forms are emerging as a way to increase processor performance. Examples include Navigo, Conservation-Cores, BERET, and DySER. While each of these employ different primitives and principles to achieve specialization, they share some common concerns with regards to implementation. Two of these concerns are: how to integrate them with a commercial processor and how to develop their compiler toolchain. This paper undertakes an implementation study of one design point: integration of DySER into OpenSPARC, a design we call OpenSPlySER. We report on our implementation exercise and quantitative results, and conclude with a set of our lessons learned. We demonstrate that DySER delivers on its goal of providing a non-intrusive accelerator design. OpenSPlySERruns on an Virtex-5 FPGA, boots unmodified Linux, and runs most of the SPECINT benchmarks with our compiler. Due to physical design constraints, speedups on full benchmarks are modest for the FPGA prototype. On targeted microbenchmarks, OpenSPlySER delivers up to a 31-fold speedup over the baseline OpenSPARC. We conclude with some lessons learned from this somewhat unique exercise of significantly modifying a commercial processor. To the best of our knowledge, this work is one of the most ambitious extensions of OpenSPARC.

ACM Transactions on Architecture and Code Optimization | 2015

Enabling GPGPU Low-Level Hardware Explorations with MIAOW: An Open-Source RTL Implementation of a GPGPU

Raghuraman Balasubramanian; Vinay Gangadhar; Ziliang Guo; Chen-Han Ho; Cherin Joseph; Jaikrishnan Menon; Mario Paulo Drumond; Robin Paul; Sharath Prasad; Pradip Valathol; Karthikeyan Sankaralingam

Graphic Processing Unit (GPU) based general purpose computing is developing as a viable alternative to CPU based computing in many domains. In this paper, we introduce MIAOW (Many-core Integrated Accelerator Of Wisconsin), an open source RTL implementation of the AMD Southern Islands GPGPU ISA, capable of running unmodified OpenCL-based applications. We present our design motivated by our goals to create a realistic, flexible, OpenCL compatible GPGPU, capable of emulating a full system. We demonstrate that MIAOW enables disruptive and transformative research and has the potential to bring all of the benefits of open source development to GPUs in real products in the long term.

international symposium on computer architecture | 2015

Efficient execution of memory access phases using dataflow specialization

Chen-Han Ho; Sung Jin Kim; Karthikeyan Sankaralingam

This paper identifies a new opportunity for improving the efficiency of a processor core: memory access phases of programs. These are dynamic regions of programs where most of the instructions are devoted to memory access or address computation. These occur naturally in programs because of workload properties, or when employing an in-core accelerator, we get induced phases where the code execution on the core is access code. We observe such code requires an OOO cores dataflow and dynamism to run fast and does not execute well on an in-order processor. However, an OOO core consumes much power, effectively increasing energy consumption and reducing the energy efficiency of in-core accelerators. We develop an execution model called memory access dataflow (MAD) that encodes dataflow computation, event-condition-action rules, and explicit actions. Using it we build a specialized engine that provides an OOO cores performance but at a fraction of the power. Such an engine can serve as a general way for any accelerator to execute its respective induced phase, thus providing a common interface and implementation for current and future accelerators. We have designed and implemented MAD in RTL, and we demonstrate its generality and flexibility by integration with four diverse accelerators (SSE, DySER, NPU, and C-Cores). Our quantitative results show, relative to in-order, 2-wide OOO, and 4-wide OOO, MAD provides 2.4×, 1.4× and equivalent performance respectively. It provides 0.8×, 0.6× and 0.4× lower energy.

international conference on parallel processing | 2012

Mechanisms and Evaluation of Cross-Layer Fault-Tolerance for Supercomputing

Chen-Han Ho; Marc de Kruijf; Karthikeyan Sankaralingam; Barry Rountree; Martin Schulz; Bronis R. de Supinski

Reliability is emerging as an important constraint for future microprocessors. Cooperative hardware and software approaches for error tolerance can solve this hardware reliability challenge. Cross-layer fault tolerance frameworks expose hardware failures to upper-layers, like the compiler, to help correct faults. Such cooperative approaches require less hardware complexity than masking all faults at the hardware level and are generally more energy efficient. This paper provides a detailed design and an implementation study of cross-layer fault tolerance for supercomputing. Since supercomputers necessarily involve large component counts, they have more frequent failures than consumer electronics and small systems. Conventionally, these systems use redundancy and check pointing to achieve reliable computing. However, redundancy increases acquisition as well as recurring energy costs. This paper describes a simple language-level mechanism coupled with complementary compilation and lightweight hardware error detection that provides efficient reliability and cross-layer fault-tolerance for supercomputers. Our evaluation focuses on strong scaling problems for which we can trade computing power for redundancy. Our results show a range of 1.07× to 2.5× speedup when employing cross-layer error-tolerance compared to conventional full dual modular redundancy (DMR) to contain all errors within hardware. Further, we demonstrate the approach can sustain 7% to 50% lower energy. The most important result of this work is qualitative: we can use a simplified hardware design with relaxed architectural correctness guarantees.

2015 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XVIII) | 2015

MIAOW - An open source RTL implementation of a GPGPU

Raghuraman Balasubramanian; Vinay Gangadhar; Ziliang Guo; Chen-Han Ho; Cherin Joseph; Jaikrishnan Menon; Mario Paulo Drumond; Robin Paul; Sharath Prasad; Pradip Valathol; Karthikeyan Sankaralingam

ieee hot chips symposium | 2012

Prototyping the DySER specialization architecture with OpenSPARC

Jesse Benson; Ryan Cofell; Chris Frericks; Venkatraman Govindaraju; Chen-Han Ho; Zachary Marzec; Tony Nowatzki; Karu Sankaralingam

This paper describes the prototype implementation of the DySER specialization architecture integrated into the OpenSPARC processor. The papers description covers the hardware, compiler, and application tuning. The prototype system provides speedups up to 14× over OpenSPARC (geometric mean 5×). The architecture is more flexible than SIMD and GPU-based acceleration while supporting a more diverse set of workloads.

IEEE Micro | 2016

Accelerating the Accelerator Memory Interface with Access-Execute and Dataflow

Chen-Han Ho; Sung Jin Kim; Karthikeyan Sankaralingam

The Memory Access Dataflow execution model and hardware architecture combine principles of decoupled access and execution, dataflow computation, and event-condition-action rules to redevelop the main primitives of an out-of-order core in a power-efficient way that targets memory accesses that naturally occur in programs or get induced when some work is offloaded to an accelerator. Such a mechanism can allow in-core accelerators to integrate with high- or low-performance cores without compromising performance, run at low power by turning off the core during such phases, and provide high energy savings.

international symposium on computer architecture | 2011