Steve Dai | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Steve Dai is active.

Explore More

Publication

Featured researches published by Steve Dai.

international conference on computer aided design | 2015

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

Mingxing Tan; Gai Liu; Ritchie Zhao; Steve Dai; Zhiru Zhang

Modern high-level synthesis (HLS) tools commonly employ pipelining to achieve efficient loop acceleration by overlapping the execution of successive loop iterations. However, existing HLS techniques provide inadequate support for pipelining irregular loop nests that contain dynamic-bound inner loops, where unrolling is either very expensive or not even applicable. To overcome this major limitation, we propose ElasticFlow, a novel architectural synthesis approach capable of dynamically distributing inner loops to an array of loop processing units (LPUs) in a complexity-effective manner. These LPUs can be either specialized to execute an individual loop or shared amongst multiple inner loops for area reduction. We evaluate ElasticFlow using a variety of real-life applications and demonstrate significant performance improvements over a widely used commercial HLS tool for Xilinx FPGAs.

international conference on computer aided design | 2014

Multithreaded pipeline synthesis for data-parallel kernels

Mingxing Tan; Bin Liu; Steve Dai; Zhiru Zhang

Pipelining is an important technique in high-level synthesis, which overlaps the execution of successive loop iterations or threads to achieve high throughput for loop/function kernels. Since existing pipelining techniques typically enforce in-order thread execution, a variable-latency operation in one thread would block all subsequent threads, resulting in considerable performance degradation. In this paper, we propose a multithreaded pipelining approach that enables context switching to allow out-of-order thread execution for data-parallel kernels. To ensure that the synthesized pipeline is complexity effective, we further propose efficient scheduling algorithms for minimizing the hardware overhead associated with context management. Experimental results show that our proposed techniques can significantly improve the effective pipeline throughput over conventional approaches while conserving hardware resources.

design automation conference | 2014

Flushing-Enabled Loop Pipelining for High-Level Synthesis

Steve Dai; Mingxing Tan; Kecheng Hao; Zhiru Zhang

Loop pipelining is a widely-accepted technique in high-level synthesis to enable pipelined execution of successive loop iterations to achieve high performance. Existing loop pipelining methods provide inadequate support for pipeline flushing. In this paper, we study the problem of enabling flushing in pipeline synthesis and examine its implications in scheduling and binding. We propose novel techniques for synthesizing a conflict-aware flushing-enabled pipeline that is robust against potential resource collisions. Experiments with real-life benchmarks show that our methods significantly reduce the possibility of resource collisions compared to conventional approaches while conserving hardware resources and achieving near-optimal performance.

Ipsj Transactions on System Lsi Design Methodology | 2015

High-level Synthesis for Low-power Design

Zhiru Zhang; Deming Chen; Steve Dai; Keith A. Campbell

Power and energy efficiency have emerged as first-order design constraints across the computing spectrum from handheld devices to warehouse-sized datacenters. As the number of transistors continues to scale, effectively managing design complexity under stringent power constraints has become an imminent challenge of the IC industry. The manual process of power optimization in RTL design has been increasingly difficult, if not already unsustainable. Complexity scaling dictates that this process must be automated with robust analysis and synthesis algorithms at a higher level of abstraction. Along this line, high-level synthesis (HLS) is a promising technology to improve design productivity and enable new opportunities for power optimization for higher design quality. By allowing early access to the system architecture, high-level decisions during HLS can have a significant impact on the power and energy efficiency of the synthesized design. In this paper, we will discuss the recent research development of using HLS to effectively explore a multi-dimensional design space and derive low-power implementations. We provide an in-depth coverage of HLS low-power optimization techniques and synthesis algorithms proposed in the last decade. We will also describe the key power optimization challenges facing HLS today and outline potential opportunities in tackling these challenges.

field programmable gate arrays | 2015

Mapping-Aware Constrained Scheduling for LUT-Based FPGAs

Mingxing Tan; Steve Dai; Udit Gupta; Zhiru Zhang

Scheduling plays a central role in high-level synthesis, as it inserts clock boundaries into the untimed behavioral model and greatly impacts the performance, power, and area of the synthesized circuits. While current scheduling techniques can make use of pre-characterized delay values of individual operations, it is difficult to obtain accurate timing estimation on a cluster of operations without considering technology mapping. This limitation is particularly pronounced for FPGAs where a large logic network can be mapped to only a few levels of look-up tables (LUT). In this paper, we propose MAPS, a mapping-aware constrained scheduling algorithm for LUT-based FPGAs. Instead of simply summing up the estimated delay values of individual operations, MAPS jointly performs technology mapping and scheduling, creating the opportunity for more aggressive operation chaining to minimize latency and reduce area. We show that MAPS can produce a latency-optimal solution, while supporting a variety of design timing requirements expressed in a system of difference constraints. We also present an efficient incremental scheduling technique for MAPS to effectively handle resource constraints. Experimental results with real-life benchmarks demonstrate that our proposed algorithm achieves very promising improvements in performance and resource usage when compared to a state-of-the-art commercial high-level synthesis tool targeting Xilinx FPGAs.

design automation conference | 2015

Area-efficient pipelining for FPGA-targeted high-level synthesis

Ritchie Zhao; Mingxing Tan; Steve Dai; Zhiru Zhang

Traditional techniques for pipeline scheduling in high-level synthesis for FPGAs assume an additive delay model where each operation incurs a pre-characterized delay. While a good approximation for some operation types, this fails to consider technology mapping, where a group of logic operations can be mapped to a single look-up table (LUT) and together incur one LUT worth of delay. We propose an exact formulation of the throughput-constrained, mapping-aware pipeline scheduling problem for FPGA-targeted high-level synthesis with area minimization being a primary objective. By taking this cross-layered approach, our technique is able to mitigate the pessimism inherent in static delay estimates and reduce the usage of LUTs and pipeline registers. Experimental results using our method demonstrate improved resource utilization for a number of logic-intensive, real-life benchmarks compared to a state-of-the-art commercial HLS tool for Xilinx FPGAs.

field programmable gate arrays | 2017

Dynamic Hazard Resolution for Pipelining Irregular Loops in High-Level Synthesis

Steve Dai; Ritchie Zhao; Gai Liu; Shreesha Srinath; Udit Gupta; Christopher Batten; Zhiru Zhang

Current pipelining approach in high-level synthesis (HLS) achieves high performance for applications with regular and statically analyzable memory access patterns. However, it cannot effectively handle infrequent data-dependent structural and data hazards because they are conservatively assumed to always occur in the synthesized pipeline. To enable high-throughput pipelining of irregular loops, we study the problem of augmenting HLS with application-specific dynamic hazard resolution, and examine its implications on scheduling and quality of results. We propose to generate an aggressive pipeline at compile-time while resolving hazards with memory port arbitration and squash-and-replay at run-time. Our experiments targeting a Xilinx FPGA demonstrate promising performance improvement across a suite of representative benchmarks.

field programmable gate arrays | 2017

Accelerating Face Detection on Programmable SoC Using C-Based Synthesis

Nitish Kumar Srivastava; Steve Dai; Rajit Manohar; Zhiru Zhang

High-level synthesis (HLS) enables designing at a higher level of abstraction to effectively cope with design complexity of emerging applications on modern programmable system-on-chip (SoC). While HLS continues to evolve with a growing set of algorithms, methodologies, and tools to efficiently map software designs onto optimized hardware architectures, there continues to lack realistic benchmark applications with sufficient complexity and enforceable constraints. In this paper we present a case study of accelerating face detection based on the Viola Jones algorithm on a programmable SoC using a C-based HLS flow. We also share our insights in porting a software-based design into a synthesizable implementation with HLS-specific data structures and optimizations. Our design is able to achieve a frame rate of 30 frames per second which is suitable for realtime applications. Our performance and quality of results are comparable to those of many traditional RTL implementations.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2017

Architecture and Synthesis for Area-Efficient Pipelining of Irregular Loop Nests

Gai Liu; Mingxing Tan; Steve Dai; Ritchie Zhao; Zhiru Zhang

Modern high-level synthesis (HLS) tools commonly employ pipelining to achieve efficient loop acceleration by overlapping the execution of successive loop iterations. While existing HLS pipelining techniques obtain good performance with low complexity for regular loop nests, they provide inadequate support for effectively synthesizing irregular loop nests. For loop nests with dynamic-bound inner loops, current pipelining techniques require unrolling of the inner loops, which is either very expensive in resource or even inapplicable due to dynamic loop bounds. To address this major limitation, this paper proposes ElasticFlow, a novel architecture capable of dynamically distributing inner loops to an array of processing units (LPUs) in an area-efficient manner. The proposed LPUs can be either specialized to execute an individual inner loop or shared among multiple inner loops to balance the tradeoff between performance and area. A customized banked memory architecture is proposed to coordinate memory accesses among different LPUs to maximize memory bandwidth without significantly increasing memory footprint. We evaluate ElasticFlow using a variety of real-life applications and demonstrate significant performance improvements over a state-of-the-art commercial HLS tool for Xilinx FPGAs.

Proceedings of SPIE | 2013

Design, simulation, and evaluation of imaging oximeters

Steve Dai; Ye Tian; Joyce E. Farrell

Computer simulations have played an important role in the design and evaluation of imaging sensors with applications in remote sensing and consumer photography. In this paper, we provide an example of computer simulations used to guide the design of imaging sensors for a biomedical application: We consider how sensor design, illumination, measurement geometry, and skin type influence the ability to detect blood oxygen saturation from non-invasive measurements of skin reflectance. The methodology we describe in this paper can be used to design, simulate and evaluate the design of other biomedical imaging systems.

Explore More