Gai Liu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gai Liu is active.

Explore More

Publication

Featured researches published by Gai Liu.

international conference on computer aided design | 2015

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

Mingxing Tan; Gai Liu; Ritchie Zhao; Steve Dai; Zhiru Zhang

Modern high-level synthesis (HLS) tools commonly employ pipelining to achieve efficient loop acceleration by overlapping the execution of successive loop iterations. However, existing HLS techniques provide inadequate support for pipelining irregular loop nests that contain dynamic-bound inner loops, where unrolling is either very expensive or not even applicable. To overcome this major limitation, we propose ElasticFlow, a novel architectural synthesis approach capable of dynamically distributing inner loops to an array of loop processing units (LPUs) in a complexity-effective manner. These LPUs can be either specialized to execute an individual loop or shared amongst multiple inner loops for area reduction. We evaluate ElasticFlow using a variety of real-life applications and demonstrate significant performance improvements over a widely used commercial HLS tool for Xilinx FPGAs.

international symposium on microarchitecture | 2014

Architectural Specialization for Inter-Iteration Loop Dependence Patterns

Shreesha Srinath; Berkin Ilbeyi; Mingxing Tan; Gai Liu; Zhiru Zhang; Christopher Batten

Hardware specialization is an increasingly common technique to enable improved performance and energy efficiency in spite of the diminished benefits of technology scaling. This paper proposes a new approach called explicit loop specialization (XLOOPS) based on the idea of elegantly encoding inter-iteration loop dependence patterns in the instruction set. XLOOPS supports a variety of inter-iteration data-and control-dependence patterns for both single and nested loops. The XLOOPS hardware/software abstraction requires only lightweight changes to a general-purpose compiler to generate XLOOPS binaries and enables executing these binaries on: (1) traditional micro architectures with minimal performance impact, (2) specialized micro architectures to improve performance and/or energy efficiency, and (3) adaptive micro architectures that can seamlessly migrate loops between traditional and specialized execution to dynamically trade-off performance vs. Energy efficiency. We evaluate XLOOPS using a vertically integrated research methodology and show compelling performance and energy efficiency improvements compared to both simple and complex general-purpose processors.

international symposium on low power electronics and design | 2014

CASA: correlation-aware speculative adders

Gai Liu; Ye Tao; Mingxing Tan; Zhiru Zhang

Speculative adders divide addition into subgroups and execute them in parallel for higher execution speed and energy efficiency, but at the risk of generating incorrect results. In this paper, we propose a lightweight correlation-aware speculative addition (CASA) method, which exploits the correlation between input data and carry-in values observed in real-life benchmarks to improve the accuracy of speculative adders. Experimental results show that applying the CASA method leads to a significant reduction in error rate with only marginal overhead in timing, area, and power consumption.

field programmable gate arrays | 2017

Dynamic Hazard Resolution for Pipelining Irregular Loops in High-Level Synthesis

Steve Dai; Ritchie Zhao; Gai Liu; Shreesha Srinath; Udit Gupta; Christopher Batten; Zhiru Zhang

Current pipelining approach in high-level synthesis (HLS) achieves high performance for applications with regular and statically analyzable memory access patterns. However, it cannot effectively handle infrequent data-dependent structural and data hazards because they are conservatively assumed to always occur in the synthesized pipeline. To enable high-throughput pipelining of irregular loops, we study the problem of augmenting HLS with application-specific dynamic hazard resolution, and examine its implications on scheduling and quality of results. We propose to generate an aggressive pipeline at compile-time while resolving hazards with memory port arbitration and squash-and-replay at run-time. Our experiments targeting a Xilinx FPGA demonstrate promising performance improvement across a suite of representative benchmarks.

design automation conference | 2016

Improving high-level synthesis with decoupled data structure optimization

Ritchie Zhao; Gai Liu; Shreesha Srinath; Christopher Batten; Zhiru Zhang

Existing high-level synthesis (HLS) tools are mostly effective on algorithm-dominated programs that only use primitive data structures such as fixed size arrays and queues. However, many widely used data structures such as priority queues, heaps, and trees feature complex member methods with data-dependent work and irregular memory access patterns. These methods can be inlined to their call sites, but this does not address the aforementioned issues and may further complicate conventional HLS optimizations, resulting in a low-performance hardware implementation. To overcome this deficiency, we propose a novel HLS architectural template in which complex data structures are decoupled from the algorithm using a latency-insensitive interface. This enables overlapped execution of the algorithm and data structure methods, as well as parallel and out-of-order execution of independent methods on multiple decoupled lanes. Experimental results across a variety of real-life benchmarks show our approach is capable of achieving very promising speedups without causing significant area overhead.

field programmable gate arrays | 2017

A Parallel Bandit-Based Approach for Autotuning FPGA Compilation

Chang Xu; Gai Liu; Ritchie Zhao; Stephen Yang; Guojie Luo; Zhiru Zhang

Mainstream FPGA CAD tools provide an extensive collection of optimization options that have a significant impact on the quality of the final design. These options together create an enormous and complex design space that cannot effectively be explored by human effort alone. Instead, we propose to search this parameter space using autotuning, which is a popular approach in the compiler optimization domain. Specifically, we study the effectiveness of applying the multi-armed bandit (MAB) technique to automatically tune the options for a complete FPGA compilation flow from RTL to bitstream, including RTL/logic synthesis, technology mapping, placement, and routing. To mitigate the high runtime cost incurred by the complex FPGA implementation process, we devise an efficient parallelization scheme that enables multiple MAB-based autotuners to explore the design space simultaneously. In particular, we propose a dynamic solution space partitioning and resource allocation technique that intelligently allocates computing resources to promising search regions based on the runtime information of search quality from previous iterations. Experiments on academic and commercial FPGA CAD tools demonstrate promising improvements in quality and convergence rate across a variety of real-life designs.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2017

Architecture and Synthesis for Area-Efficient Pipelining of Irregular Loop Nests

Gai Liu; Mingxing Tan; Steve Dai; Ritchie Zhao; Zhiru Zhang

Modern high-level synthesis (HLS) tools commonly employ pipelining to achieve efficient loop acceleration by overlapping the execution of successive loop iterations. While existing HLS pipelining techniques obtain good performance with low complexity for regular loop nests, they provide inadequate support for effectively synthesizing irregular loop nests. For loop nests with dynamic-bound inner loops, current pipelining techniques require unrolling of the inner loops, which is either very expensive in resource or even inapplicable due to dynamic loop bounds. To address this major limitation, this paper proposes ElasticFlow, a novel architecture capable of dynamically distributing inner loops to an array of processing units (LPUs) in an area-efficient manner. The proposed LPUs can be either specialized to execute an individual inner loop or shared among multiple inner loops to balance the tradeoff between performance and area. A customized banked memory architecture is proposed to coordinate memory accesses among different LPUs to maximize memory bandwidth without significantly increasing memory footprint. We evaluate ElasticFlow using a variety of real-life applications and demonstrate significant performance improvements over a state-of-the-art commercial HLS tool for Xilinx FPGAs.

design automation conference | 2015

A reconfigurable analog substrate for highly efficient maximum flow computation

Gai Liu; Zhiru Zhang

We present the design and analysis of a novel analog reconfigurable substrate that enables fast and efficient computation of maximum flow on directed graphs. The substrate is composed of memristors and standard analog circuit components, where the on/off states of the crossbar switches encode the graph topology. We show that upon convergence, the steady-state voltages in the circuit capture the solution to the maximum flow problem. We also provide techniques to minimize the impacts of variability and non-ideal circuit components on the solution quality, enabling practical implementation of the proposed substrate. Performance evaluation demonstrates orders of magnitude improvements in speed and energy efficiency compared to a standard CPU implementation.

field programmable gate arrays | 2018

Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software Programmable FPGAs

Yuan Zhou; Udit Gupta; Steve Dai; Ritchie Zhao; Nitish Kumar Srivastava; Hanchen Jin; Joseph Featherston; Yi-Hsiang Lai; Gai Liu; Gustavo Velásquez; Wenping Wang; Zhiru Zhang

Modern high-level synthesis (HLS) tools greatly reduce the turn-around time of designing and implementing complex FPGA-based accelerators. They also expose various optimization opportunities, which cannot be easily explored at the register-transfer level. With the increasing adoption of the HLS design methodology and continued advances of synthesis optimization, there is a growing need for realistic benchmarks to (1) facilitate comparisons between tools, (2) evaluate and stress-test new synthesis techniques, and (3) establish meaningful performance baselines to track progress of the HLS technology. While several HLS benchmark suites already exist, they are primarily comprised of small textbook-style function kernels, instead of complete and complex applications. To address this limitation, we introduce Rosetta, a realistic benchmark suite for software programmable FPGAs. Designs in Rosetta are fully-developed applications. They are associated with realistic performance constraints, and optimized with advanced features of modern HLS tools. We believe that Rosetta is not only useful for the HLS research community, but can also serve as a set of design tutorials for non-expert HLS users. In this paper we describe the characteristics of our benchmarks and the optimization techniques applied to them. We further report experimental results on an embedded FPGA device as well as a cloud FPGA platform.

field programmable gate arrays | 2018

DATuner: An Extensible Distributed Autotuning Framework for FPGA Design and Design Automation: (Abstract Only)

Gai Liu; Ecenur Ustun; Shaojie Xiang; Chang Xu; Guojie Luo; Zhiru Zhang

Mainstream FPGA tools contain an extensive set of user-controlled compilation options and internal optimization strategies that significantly impact the design quality. These compilation and optimization parameters create a complex design space that human designers may not be able to effectively explore in a time-efficient manner. In this work we describe DATuner, an open-source extensible distributed autotuning framework for optimizing FPGA designs and design automation tools using an ensemble of search techniques managed by multi-armed bandit algorithms. DATuner is designed for a distributed environment that uses parallel searches to amortize the significant runtime overhead of the CAD tools. DATuner provides convenient interface for extension to user-supplied tools, which enables the end users to apply DATuner to design tools/flows of their interest. We demonstrate the effectiveness and extensibility of DATuner using three case studies, which include clock frequency optimization for FPGA compilation, fixed-point optimization, and autotuning logic synthesis transformations.

Explore More