Mingxing Tan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mingxing Tan is active.

Explore More

Publication

Featured researches published by Mingxing Tan.

field programmable gate arrays | 2010

Bit-level optimization for high-level synthesis and FPGA-based acceleration

Jiyu Zhang; Zhiru Zhang; Sheng Zhou; Mingxing Tan; Xianhua Liu; Xu Cheng; Jason Cong

Automated hardware design from behavior-level abstraction has drawn wide interest in FPGA-based acceleration and configurable computing research field. However, for many high-level programming languages, such as C/C++, the description of bitwise access and computation is not as direct as hardware description languages, and high-level synthesis of algorithmic descriptions may generate suboptimal implementations for bitwise computation-intensive applications. In this paper we introduce a bit-level transformation and optimization approach to assisting high-level synthesis of algorithmic descriptions. We introduce a bit-flow graph to capture bit-value information. Analysis and optimizing transformations can be performed on this representation, and the optimized results are transformed back to the standard data-flow graphs extended with a few instructions representing bitwise access. This allows high-level synthesis tools to automatically generate circuits with higher quality. Experiments show that our algorithm can reduce slice usage by 29.8% on average for a set of real-life benchmarks on Xilinx Virtex-4 FPGAs. In the meantime, the clock period is reduced by 13.6% on average, with an 11.4% latency reduction.

international conference on computer aided design | 2015

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

Mingxing Tan; Gai Liu; Ritchie Zhao; Steve Dai; Zhiru Zhang

Modern high-level synthesis (HLS) tools commonly employ pipelining to achieve efficient loop acceleration by overlapping the execution of successive loop iterations. However, existing HLS techniques provide inadequate support for pipelining irregular loop nests that contain dynamic-bound inner loops, where unrolling is either very expensive or not even applicable. To overcome this major limitation, we propose ElasticFlow, a novel architectural synthesis approach capable of dynamically distributing inner loops to an array of loop processing units (LPUs) in a complexity-effective manner. These LPUs can be either specialized to execute an individual loop or shared amongst multiple inner loops for area reduction. We evaluate ElasticFlow using a variety of real-life applications and demonstrate significant performance improvements over a widely used commercial HLS tool for Xilinx FPGAs.

international conference on computer aided design | 2014

Multithreaded pipeline synthesis for data-parallel kernels

Mingxing Tan; Bin Liu; Steve Dai; Zhiru Zhang

Pipelining is an important technique in high-level synthesis, which overlaps the execution of successive loop iterations or threads to achieve high throughput for loop/function kernels. Since existing pipelining techniques typically enforce in-order thread execution, a variable-latency operation in one thread would block all subsequent threads, resulting in considerable performance degradation. In this paper, we propose a multithreaded pipelining approach that enables context switching to allow out-of-order thread execution for data-parallel kernels. To ensure that the synthesized pipeline is complexity effective, we further propose efficient scheduling algorithms for minimizing the hardware overhead associated with context management. Experimental results show that our proposed techniques can significantly improve the effective pipeline throughput over conventional approaches while conserving hardware resources.

design automation conference | 2014

Flushing-Enabled Loop Pipelining for High-Level Synthesis

Steve Dai; Mingxing Tan; Kecheng Hao; Zhiru Zhang

Loop pipelining is a widely-accepted technique in high-level synthesis to enable pipelined execution of successive loop iterations to achieve high performance. Existing loop pipelining methods provide inadequate support for pipeline flushing. In this paper, we study the problem of enabling flushing in pipeline synthesis and examine its implications in scheduling and binding. We propose novel techniques for synthesizing a conflict-aware flushing-enabled pipeline that is robust against potential resource collisions. Experiments with real-life benchmarks show that our methods significantly reduce the possibility of resource collisions compared to conventional approaches while conserving hardware resources and achieving near-optimal performance.

international symposium on microarchitecture | 2014

Architectural Specialization for Inter-Iteration Loop Dependence Patterns

Shreesha Srinath; Berkin Ilbeyi; Mingxing Tan; Gai Liu; Zhiru Zhang; Christopher Batten

Hardware specialization is an increasingly common technique to enable improved performance and energy efficiency in spite of the diminished benefits of technology scaling. This paper proposes a new approach called explicit loop specialization (XLOOPS) based on the idea of elegantly encoding inter-iteration loop dependence patterns in the instruction set. XLOOPS supports a variety of inter-iteration data-and control-dependence patterns for both single and nested loops. The XLOOPS hardware/software abstraction requires only lightweight changes to a general-purpose compiler to generate XLOOPS binaries and enables executing these binaries on: (1) traditional micro architectures with minimal performance impact, (2) specialized micro architectures to improve performance and/or energy efficiency, and (3) adaptive micro architectures that can seamlessly migrate loops between traditional and specialized execution to dynamically trade-off performance vs. Energy efficiency. We evaluate XLOOPS using a vertically integrated research methodology and show compelling performance and energy efficiency improvements compared to both simple and complex general-purpose processors.

field programmable gate arrays | 2015

Mapping-Aware Constrained Scheduling for LUT-Based FPGAs

Mingxing Tan; Steve Dai; Udit Gupta; Zhiru Zhang

Scheduling plays a central role in high-level synthesis, as it inserts clock boundaries into the untimed behavioral model and greatly impacts the performance, power, and area of the synthesized circuits. While current scheduling techniques can make use of pre-characterized delay values of individual operations, it is difficult to obtain accurate timing estimation on a cluster of operations without considering technology mapping. This limitation is particularly pronounced for FPGAs where a large logic network can be mapped to only a few levels of look-up tables (LUT). In this paper, we propose MAPS, a mapping-aware constrained scheduling algorithm for LUT-based FPGAs. Instead of simply summing up the estimated delay values of individual operations, MAPS jointly performs technology mapping and scheduling, creating the opportunity for more aggressive operation chaining to minimize latency and reduce area. We show that MAPS can produce a latency-optimal solution, while supporting a variety of design timing requirements expressed in a system of difference constraints. We also present an efficient incremental scheduling technique for MAPS to effectively handle resource constraints. Experimental results with real-life benchmarks demonstrate that our proposed algorithm achieves very promising improvements in performance and resource usage when compared to a state-of-the-art commercial high-level synthesis tool targeting Xilinx FPGAs.

design automation conference | 2015

Area-efficient pipelining for FPGA-targeted high-level synthesis

Ritchie Zhao; Mingxing Tan; Steve Dai; Zhiru Zhang

Traditional techniques for pipeline scheduling in high-level synthesis for FPGAs assume an additive delay model where each operation incurs a pre-characterized delay. While a good approximation for some operation types, this fails to consider technology mapping, where a group of logic operations can be mapped to a single look-up table (LUT) and together incur one LUT worth of delay. We propose an exact formulation of the throughput-constrained, mapping-aware pipeline scheduling problem for FPGA-targeted high-level synthesis with area minimization being a primary objective. By taking this cross-layered approach, our technique is able to mitigate the pessimism inherent in static delay estimates and reduce the usage of LUTs and pipeline registers. Experimental results using our method demonstrate improved resource utilization for a number of logic-intensive, real-life benchmarks compared to a state-of-the-art commercial HLS tool for Xilinx FPGAs.

international symposium on low power electronics and design | 2014

CASA: correlation-aware speculative adders

Gai Liu; Ye Tao; Mingxing Tan; Zhiru Zhang

Speculative adders divide addition into subgroups and execute them in parallel for higher execution speed and energy efficiency, but at the risk of generating incorrect results. In this paper, we propose a lightweight correlation-aware speculative addition (CASA) method, which exploits the correlation between input data and carry-in values observed in real-life benchmarks to improve the accuracy of speculative adders. Experimental results show that applying the CASA method leads to a significant reduction in error rate with only marginal overhead in timing, area, and power consumption.

international conference on supercomputing | 2012

CVP: an energy-efficient indirect branch prediction with compiler-guided value pattern

Mingxing Tan; Xianhua Liu; Tong Tong; Xu Cheng

Indirect branch prediction is becoming increasingly important in modern high-performance processors. However, previous indirect branch predictors either require a significant amount of hardware storage and complexity, or heavily rely on the expensive manual profiling. In this paper, we propose the Compiler-Guided Value Pattern (CVP) prediction, an energy-efficient and accurate indirect branch prediction via compiler-microarchitecture cooperation. The key of CVP prediction is to use the compiler-guided value pattern as the correlated information to hint the dynamic predictor. The value pattern reflects the pattern regularity of the value correlation, and thus significantly improves the prediction accuracy even in the case of deep pipeline stage or long memory latency. CVP prediction relies on the compiler to automatically identify the primary value correlation based on three high-level program substructures: virtual function calls, switch-case statements and function pointer calls. The compiler-identified information is then fed back to the dynamic predictor and is further used to hint the indirect branch prediction at runtime. We show that CVP prediction can be implemented in modern processors with little extra hardware support. Evaluations show that CVP prediction can significantly improve the prediction accuracy by 46% over the traditional BTB-based prediction, leading to the performance improvement of 20%. Compared with the state-of-the-art aggressive ITTAGE and VBBI predictors, CVP prediction can improve the performance by 5.5% and 4.2% respectively.

design, automation, and test in europe | 2012

Energy-efficient branch prediction with compiler-guided history stack

Mingxing Tan; Xianhua Liu; Zichao Xie; Dong Tong; Xu Cheng

Branch prediction is critical in exploring instruction level parallelism for modern processors. Previous aggressive branch predictors generally require significant amount of hardware storage and complexity to pursue high prediction accuracy. This paper proposes the Compiler-guided History Stack (CHS), an energy-efficient compiler-microarchitecture cooperative technique for branch prediction. The key idea is to track very-long-distance branch correlation using a low-cost compiler-guided history stack. It relies on the compiler to identify branch correlation based on two program substructures: loop and procedure, and feed the information to the predictor by inserting guiding instructions. At runtime, the processor dynamically saves and restores the global history using a low-cost history stack structure according to the compiler-guided information. The modification on the global history enables the predictor to track very-long-distance branch correlation and thus improves the prediction accuracy. We show that CHS can be combined with most of existing branch predictors and it is especially effective with small and simple predictors. Our evaluations show that the CHS technique can reduce the average branch mispredictions by 28.7% over gshare predictor, resulting in average performance improvement of 10.4%. Furthermore, it can also improve those aggressive perceptron, OGEHL and TAGE predictors.

Explore More