Dajiang Liu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dajiang Liu is active.

Explore More

Publication

Featured researches published by Dajiang Liu.

design automation conference | 2013

Polyhedral model based mapping optimization of loop nests for CGRAs

Dajiang Liu; Shouyi Yin; Leibo Liu; Shaojun Wei

The coarse-grained reconfigurable architecture (CGRA) is a promising platform that provides both high performance and high power-efficiency. The compute-intensive portions of an application (e.g. loops) are often mapped onto CGRA for acceleration. To optimize the mapping of loop nests to CGRA, this paper makes two contributions: i) Establishing a precise CGRA performance model and formulating the loop nests mapping as a nonlinear optimization problem based on polyhedral model, ii) Extracting an efficient heuristic loop transformation and mapping algorithm (PolyMAP) to improve mapping performance. Experiment results on most kernels of the PolyBench and real-life applications show that our proposed approach can improve the performance of the kernels by 21% on average, as compared to one of the best existing mapping algorithm, EPIMap. The runtime complexity of PolyMAP is also acceptable.

IEEE Transactions on Very Large Scale Integration Systems | 2016

Improving Nested Loop Pipelining on Coarse-Grained Reconfigurable Architectures

Shouyi Yin; Dajiang Liu; Yu Peng; Leibo Liu; Shaojun Wei

Coarse-grained reconfigurable architecture (CGRA) is a promising architecture with high performance, high power efficiency, and attraction of flexibility. The computation-intensive portions of applications, i.e., loops, are often implemented on CGRAs for acceleration. The loop pipelining techniques are usually used to exploit the parallelism of loops. However, for nested loops, the existing loop pipelining methods often result in poor hardware utilization and low execution performance. To tackle this problem, this paper makes three contributions: 1) we propose the use of affine transformation to facilitate nested loop pipelining; 2) based on polyhedral model, we present a precise and general formulation of the nested loop pipelining problem on a CGRA; and 3) using the insights from problem formulation, we design a joint affine transformation and multipipeline merging approach to improve the performance of nested loop on CGRA. The experimental results show that our approach can improve the performance of nested loops up to 35% on average, compared with the state-of-the-art techniques.

IEEE Transactions on Very Large Scale Integration Systems | 2015

Optimizing Spatial Mapping of Nested Loop for Coarse-Grained Reconfigurable Architectures

Dajiang Liu; Shouyi Yin; Yu Peng; Leibo Liu; Shaojun Wei

Coarse-grained reconfigurable architectures (CGRAs) have drawn increasing attention due to their flexibility and efficiency. Loops in applications are often mapped onto CGRAs for acceleration, and the mapping of loops onto CGRA is quite a challenging work due to the parallel execution paradigm and constrained hardware resource. To map loops onto CGRAs efficiently, it is important to transform loops into pieces that obey hardware resource constraints with less overhead (e.g., communication and configuration overhead). In this paper, we tackle this problem by establishing a performance optimization problem, including loop transformation and back- end placing and routing. A novel searching strategy is also designed to find the optimal result efficiently. Finally, we built a complete flow of mapping loop nests onto CGRA. Experiment results on most kernels of the Polybench show that our proposed approach can improve the performance of the kernels by 42% on average, as compared with the state-of-the-art methods. The runtime complexity of our approach is also acceptable.

IEEE Transactions on Very Large Scale Integration Systems | 2016

Memory-Aware Loop Mapping on Coarse-Grained Reconfigurable Architectures

Shouyi Yin; Xianqing Yao; Dajiang Liu; Leibo Liu; Shaojun Wei

The coarse-grained reconfigurable architectures (CGRAs) are a promising class of architectures with the advantages of high performance and high power efficiency. The compute-intensive parts of an application (e.g., loops) are often mapped onto the CGRA for acceleration. Due to the extra overhead of memory access and the limited communication bandwidth between the processing element (PE) array and local memory, previous works trying to solve the routing problem are mainly confined in the internal resources of PE arrays (e.g., PEs and registers). Inevitably, routing with PEs or registers will consume a lot of computational resources and cause the increase of the initiation interval. To solve this problem, this paper makes two contributions: 1) establishing a precise formulation for the CGRA mapping problem while using shared local data memory as a routing resource and 2) extracting an effective approach for mapping loops to CGRAs. The experimental results on loops of the SPEC2006, Livermore, and MiBench show that our approach (called MEMMap) can improve the performance of the kernels on CGRA up to 1.62×, 1.58×, 1.28×, and 1.23× compared with the edge-centric modulo scheduling, EPIMap, REGIMap, and force-directed map, respectively, with an acceptable increase in compilation time.

field programmable gate arrays | 2017

Learning Convolutional Neural Networks for Data-Flow Graph Mapping on Spatial Programmable Architectures (Abstract Only)

Shouyi Yin; Dajiang Liu; Lifeng Sun; Xinhan Lin; Leibo Liu; Shaojun Wei

Data flow graph (DFG) mapping is critical for the compiling of spatial programmable architecture, where compilation time is a key factor for both time-to-market requirement and mapping successful rate. Inspired from the great progress made in tree search game using deep neural network, we proposed a framework for learning convolutional neural networks for mapping DFGs onto spatial programmable architectures. Considering that mapping is a process from source to target, we present a dual-input neural network capturing features from both DFGs in applications and Process Element Array (PEA) in spatial programmable architectures. In order to train the neural network, algorithms are designed to automatically generate a data set from PEA intermediate states of preprocessed DFG. Finally, we demonstrate that the trained neural network can get high identifying accuracy of mapping quality and our proposed mapping approach are competitive with state-of-the-art DFG mapping algorithms in performance while the compilation time is greatly reduced.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2016

Joint Modulo Scheduling and

Shouyi Yin; Jiangyuan Gu; Dajiang Liu; Leibo Liu; Shaojun Wei

Coarse-grained reconfigurable architecture (CGRA) is becoming an increasingly attractive platform because of its high performance and power (or energy) efficiency. To reduce energy consumption, the dual-Vdd technique has been employed in CGRAs, and the modulo scheduling technique is widely used to improve performance of applications. To achieve both high performance and energy-efficiency simultaneously, this paper formulates the solution as a biobjective optimization problem of energy consumption and initiation interval of loop pipelines on CGRAs, and proposes a joint modulo scheduling and dual-Vdd assignment approach. The experimental results show that the proposed approach can bring a significant energy reduction of 24.8% and kernel energy efficiency acceleration of 1.41× on average, while the performance is maintained.

design, automation, and test in europe | 2015

V_{\mathrm{ dd}}

Shouyi Yin; Dajiang Liu; Leibo Liu; Shaojun Wei; Yike Guo

Coarse-Grained Reconfigurable Architectures (CGRAs) are the promising architectures with high performance, high power- efficiency and attractions of flexibility. The computation-intensive portions of application, i.e. loops, are often implemented on CGRAs for acceleration. The loop pipelining techniques are usually used to exploit the parallelism of loops. However, for nested loops, the existing loop pipelining methods often result in poor hardware utilization and low execution performance. To tackle this problem, this paper makes two contributions: 1) a pipelining-beneficial affine transformation method which can optimize the initiation interval (II) of nested loop and enable multiple loop pipelines merging; 2) a multi-pipeline merging method which can improve hardware utilization further. The experimental results show that our approach can improve the performance of nested loop by up to 56% on average, as compared to the state-of-the-art techniques.

international conference on cyber-physical systems | 2011

Assignment for Loop Mapping on Dual-

Dajiang Liu; Shouyi Yin; Jianfeng Chen; Hui Gao; Shaojun Wei

Environmental monitoring is required almost in every walk of life, especially in manufactory, truck farm and hospital wards. But the collection of many different monitoring sensors for synthetical analysis is difficult because most existing sensor works alone and there are varies of sensor interfaces. This paper presents a novel environmental monitoring system based wireless sensor network with SD memory interface to improve off-the-shelf sensor products. We designed a main-node with SD memory interface to gather sub-node sensor data, then five small sub-node boards are designed to be installed in existing sensors, such as temperature/humidity sensor, infrared sensor, smoke detector and door magnetism detector. Via the general SD memory interface, a portable tablet can read environment information instantaneously, which could be recorded in magnanimous nand flash storage at the same time. The performance evaluation shows that our monitoring system is flexible and convenient to apply to off-the-shelf commercial sensors.

international symposium on circuits and systems | 2017

V_{\mathrm{ dd}}

Shouyi Yin; Dajiang Liu; Lifeng Sun; Leibo Liu; Shaojun Wei

The coarse-grained reconfigurable architecture (C-GRA) is a promising platform that provides both high performance and high power-efficiency. Dataflow graph (DFG) mapping is critical to tap the potentials of CGRAs. Inspired from the great progress made in tree search game using deep neural network, we proposed a frame work for learning convolutional neural network for mapping DFGs onto spatial programmable CGRAs. Considering the mapping process, we present a dual-input neural network capturing the features from both DFGs in applications and Process Element Array (PEA) in CGRA. In order to train the neural network, algorithms are designed to automatically generate a data set from PEA intermediate states of preprocessed DFG. Finally, experimental results demonstrate that our proposed mapping approach is competitive with state-of-the-art DFG mapping algorithms in performance while the compilation time is greatly reduced.

field-programmable custom computing machines | 2014

CGRAs

Dajiang Liu; Shouyi Yin; Leibo Liu; Shaojun Wei

Since the maximum likelihood (ML) decoding results too complex when the modulation order and the number of receive antennas increase, an efficient reduced complexity ML-based decoding scheme applied to a multiple-input-multiple-output (MIMO) antenna systems with quasi-orthogonal space-time block code (QO-STBC) is proposed, and named reduced cluster search ML decoding (RCS-ML). Its performance and complexity aspects are compared to the conventional ML decoding approach. High-order modulation indexes and short low density parity check codes (LDPC) are considered. Numerical results have indicated no degradation in the performance and an increasing reduction in the complexity of RCS-ML decoding with respect to the conventional ML when the modulation order increases.Coarse-Grained Reconfigurable Architecture (CGRAs) are a promising parallel architecture with both high performance and high power-efficiency. Inner loop pipelining and outer loop merging techniques are usually used to improve the execution performance when mapping loops ontoCGRA. However, the number of concurrently executable operators (CEOs) from the kernel still can not make the best of PEs in a PEA as the kernel width is limited by loop body size. In this paper, we evaluate the number of CEOs from kernel and proposed a novel outer loop parallelization scheme to increase the number of CEOs, and further reduce the execution time of loops. Our experimental results using loops from polybench demonstrated that our approach can increase the performance of nested loop by up to 1.37 times compared to the epilog-prolog merging method. Moreover, modest compilation time is needed to generates the final solution.There is an increasing interest for cloud services to be provided in a more energy efficient way. The growing deployment of large-scale, complex workflow applications onto cloud computing hosts is being faced with crucial challenges in reducing the power consumption without violating the service level agreement (SLA). In this paper, we consider cloud hosts which can operate in different power states with different capacities respectively, and propose a novel scheduling heuristic for workflows to reduce energy consumption while still meeting deadline constraint. The proposed heuristic is evaluated using simulation with four different real-world applications. The observed results indicates that our heuristic does significantly outperform the existing approaches.As enterprises shift from using direct-attached storage to network-based storage for housing primary data, flash-based, host-side caching has gained momentum as the primary latency reduction technique. In this paper, we make the case for integration of flash caching algorithms at the file level, as opposed to the conventional block-level integration. In doing so, we will show how our extensions to Loris, a reliable, file-oriented storage stack, transform it into a framework for designing layout-independent, file-level caching systems. Using our Loris prototype, we demonstrate the effectiveness of Loris-based, file-level flash caching systems over their block-level counterparts, and investigate the effect of various write and allocation policies on the overall performance.

Explore More