Lei Gong
University of Science and Technology of China
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lei Gong.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2017
Chao Wang; Lei Gong; Qi Yu; Xi Li; Yuan Xie; Xuehai Zhou
As the emerging field of machine learning, deep learning shows excellent ability in solving complex learning problems. However, the size of the networks becomes increasingly large scale due to the demands of the practical applications, which poses significant challenge to construct a high performance implementations of deep learning neural networks. In order to improve the performance as well as to maintain the low power cost, in this paper we design deep learning accelerator unit (DLAU), which is a scalable accelerator architecture for large-scale deep learning networks using field-programmable gate array (FPGA) as the hardware prototype. The DLAU accelerator employs three pipelined processing units to improve the throughput and utilizes tile techniques to explore locality for deep learning applications. Experimental results on the state-of-the-art Xilinx FPGA board demonstrate that the DLAU accelerator is able to achieve up to
international conference on web services | 2017
Chongchong Xu; Jinhong Zhou; Yuntao Lu; Fan Sun; Lei Gong; Chao Wang; Xi Li; Xuehai Zhou
36.1 {\times }
international conference on cluster computing | 2017
Fan Sun; Chao Wang; Lei Gong; Chongchong Xu; Yiwei Zhang; Yuntao Lu; Xi Li; Xuehai Zhou
speedup comparing to the Intel Core2 processors, with the power consumption at 234 mW.
field programmable gate arrays | 2018
Chongchong Xu; Chao Wang; Yiwei Zhang; Lei Gong; Xi Li; Xuehai Zhou
Large-scale data is often represented as graphs in the field of modern cloud computing. Graph processing attracts more and more attentions when utilizing the cloud computing service. With the increasing attentions to process massive graphs (e.g., social networks, web graphs, transport networks, and bioinformatics), many state-of-the-art open source graph computing systems on a single node have been proposed, including GraphChi, X-Stream, and GridGraph. GraphChi adopts a vertex-centric model while the latter two adopt an edge-centric model. However, there is a lack of evaluations and analyses to the performance of these systems, which makes it difficult for users to choose the best system for their applications. In this paper, to make the graph processing provide excellent cloud services to users, we propose an evaluation framework, conduct a series of extensive experiments to evaluate the performance and analyze the bottlenecks of these systems on graphs with different characteristics and different kinds of algorithms. The metrics we adopt in this paper are principles to design graph computing systems on a single node, such as RunTime, CPU Utilization, and Data Locality. The results demonstrate the trade-offs among different graph frameworks and X-Stream is more suitable to process transport networks on WCC and BFS, compared to GridGraph. Besides, we present several discussions on GridGraph. The results of our work are concluded as a reference for users, researchers, and developers.
International Journal of Parallel Programming | 2018
Yuntao Lu; Chao Wang; Lei Gong; Xuehai Zhou
Convolutional neural networks(CNNs) have been widely applied in various applications. However, the computation-intensive convolutional layers and memory-intensive fully connected layers have brought many challenges to the implementation of CNN on embedded platforms. To overcome this problem, this work proposes a power-efficient accelerator for CNNs, and different methods are applied to optimize the convolutional layers and fully connected layers. For the convolutional layer, the accelerator first rearranges the input features into matrix on-the-fly when storing them to the on-chip buffers. Thus the computation of convolutional layer can be completed through matrix multiplication. For the fully connected layer, the batch-based method is used to reduce the required memory bandwidth, which also can be completed through matrix multiplication. Then a two-layer pipelined computation method for matrix multiplication is proposed to increase the throughput. As a case study, we implement a widely used CNN model, LeNet-5, on an embedded device. It can achieve a peak performance of 34.48 GOP/s and the power efficiency with the value of 19.45 GOP/s/W under 100MHz clock frequency which outperforms previous approaches.
International Journal of Parallel Programming | 2018
Fan Sun; Chao Wang; Lei Gong; Yiwei Zhang; Chongchong Xu; Yuntao Lu; Xi Li; Xuehai Zhou
Large-scale graphs processing, which draws attentions of researchers, applies in a large range of domains, such as social networks, web graphs, and transport networks. However, processing large-scale graphs on general processors suffers from difficulties including computation and memory inefficiency. Therefore, the research of hardware accelerator for graph processing has become a hot issue recently. Meanwhile, as a power-efficiency and reconfigurable resource, FPGA is a potential solution to design and employ graph processing algorithms. In this paper, we propose Domino, an asynchronous and energy-efficient hardware accelerator for graph processing. Domino adopts the asynchronous model to process graphs, which is efficient for most of the graph algorithms, such as Breadth-First Search, Depth-First Search, and Single Source Shortest Path. Domino also proposes a specific data structure based on row vector, named Batch Row Vector, to present graphs. Our work adopts the naive update mechanism and bisect update mechanism to perform asynchronous control. Ultimately, we implement Domino on an advanced Xilinx Virtex-7 board, and experimental results demonstrate that Domino has significant performance and energy improvement, especially for graphs with a large diameter(e.g., roadNet-CA and USA-Road). Case studies in Domino achieve 1.47x-7.84x and 0.47x-2.52x average speedup for small-diameter graphs(e.g., com-youtube, WikiTalk, and soc-LiveJournal), over GraphChi on the Intel Core2 and Core i7 processors, respectively. Besides, compared to Intel Core i7 processors, Domino also performs significant energy-efficiency that is 2.03x-10.08x for three small-diameter graphs and 27.98x-134.50x for roadNet-CA which is a graph with relatively large diameter.
international conference on web services | 2017
Chao Wang; Jinhong Zhou; Lei Gong; Xi Li; Aili Wang; Xuehai Zhou
Neural networks have been widely used as a powerful representation in various research domains, such as computer vision, natural language processing, and artificial intelligence, etc. To achieve better effect of applications, the increasing number of neurons and synapses make neural networks both computationally and memory intensive, furthermore difficult to deploy on resource-limited platforms. Sparse methods can reduce redundant neurons and synapses, but conventional accelerators cannot benefit from the sparsity. In this paper, we propose an efficient accelerating method for sparse neural networks, which compresses synapse weights and processes the compressed structure by an FPGA accelerator. Our method will achieve 40 and 20% compression ratio of synapse weights in convolutional and full-connected layers. The experiment results demonstrate that our accelerating method can boost an FPGA accelerator to achieve 3
international conference on web services | 2017
Chao Wang; Haijie Fang; Shiming Lei; Lei Gong; Aili Wang; Xi Li; Xuehai Zhou
international conference on hardware software codesign and system synthesis | 2017
Lei Gong; Chao Wang; Xi Li; Huaping Chen; Xuehai Zhou
\times
international conference on cluster computing | 2017
Yiwei Zhang; Chao Wang; Lei Gong; Yuntao Lu; Fan Sun; Chongchong Xu; Xi Li; Xuehai Zhou