Is this you? Create Your Porfile

Chongchong Xu

University of Science and Technology of China

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chongchong Xu is active.

Explore More

Publication

Featured researches published by Chongchong Xu.

international conference on web services | 2017

Evaluation and Trade-offs of Graph Processing for Cloud Services

Chongchong Xu; Jinhong Zhou; Yuntao Lu; Fan Sun; Lei Gong; Chao Wang; Xi Li; Xuehai Zhou

Large-scale data is often represented as graphs in the field of modern cloud computing. Graph processing attracts more and more attentions when utilizing the cloud computing service. With the increasing attentions to process massive graphs (e.g., social networks, web graphs, transport networks, and bioinformatics), many state-of-the-art open source graph computing systems on a single node have been proposed, including GraphChi, X-Stream, and GridGraph. GraphChi adopts a vertex-centric model while the latter two adopt an edge-centric model. However, there is a lack of evaluations and analyses to the performance of these systems, which makes it difficult for users to choose the best system for their applications. In this paper, to make the graph processing provide excellent cloud services to users, we propose an evaluation framework, conduct a series of extensive experiments to evaluate the performance and analyze the bottlenecks of these systems on graphs with different characteristics and different kinds of algorithms. The metrics we adopt in this paper are principles to design graph computing systems on a single node, such as RunTime, CPU Utilization, and Data Locality. The results demonstrate the trade-offs among different graph frameworks and X-Stream is more suitable to process transport networks on WCC and BFS, compared to GridGraph. Besides, we present several discussions on GridGraph. The results of our work are concluded as a reference for users, researchers, and developers.

international conference on cluster computing | 2017

A Power-Efficient Accelerator for Convolutional Neural Networks

Fan Sun; Chao Wang; Lei Gong; Chongchong Xu; Yiwei Zhang; Yuntao Lu; Xi Li; Xuehai Zhou

Convolutional neural networks(CNNs) have been widely applied in various applications. However, the computation-intensive convolutional layers and memory-intensive fully connected layers have brought many challenges to the implementation of CNN on embedded platforms. To overcome this problem, this work proposes a power-efficient accelerator for CNNs, and different methods are applied to optimize the convolutional layers and fully connected layers. For the convolutional layer, the accelerator first rearranges the input features into matrix on-the-fly when storing them to the on-chip buffers. Thus the computation of convolutional layer can be completed through matrix multiplication. For the fully connected layer, the batch-based method is used to reduce the required memory bandwidth, which also can be completed through matrix multiplication. Then a two-layer pipelined computation method for matrix multiplication is proposed to increase the throughput. As a case study, we implement a widely used CNN model, LeNet-5, on an embedded device. It can achieve a peak performance of 34.48 GOP/s and the power efficiency with the value of 19.45 GOP/s/W under 100MHz clock frequency which outperforms previous approaches.

field programmable gate arrays | 2018

Domino: An Asynchronous and Energy-efficient Accelerator for Graph Processing: (Abstract Only)

Chongchong Xu; Chao Wang; Yiwei Zhang; Lei Gong; Xi Li; Xuehai Zhou

Large-scale graphs processing, which draws attentions of researchers, applies in a large range of domains, such as social networks, web graphs, and transport networks. However, processing large-scale graphs on general processors suffers from difficulties including computation and memory inefficiency. Therefore, the research of hardware accelerator for graph processing has become a hot issue recently. Meanwhile, as a power-efficiency and reconfigurable resource, FPGA is a potential solution to design and employ graph processing algorithms. In this paper, we propose Domino, an asynchronous and energy-efficient hardware accelerator for graph processing. Domino adopts the asynchronous model to process graphs, which is efficient for most of the graph algorithms, such as Breadth-First Search, Depth-First Search, and Single Source Shortest Path. Domino also proposes a specific data structure based on row vector, named Batch Row Vector, to present graphs. Our work adopts the naive update mechanism and bisect update mechanism to perform asynchronous control. Ultimately, we implement Domino on an advanced Xilinx Virtex-7 board, and experimental results demonstrate that Domino has significant performance and energy improvement, especially for graphs with a large diameter(e.g., roadNet-CA and USA-Road). Case studies in Domino achieve 1.47x-7.84x and 0.47x-2.52x average speedup for small-diameter graphs(e.g., com-youtube, WikiTalk, and soc-LiveJournal), over GraphChi on the Intel Core2 and Core i7 processors, respectively. Besides, compared to Intel Core i7 processors, Domino also performs significant energy-efficiency that is 2.03x-10.08x for three small-diameter graphs and 27.98x-134.50x for roadNet-CA which is a graph with relatively large diameter.

International Journal of Parallel Programming | 2018

UniCNN: A Pipelined Accelerator Towards Uniformed Computing for CNNs

Fan Sun; Chao Wang; Lei Gong; Yiwei Zhang; Chongchong Xu; Yuntao Lu; Xi Li; Xuehai Zhou

Convolutional neural networks (CNNs) have been widely applied for image recognition, face detection, and video analysis because of their ability to achieve accuracy close to or even better than human level perception. However, different features of convolution layers and fully connected layers have brought many challenges to the implementation of CNN on FPGA platforms, because different accelerator units must be designed to process the whole networks. In order to overcome this problem, this work proposes a pipelined accelerator towards uniformed computing for convolutional neural networks. For the convolution layer, the accelerator first repositions the input features into matrix on-the-fly when they are stored to FPGA on-chip buffers, thus the computation of convolution layer can be completed through matrix multiplication. For the fully connected layer, the batch-based method is used to reduce the required memory bandwidth, which also can be completed through matrix multiplication. Then a pipelined computation method for matrix multiplication is proposed to increase the throughput and also reduce the buffer requirement. The experiment results show that the proposed accelerator surpasses CPUs and GPUs platform in terms of energy efficiency. The proposed accelerator can achieve the throughput of 49.31 GFLOPS, which is done using only 198 DSP modules. Compared to the state-of-the-art implementatuion, our accelerator has better hardware utilization efficiency.

international conference on cluster computing | 2017

A Power-Efficient Accelerator Based on FPGAs for LSTM Network

Yiwei Zhang; Chao Wang; Lei Gong; Yuntao Lu; Fan Sun; Chongchong Xu; Xi Li; Xuehai Zhou

Today, artificial neural networks (ANNs) are widely used in a variety of applications, including speech recognition, face detection, disease diagnosis, etc. And as the emerging field of ANNs, Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) which contains complex computational logic. To achieve high accuracy, researchers always build large-scale LSTM networks which are time-consuming and power-consuming. In this paper, we present a hardware accelerator for the LSTM neural network layer based on FPGA Zedboard and use pipeline methods to parallelize the forward computing process. We also implement a sparse LSTM hidden layer, which consumes fewer storage resources than the dense network. Our accelerator is power-efficient and has a higher speed than ARM Cortex-A9 processor.

international conference on cluster computing | 2017

OmniGraph: A Scalable Hardware Accelerator for Graph Processing

Chongchong Xu; Chao Wang; Lei Gong; Yuntao Lu; Fan Sun; Yiwei Zhang; Xi Li; Xuehai Zhou

Large-scale graphs processing attracts more and more attentions, and it has been widely applied in many application domains. FPGA is a promising platform to implement graph processing algorithms with high power-efficiency and parallelism. In this paper, we propose OmniGraph, a scalable hardware accelerator for graph processing. OmniGraph can process graphs with different sizes adaptively and is adaptable to various graph algorithms. OmniGraph improves the preprocessing methodology based on Interval-Shard and consists of three computation engines, vertices on-chip && edges on-chip engine, vertices on-chip && edges off-chip engine, and vertices off-chip && edges off-chip engine. Experimental results on the state-of-the-art Xilinx Virtex-7 board demonstrate that case studies in OmniGraph achieve 1.03x-8.13x average speedup comparing to GraphChi on Intel core2 processors.

ieee acm international symposium cluster cloud and grid computing | 2017

Mermaid: Integrating Vertex-Centric with Edge-Centric for Real-World Graph Processing

Jinhong Zhou; Chongchong Xu; Xianglan Chen; Chao Wang; Xuehai Zhou

There has been increasing interests in processing large-scale real-world graphs, and recently many graph systems have been proposed. Vertex-centric GAS (Gather-Apply-Scatter) and Edge-centric GAS are two graph computation models being widely adopted, and existing graph analytics systems commonly follow only one computation model, which is not the best choice for real-world graph processing. In fact, vertex degrees in real-world graphs often obey skewed power-law distributions: most vertices have relatively few neighbors while a few have many neighbors. We observe that vertex-centric GAS for high-degree vertices and edge-centric GAS for low-degree vertices is a much better choice for real-world graph processing. In this paper, we present Mermaid, a system for processing large-scale real-world graphs on a single machine. Mermaid skillfully integrates vertex-centric GAS with edge-centric GAS through a novel vertex-mapping mechanism, and supports streamlined graph processing. On a total of 6 practical natural graph processing tasks, we demonstrate that, on average, Mermaid achieves 1.83x better performance than the state-of-the-art graph system on a single machine.

compilers, architecture, and synthesis for embedded systems | 2017

A high-performance FPGA accelerator for sparse neural networks: work-in-progress

Yuntao Lu; Lei Gong; Chongchong Xu; Fan Sun; Yiwei Zhang; Chao Wang; Xuehai Zhou

Neural networks have been widely used in a large range of domains, researchers tune numbers of layrs, neurons and synapses to adapt various applications. As a consequence, computations and memory of neural networks models are both intensive. As large requirements of memory and computing resources, it is difficult to deploy neural networks on resource-limited platforms. Sparse neural networks, which prune redundant neurons and synapses, alleviate computation and memory pressure. However, conventional accelerators cannot benefit from the sparse feature. In this paper, we propose a high-performance FPGA accelerator for sparse neural networks which utilizes eliminate computations and storage space. This work compresses sparse weights and processes compressed data directly. Experimental results demonstrate that our accelerator will reduce 50% and 10% storage of convolutional and full-connected layers, and achieve 3x speedup of performance over an optimized conventional FPGA accelerator.

2017 International Conference on Compilers, Architectures and Synthesis For Embedded Systems (CASES) | 2017