Xuegong Zhou | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xuegong Zhou is active.

Explore More

Publication

Featured researches published by Xuegong Zhou.

field-programmable technology | 2006

On-line scheduling of real-time tasks for reconfigurable computing system

Xuegong Zhou; Ying Wang; Xun-zhang Huang; Cheng-Lian Peng

Efficient task scheduling is very important for obtaining high performance in reconfigurable computing system. Previous researches mostly concentrate on the spatial placement of tasks, and did not pay enough attention to temporal factors. This paper focuses on the on-line scheduling of real-time tasks with known executing time, and introduces the notion of recognition-complete for scheduling algorithms, that is the algorithm can arrange the start time of a newly arrived task as early as possible. A new on-line scheduling algorithm is proposed, which achieves recognition-complete by using the technique of time window. The simulation results show that the proposed algorithm gains a prominent improvement in scheduling performance over previous algorithms

field programmable logic and applications | 2016

A high performance FPGA-based accelerator for large-scale convolutional neural networks

Huimin Li; Xitian Fan; Li Jiao; Wei Cao; Xuegong Zhou; Lingli Wang

In recent years, convolutional neural networks (CNNs) based machine learning algorithms have been widely applied in computer vision applications. However, for large-scale CNNs, the computation-intensive, memory-intensive and resource-consuming features have brought many challenges to CNN implementations. This work proposes an end-to-end FPGA-based CNN accelerator with all the layers mapped on one chip so that different layers can work concurrently in a pipelined structure to increase the throughput. A methodology which can find the optimized parallelism strategy for each layer is proposed to achieve high throughput and high resource utilization. In addition, a batch-based computing method is implemented and applied on fully connected layers (FC layers) to increase the memory bandwidth utilization due to the memory-intensive feature. Further, by applying two different computing patterns on FC layers, the required on-chip buffers can be reduced significantly. As a case study, a state-of-the-art large-scale CNN, AlexNet, is implemented on Xilinx VC709. It can achieve a peak performance of 565.94 GOP/s and 391 FPS under 156MHz clock frequency which outperforms previous approaches.

IEEE Transactions on Very Large Scale Integration Systems | 2013

SPREAD: A Streaming-Based Partially Reconfigurable Architecture and Programming Model

Ying Wang; Xuegong Zhou; Lingli Wang; Jian Yan; Wayne Luk; Cheng-Lian Peng; Jiarong Tong

Partially reconfigurable systems are promising computing platforms for streaming applications, which demand both hardware efficiency and reconfigurable flexibility. To realize the full potential of these systems, a streaming-based partially reconfigurable architecture and unified software/hardware multithreaded programming model (SPREAD) is presented in this paper. SPREAD is a reconfigurable architecture with a unified software/hardware thread interface and high throughput point-to-point streaming structure. It supports dynamic computing resource allocation, runtime software/hardware switching, and streaming-based multithreaded management at the operating system level. SPREAD is designed to provide programmers of streaming applications with a unified view of threads, allowing them to exploit thread, data, and pipeline parallelism; it enhances hardware efficiency while simplifying the development of streaming applications for partially reconfigurable systems. Experimental results targeting cryptography applications demonstrate the feasibility and superior performance of SPREAD. Moreover, the parallelized Advanced Encryption Standard (AES), Data Encryption Standard (DES), and Triple DES (3DES) hardware threads on field-programmable gate arrays show 1.61-4.59 times higher power efficiency than their implementations on state-of-the-art graphics processing units.

computer and information technology | 2006

A MDA based SoC Modeling Approach using UML and SystemC

Ying Wang; Xuegong Zhou; Bo Zhou; Liang Liang; Cheng-Lian Peng

Modeling is an efficient way to improve SoC design efficiency. In this paper, a Model Driven Architecture (MDA) based approach is proposed to combine the capability of newly released Unified Modeling Language 2.0 (UML) with SystemC, extending UML to express SystemC concept while maintaining the mappings between them. This approach will consequently promote stepwise semiautomatic conversion from UML specification to executable SystemC code. We intend to build a smooth SoC design flow in which implementation can be derived from specification directly. The proposed design flow will make use of the graphical modeling capability of UML, and produce SystemC code for further analysis.

Journal of International Medical Research | 2011

Comparison of the Predictability, Uniformity and Stability of a Laser in Situ Keratomileusis Corneal Flap Created with a VisuMax Femtosecond Laser or a Moria Microkeratome

Peijun Yao; Yingxiao Xu; Xuegong Zhou

This prospective study compared the predictability, uniformity and stability of laser in situ keratomileusis corneal flap thickness created by a femtosecond laser or a classic microkeratome. Twenty-five femtosecond laser (VisuMax, Carl Zeiss Meditec) flaps and 38 microkeratome (Moria M3) flaps were measured using anterior segment optical coherence tomography at 1 week, 1 month and 6 months postoperatively. Flap thickness at seven points on each of the four meridians was calculated. At 6 months, VisuMax flaps showed better prediction than Moria flaps for central flap thickness. The standard deviation within individual flaps was smaller for VisuMax flaps and their index of symmetry was better. The mean thicknesses among the four eccentricities in the VisuMax flaps were the same, while Moria flaps were thicker at the 3-mm radius compared with the centre. The VisuMax femtosecond laser created corneal flaps with better predictability and uniformity, and similar reproducibility and stability, compared with the microkeratome.

field-programmable technology | 2012

A partially reconfigurable architecture supporting hardware threads

Ying Wang; Jian Yan; Xuegong Zhou; Lingli Wang; Wayne Luk; Cheng-Lian Peng; Jiarong Tong

As a promising computing platform for stream processing, partially reconfigurable systems have shown their hardware efficiency and reconfiguration flexibility. This paper presents a partially reconfigurable architecture supporting hardware threads. It gives a unified software/hardware thread interface and high throughput point-to-point streaming structure. Dynamic computing resource allocation and streaming-based multi-threaded management are also provided at operating system level. It is easy for programmers to exploit the inherent thread, data and pipeline parallelism in a unified view of threads, enhancing hardware efficiency while improving productivity. The experimental results on a cryptography application demonstrate the feasibility and superior performance. Moreover, the parallelized AES, DES and 3DES hardware threads on field-programmable gate arrays show 1.61-4.59 times higher power efficiency than their implementations on state-of-the-art graphics processing units.

field-programmable logic and applications | 2007

Fast On-Line Task Placement and Scheduling on Reconfigurable Devices

Xuegong Zhou; Ying Wang; XiinZhang Huang; Cheng-Lian Peng

This paper focus on on-line placement and scheduling of tasks with known executing time on reconfigurable devices. The notion of recognition-earliest for scheduling algorithms is introduced, that is the algorithm can arrange the start time of a newly arrived task as early as possible. A new scheduling algorithm is proposed. By exploit the knowledge about temporal properties of each task, the algorithm attains recognition-earliest. A fast placement algorithm is also presented. The evaluation results show that the proposed placement algorithm is one of the fastest algorithm, and the proposed scheduling algorithm achieves the best performance compared with previous algorithms, while has a quite low runtime cost.

field-programmable technology | 2013

Implementation of high performance hardware architecture of OpenSURF algorithm on FPGA

Xitian Fan; Chenlu Wu; Wei Cao; Xuegong Zhou; Shengye Wang; Lingli Wang

This paper proposes a high performance hardware architecture of Speeded Up Robust Features (SURF) algorithm based on OpenSURF. In order to achieve high processing frame rate, the hardware architecture is designed with several characteristics. Firstly, a sliding window method is proposed to extract feature points in parallel at selected scale levels. As a result, the time cost in feature extraction can be greatly reduced. Secondly, data reuse strategy is proposed in orientation generation and descriptor generation to reduce the memory access times. In this way, 3.87x and 2.25X speedup are achieved respectively. Thirdly, the integral image is segmented to buffer in different memory blocks in order to support multiple data accessing in one clock cycle, which will further reduce the whole calculating time of our implementation. The hardware architecture is implemented on an XC6VSX475T FPGA with 156 MHz and its maximal frame rate for VGA format image can reach 356 frames per second (fps), which is 6.25 times frame rate of OpenSURF running on a server with a Xeon 5650 processor, and 6 times the reported frame rate of the recent implementation on three Vritex4 FPGAs [8].

international conference on solid-state and integrated circuits technology | 2008

A novel packing algorithm for sparse crossbar FPGA architectures

Kanwen Wang; Meng Yang; Lingli Wang; Xuegong Zhou; Jiarong Tong

The cluster-based FPGA can significantly improve timing and routability. Packing is introduced in the CAD flow to pack logic elements into clusters. In order to reduce unnecessary connectivity within a cluster, sparse crossbar FPGA architectures are under investigation. This paper proposes a novel packing algorithm using direct graph searching method and connection gain function. Experimental results show that half populated crossbar FPGA architecture achieves 7% area improvement compared to fully populated counterpart with only 3% number of external nets overhead.

computer supported cooperative work in design | 2007

Online Hybrid Task Scheduling in Reconfigurable Systems

Liang Liang; Xuegong Zhou; Ying Wang; Cheng-Lian Peng

This paper mainly discusses online tasks scheduling problem on hybrid CPU-FPGA reconfigurable systems. In these systems, hybrid tasks may be binary codes executed on CPU as well as hardware logic circuits implemented on FPGA. Tasks scheduling algorithms of conventional operating systems are not suitable for scheduling hybrid tasks on CPU-FPGA architecture. Based on a real reconfigurable system prototype, we present a task scheduler model and correlative algorithm for scheduling software, hardware and hybrid tasks. This algorithm combines tasks allocation, tasks placement with tasks migration. Simulation results have demonstrated this algorithm provides preferable scheduling performance and reduces the scheduling rejection rate by making use of the great flexibility of hybrid tasks.

Explore More