Is this you? Create Your Porfile

Xuehai Zhou

University of Science and Technology of China

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xuehai Zhou is active.

Explore More

Publication

Featured researches published by Xuehai Zhou.

architectural support for programming languages and operating systems | 2015

PuDianNao: A Polyvalent Machine Learning Accelerator

Daofu Liu; Tianshi Chen; Shaoli Liu; Jinhong Zhou; Shengyuan Zhou; Olivier Teman; Xiaobing Feng; Xuehai Zhou; Yunji Chen

Machine Learning (ML) techniques are pervasive tools in various emerging commercial applications, but have to be accommodated by powerful computer systems to process very large data. Although general-purpose CPUs and GPUs have provided straightforward solutions, their energy-efficiencies are limited due to their excessive supports for flexibility. Hardware accelerators may achieve better energy-efficiencies, but each accelerator often accommodates only a single ML technique (family). According to the famous No-Free-Lunch theorem in the ML domain, however, an ML technique performs well on a dataset may perform poorly on another dataset, which implies that such accelerator may sometimes lead to poor learning accuracy. Even if regardless of the learning accuracy, such accelerator can still become inapplicable simply because the concrete ML task is altered, or the user chooses another ML technique. In this study, we present an ML accelerator called PuDianNao, which accommodates seven representative ML techniques, including k-means, k-nearest neighbors, naive bayes, support vector machine, linear regression, classification tree, and deep neural network. Benefited from our thorough analysis on computational primitives and locality properties of different ML techniques, PuDianNao can perform up to 1056 GOP/s (e.g., additions and multiplications) in an area of 3.51 mm^2, and consumes 596 mW only. Compared with the NVIDIA K20M GPU (28nm process), PuDianNao (65nm process) is 1.20x faster, and can reduce the energy by 128.41x.

great lakes symposium on vlsi | 2010

Write activity reduction on flash main memory via smart victim cache

Liang Shi; Chun Jason Xue; Jingtong Hu; Wei-Che Tseng; Xuehai Zhou; Edwin Hsing-Mean Sha

Flash Memory is a desirable candidate for main memory replacement in embedded systems due to its low leakage power consumption, higher density and non-volatility characteristics. There are two challenges in applying flash memory as main memory. First, the write operations are much slower than read operations. Second, the lifetime of flash memory depends on the number of the write/erase operations. In this paper, we introduce a smart victim cache architecture to reduce the write activities by exploring the coarse grain accessing character of NAND flash memory. Experimental results show that the proposed approaches can reduce write activities on flash main memory by 65.38% on average compared to traditional architecture.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2017

DLAU: A Scalable Deep Learning Accelerator Unit on FPGA

Chao Wang; Lei Gong; Qi Yu; Xi Li; Yuan Xie; Xuehai Zhou

As the emerging field of machine learning, deep learning shows excellent ability in solving complex learning problems. However, the size of the networks becomes increasingly large scale due to the demands of the practical applications, which poses significant challenge to construct a high performance implementations of deep learning neural networks. In order to improve the performance as well as to maintain the low power cost, in this paper we design deep learning accelerator unit (DLAU), which is a scalable accelerator architecture for large-scale deep learning networks using field-programmable gate array (FPGA) as the hardware prototype. The DLAU accelerator employs three pipelined processing units to improve the throughput and utilizes tile techniques to explore locality for deep learning applications. Experimental results on the state-of-the-art Xilinx FPGA board demonstrate that the DLAU accelerator is able to achieve up to

embedded software | 2011

ExLRU: a unified write buffer cache management for flash memory

Liang Shi; Jianhua Li; Chun Jason Xue; Chengmo Yang; Xuehai Zhou

36.1 {\times }

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2015

Heterogeneous cloud framework for big data genome sequencing

Chao Wang; Xi Li; Peng Chen; Aili Wang; Xuehai Zhou; Hong Yu

speedup comparing to the Intel Core2 processors, with the power consumption at 234 mW.

design, automation, and test in europe | 2015

SODA: software defined FPGA based accelerators for big data

Chao Wang; Xi Li; Xuehai Zhou

NAND flash memory has been widely adopted in embedded systems as secondary storage. Yet the further development of flash memory strongly hinges on the tackling of its inherent implausible characteristics, including read and write speed asymmetry, inability of in-place update, and performance harmful erase operations. While Write Buffer Cache (WBC) has been proposed to enhance the performance of write operations, the development of a unified WBC management scheme that is effective for diverse types of access patterns is still a challenging task. In this paper, a novel WBC management scheme named Expectation-based LRU (ExLRU) is proposed to improve the performance of write operations while at the same time reducing the number of erase operations on flash memory. ExLRU accurately maintains access history information in WBC, based on which a new cost model is constructed to select the data with minimum write cost to be written to flash memory. An efficient ExLRU implementation with negligible hardware overhead is further developed. Simulation results show that ExLRU outperforms state-of-art WBC management schemes under various workloads.

ieee international conference on services computing | 2011

SOMP: Service-Oriented Multi Processors

Chao Wang; Junneng Zhang; Xuehai Zhou; Xiaojing Feng; Xiaoning Nie

The next generation genome sequencing problem with short (long) reads is an emerging field in numerous scientific and big data research domains. However, data sizes and ease of access for scientific researchers are growing and most current methodologies rely on one acceleration approach and so cannot meet the requirements imposed by explosive data scales and complexities. In this paper, we propose a novel FPGA-based acceleration solution with MapReduce framework on multiple hardware accelerators. The combination of hardware acceleration and MapReduce execution flow could greatly accelerate the task of aligning short length reads to a known reference genome. To evaluate the performance and other metrics, we conducted a theoretical speedup analysis on a MapReduce programming platform, which demonstrates that our proposed architecture have efficient potential to improve the speedup for large scale genome sequencing applications. Also, as a practical study, we have built a hardware prototype on the real Xilinx FPGA chip. Significant metrics on speedup, sensitivity, mapping quality, error rate, and hardware cost are evaluated, respectively. Experimental results demonstrate that the proposed platform could efficiently accelerate the next generation sequencing problem with satisfactory accuracy and acceptable hardware cost.

ACM Transactions on Architecture and Code Optimization | 2013

MP-Tomasulo: A Dependency-Aware Automatic Parallel Execution Engine for Sequential Programs

Chao Wang; Xi Li; Junneng Zhang; Xuehai Zhou; Xiaoning Nie

FPGA has been an emerging field in novel big data architectures and systems, due to its high efficiency and low power consumption. It enables the researchers to deploy massive accelerators within one single chip. In this paper, we present a software defined FPGA based accelerators for big data, named SODA, which could reconstruct and reorganize the acceleration engines according to the requirement of the various dataintensive applications. SODA decomposes large and complex applications into coarse grained single-purpose RTL code libraries that perform specialized tasks in out-of-order hardware. We built a prototyping system with constrained shortest path Finding (CSPF) case studies to evaluate SODA framework. SODA is able to achieve up to 43.75X speedup at 128 node application. Furthermore, hardware cost of the SODA framework demonstrates that it can achieve high speedup with moderate hardware utilization.

real time technology and applications symposium | 2011

Cooperating Write Buffer Cache and Virtual Memory Management for Flash Memory Based Systems

Liang Shi; Chun Jason Xue; Xuehai Zhou

Multi-processor system on chip (MPSoC) has been widely applied in embedded systems design. However, it has posed great challenges in designing and implementing prototype chip for diverse applications due to different instruction set architectures (ISA), programming interfaces and software tool chains. In order to solve the problem, we introduce SOA into MPSoC design, because it can provide flexibility and extensibility for MPSoC chip design at lower cost through adopting re-usable, self-contained modules in the design process. In this paper, we propose a service-oriented multi-processor SOMP, which integrates embedded processors and hardware IP cores as computing servants on a single chip. SOMP provides unified programming interfaces for users through utilizing diverse computing resources. In order to demonstrate the performance of SOMP, we implemented it on a Digilent Virtex5LX110T FPGA board and designed several sample test applications for verification purpose. The experimental results show that SOMP can improve the parallelism greatly and achieve 95.7% of the theoretical speedup on average.

international conference on smart grid communications | 2012

Smart Grid communication using next generation heterogeneous wireless networks

Rahul Amin; Jim Martin; Xuehai Zhou

This article presents MP-Tomasulo, a dependency-aware automatic parallel task execution engine for sequential programs. Applying the instruction-level Tomasulo algorithm to MPSoC environments, MP-Tomasulo detects and eliminates Write-After-Write (WAW) and Write-After-Read (WAR) inter-task dependencies in the dataflow execution, therefore to operate out-of-order task execution on heterogeneous units. We implemented the prototype system within a single FPGA. Experimental results on EEMBC applications demonstrate that MP-Tomasulo can execute the tasks out-of-order to achieve as high as 93.6% to 97.6% of ideal peak speedup. A comparative study against a state-of-the-art dataflow execution scheme is illustrated with a classic JPEG application. The promising results show MP-Tomasulo enables programmers to uncover more task-level parallelism on heterogeneous systems, as well as to ease the burden of programmers.

Explore More