Is this you? Create Your Porfile

Xuehai Qian

University of Southern California

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xuehai Qian is active.

Explore More

Publication

Featured researches published by Xuehai Qian.

high-performance computer architecture | 2017

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning

Linghao Song; Xuehai Qian; Hai Li; Yiran Chen

Convolution neural networks (CNNs) are the heart of deep learning applications. Recent works PRIME [1] and ISAAC [2] demonstrated the promise of using resistive random access memory (ReRAM) to perform neural computations in memory. We found that training cannot be efficiently supported with the current schemes. First, they do not consider weight update and complex data dependency in training procedure. Second, ISAAC attempts to increase system throughput with a very deep pipeline. It is only beneficial when a large number of consecutive images can be fed into the architecture. In training, the notion of batch (e.g. 64) limits the number of images can be processed consecutively, because the images in the next batch need to be processed based on the updated weights. Third, the deep pipeline in ISAAC is vulnerable to pipeline bubbles and execution stall. In this paper, we present PipeLayer, a ReRAM-based PIM accelerator for CNNs that support both training and testing. We analyze data dependency and weight update in training algorithms and propose efficient pipeline to exploit inter-layer parallelism. To exploit intra-layer parallelism, we propose highly parallel design based on the notion of parallelism granularity and weight replication. With these design choices, PipeLayer enables the highly pipelined execution of both training and testing, without introducing the potential stalls in previous work. The experiment results show that, PipeLayer achieves the speedups of 42.45x compared with GPU platform on average. The average energy saving of PipeLayer compared with GPU implementation is 7.17x.

architectural support for programming languages and operating systems | 2017

SC-DCNN: Highly-Scalable Deep Convolutional Neural Network using Stochastic Computing

Ao Ren; Zhe Li; Caiwen Ding; Qinru Qiu; Yanzhi Wang; Ji Li; Xuehai Qian; Bo Yuan

With the recent advance of wearable devices and Internet of Things (IoTs), it becomes attractive to implement the Deep Convolutional Neural Networks (DCNNs) in embedded and portable systems. Currently, executing the software-based DCNNs requires high-performance servers, restricting the widespread deployment on embedded and mobile IoT devices. To overcome this obstacle, considerable research efforts have been made to develop highly-parallel and specialized DCNN accelerators using GPGPUs, FPGAs or ASICs. Stochastic Computing (SC), which uses a bit-stream to represent a number within [-1, 1] by counting the number of ones in the bit-stream, has high potential for implementing DCNNs with high scalability and ultra-low hardware footprint. Since multiplications and additions can be calculated using AND gates and multiplexers in SC, significant reductions in power (energy) and hardware footprint can be achieved compared to the conventional binary arithmetic implementations. The tremendous savings in power (energy) and hardware resources allow immense design space for enhancing scalability and robustness for hardware DCNNs. This paper presents SC-DCNN, the first comprehensive design and optimization framework of SC-based DCNNs, using a bottom-up approach. We first present the designs of function blocks that perform the basic operations in DCNN, including inner product, pooling, and activation function. Then we propose four designs of feature extraction blocks, which are in charge of extracting features from input feature maps, by connecting different basic function blocks with joint optimization. Moreover, the efficient weight storage methods are proposed to reduce the area and power (energy) consumption. Putting all together, with feature extraction blocks carefully selected, SC-DCNN is holistically optimized to minimize area and power (energy) consumption while maintaining high network accuracy. Experimental results demonstrate that the LeNet5 implemented in SC-DCNN consumes only 17 mm2 area and 1.53 W power, achieves throughput of 781250 images/s, area efficiency of 45946 images/s/mm2, and energy efficiency of 510734 images/J.

architectural support for programming languages and operating systems | 2017

DudeTM: Building Durable Transactions with Decoupling for Persistent Memory

Mengxing Liu; Mingxing Zhang; Kang Chen; Xuehai Qian; Yongwei Wu; Weimin Zheng; Jinglei Ren

Emerging non-volatile memory (NVM) offers non-volatility, byte-addressability and fast access at the same time. To make the best use of these properties, it has been shown by empirical evidence that programs should access NVM directly through CPU load and store instructions, so that the overhead of a traditional file system or database can be avoided. Thus, durable transactions become a common choice of applications for accessing persistent memory data in a crash consistent manner. However, existing durable transaction systems employ either undo logging, which requires a fence for every memory write, or redo logging, which requires intercepting all memory reads within transactions. This paper presents DUDETM, a crash-consistent durable transaction system that avoids the drawbacks of both undo logging and redo logging. DUDETM uses shadow DRAM to decouple the execution of a durable transaction into three fully asynchronous steps. The advantage is that only minimal fences and no memory read instrumentation are required. This design also enables an out-of-the-box transactional memory (TM) to be used as an independent component in our system. The evaluation results show that DUDETM adds durability to a TM system with only 7.4 ~ 24.6% throughput degradation. Compared to the existing durable transaction systems, DUDETM provides 1.7times to 4.4times higher throughput. Moreover, DUDETM can be implemented with existing hardware TMs with minor hardware modifications, leading to a further 1.7times speedup.

architectural support for programming languages and operating systems | 2013

Volition: scalable and precise sequential consistency violation detection

Xuehai Qian; Josep Torrellas; Benjamin Sahelices; Depei Qian

Sequential Consistency (SC) is the most intuitive memory model, and SC Violations (SCVs) produce unintuitive, typically incorrect executions. Most prior SCV detection schemes have used data races as proxies for SCVs, which is highly imprecise. Other schemes that have targeted data-race cycles are either too conservative or are designed only for two-processor cycles and snoopy-based systems. This paper presents Volition, the first hardware scheme that detects SCVs in a relaxed-consistency machine precisely, in a scalable manner, and for an arbitrary number of processors in the cycle. Volition leverages cache coherence protocol transactions to dynamically detect cycles in memory-access orders across threads. When a cycle is about to occur, an exception is triggered. Volition can be used in both directory- and snoopy-based coherence protocols. Our simulations of Volition in a 64-processor multicore with directory-based coherence running SPLASH-2 and Parsec programs shows that Volition induces negligible traffic and execution overhead. In addition, it can detect SCVs with several processors. Volition is suitable for on-the-fly use.

international symposium on microarchitecture | 2017

C ir CNN: accelerating and compressing deep neural networks using block-circulant weight matrices

Caiwen Ding; Siyu Liao; Yanzhi Wang; Zhe Li; Ning Liu; Youwei Zhuo; Chao Wang; Xuehai Qian; Yu Bai; Geng Yuan; Xiaolong Ma; Yipeng Zhang; Jian Tang; Qinru Qiu; Xue Lin; Bo Yuan

Large-scale deep neural networks (DNNs) are both compute and memory intensive. As the size of DNNs continues to grow, it is critical to improve the energy efficiency and performance while maintaining accuracy. For DNNs, the model size is an important factor affecting performance, scalability and energy efficiency. Weight pruning achieves good compression ratios but suffers from three drawbacks: 1) the irregular network structure after pruning, which affects performance and throughput; 2) the increased training complexity; and 3) the lack of rigirous guarantee of compression ratio and inference accuracy.To overcome these limitations, this paper proposes CirCNN, a principled approach to represent weights and process neural networks using block-circulant matrices. CirCNN utilizes the Fast Fourier Transform (FFT)-based fast multiplication, simultaneously reducing the computational complexity (both in inference and training) from

high performance computer architecture | 2012