Sicheng Li
University of Pittsburgh
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sicheng Li.
field-programmable custom computing machines | 2015
Sicheng Li; Chunpeng Wu; Hai Li; Boxun Li; Yu Wang; Qinru Qiu
Recurrent neural network (RNN) based language model (RNNLM) is a biologically inspired model for natural language processing. It records the historical information through additional recurrent connections and therefore is very effective in capturing semantics of sentences. However, the use of RNNLM has been greatly hindered for the high computation cost in training. This work presents an FPGA implementation framework for RNNLM training acceleration. At architectural level, we improve the parallelism of RNN training scheme and reduce the computing resource requirement for computation efficiency enhancement. The hardware implementation primarily targets at reducing data communication load. A multi-thread based computation engine is utilized which can successfully mask the long memory latency and reuse frequent accessed data. The evaluation based on the Microsoft Research Sentence Completion Challenge shows that the proposed FPGA implementation outperforms traditional class-based modest-size recurrent networks and obtains 46.2% in training accuracy. Moreover, experiments at different network sizes demonstrate a great scalability of the proposed framework.
design, automation, and test in europe | 2013
Jie Guo; Wujie Wen; Yaojun Zhang Li; Sicheng Li; Hai Li; Yiran Chen
Program disturb, read disturb and retention time limit are three major reasons accounting for the bit errors in NAND flash memory. The adoption of multi-level cell (MLC) technology and technology scaling further aggravates this reliability issue by narrowing threshold voltage noise margins and introducing larger device variations. Besides implementing error correction code (ECC) in NAND flash modules, RAID-5 are often deployed at system level to protect the data integrity of NAND flash storage systems (NFSS), however, with significant performance degradation. In this work, we propose a technique called “DA-RAID-5” to improve the performance of the enterprise NFSS under RAID-5 protection without harming its reliability (here DA stands for “disturb aware”). Three schemes, namely, unbound-disturb limiting (UDL), PE-aware RAID-5 and Hybrid Caching(HC) are proposed to protect the NFSS at the different stages of its lifetime. The experimental results show that compared to the best prior work, DA-RAID-5 can improve the NFSS response time by 9.7% on average.
Integration | 2017
Yiran Chen; Hai Li; Chunpeng Wu; Chang Song; Sicheng Li; Chuhan Min; Hsin-Pai Cheng; Wei Wen; Xiaoxiao Liu
Neuromorphic computing was originally referred to as the hardware that mimics neuro-biological architectures to implement models of neural systems. The concept was then extended to the computing systems that can run bio-inspired computing models, e.g., neural networks and deep learning networks. In recent years, the rapid growth of cognitive applications and the limited processing capability of conventional von Neumann architecture on these applications motivated worldwide research on neuromorphic computing systems. In this paper, we review the evolution of neuromorphic computing technique in both computing model and hardware implementation from a historical perspective. Various implementation methods and practices are also discussed. Finally, we present some emerging technologies that may potentially change the landscape of neuromorphic computing in the future, e.g., new devices and interdisciplinary computing architectures.
international symposium on circuits and systems | 2016
Sicheng Li; Xiaoxiao Liu; Mengjie Mao; Hai Helen Li; Yiran Chen; Boxun Li; Yu Wang
Developing heterogeneous system with hardware accelerator is a promising solution to implement high performance applications where explicitly programmed, rule-based algorithms are either infeasible or inefficient. However, mapping a neural network model to a hardware representation is a complex process, where balancing computation resources and memory accesses is crucial. In this work, we present a systematic approach o optimize the heterogeneous system with a FPGA-based neuromorphic computing accelerator (NCA). For any applications, the neural network topology and computation flow of the accelerator can be configured through a NCA-aware compiler. The FPGA-based NCA contains a generic multi-layer neural network composed of a set of parallel neural processing elements. Such a scheme imitates the human cognition process and follows the hierarchy of neocortex. At architectural level, we decrease the computing resource requirement to enhance computation efficiency. The hardware implementation primarily targets at reducing data communication load: a multi-thread computation engine is utilized to mask the long memory latency. Such a combined solution can well accommodate the ever increasing complexity and scalability of machine learning applications and improve the system performance and efficiency. Through the evaluation across eight representative benchmarks, we observed on average 12.1× speedup and 45.8× energy reduction, with marginal accuracy loss comparing with CPU-only computation.
embedded systems for real time multimedia | 2016
Chunpeng Wu; Hsin-Pai Cheng; Sicheng Li; Hai Helen Li; Yiran Chen
Autonomous driving can effectively reduce traffic congestion and road accidents. Therefore, it is necessary to implement an efficient high-level, scene understanding model in an embedded device with limited power and sources. Toward this goal, we propose ApesNet, an efficient pixel-wise segmentation network, which understands road scenes in real-time, and has achieved promising accuracy. The key findings in our experiments are significantly lower the classification time and achieve a high accuracy compared to other conventional segmentation methods. The model is characterized by an efficient training and a sufficient fast testing. Experimentally, we use the well-known CamVid road scene dataset to show the advantages provided by our contributions. We compare our proposed architectures accuracy and time performance with SegNet. In CamVid dataset training and testing, our network, ApesNet outperform SegNet in eight classes accuracy. Additionally, our model size is 37% smaller than SegNet. With this advantage, the combining encoding and decoding time for each image is 1.45 to 2.47 times faster than SegNet.
IET Cyber-Physical Systems: Theory & Applications | 2016
Chunpeng Wu; Hsin-Pai Cheng; Sicheng Li; Hai Li; Yiran Chen
Road scene understanding and semantic segmentation is an on-going issue for computer vision. A precise segmentation can help a machine learning model understand the real world more accurately. In addition, a well-designed efficient model can be used on source limited devices. The authors aim to implement an efficient high-level, scene understanding model in an embedded device with finite power and resources. Toward this goal, the authors propose ApesNet, an efficient pixel-wise segmentation network which understands road scenes in near real-time and has achieved promising accuracy. The key findings in the authors’ experiments are significantly lower the classification time and achieving a high accuracy compared with other conventional segmentation methods. The model is characterised by an efficient training and a sufficient fast testing. Experimentally, the authors use two road scene benchmarks, CamVid and Cityscapes to show the advantages of ApesNet. The authors’ compare the proposed architectures accuracy and time performance with SegNet-Basic, a deep convolutional encoder–decoder architecture. ApesNet is 37% smaller than SegNet-Basic in terms of model size. With this advantage, the combining encoding and decoding time for each image is 2.5 times faster than SegNet-Basic.
field programmable custom computing machines | 2017
Sicheng Li; Wei Wen; Yu Wang; Song Han; Yiran Chen; Hai Li
Convolutional neural networks (CNNs) have recently broken many performance records in image recognition and object detection problems. The success of CNNs, to a great extent, is enabled by the fast scaling-up of the networks that learn from a huge volume of data. The deployment of big CNN models can be both computation-intensive and memory-intensive, leaving severe challenges to hardware implementations. In recent years, sparsification techniques that prune redundant connections in the networks while still retaining the similar accuracy emerge as promising solutions to alliterate the computation overheads associated with CNNs [1]. However, imposing sparsity in CNNs usually generates random network connections and thus, the irregular data access pattern results in poor data locality. The low computation efficiency of the sparse networks, which is caused by the incurred unbalance in computing resource consumption and low memory bandwidth usage, significantly offsets the theocratical reduction of the computation complexity and limits the execution scalability of CNNs on general- purpose architectures [2]. For instance, as an important computation kernel in CNNs – the sparse convoluation, is usually accelerated by using data compression schemes where only nonzero elements of the kernel weights are stored and sent to multiplication-accumulation computations (MACs) at runtime. However, the relevant executions on CPUs and GPUs reach only 0.1% to 10% of the system peak performance even designated software libraries are applied (e.g., MKL library for CPUs and cuSPARSE library for GPUs). Field programmable gate arrays (FPGAs) have been also extensively studied as an important hardware platform for CNN computations [3]. Different from general-purpose architectures, FPGA allows users to customize the functions and organization of the designed hardware in order to adapt various resource needs and data usage patterns. This characteristic, as we identified in this work, can be leveraged to effectively overcome the main challenges in the execution of sparse CNNs through close coordinations between software and hardware. In particular, the reconfigurability of FPGA helps to 1) better map the sparse CNN onto the hardware for improving computation parallelism and execution efficiency and 2) eliminate the computation cost associated with zero weights and enhance data reuse to alleviate the adverse impacts of the irregular data accesses. In this work, we propose a hardware-software co-design framework to address the above challenges in sparse CNN accelerations. First, we introduce a data locality-aware sparsification scheme that optimizes the structure of the sparse CNN during training phase to make it friendly for hardware mapping. Both memory allocation and data access regularization are considered in the optimization process. Second, we develop a distributed architecture composed of the customized processing elements (PEs) that enables high computation parallelism and data reuse rate of the compressed network. Moreover, a holistic sparse optimization is introduced to our design framework for hardware platforms with different requirement. We evaluate our proposed frame- work by executing AlexNet on Xilinx Zynq ZC706. Our FPGA accelerator obtains a processing power of 71.2 GOPS, corresponding to 271.6 GOPS on the dense CNN model. On average, our FPGA design runs 11.5× faster than a well- tuned CPU implementation on Intel Xeon E5-2630, and has 3.2× better energy efficiency over the GPU realization on Nvidia Pascal Titan X. Compared to state-of-the-art FPGA designs [4], our accelerator reduces the classification time by 2.1×, with
design, automation, and test in europe | 2017
Hsin-Pai Cheng; Wei Wen; Chunpeng Wu; Sicheng Li; Hai Helen Li; Yiran Chen
As a large-scale commercial spiking-based neuromorphic computing platform, IBM TrueNorth processor received tremendous attentions in society. However, one of the known issues in TrueNorth design is the limited precision of synaptic weights. The current workaround is running multiple neural network copies in which the average value of each synaptic weight is close to that in the original network. We theoretically analyze the impacts of low data precision in the TrueNorth chip on inference accuracy, core occupation, and performance, and present a probability-biased learning method to enhance the inference accuracy through reducing the random variance of each computation copy. Our experimental results proved that the proposed techniques considerably improve the computation accuracy of TrueNorth platform and reduce the incurred hardware and performance overheads. Among all the tested methods, L1TEA regularization achieved the best result, say, up to 2.74% accuracy enhancement when deploying MNIST application onto TrueNorth platform. In May 2016, IBM TrueNorth team implemented convolutional neural networks (CNN) on TrueNorth processor and coincidently use a similar method, say, trinary weights, {-1, 0, 1}. It achieves near state-of-the-art accuracy on 8 standard datasets. In addition, to further evaluate TrueNorth performance on CNN, we test similar deep convolutional networks on True-North, GPU and FPGA. Among all, GPU has the highest through-put. But if we consider energy consumption, TrueNorth processor is the most energy-efficient one, say, > 6000 frames/sec/Watt.
international conference on computer aided design | 2016
Sicheng Li; Yandan Wang; Wujie Wen; Yu Wang; Yiran Chen; Hai Li
Sparse matrix-vector multiplication (SpMV) is an important computational kernel in many applications. For performance improvement, software libraries designated for SpMV computation have been introduced, e.g., MKL library for CPUs and cuSPARSE library for GPUs. However, the computational throughput of these libraries is far below the peak floating-point performance offered by hardware platforms, because the efficiency of SpMV kernel is greatly constrained by the limited memory bandwidth and irregular data access patterns. In this work, we propose a data locality-aware design framework for FPGA-based SpMV acceleration. We first include the hardware constraints in sparse matrix compression at software level to regularize the memory allocation and accesses. Moreover, a distributed architecture composed of processing elements is developed to improve the computation parallelism. We implement the reconfigurable SpMV kernel on Convey HC-2ex and conduct the evaluation by using the University of Florida sparse matrix collection. The experiments demonstrate an average computational efficiency of 48.2%, which is a lot better than those of CPU and GPU implementations. Our FPGA-based kernel has a comparable runtime as GPU, and achieves 2.1× reduction than CPU. Moreover, our design obtains substantial saving in energy consumption, say, 9.3× and 5.6× better than the implementations on CPU and GPU, respectively.
ieee computer society annual symposium on vlsi | 2018
Chang Song; Hsin-Pai Cheng; Huanrui Yang; Sicheng Li; Chunpeng Wu; Qing Wu; Yiran Chen; Hai Li