Ninghui Sun | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ninghui Sun is active.

Explore More

Publication

Featured researches published by Ninghui Sun.

architectural support for programming languages and operating systems | 2014

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

Tianshi Chen; Zidong Du; Ninghui Sun; Jia Wang; Chueh-Hung Wu; Yunji Chen; Olivier Temam

Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65 nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

international symposium on microarchitecture | 2014

DaDianNao: A Machine-Learning Supercomputer

Yunji Chen; Tao Luo; Shaoli Liu; Shijin Zhang; Liqiang He; Jia Wang; Ling Li; Tianshi Chen; Zhiwei Xu; Ninghui Sun; Olivier Temam

Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.

Nucleic Acids Research | 2016

NONCODE 2016: an informative and valuable data source of long non-coding RNAs

Yi Zhao; Hui Li; Shuangsang Fang; Yue Kang; Wei Wu; Yajing Hao; Ziyang Li; Dechao Bu; Ninghui Sun; Michael Q. Zhang; Runsheng Chen

NONCODE (http://www.bioinfo.org/noncode/) is an interactive database that aims to present the most complete collection and annotation of non-coding RNAs, especially long non-coding RNAs (lncRNAs). The recently reduced cost of RNA sequencing has produced an explosion of newly identified data. Revolutionary third-generation sequencing methods have also contributed to more accurate annotations. Accumulative experimental data also provides more comprehensive knowledge of lncRNA functions. In this update, NONCODE has added six new species, bringing the total to 16 species altogether. The lncRNAs in NONCODE have increased from 210 831 to 527,336. For human and mouse, the lncRNA numbers are 167,150 and 130,558, respectively. NONCODE 2016 has also introduced three important new features: (i) conservation annotation; (ii) the relationships between lncRNAs and diseases; and (iii) an interface to choose high-quality datasets through predicted scores, literature support and long-read sequencing method support. NONCODE is also accessible through http://www.noncode.org/.

ieee international conference on high performance computing data and analytics | 2011

Fast implementation of DGEMM on Fermi GPU

Guangming Tan; Linchuan Li; Sean Triechle; Everett H. Phillips; Yungang Bao; Ninghui Sun

In this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEM-M) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of the Fermi memory hierarchy. Our optimization strategy is further guided by a performance modeling based on micro-architecture benchmarks. Our optimizations include software pipelining, use of vector memory operations, and instruction scheduling. Our best CUDA algorithm achieves comparable performance with the latest CUBLAS library1. We further improve upon this with an implementation in the native machine language, leading to 20% increase in performance. That is, the achieved peak performance (efficiency) is improved from 302Gflop/s (58%) to 362Gflop/s (70%).

acm symposium on parallel algorithms and architectures | 2007

A parallel dynamic programming algorithm on a multi-core architecture

Guangming Tan; Ninghui Sun; Guang R. Gao

Dynamic programming is an efficient technique to solve combinatorial search and optimization problem. There have been many parallel dynamic programming algorithms. The purpose of this paper is to study a family of dynamic programming algorithm where data dependence appear between non-consecutive stages, in other words, the data dependence is non-uniform. This kind of dynnamic programming is typically called nonserial polyadic dynamic programming. Owing to the non-uniform data dependence, it is harder to optimize this problem for parallelism and locality on parallel architectures. In this paper, we address the chanllenge of exploiting fine grain parallelism and locality of nonserial polyadic dynamic programming on a multi-core architecture. We present a programming and execution model for multi-core architectures with memory hierarchy. In the framework of the new model, the parallelism and locality benifit from a data dependence transformation. We propose a parallel pipelined algorithm for filling the dynamic programming matrix by decomposing the computation operators. The new parallel algorithm tolerates the memory access latency using multi-thread and is easily improved with tile technique. We formulate and analytically solve the optimization problem determing the tile size that minimizes the total execution time. The experiments on a simulator give a validation of the proposed model and show that the fine grain parallel algorithm achieves sub-linear speedup and that a potential high scalability on multi-core arichitecture.

Frontiers of Computer Science in China | 2012

CloudRank-D: benchmarking and ranking cloud computing systems for data processing applications

Chunjie Luo; Jianfeng Zhan; Zhen Jia; Lei Wang; Gang Lu; Lixin Zhang; Cheng Zhong Xu; Ninghui Sun

With the explosive growth of information, more and more organizations are deploying private cloud systems or renting public cloud systems to process big data. However, there is no existing benchmark suite for evaluating cloud performance on the whole system level. To the best of our knowledge, this paper proposes the first benchmark suite CloudRank-D to benchmark and rank cloud computing systems that are shared for running big data applications. We analyze the limitations of previous metrics, e.g., floating point operations, for evaluating a cloud computing system, and propose two simple metrics: data processed per second and data processed per Joule as two complementary metrics for evaluating cloud computing systems. We detail the design of CloudRank-D that considers representative applications, diversity of data characteristics, and dynamic behaviors of both applications and system software platforms. Through experiments, we demonstrate the advantages of our proposed metrics. In several case studies, we evaluate two small-scale deployments of cloud computing systems using CloudRank-D.

IEEE Transactions on Circuits and Systems Ii-express Briefs | 2007

A Reconfigurable Accelerator for Smith–Waterman Algorithm

Xianyang Jiang; Xinchun Liu; Lin Xu; Peiheng Zhang; Ninghui Sun

Scanning bio-sequence database and finding similarities among DNA and protein sequences is basic and important work in bioinformatics field. To solve this problem, Needleman-Wunschh (NW) algorithm is a classical and precise tool, and Smith-Waterman (SW) algorithm is more practical for its capability to find similarities between subsequences. Such algorithms have computational complexity proportional to the length product of both involved sequences, hence processing time becomes insufferable due to exponential growth speed and great amount of bio-sequence database. To alleviate this serious problem, a reconfigurable accelerator for SW algorithm is presented. In the accelerator, a modified equation is proposed to improve mapping efficiency of a processing element (PE), and a special floor plan is applied to a fine-grain parallel PE array and interface components to cut down their routing delay. Basing on the two techniques, the proposed accelerator can reach at 82-MHz frequency in an Altera EP1S30 device. Experiments demonstrate the accelerator provides more than 330 speedup as compared to a standard desktop platform with a 2.8-GHz Xeon processor and 4-GB memory and has 50% improvement on the peak performance of a transferred traditional implementation without using the two special techniques. Our implementation is also about 9% faster than the fastest implementation in a most recent family of SW algorithm accelerators.

programming language design and implementation | 2013

SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication

Jiajia Li; Guangming Tan; Mingyu Chen; Ninghui Sun

Sparse Matrix Vector multiplication (SpMV) is an important kernel in both traditional high performance computing and emerging data-intensive applications. By far, SpMV libraries are optimized by either application-specific or architecture-specific approaches, making the libraries become too complicated to be used extensively in real applications. In this work we develop a Sparse Matrix-vector multiplication Auto-Tuning system (SMAT) to bridge the gap between specific optimizations and general-purpose usage. SMAT provides users with a unified programming interface in compressed sparse row (CSR) format and automatically determines the optimal format and implementation for any input sparse matrix at runtime. For this purpose, SMAT leverages a learning model, which is generated in an off-line stage by a machine learning method with a training set of more than 2000 matrices from the UF sparse matrix collection, to quickly predict the best combination of the matrix feature parameters. Our experiments show that SMAT achieves impressive performance of up to 51GFLOPS in single-precision and 37GFLOPS in double-precision on mainstream x86 multi-core processors, which are both more than 3 times faster than the Intel MKL library. We also demonstrate its adaptability in an algebraic multigrid solver from Hypre library with above 20% performance improvement reported.

conference on high performance computing (supercomputing) | 2006

Locality and parallelism optimization for dynamic programming algorithm in bioinformatics

Guangming Tan; Shengzhong Feng; Ninghui Sun

Dynamic programming has been one of the most efficient approaches to sequence analysis and structure prediction in biology. However, their performance is limited due to the drastic increase in both the number of biological data and variety of the computer architectures. With regard to such predicament, this paper creates excellent algorithms aimed at addressing the challenges of improving memory efficiency and network latency tolerance for nonserial polyadic dynamic programming where the dependences are nonuniform. By relaxing the nonuniform dependences, we proposed a new cache oblivious scheme to enhance its performance on memory hierarchy architectures. Moreover we develop and extend a tiling technique to parallelize this nonserial polyadic dynamic programming using an alternate block-cyclic mapping strategy for balancing the computational and memory load, where an analytical parameterized model is formulated to determine the tile volume size that minimizes the total execution time and an algorithmic transformation is used to schedule the tile to overlap communication with computation to further minimize communication overhead on parallel architectures. The numerical experiments were carried out on several high performance computer systems. The new cache-oblivious dynamic programming algorithm achieve 2-10 speedup and the parallel tiling algorithm with communication-computation overlapping shows a desired potential for fine-grained parallel computing on massively parallel computer systems

field-programmable custom computing machines | 2012

Accelerating Millions of Short Reads Mapping on a Heterogeneous Architecture with FPGA Accelerator

Wen Tang; Wendi Wang; Bo Duan; Chunming Zhang; Guangming Tan; Peiheng Zhang; Ninghui Sun

The explosion of Next Generation Sequencing (NGS) data with over one billion reads per day poses a great challenge to the capability of current computing systems. In this paper, we proposed a CPU-FPGA heterogeneous architecture for accelerating a short reads mapping algorithm, which was built upon the concept of hash-index. In particular, by extracting and mapping the most time-consuming and basic operations to specialized processing elements (PEs), our new algorithm is favorable to efficient acceleration on FPGAs. The proposed architecture is implemented and evaluated on a customized FPGA accelerator card with a Xilinx Virtex5 LX330 FPGA resided. Limited by available data transfer bandwidth, our NGS mapping accelerator, which operates at 175MHz, integrates up to 100 PEs. Compared to an Intel six-cores CPU, the speedup of our accelerator ranges from 22.2 times to 42.9 times.

Explore More