Peiheng Zhang
Chinese Academy of Sciences
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Peiheng Zhang.
international workshop on high performance reconfigurable computing technology and applications | 2007
Peiheng Zhang; Guangming Tan; Guang R. Gao
An innovative reconfigurable supercomputing platform -- XD1000 is developed by XtremeData Inc. to exploit the rapid progress of FPGA technology and the high-performance of Hyper-Transport interconnection. In this paper, we present the implementations of the Smith-Waterman algorithm for both DNA and protein sequences on the platform. The main features include: (1) we bring forward a multistage PE (processing element) design which significantly reduces the FPGA resource usage and hence allows more parallelism to be exploited; (2) our design features a pipelined control mechanism with uneven stage latencies -- a key to minimize the overall PE pipeline cycle time; (3) we also put forward a compressed substitution matrix storage structure, resulting in substantial decrease of the on-chip SRAM usage. Finally, we implement a 384-PE systolic array running at 66.7MHz, which can achieve 25.6GCUPS peak performance. Compared with the 2.2GHz AMD Opteron host processor, the FPGA coprocessor speedups 185X and 250X respectively.
IEEE Transactions on Circuits and Systems Ii-express Briefs | 2007
Xianyang Jiang; Xinchun Liu; Lin Xu; Peiheng Zhang; Ninghui Sun
Scanning bio-sequence database and finding similarities among DNA and protein sequences is basic and important work in bioinformatics field. To solve this problem, Needleman-Wunschh (NW) algorithm is a classical and precise tool, and Smith-Waterman (SW) algorithm is more practical for its capability to find similarities between subsequences. Such algorithms have computational complexity proportional to the length product of both involved sequences, hence processing time becomes insufferable due to exponential growth speed and great amount of bio-sequence database. To alleviate this serious problem, a reconfigurable accelerator for SW algorithm is presented. In the accelerator, a modified equation is proposed to improve mapping efficiency of a processing element (PE), and a special floor plan is applied to a fine-grain parallel PE array and interface components to cut down their routing delay. Basing on the two techniques, the proposed accelerator can reach at 82-MHz frequency in an Altera EP1S30 device. Experiments demonstrate the accelerator provides more than 330 speedup as compared to a standard desktop platform with a 2.8-GHz Xeon processor and 4-GB memory and has 50% improvement on the peak performance of a transferred traditional implementation without using the two special techniques. Our implementation is also about 9% faster than the fastest implementation in a most recent family of SW algorithm accelerators.
field-programmable custom computing machines | 2012
Wen Tang; Wendi Wang; Bo Duan; Chunming Zhang; Guangming Tan; Peiheng Zhang; Ninghui Sun
The explosion of Next Generation Sequencing (NGS) data with over one billion reads per day poses a great challenge to the capability of current computing systems. In this paper, we proposed a CPU-FPGA heterogeneous architecture for accelerating a short reads mapping algorithm, which was built upon the concept of hash-index. In particular, by extracting and mapping the most time-consuming and basic operations to specialized processing elements (PEs), our new algorithm is favorable to efficient acceleration on FPGAs. The proposed architecture is implemented and evaluated on a customized FPGA accelerator card with a Xilinx Virtex5 LX330 FPGA resided. Limited by available data transfer bandwidth, our NGS mapping accelerator, which operates at 175MHz, integrates up to 100 PEs. Compared to an Intel six-cores CPU, the speedup of our accelerator ranges from 22.2 times to 42.9 times.
field-programmable technology | 2011
Bo Duan; Wendi Wang; Xingjian Li; Chunming Zhang; Peiheng Zhang; Ninghui Sun
Over the past decades, we noticed huge advances in FPGA technologies. The topic of floating-point accelerator on FPGA has gained renewed interests due to the increased device size and the emergence of fast hardware floating-point library. The popularity of FFT makes it easier to justify spending lots of effort doing detailed optimization. However, the ever increasing data size in some compelling application domains remains beyond the capability of existing FFT accelerators. The demand for more performance remains an active research topic. In this paper, leveraging structured description of FFT algorithms, we propose a FPGA-based FFT core generation framework, which emits Verilog HDL code given high-level algorithmic description and can handle radix-2 as well as prime-radix problem size. In particular, the proposed framework is optimized for 2D FFT and real FFT. The performance of our implementation is comparable with a commercial FFT IP. When compared with the latest results on GPU and CPU, measured in peak floating-point performance and energy efficiency, it shows that GPUs have outperformed FPGAs for FFT acceleration. However, we consider that FPGAs still have advantage in some situations.
The Journal of Supercomputing | 2007
Xianyang Jiang; Peiheng Zhang; Xinchun Liu; Stephen S.-T. Yau
Abstract Up to now, there are many homology search algorithms that have been investigated and studied. However, a good classification method and a comprehensive comparison for these algorithms are absent. This is especially true for index based homology search algorithms. The paper briefly introduces main index construction methods. According to index construction methods, index based homology search algorithms are classified into three categories, i.e., length based index ones, transformation based index ones, and their combination. Based on the classification, the characteristics of the currently popular index based homology search algorithms are compared and analyzed. At the same time, several promising and new index techniques are also discussed. As a whole, the paper provides a survey on index based homology search algorithms.
reconfigurable computing and fpgas | 2010
Wendi Wang; Bo Duan; Chunming Zhang; Peiheng Zhang; Ninghui Sun
The emergence of embedded and multimedia applications, which have a data-centric favor to them, have great influences on the design methodology of future systems. The 2D FFT is of particular importance to these applications. In this paper, leveraging the reconfigurable features of off-the-shelf FPGAs, we propose a stream architecture that is suitable for accelerating 2D FFT with variable and non-power-of-two problem size. The system is built upon the concept of coarse-grained and application-specific pipeline and is optimized for 2D and real data FFT specifically. Separating data flow from computing flow and providing hardware support for data flow permutation involved in 2D FFT, our design outperforms previous work by 1:8 21 times. Moreover, when facing real data, the speed will further increase by nearly 100%. To increase designer productivity, we reinforced traditional HDL-based hardware design flow with High-Level Synthesis concept. The prototyping system on a customized FPGA-based accelerator card can operate in 200MHz stably.
BMC Bioinformatics | 2009
Wendi Wang; Peiheng Zhang; Xinchun Liu
BackgroundThe emerging next-generation sequencing method based on PCR technology boosts genome sequencing speed considerably, the expense is also get decreased. It has been utilized to address a broad range of bioinformatics problems. Limited by reliable output sequence length of next-generation sequencing technologies, we are confined to study gene fragments with 30~50 bps in general and it is relatively shorter than traditional gene fragment length. Anchoring gene fragments in long reference sequence is an essential and prerequisite step for further assembly and analysis works. Due to the sheer number of fragments produced by next-generation sequencing technologies and the huge size of reference sequences, anchoring would rapidly becoming a computational bottleneck.Results and discussionWe compared algorithm efficiency on BLAT, SOAP and EMBF. The efficiency is defined as the count of total output results divided by time consumed to retrieve them. The data show that our algorithm EMBF have 3~4 times efficiency advantage over SOAP, and at least 150 times over BLAT. Moreover, when the reference sequence size is increased, the efficiency of SOAP will get degraded as far as 30%, while EMBF have preferable increasing tendency.ConclusionIn conclusion, we deem that EMBF is more suitable for short fragment anchoring problem where result completeness and accuracy is predominant and the reference sequences are relatively large.
international parallel and distributed processing symposium | 2012
Wendi Wang; Wen Tang; Linchuan Li; Guangming Tan; Peiheng Zhang; Ninghui Sun
Next Generation Sequencing (NGS) is gaining interests due to the increased requirements and the decreased sequencing cost. The important and prerequisite step of most NGS applications is the mapping of short sequences, called reads, to the template reference sequences. Both the explosion of NGS data with over billions of reads generated each day and the data intensive computations pose great challenges to the capability of existing computing systems. In this paper, we take a hash index based algorithm (PerM) as an example to investigate the optimization approaches for accelerating NGS reads mapping on multi-core architectures. First, we propose a new parallel algorithm that reorders bucket access in hash index among multiple threads so that data locality in shared cache is improved. Second, in order to reduce the number of empty hash bucket, we propose a serialized hash index compression algorithm, which coincides with the sequential access nature of our new parallel algorithm. With reduced hash index size, it also becomes possible for us to use longer hash keys, which alleviates the hash conflicts and improves the query performance. Our experiment on an 8-socket 8-cores Intel Xeon X7550 SMP with 128 GB memory shows that the new parallel algorithm reduces LLC miss ratio to be 8%~15% of the original algorithm and the overall performance is improved by 4~11 times (6 times avg.).
high performance distributed computing | 2011
Linchuan Li; Xingjian Li; Guangming Tan; Mingyu Chen; Peiheng Zhang
Heterogeneous architecture is becoming an important way to build a massive parallel computer system, i.e. the CPU-GPU heterogeneous systems ranked in Top500 list. However, it is a challenge to efficiently utilize massive parallelism of both applications and architectures on such heterogeneous systems. In this paper we present a practice on how to exploit and orchestrate parallelism at algorithm level to take advantage of underlying parallelism at architecture level. A potential Petaflops application -- cryo-EM 3D reconstruction is selected as an example. We exploit all possible parallelism in cryo-EM 3D reconstruction, and leverage a self-adaptive dynamic scheduling algorithm to create a proper parallelism mapping between the application and architecture. The parallelized programs are evaluated on a subsystem of Dawning Nebulae supercomputer, whose node is composed of two Intel six-core Xeon CPUs and one Nvidia Fermi GPU. The experiment confirms that hierarchical parallelism is an efficient pattern of parallel programming to utilize capabilities of both CPU and GPU in a heterogeneous system. The CUDA kernels run more than 3 times faster than the OpenMP parallelized ones using 12 cores (threads). Based on the GPU-only version, the hybrid CPU-GPU program further improves the whole applications performance by 30% on the average.
field programmable gate arrays | 2012
Wendi Wang; Bo Duan; Wen Tang; Chunming Zhang; Guangming Tang; Peiheng Zhang; Ninghui Sun
The wide acceptance of bioinformatics, medical imaging and multimedia applications, which have a data-centric favor to them, require more efficient and application-specific systems to be built. Due to the advances in modern FPGA technologies recently, there has been a resurgence in research aimed at accelerator design that leverages FPGAs to accelerate large-scale scientific applications. In this paper, we exploit this trend towards FPGA-based accelerator design and provide a proof-of-concept and comprehensive case study on FPGA-based accelerator design for a single-particle 3D reconstruction application in single-precision floating-point format. The proposed stream architecture is built by first offloading computing-intensive software kernels to dedicated hardware modules, which emphasizes the importance of optimizing computing dominated data access patterns. Then configurable computing streams are constructed by arranging the hardware modules and bypass channels to form a linear deep pipeline. The efficiency of the proposed stream architecture is justified by the reported 2.54 times speedup over a 4-cores CPU. In terms of power efficiency, our FPGA-based accelerator introduces a 7.33 and 3.4 times improvement over a 4-cores CPU and an up-to-date GPU device, respectively.