Yuanwu Lei
National University of Defense Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yuanwu Lei.
Neurocomputing | 2016
Yueqing Wang; Zhige Xie; Kai Xu; Yong Dou; Yuanwu Lei
3D shape features play a crucial role in graphics applications, such as 3D shape matching, recognition, and retrieval. Various 3D shape descriptors have been developed over the last two decades; however, existing descriptors are handcrafted features that are labor-intensively designed and cannot extract discriminative information for a large set of data. In this paper, we propose a rapid 3D feature learning method, namely, a convolutional auto-encoder extreme learning machine (CAE-ELM) that combines the advantages of the convolutional neuron network, auto-encoder, and extreme learning machine (ELM). This method performs better and faster than other methods. In addition, we define a novel architecture based on CAE-ELM. The architecture accepts two types of 3D shape representation, namely, voxel data and signed distance field data (SDF), as inputs to extract the global and local features of 3D shapes. Voxel data describe structural information, whereas SDF data contain details on 3D shapes. Moreover, the proposed CAE-ELM can be used in practical graphics applications, such as 3D shape completion. Experiments show that the features extracted by CAE-ELM are superior to existing hand-crafted features and other deep learning methods or ELM models. Moreover, the classification accuracy of the proposed architecture is superior to that of other methods on ModelNet10 (91.4%) and ModelNet40 (84.35%). The training process also runs faster than existing deep learning methods by approximately two orders of magnitude.
high performance computing and communications | 2008
Jie Zhou; Yong Dou; Yuanwu Lei; Jinbo Xu; Yazhuo Dong
FPGA chips have become a promising option for accelerating scientific applications, which involve many floating-point transcendental functions, such as sin, log, exp, sqrt and etc. In this paper, we present a 64-bit ANSI/IEEE floating-point CORDIC co-processor on FPGA, providing all known CORDIC functions. And there is no 64-bit CORDIC implementation on FPGA known to us. We propose a hybrid-mode CORDIC algorithm, combining hybrid rotation angle methods with argument reduction algorithm to reduce hardware area usage and meanwhile keep unlimited convergence domain for any floating-point inputs of the functions. Our hybrid-mode CORDIC co-processor is organized into three phases, argument reduction, CORDIC calculation and normalization with 69 pipeline stages for FPGA implementation. The synthesis results show the clock frequency can reach 173 MHz on Xilinx Virtex5 FPGA. Comparing to general-purpose microprocessor in three scientific program kernels, the CORDIC co-processor can achieve a maximum speedup of 49.3 times, 28.7 times in average.
field-programmable custom computing machines | 2009
Guiming Wu; Yong Dou; Yuanwu Lei; Jie Zhou; Miao Wang; Jingfei Jiang
Previous works have projected that the peak performance of FPGAs can outperform that of the general purpose processors. However, no work actually compares the performance between FPGAs and CPUs using the standard benchmarks such as the LINPACK benchmark. We propose and implement an FPGA-based hardware design of the LINPACK benchmark, the key step of which is LU decomposition with pivoting. We introduce a fine-grained pipelined LU decomposition algorithm that enables optimum performance by exploiting fine-grained pipeline parallelism. A scalable linear array of processing elements (PEs), which is the core component of our hardware design, is proposed to implement this algorithm. To the best of our knowledge, this is the first reported FPGA-based pipelined implementation of LU decomposition with pivoting. A total of 19 PEs can be integrated into an Altera Stratix II EP2S130F1020C5 on our self-designed development board. Experimental results show that the speedup up to 6.14 can be achieved relative to a Pentium 4 processor for the LINPACK benchmark.
Neurocomputing | 2016
Yueqing Wang; Yong Dou; Xinwang Liu; Yuanwu Lei
Extreme learning machine (ELM) has been intensively studied during the last decade due to its high efficiency, effectiveness and easy-to-implementation. Recently, many variants, such as parallel ELM (P-ELM) incremental ELM and online sequential ELM(OS-ELM), have been proposed to improve its timing performance and enable its ability of incremental learning. In this paper, we propose two parallel variants, termed as data parallel regularized ELM (DPR-ELM) and model parallel regularized ELM (MPR-ELM), to further improve the computational efficiency of ELM in handling large scale learning tasks. Collectively, these two variants are called as parallel regularized ELM (PR-ELM). Specifically, our proposed algorithms are implemented on cluster with Message Passing Interface (MPI) environment. In summary, the advantages of the proposed PR-ELM algorithms over existing variants are highlighted as follows: (1) They have better parallelism since they train each data block or each sub-model independently. (2) They dramatically reduce the requirement of huge runtime memory since the whole datasets or the whole model are split into small chunks or sub-models. (3) Both DPR-ELM and MPR-ELM have better scalability since they are able to be configured on clusters with many more computing nodes. Extensive experiments have been conducted to validate the effectiveness of the proposed algorithms. As shown, DPR-ELM and MPR-ELM achieve 5.15× and 3.5× speedup on cluster with six nodes, respectively. Moreover, the speedup of DPR-ELM increases to 5.85× with the increase of the size of dataset, and this quantity is increased to 4× for MPR-ELM with the increase of the number of hidden nodes.
international workshop on computer architecture for machine perception | 2007
Yong Dou; Jie Zhou; Yuanwu Lei; Xingming Zhou
In the paper, we present a design of FPGA SAR processor with four 1D FFT processing elements, double internal RAM buffers and double external SDRAM modules. Without traditional corner turn phase, we propose a data layout scheme mapping one row of logical matrix into a rectangular window in physical banks of SDRAM in order to increase the practical I/O throughout between SDRAM modules and SAR processing elements. In addition, we theoretically analyses the optimal window size to minimize the total number of opening/closing pages when performing 2D FFT by balancing the number of handling physical pages between row accesses and column accesses. The experimental results show our window layout approach achieves 650 MB/s of effective bandwidth, reaching nearly 82% of peak bandwidth, with 58.1% increases compared to traditional Corner Turn approaches. The proposed SAR processor has been implemented in an FPGA test-bed, outperforming related works in both of computing speed and image scale.
international conference on supercomputing | 2010
Yong Dou; Yuanwu Lei; Guiming Wu; Song Guo; Jie Zhou; Li Shen
In this paper we explore the capability and flexibility of FPGA solutions in a sense to accelerate scientific computing applications which require very high precision arithmetic, based on 128-bit or even 256-bit floating-point number representation. This paper addresses the accuracy when performing LU decomposition on large-scale matrices. In future ExaScale computing environments, accuracy errors are expected to increase up to a level which leaves only 11 significant bits in the mantissa. This is caused by the required large amount of accumulation operations which are in the order of O(n3). Using exact long fixed-point numbers instead of usual floatingpoint numbers in the accumulation process, leads to exact accumulation results with only one bit error, originated by the rounding in the last normalization step. We have developed two types of High Precision Multiplication and Accumulation (HP-MAC), for Double-Double (128 bits) and Quad-Double (256 bits) floating-point, respectively, and implemented them into FPGA devices. We propose a two-level RAM banks scheme to store and add long fixed-point numbers with minimized crucial data paths lengths. We also introduce a scheme of partial summation to enhance the pipeline throughput of MAC operations, by dividing the summation function into 4 partial operations, processed in 4 banks. To prove the concept, we prototyped six 128-bit HP-MAC units into a Xilinx Virtex-5 XC5VLX330 FPGA chip and performed LU decomposition. The experimental results show accuracy improvement of 10 to 24 bits, compared to a software approach with similar precision arithmetic. Moreover, our LU decomposition implementation, based on FPGA running at 133MHz, achieves 29X--56X better performance and much lower power consumption compared to the use of a software-based library running on an Intel Core2 Quad Q8200 CPU at 2.33GHz.
Concurrency and Computation: Practice and Experience | 2014
Yueqing Wang; Yong Dou; Song Guo; Yuanwu Lei; Dan Zou
Gadget is a simulation application for N‐body and smoothed particle hydrodynamics problems in cosmology, and it is widely applied in solving series of cosmological problems. N‐body focuses on the motion of the interaction of N particles, and smoothed particle hydrodynamics is a fluid simulation algorithm that studies the movement of fluid through particle simulation. Most scholars focus their attention on accelerating Gadget on multi‐core CPU or graphics processing units (GPUs) platforms. However, these research activities failed to achieve CPU–GPU hybrid computing, which resulted in tremendous waste of CPU computing resources.
field-programmable logic and applications | 2011
Yuanwu Lei; Yong Dou; Jie Zhou; Sufeng Wang
Many scientific computing applications require efficient variable-precision floating-point arithmetic. This paper presents a special-purpose Variable-Precision Floating-Point Arithmetic Processor (VPFPAP) based on Very Large Instruction Word (VLIW) structure. The proposed processor uses a unified hardware structure, equipped with multiple custom basic variable-precision arithmetic units, to implement various variable-precision algebraic and transcendental functions. The performance is improved by the explicitly parallel technology of VLIW instruction and by dynamically varying the precision of intermediate computation. Finally, we create a prototype of VPFPAP unit into a Xilinx Virtex-6 XC6VLX760-2FF1760 FPGA chip. The experimental results show that our design, based on FPGA running at 253 MHz, outperforms the approach of a software-based library running on an Intel Core i3 530 CPU at 2.93GHz by a factor of 5-38X for basic variable precision arithmetic operations and elementary functions.
field-programmable logic and applications | 2009
Yong Dou; Jie Zhou; Xiaoyang Chen; Yuanwu Lei; Jinbo Xu
Many FPGA implementations for QR decomposition have been studied on small-scale matrix and all of them are presented individually. However to the best of our knowledge, there is no FPGA-based accelerator for large-scale QR decomposition. In this paper, we propose a unified FPGA accelerator structure for large-scale QR decomposition. To exploit the computational potential of FPGA, we introduce a fine-grained parallel algorithm for QR decomposition. A scalable linear array processing elements (PEs), which is the core component of the FPGA accelerator, is proposed to implement this algorithm. A total of 15 PEs can integrated into an Altera StratixII EP2S130F1020C5 on our self-designed board. Experimental results show that a factor of 4 speedup and the maximum powerperformance of 60.9 can be achieved compare to Pentium Dual CPU with double SSE thread.
The Journal of Supercomputing | 2013
Yuanwu Lei; Yong Dou; Yazhuo Dong; Jie Zhou; Fei Xia
The current paper explores the capability and flexibility of field programmable gate-arrays (FPGAs) to implement variable-precision floating-point (VP) arithmetic. First, the VP exact dot product algorithm, which uses exact fixed-point operations to obtain an exact result, is presented. A VP multiplication and accumulation unit (VPMAC) on FPGA is then proposed. In the proposed design, the parallel multipliers generate the partial products of mantissa multiplication in parallel, which is the most time-consuming part in the VP multiplication and accumulation operation. This method fully utilizes DSP performance on FPGAs to enhance the performance of the VPMAC unit. Several other schemes, such as two-level RAM bank, carry-save accumulation, and partial summation, are used to achieve high frequency and pipeline throughput in the product accumulation stage. The typical algorithms in Basic Linear Algorithm Subprograms (i.e., vector dot product, general matrix vector product, and general matrix multiply product), LU decomposition, and Modified Gram–Schmidt QR decomposition, are used to evaluate the performance of the VPMAC unit. Two schemes, called the VPMAC coprocessor and matrix accelerator, are presented to implement these applications. Finally, prototypes of the VPMAC unit and the matrix accelerator based on the VPMAC unit are created on a Xilinx XC6VLX760 FPGA chip.Compared with a parallel software implementation based on OpenMP running on an Intel Xeon Quad-core E5620 CPU, the VPMAC coprocessor, equipped with one VPMAC unit, achieves a maximum acceleration factor of 18X. Moreover, the matrix accelerator, which mainly consists of a linear array of eight processing elements, achieves 12X–65X better performance.