Luo-Feng Geng
Hefei University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Luo-Feng Geng.
international conference on asic | 2007
Wen-Ting Zhang; Luo-Feng Geng; Duo-Li Zhang; Gaoming Du; Ming-Lun Gao; Wei Zhang; Ning Hou; Yi-hua Tang
To achieve a balance between high performance and energy efficiency, embedded systems often use heterogeneous multiprocessor platforms which tuned for a well defined application domain. Meanwhile FPGA is known for providing designers with several benefits in system design. One most important is high programmability and low risks. In this paper we demonstrate the design of an FPGA-based heterogeneous multiprocessor system integrating 4 Nios II soft cores and 1 ARM core. ARM core is the central controller of the whole system, and 4 Nios II cores are served as slaves, which are commanded by ARM core and responsible for processing regular and quantity data. ARM core and Nios II cores cooperate and work in parallel to accomplish each task. FPGA utilization of current implementation is 13% requiring 19,593 ALUTs on Altera Stratix II EP2S180.
international conference on anti counterfeiting security and identification | 2009
Luo-Feng Geng; Duoli Zhang; Ming-Lun Gao; Ying-Chun Chen; Gaoming Du
The Multiprocessor System-on-Chip (MPSoC) is a promising solution for future complex computer and embedded systems. And, the Network-on-Chip (NoC) has been proposed as the future on-chip interconnection. Whereas, the NoCs bring more challenge on parallel programming and synchronization of different processor cores. This paper proposes a new cluster-based homogeneous MPSoC architecture, which adopts the hybrid interconnection composed of both bus-based and NoC architecture. This architecture has been implemented as a prototype by FPGA device, which integrates 17 processor cores. The performances of this prototype are evaluated under two real applications, matrix chain multiplication and JPEG picture decoding. The speedup ratio of this prototype is up to 15.850.
international conference on asic | 2009
Junqiao Huang; Gaoming Du; Duoli Zhang; Yukun Song; Luo-Feng Geng; Ming-Lun Gao
A VLSI design of complex Quadrature Mirror Filterbank (QMF) for MPEG-4 High Efficiency Advanced Audio Coding (MPEG-4 HE-AAC) decoder using resource-sharing technique is proposed. The algorithm that uses conventional discrete cosine transform of type IV(DCT-IV) to optimize complex-QMF is derived in this paper. By using the proposed algorithm, the VLSI design of complex valued analysis quadrature mirror filterbank (complex-AQMF) and synthesis quadrature mirror filterbank (complex-SQMF) can improve resource efficiently by sharing the same DCT module. Experiment results show that the computational complexity of the complex-QMF can be reduced up to 8.59%, the VLSI architecture of the proposed algorithm can save about 53% of area and 50% memory due to the shared resources of DCT-IV.
international conference on solid state and integrated circuits technology | 2006
Wei Zhang; Gaoming Du; Yi Xu; Ming-lun Gao; Luo-Feng Geng; Bing Zhang; Zhao-yu Jiang; Ning Hou; Yi-hua Tang
The increasing system resources available on field-programmable gate arrays (FPGA) enable the integration of complex system on one programmable chip. This paper focuses on the design and implementation of a hierarchy-bus based multi-processor system-on-chip (MPSoC) integrating 4 ARM processors on FPGA. Experimental results had been obtained running at 60MHz with total area requiring 34% adaptive look-up tables (ALUTs) of Altera Stratix II EP2S180 and a maxim performance speedup of 3.2
pacific-asia workshop on computational intelligence and industrial application | 2008
Gaoming Du; Duoli Zhang; Yukun Song; Ming-Lun Gao; Luo-Feng Geng; Ning Hou
With the development of IC technology and the increasing processing power requirement, more and more processing cores are being integrated into one single chip. One of the key problems is the communication efficiency between the processing cores, and network on chip (NoC) has been proposed as prospect architecture. In this paper, scalability issue of 2-D mesh based NoC is analyzed. First, a mesh based NoC router using XY routing algorithm is designed and implemented in FPGA prototype. Second, 2*2 and 3*3 NoCs are constructed using the above router module, with each router connected to a processing core via the resource network interface (RNI). At last, pipelined matrixes multiplications and FFT are executed to evaluate the 2-D mesh based NoC performance, together with the router area overhead in the case of increasing processing nodes numbers. Experiments showed that 2-D mesh based NoC architecture is easy scalable in increasing processing nodes numbers with small resource overhead.
international conference on anti-counterfeiting, security, and identification | 2008
Liang Ma; Gaoming Du; Duoli Zhang; Yukun Song; Luo-Feng Geng; Ming-Lun Gao
A new architecture based on parallel FIR systolic arrays for motion compensation interpolation in H.264/AVC is presented in this paper. Unlike other interpolation architectures based on traditional adder tree or one systolic FIR, this design has advantages of both the pipeline property of systolic FIR filter and high parallel property. It has following characteristics: First, it uses several strategies to reduce the number of memory access. For example, the design fully uses the recursive relation between the fractional-pel samples, the appropriate interpolation orders for different situations are adopted, and two buffers are designed for storing immediate values. Second, it can increase the system clock frequency by using the systolic FIR filter to replace the traditional adder tree. Third, it can enhance the interpolation throughput by generating four fractional-pel samples in parallel. Fourth, it doesnpsilat need high memory bandwidth and can work under different bus-width by changing the number of systolic FIR filters. The design is synthesized with synopsys design compiler by using TSMC 0.18 um standard cell CMOS technology. The synthesis result shows that this architecture can achieve 230 MHz and meet the need for interpolation of the H.264 decoder for SDTV or HDTV.
international conference on anti counterfeiting security and identification | 2009
Haihua Wen; Gaoming Du; Duoli Zhang; Luo-Feng Geng; Ming-Lun Gao; Ying-Chun Chen
Performance evaluation for Network on Chip (NoC) is still a challenging problem. This paper presents the design of an on-line configurable traffic generator (OCTG) that provides a fast and effective traffic generation environment for evaluating the communication performance of Network-on-Chip (NoC). The novelty of the proposed OCTG architecture lies in the fact that it is different from just having some configurable parameters as the conventional design in order to improve its flexibility but it holds out on-line configuration. Parameters are transferred to the configuration engine through JTAG interface, then the configuration engine creates configuration signals to OCTGs to perform online configuration. The OCTG comprises two traffic modes: broadcast transmission (BT) and node to node transmission (NTNT). The OCTG can restart communication immediately without any other operations after completing configuration even when the communication transaction is running. Some communication traffic modes can be exactly emulated by the OCTG, so we can evaluate the NoC communication architecture in different traffic modes or compare the NoC performance with different architectures. Experiments showed that NoC performance with the same architecture in NTNT (node (i, j) to node (j, i)) is better than that in BT. And the XY routing is better than that with odd_even router in BT when the injection rate is more than 0.2. But when the injection rate is less than 0.2, the later is better than the former only in average packet latency.
international conference on asic | 2009
Ying-Chun Chen; Gaoming Du; Luo-Feng Geng; Duoli Zhang; Ming-Lun Gao
A dynamically reconfigurable computing system based on network-on-chip (DReNoC) is proposed, which consists of computing nodes and communication nodes. The computing node is a complete coarse-grained dynamically reconfigurable SoC named DReSoC. And the DReSoCs communicate with each other through on chip network routers. The proposed DReNoC has been implemented on the ALTERA STRATIX II EP2S180 DSP development board with 48063 Combinational ALUTs and 26211 logic registers. Experimental result of 8?8 matrix sequential matrix multiplications showed that, compared with a single-core system-on-chip (SoC) based on the standard Nios II processor, the speed-up ratio can reach 124.911.
2009 Fourth International Conference on Embedded and Multimedia Computing | 2009
Luo-Feng Geng; Duo-li Zhang; Ming-Lun Gao
The cluster-based multiprocessor system-on-chip (MPSoC), which adopts the hybrid interconnection composed of both bus-based and NoC architecture, is a new infrastructure for MPSoC. For obtaining the fast exploration of multiple hardware (HW) and software (SW) implementation alternatives with accurate estimations of performance to tune the MPSoC architecture in an early stage of the design process, this paper use the FPGA device to prototype the cluster-based MPSoC with 17 processing cores. And, a suite of benchmarks, including several parallel applications with different characteristic of parallelism, workload and communication pattern, are designed and presented in this paper. The experiment results show that, the highest speedup ratio is up to 15.850. Index Terms—Multiprocessor System-on-Chip, Network-on-Chip, Performance Evaluation.
international conference on asic | 2007
Ya-Jun He; Duo-Li Zhang; Bin Shen; Luo-Feng Geng
An efficient Huffman decoding method is presented in this paper. This new method first partitions a Huffman tree into subtrees. Then some look-up tables are used to represent these subtrees and the symbols of some subtrees are decoded by direct combinational logic. It is shown that by employing this technique decoding operations become significantly faster, and the memory consumption also becomes much smaller compared to the normal Huffman decoding.