Duoli Zhang
Hefei University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Duoli Zhang.
international conference on anti counterfeiting security and identification | 2009
Fu-ming Xiao; Dong-sheng Li; Gaoming Du; Yukun Song; Duoli Zhang; Ming-Lun Gao
While the computational core is becoming faster and faster, the communication efficiency between the processors has become a bottleneck which limits the performance of multiprocessor system-on-chip (MPSoC). This paper focuses on design and implementation of AXI bus protocol-based MPSoC architecture. Firstly, the RTL models of 4 NIOS II processors using AXI communication architecture are developed. Then the MPSoC was implemented in Altera Stratix II EP2S180 FPGA. Lastly, the performance was evaluated using matrix operation benchmark and compared with previous in-house designed architecture. Experiments showed that the proposed prototype could run at 100 MHz requiring 8963 Adaptive Look-up Table (ALUTs) and the maxim speedup ratio can be up to 3.81, and performs better than the traditional bus (AHB bus) and 2-D mesh NoC architecture.
international conference on anti-counterfeiting, security, and identification | 2008
Fuhui Du; Gaoming Du; Yukun Song; Duoli Zhang; Ming-Lun Gao
MPEG-2 AAC is the widely used audio standard and getting more popular for commercial use. In the AAC, the filterbank tool, which is composed of IMDCT, windowing and overlap-add, has the highest computation complexity. In this three steps, IMDCT is the important component. Hence, most published filterbank algorithms focus mainly on the implementation of IMDCT but overlook the relevancy between the steps. This paper proposed a novel architecture of filterbank tool and its hardware implementation. A fast algorithm for IMDCT which contains pre-IFFT, N/4-point IFFT and post-IFFT is employed. In order to improve the efficiency of memory access, windowing and overlap-add operation are combined with post-IFFT, which means no storage elements are required between them and results of post-IFFT will perform windowing and overlap-adding directly. This proposed architecture contains three hardware modules and further improvements are made to each module as well. Totally, 4 multipliers are shared by them in different time. Each module reads data continually from RAM, just like pipeline operation. As a result, this new architecture can improve the memory access efficiency with a speedup of 75% in computation time over the unoptimized one.
international conference on anti-counterfeiting, security, and identification | 2010
Xin Jin; Yukun Song; Duoli Zhang
With the improved performance of SoC for real-time application, more and more processing cores have been integrated into one chip, which is called MPSoC. One of the key problems is how to design the MPSoC architecture to improve the overall performance. In this paper, a cluster-based MPSoC using hierarchical on-chip communication is proposed. In the top level, on-chip network is used as the communication backbone for various clusters. In the cluster level, processing cores and IPs communicate with each others via a hierarchical bus. This paper focuses on the design and verification problems of computation cluster, which consists of several RISC processors and storage components. Separate control path and data path are designed to meet the performance requirements. The proposed architecture is implemented into a FPGA prototype. And a video application is mapped on the prototype to verify the functionality. Experiments show that the proposed MPSoC can work at 90 MHz and successfully accomplish real-time fade-in-fade-out processing of 4 lane videos which are 320⋆240 and 24 frames per second.
international conference on anti counterfeiting security and identification | 2009
Luo-Feng Geng; Duoli Zhang; Ming-Lun Gao; Ying-Chun Chen; Gaoming Du
The Multiprocessor System-on-Chip (MPSoC) is a promising solution for future complex computer and embedded systems. And, the Network-on-Chip (NoC) has been proposed as the future on-chip interconnection. Whereas, the NoCs bring more challenge on parallel programming and synchronization of different processor cores. This paper proposes a new cluster-based homogeneous MPSoC architecture, which adopts the hybrid interconnection composed of both bus-based and NoC architecture. This architecture has been implemented as a prototype by FPGA device, which integrates 17 processor cores. The performances of this prototype are evaluated under two real applications, matrix chain multiplication and JPEG picture decoding. The speedup ratio of this prototype is up to 15.850.
international conference on anti counterfeiting security and identification | 2009
Ning Hou; Duoli Zhang; Gaoming Du; Yukun Song; Haihua Wen
New tendencies envisage multiprocessor systems-onchips (MPSoCs) as a promising solution for the high performance Embedding System. And the key challenge is how to improve the communication efficiency. Network on Chip (NoC) has been considered as a new paradigm in the next generation communication architecture for its extensibility and power efficiency. The router is the fundamental unit of NoC. In this paper, a NoC prototype which consists of 6 ARM compatible cores and a router-based on-chip network is designed, and implements on a FPGA device. Different from the prototypes which we formerly designed, this prototype comprises more cores, and virtual-channel routers instead of basic routers. Specially, to evaluate the network performance, we present a run-time network monitor system, which can monitor the performance of on-chip network by calculating the performance parameters, such as average latency and throughput. The experimental results show that this prototype with 2×3 virtualchannel routers has less average latency than the former basic router prototype, and improves the throughput by up to 62%. Furthermore, JPEG decoding application is applied on this prototype, which steadily works at 50MHZ. And the decoding speed of system is very fast because of 2 decoding lane.
international conference on asic | 2009
Junqiao Huang; Gaoming Du; Duoli Zhang; Yukun Song; Luo-Feng Geng; Ming-Lun Gao
A VLSI design of complex Quadrature Mirror Filterbank (QMF) for MPEG-4 High Efficiency Advanced Audio Coding (MPEG-4 HE-AAC) decoder using resource-sharing technique is proposed. The algorithm that uses conventional discrete cosine transform of type IV(DCT-IV) to optimize complex-QMF is derived in this paper. By using the proposed algorithm, the VLSI design of complex valued analysis quadrature mirror filterbank (complex-AQMF) and synthesis quadrature mirror filterbank (complex-SQMF) can improve resource efficiently by sharing the same DCT module. Experiment results show that the computational complexity of the complex-QMF can be reduced up to 8.59%, the VLSI architecture of the proposed algorithm can save about 53% of area and 50% memory due to the shared resources of DCT-IV.
international conference on anti-counterfeiting, security, and identification | 2010
Chunhua Chen; Gaoming Du; Duoli Zhang; Yukun Song; Ning Hou
Inter-Processor communication synchronization in multi-processor system-on-chip (MPSoC) is one of the key factors for the whole chip performance. It cannot only affect the efficiency of task-level parallelism, but also has high dependency on MPSoC hardware architecture. Two synchronization mechanisms, i.e. mailbox and packet switching, are studied and analyzed in Network on chip based MPSoC. At first, the two schemes are implemented and verified in stand-alone mode, analyzed with communication latency, communication bandwidth and resource utilization. Furthermore, the two schemes are analyzed in MPSoC prototype environment that runs real-time fade-in fade-out video processing. Experimental results show that the mailbox based synchronization scheme has low latency and low resource overhead, but it is not feasible for large number of clusters due to the physical limitation. Although the packet based scheme has more latency, it has more scalability and feasibility.
pacific-asia workshop on computational intelligence and industrial application | 2008
Gaoming Du; Duoli Zhang; Yukun Song; Ming-Lun Gao; Luo-Feng Geng; Ning Hou
With the development of IC technology and the increasing processing power requirement, more and more processing cores are being integrated into one single chip. One of the key problems is the communication efficiency between the processing cores, and network on chip (NoC) has been proposed as prospect architecture. In this paper, scalability issue of 2-D mesh based NoC is analyzed. First, a mesh based NoC router using XY routing algorithm is designed and implemented in FPGA prototype. Second, 2*2 and 3*3 NoCs are constructed using the above router module, with each router connected to a processing core via the resource network interface (RNI). At last, pipelined matrixes multiplications and FFT are executed to evaluate the 2-D mesh based NoC performance, together with the router area overhead in the case of increasing processing nodes numbers. Experiments showed that 2-D mesh based NoC architecture is easy scalable in increasing processing nodes numbers with small resource overhead.
international conference on anti-counterfeiting, security, and identification | 2008
Liang Ma; Gaoming Du; Duoli Zhang; Yukun Song; Luo-Feng Geng; Ming-Lun Gao
A new architecture based on parallel FIR systolic arrays for motion compensation interpolation in H.264/AVC is presented in this paper. Unlike other interpolation architectures based on traditional adder tree or one systolic FIR, this design has advantages of both the pipeline property of systolic FIR filter and high parallel property. It has following characteristics: First, it uses several strategies to reduce the number of memory access. For example, the design fully uses the recursive relation between the fractional-pel samples, the appropriate interpolation orders for different situations are adopted, and two buffers are designed for storing immediate values. Second, it can increase the system clock frequency by using the systolic FIR filter to replace the traditional adder tree. Third, it can enhance the interpolation throughput by generating four fractional-pel samples in parallel. Fourth, it doesnpsilat need high memory bandwidth and can work under different bus-width by changing the number of systolic FIR filters. The design is synthesized with synopsys design compiler by using TSMC 0.18 um standard cell CMOS technology. The synthesis result shows that this architecture can achieve 230 MHz and meet the need for interpolation of the H.264 decoder for SDTV or HDTV.
international conference on computer science and network technology | 2011
Qian Zhou; Yukun Song; Duoli Zhang; Gaoming Du
With the development of a large scale integrated circuit and semiconductor production process, multi-processor on-chip system provides a feasible solution for the highly parallel computation and communications. In this paper, taking JPEG decoding as the starting point, a decoding system with multi-core processors based on the Avalon bus is presented. The paper also introduced the principle of JPEG decoding briefly and the hardware architecture of this system. And we also analyze the parallel process of JPEG decoding. Based on it, to verify the resource comparison, it is compared with the JPEG decoding system based on the AHB bus with 4-core on the EP2S180 FPGA development board. According to the experiments, the JPEG decoding system based on the Avalon bus with multi-core takes up less resource, and compared with the system based on the Avalon BUS with single-core, the total decoding time of the same four pictures of this system saves about 66.7%.