Zhenqi Wei
Shanghai Jiao Tong University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Zhenqi Wei.
symposium on application specific processors | 2010
Ji Kong; Peilin Liu; Xianmin Chen; Jin Wang; Xingguang Pan; Jun Wang; He Xiao; Zhenqi Wei; Rendong Ying
For next-generation audio applications, the dominant trends are much higher sample rate, larger word length and more audio channels for playback audio data. Traditional DSPs or embedded processors are inefficient for such kinds of applications because of their non-specific or limited computing capabilities as well as the on-chip memory architectures. In this paper, an embedded audio processor aiming at next-generation audio applications has been proposed. The audio specific instruction set architecture is based on the analysis of the requirements for next-generation audio processing. Besides, a novel tightly coupled audio memory has been proposed to support extremely high audio data throughputs and flexible audio data transfers with main memories. To evaluate the performance of the proposed audio processor, a set of benchmarks have been used based on the analysis of next-generation audio applications. The implementation and evaluation results lead to the conclusion that the proposed audio processor is of outstanding efficiency and cost-effectiveness for next-generation audio applications.
international symposium on circuits and systems | 2015
Zhenqi Wei; Peilin Liu; Rongdi Sun; Rendong Ying
As one of the mostly used synchronization schemes in parallel programming on multi-core processors, barrier synchronization has been extensively studied in former research works. In conventional master-slave barrier or tree barrier, usually one centric core is selected to collect barrier arriving messages and to broadcast barrier releasing messages. Unfortunately the barrier core sometimes is deviated from the center location and may lead to worse synchronization efficiency. We propose a hybrid tree-based all-to-all (TAB) barrier for NoC-based many-core processors to relieve performance degradation caused by the off-centered barrier core. Performance of TAB barrier is compared to canonical algorithms and former solution, and almost 20% time is saved during off-centered scenarios with marginal area and power overhead.
international symposium on circuits and systems | 2014
Zhenqi Wei; Peilin Liu; Zhencheng Zeng; Jiangwei Xu; Rendong Ying
Parallelized applications running on many-core Network-on-Chip (NoC) processors may consume a great part of execution time to synchronize threads mapped on multiple NoC nodes, if synchronization for NoC processors is not carefully designed. In this paper, we propose an instruction-based synchronization solution applied in a packet-switched many-core NoC processor with 2D mesh grid topology. Return links are added into the on-chip network to transmit acknowledgements of read requests, while a specific instruction SET is designed as instruction set extension to the original pipeline to perform atomic read-modify-write operations. To support various synchronization schemes, a hardware unit SYNC containing globally addressable registers as shared variables is adopted to handle synchronization requests from both local and remote NoC nodes. Additionally, a FIFO located in the SYNC unit can store these synchronization requests to poll on shared variables locally. Thus, network contention due to busy-wait synchronization algorithms is greatly reduced. Synchronization schemes including spinlock, barrier, FIFO spinlock and semaphore are implemented as inline assembly functions. Synthesis results under 55nm process suggest low area and power overhead of the hardware design. Performance of synchronization schemes are evaluated and are compared to results of conventional methods and prior works, showing the proposed solution is of higher efficiency.
asia pacific conference on circuits and systems | 2014
Zhenqi Wei; Peilin Liu; Rongdi Sun; Rendong Ying
As one of the mostly used synchronization schemes in parallel programming, spin lock is supported in most off-the-shelf multi-/many-core processors. However the classical spin lock synchronization may lead to contention of acquiring the only lock and starvation of some threads busy waiting to be served. Thus queue-based spin lock has been put forwarded to eliminate both contention and unfairness issues of conventional schemes. Whereas applying queue-based spin lock synchronization in NoC processors introduces additional on-chip traffic to preserve serving sequence of participated cores. In this paper we propose a hardware solution of queue-based spin locks for NoC processors. A new instruction is designed to perform atomic read-after-write operations within single instruction, and a synchronization controller is used to handle global synchronization requests efficiently. Experimental results prove that our proposal outperforms former solutions and can save more than half time in some cases with marginal hardware overhead.
international symposium on circuits and systems | 2013
Zhenqi Wei; Peilin Liu; Cun Yu; Hongbin Zhou; Ying Ye; Ji Kong; Rendong Ying
Server-terminal based distributed speech recognition (DSR) applications are widely adopted on mobile devices. In this paper, we have implemented a power-efficient DSR solution of high performance for real-time speech processing. The DSR frontend algorithms are elaborately optimized in assembly codes utilizing accelerating technics provided by a previously released audio DSP, such as binary scaling operations in a deep instruction pipeline, automatic memory addressing method, and parallel processing of packaged data. The performance of DSR frontend software running on the DSP is greatly improved, and our work is of best efficiency compared with former solutions. The realtime frequency of processing 16 kHz input streams is 124.3 MHz and is only about 30% of what is required on a TI C64x DSP. Based on simulation experiment under SMIC 130 nm process, the power consumed for DSR frontend processing is 23 mW. Besides, the presented implementation of the algorithms is also integrated in a server-terminal demo system, and is proved to be worked well in real speech recognition applications.
international symposium on circuits and systems | 2011
Ji Kong; Peilin Liu; Zhenqi Wei; Kun Yang; Ying Ye; Rendong Ying
In stream programming style, the computation and the memory accesses are decoupled as much as possible. Such kind of programming style has brought new profits both on performance and power-efficiency for digital signal processing. In this paper, the architecture of a stream programming oriented power-efficient digital signal processor named as StreamPoP has been presented for consumer audio applications. The instruction and data supply subsystem of StreamPoP has been well designed for flexible stream programming and high power-efficiency. To further reduce the power consumption, an audio computation specific deep-instruction-pipeline (DIP) has been used in the micro-architecture of StreamPoP. To evaluate the performance of StreamPoP for consumer audio applications, a set of audio benchmarks have been used. It has been presented in this paper that the performance of StreamPoP is better than conventional high performance DSPs, while the former architecture is much more power-efficient for audio applications. The simulated power consumption result of StreamPoP under TSMC 90nm process is 5.1mW for AAC real-time decoding.
Science in China Series F: Information Sciences | 2017
Zhenqi Wei; Peilin Liu; Rongdi Sun
Realizing barrier synchronization in multi-/many-core processors with high efficiency becomes more and more challenging as the number of cores integrated in a single chip keeps growing. Quite a few barrier solutions have been proposed, while they provide limited improvements for synchronizing large amounts of cores or incur unfavorable restrictions on performing concurrent barriers. This paper presents HyBar, a hardware barrier based on a hybrid switching NoC which adopts packet switching and circuit switching methods in two sub-networks respectively. Dedicated channels in the circuit-switching sub-network are dynamically built and removed when barrier requests traverse the packet-switching sub-network according to a modified dimensionorder routing algorithm. The efficiency of inter-core communication for concurrent barriers is improved by merging barrier arrival requests and broadcasting release requests along the circuit channels. The execution time of synthetic cases, benchmark kernels and parallel applications using various barrier solutions are evaluated in an RTL-based simulation platform. Experimental results show that our proposal provides about 15%–50% performance improvement compared to previous solutions, while the hardware overhead is marginal under SMIC 40 nm technology. Moreover, HyBar introduces a minor efficiency loss for concurrent barriers with no limitation on their layouts of participating cores in the on-chip network.
IEICE Electronics Express | 2016
Zhenqi Wei; Peilin Liu; Rongdi Sun; Zunquan Zhou; Ke Jin; Dajiang Zhou
With a growing number of cores integrated in a single chip, the efficiency of inter-core direct memory access (DMA) transfers has an increasingly significant impact on the overall performance of parallel applications running on network-on-chip (NoC) processors. In this paper we propose HyDMA, a low-latency inter-core DMA approach based on a hybrid packetcircuit switching NoC. With dynamic setup and lengthening of circuit channels composing of bidirectional links, HyDMA can achieve both high flexibility of packet switching and low communication latency of circuit switching for concurrent DMA transfers. Experimental results prove HyDMA exhibits high efficiency with marginal hardware overhead.
IEEE Transactions on Very Large Scale Integration Systems | 2016
Zhenqi Wei; Peilin Liu; Rongdi Sun; Jun Dai; Zunquan Zhou; Xiangming Geng; Rendong Ying
As are widely used in military and security fields, multiple channels of low-bit-rate vocoders are required to perform on embedded devices efficiently. We propose HAVA, a multicore Application Specific Instruction Set Processor for multichannel low-bit-rate vocoders with real-time performance. To provide both flexibility and efficiency, HAVA integrates two types of processing cores and a shared-memory core on a 2-D-mesh on-chip network. Adopting a single-Instruction Set Architecture heterogeneous multicore architecture, HAVA cuts down the real-time performance requirement of vocoders by over 40% compared with other platforms. By leveraging the on-chip network for intercore communication, HAVA can perform multichannel vocoders with a marginal efficiency loss. The chip implementation of HAVA is finished in a 40-nm CMOS technology and it dissipates 149 mW at 100-MHz operating frequency for four channels of encoders.
2016 Fourth International Conference on Ubiquitous Positioning, Indoor Navigation and Location Based Services (UPINLBS) | 2016
Ke Jin; Peilin Liu; Rongdi Sun; Zhenqi Wei; Zunquan Zhou
This paper provides a real-time plane segmentation method which can be used in navigation systems for the visually impaired to avoid indoor obstacles. The proposed method is based on surface normal estimation in range images. Efficiency and overall accuracy are considered as two main challenges in plane segmentation algorithms which use depth information. Our method exploits integral images to enhance the efficiency of normal estimation. A dynamic determination for smoothing region is proposed in our method to improve the overall accuracy. Compared to the methods in Point Cloud Library (PCL), our method consumes less time and has better performance in a wide range of depth (1∼8m). The proposed method is implemented on Robot Operating System (ROS) at 30fps. Our method makes it possible to obtain a robust and real-time indoor navigation system integrated with commercial Time-of-flight (TOF) sensors.