Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Rongdi Sun is active.

Publication


Featured researches published by Rongdi Sun.


international symposium on circuits and systems | 2015

TAB barrier: Hybrid barrier synchronization for NoC-based processors

Zhenqi Wei; Peilin Liu; Rongdi Sun; Rendong Ying

As one of the mostly used synchronization schemes in parallel programming on multi-core processors, barrier synchronization has been extensively studied in former research works. In conventional master-slave barrier or tree barrier, usually one centric core is selected to collect barrier arriving messages and to broadcast barrier releasing messages. Unfortunately the barrier core sometimes is deviated from the center location and may lead to worse synchronization efficiency. We propose a hybrid tree-based all-to-all (TAB) barrier for NoC-based many-core processors to relieve performance degradation caused by the off-centered barrier core. Performance of TAB barrier is compared to canonical algorithms and former solution, and almost 20% time is saved during off-centered scenarios with marginal area and power overhead.


asia pacific conference on circuits and systems | 2014

High-efficient queue-based spin locks for Network-on-Chip processors

Zhenqi Wei; Peilin Liu; Rongdi Sun; Rendong Ying

As one of the mostly used synchronization schemes in parallel programming, spin lock is supported in most off-the-shelf multi-/many-core processors. However the classical spin lock synchronization may lead to contention of acquiring the only lock and starvation of some threads busy waiting to be served. Thus queue-based spin lock has been put forwarded to eliminate both contention and unfairness issues of conventional schemes. Whereas applying queue-based spin lock synchronization in NoC processors introduces additional on-chip traffic to preserve serving sequence of participated cores. In this paper we propose a hardware solution of queue-based spin locks for NoC processors. A new instruction is designed to perform atomic read-after-write operations within single instruction, and a synchronization controller is used to handle global synchronization requests efficiently. Experimental results prove that our proposal outperforms former solutions and can save more than half time in some cases with marginal hardware overhead.


international symposium on circuits and systems | 2017

A low latency feature extraction accelerator with reduced internal memory

Rongdi Sun; Peilin Liu; Jun Wang; Zunquan Zhou

ORB (Oriented FAST and Rotated BRIEF) feature extraction is popular in embedded vision applications like visual navigation due to its higher speed and robustness in many situations. However, feature description in ORB still accesses large amounts of image patches especially when an image pyramid is built. In order to reduce internal memory cost as well as maintain low latency processing, we design a hybrid pipeline architecture for ORB feature extraction. The accelerator combines different levels of computing granularity and migrates image pyramids to external memory. In addition, a data reuse scheme is adopted in descriptor generation to minimize external memory access, and achieve the ability to operate in multiple scales. The synthesis result shows 700kb internal memory cost and 24.5mW low power consumption. Experiments demonstrate 22% bandwidth reduction on average by the data reuse scheme. The system is verified on an FPGA platform and can provide 4000 features per frame, achieving up to 81fps in 1080p resolution at 100MHz frequency.


Science in China Series F: Information Sciences | 2017

HyBar: high efficient barrier synchronization based on a hybrid packet-circuit switching Network-on-Chip

Zhenqi Wei; Peilin Liu; Rongdi Sun

Realizing barrier synchronization in multi-/many-core processors with high efficiency becomes more and more challenging as the number of cores integrated in a single chip keeps growing. Quite a few barrier solutions have been proposed, while they provide limited improvements for synchronizing large amounts of cores or incur unfavorable restrictions on performing concurrent barriers. This paper presents HyBar, a hardware barrier based on a hybrid switching NoC which adopts packet switching and circuit switching methods in two sub-networks respectively. Dedicated channels in the circuit-switching sub-network are dynamically built and removed when barrier requests traverse the packet-switching sub-network according to a modified dimensionorder routing algorithm. The efficiency of inter-core communication for concurrent barriers is improved by merging barrier arrival requests and broadcasting release requests along the circuit channels. The execution time of synthetic cases, benchmark kernels and parallel applications using various barrier solutions are evaluated in an RTL-based simulation platform. Experimental results show that our proposal provides about 15%–50% performance improvement compared to previous solutions, while the hardware overhead is marginal under SMIC 40 nm technology. Moreover, HyBar introduces a minor efficiency loss for concurrent barriers with no limitation on their layouts of participating cores in the on-chip network.


IEICE Electronics Express | 2016

HyDMA: low-latency inter-core DMA based on a hybrid packet-circuit switching network-on-chip

Zhenqi Wei; Peilin Liu; Rongdi Sun; Zunquan Zhou; Ke Jin; Dajiang Zhou

With a growing number of cores integrated in a single chip, the efficiency of inter-core direct memory access (DMA) transfers has an increasingly significant impact on the overall performance of parallel applications running on network-on-chip (NoC) processors. In this paper we propose HyDMA, a low-latency inter-core DMA approach based on a hybrid packetcircuit switching NoC. With dynamic setup and lengthening of circuit channels composing of bidirectional links, HyDMA can achieve both high flexibility of packet switching and low communication latency of circuit switching for concurrent DMA transfers. Experimental results prove HyDMA exhibits high efficiency with marginal hardware overhead.


IEEE Transactions on Very Large Scale Integration Systems | 2016

HAVA: Heterogeneous Multicore ASIP for Multichannel Low-Bit-Rate Vocoder Applications

Zhenqi Wei; Peilin Liu; Rongdi Sun; Jun Dai; Zunquan Zhou; Xiangming Geng; Rendong Ying

As are widely used in military and security fields, multiple channels of low-bit-rate vocoders are required to perform on embedded devices efficiently. We propose HAVA, a multicore Application Specific Instruction Set Processor for multichannel low-bit-rate vocoders with real-time performance. To provide both flexibility and efficiency, HAVA integrates two types of processing cores and a shared-memory core on a 2-D-mesh on-chip network. Adopting a single-Instruction Set Architecture heterogeneous multicore architecture, HAVA cuts down the real-time performance requirement of vocoders by over 40% compared with other platforms. By leveraging the on-chip network for intercore communication, HAVA can perform multichannel vocoders with a marginal efficiency loss. The chip implementation of HAVA is finished in a 40-nm CMOS technology and it dissipates 149 mW at 100-MHz operating frequency for four channels of encoders.


2016 Fourth International Conference on Ubiquitous Positioning, Indoor Navigation and Location Based Services (UPINLBS) | 2016

Real-time plane segmentation in a ROS-based navigation system for the visually impaired

Ke Jin; Peilin Liu; Rongdi Sun; Zhenqi Wei; Zunquan Zhou

This paper provides a real-time plane segmentation method which can be used in navigation systems for the visually impaired to avoid indoor obstacles. The proposed method is based on surface normal estimation in range images. Efficiency and overall accuracy are considered as two main challenges in plane segmentation algorithms which use depth information. Our method exploits integral images to enhance the efficiency of normal estimation. A dynamic determination for smoothing region is proposed in our method to improve the overall accuracy. Compared to the methods in Point Cloud Library (PCL), our method consumes less time and has better performance in a wide range of depth (1∼8m). The proposed method is implemented on Robot Operating System (ROS) at 30fps. Our method makes it possible to obtain a robust and real-time indoor navigation system integrated with commercial Time-of-flight (TOF) sensors.


2016 Fourth International Conference on Ubiquitous Positioning, Indoor Navigation and Location Based Services (UPINLBS) | 2016

A flexible processing circuit of morphological transform for obstacle detection

Rongdi Sun; Peilin Liu; Zhenqi Wei

Simply but effectively, morphological transform can be used for obstacle detection in visual navigation. Since the resolution of captured images increases, this paper presents a high-speed circuit for grayscale morphological transform to satisfy real-time processing requirement. The design adopts pipelined architecture with by-pass lines to exploit the most flexibility and scalability. It supports flat structure elements of any shape and size by easy configuration. The overall architecture is implemented based on a Xilinx Virtex-4 FPGA chip with low resource overhead and a high synthesized frequency of 290MHz. It achieves a good performance over 100fps for processing 1080P images.


2016 Fourth International Conference on Ubiquitous Positioning, Indoor Navigation and Location Based Services (UPINLBS) | 2016

A hardware accelerated Scale Invariant Feature detector for real-time visual localization and mapping

Zunquan Zhou; Rendong Ying; Rongdi Sun; Zhenqi Wei; Ke Jin; Peilin Liu

Scale Invariant Feature Transform (SIFT) has drawn attention in the field of computer vision, recently. SIFT has been adopted in many visual localization and mapping applications, for its robustness to scale, rotation and illumination changes. However, the high computational cost limits its use in practical scenarios. In this paper, we present a real-time FPGA-based hardware accelerator of SIFT. The design is composed of two main parts: key-point detection component and feature generation component. The key-point detection component applies an octave-interleaved scale-parallel pipeline structure, as a tradeoff between frame rate and resource consumption. The feature generation component works in task-level burst mode for each key-point. The buffer together with buffer management logic enables quasi-parallelism between the two components, and also enables task-level quasi-parallelism between main orientation generation and local descriptor generation in the feature generation component. Our proposal can perform feature extraction of 720p video with real-time efficiency of 42fps at a clock frequency of 100MHz.


international symposium on circuits and systems | 2018

A 974GOPS/W Multi-level Parallel Architecture for Binary Weight Network Acceleration

Rongdi Sun; Peilin Liu; Cecil Accetti; Abid A. Naqvi; Haroon Ahmed; Jiuchao Qian

Collaboration


Dive into the Rongdi Sun's collaboration.

Top Co-Authors

Avatar

Peilin Liu

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Zhenqi Wei

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Zunquan Zhou

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Ke Jin

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Rendong Ying

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Abid A. Naqvi

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Cecil Accetti

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Jun Wang

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Jun Dai

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Xiangming Geng

Shanghai Jiao Tong University

View shared research outputs
Researchain Logo
Decentralizing Knowledge