Is this you? Create Your Porfile

Bei Hua

University of Science and Technology of China

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bei Hua is active.

Explore More

Publication

Featured researches published by Bei Hua.

compilers, architecture, and synthesis for embedded systems | 2006

High-performance packet classification algorithm for many-core and multithreaded network processor

Duo Liu; Bei Hua; Xianghui Hu; Xinan Tang

Packet classification is crucial for the Internet to provide more value-added services and guaranteed quality of service. Besides hardware-based solutions, many software-based classification algorithms have been proposed. However, classifying at 10Gbps speed or higher is a challenging problem and it is still one of the performance bottlenecks in core routers. In general, classification algorithms face the same challenge of balancing between high classification speed and low memory requirements. This paper proposes a modified Recursive Flow Classification (RFC) algorithm, Bitmap-RFC, which significantly reduces the memory requirements of RFC by applying a bitmap compression technique. To speed up classifying speed, we experiment on exploiting the architectural features of a many-core and multithreaded architecture from algorithm design to algorithm implementation. As a result, Bitmap-RFC strikes a good balance between speed and space. It can not only keep high classification speed but also reduce memory space significantly.This paper investigates the main NPU software design aspects that have dramatic performance impacts on any NPU-based implementations: memory space reduction, instruction selection, data allocation, task partitioning, and latency hiding. We experiment with an architecture-aware design principle to guarantee the high performance of the classification algorithm on an NPU implementation. The experimental results show that the Bitmap-RFC algorithm achieves 10Gbps speed or higher and has a good scalability on Intel IXP2800 NP.

international conference on supercomputing | 2009

Practice of parallelizing network applications on multi-core architectures

Junchang Wang; Haipeng Cheng; Bei Hua; Xinan Tang

The industry wide shift to multi-core architectures arouses great interests in parallelizing sequential applications. However, it is very difficult to parallelize fine-grained applications for multi-core architectures due to insufficient hardware support of fast communication and synchronization. Fortunately, network applications can be decomposed into pipelined structures that are amenable to streaming based parallel processing. To realize the potential of pipelining on multi-core architectures, it requires reevaluating the basic tradeoffs in parallel processing, including the ones between load balance and data locality and between general lock mechanisms and special lock-free data structures. This paper presents the practice of building a high-performance multi-core based network processing platform in which connection-affinity and lock-free design principles are applied effectively for better data locality and faster core-to-core synchronization and communication. We parallelize a complete Layer 2 to Layer 7 (L2-L7) network processing system on an Intel Core 2 Quad processor, including a TCP/IP stack based on Libnids (L2-L4) and a port-independent protocol identification engine by deep packet inspection (L7+). Furthermore, we develop a compiling method to transform sequential network applications to parallel ones to enable those applications to run on multi-core architectures. Our experience suggests that (1) fine-grained pipelining can be a good software solution for parallelizing network applications on multi-core architectures if connection-affinity and lock-free are used as the first design principles; (2) a delicate partitioning scheme is required to map pipelined structures onto specific multi-core architecture; (3) an automatic parallelization approach can work if domain knowledge is considered in the parallelizing process. Our multi-core based network processing platform can deliver not only 6Gbps processing speed for large packet sizes but also more challenging 2Gbps speed for smaller packets.

acm sigplan symposium on principles and practice of parallel programming | 2006

High-performance IPv6 forwarding algorithm for multi-core and multithreaded network processor

Xianghui Hu; Xinan Tang; Bei Hua

IP forwarding is one of the main bottlenecks in Internet backbone routers, as it requires performing the longest-prefix match at 10Gbps speed or higher. IPv6 forwarding further exacerbates the situation because its search space is quadrupled. We propose a high-performance IPv6 forwarding algorithm TrieC, and implement it efficiently on the Intel IXP2800 network processor (NPU). Programming the multi-core and multithreaded NPU is a daunting task. We study the interaction between the parallel algorithm design and the architecture mapping to facilitate efficient algorithm implementation. We experiment with an architecture-aware design principle to guarantee the high performance of the resulting algorithm.This paper investigates the main software design issues that have dramatic performance impacts on any NPU based implementation: memory space reduction, instruction selection, data allocation, task partitioning, latency hiding, and thread synchronization. In the paper, we provide insight on how to design an NPU-aware algorithm for high-performance networking applications. Based on the detailed performance analysis of the TrieC algorithm, we provide guidance on developing high-performance networking applications for the multi-core and multithreaded architecture.

acm sigplan symposium on principles and practice of parallel programming | 2008

Scalable packet classification using interpreting: a cross-platform multi-core solution

Haipeng Cheng; Zheng Chen; Bei Hua; Xinan Tang

Packet classification is an enabling technology to support advanced Internet services. It is still a challenge for a software solution to achieve 10Gbps (line-rate) classification speed. This paper presents a classification algorithm that can be efficiently implemented on a multi-core architecture with or without cache. The algorithm embraces the holistic notion of exploiting application characteristics, considering the capabilities of the CPU and the memory hierarchy, and performing appropriate data partitioning. The classification algorithm adopts two stages: searching on a reduction tree and searching on a list of ranges. This decision is made based on a classification heuristic: the size of the range list is limited after the first stage search. Optimizations are then designed to speed up the two-stage execution. To exploit the speed gap (1) between the CPU and external memory; (2) between internal memory (cache) and external memory, an interpreter is used to trade the CPU idle cycles with demanding memory access requirements. By applying the CISC style of instruction encoding to compress the range expressions, it not only significantly reduces the total memory requirement but also makes effective use of the internal memory (cache) bandwidth. We show that compressing data structures is an effective optimization across the multi-core architectures. We implement this algorithm on both Intel IXP2800 network processor and Core 2 Duo X86 architecture, and experiment with the classification benchmark, ClassBench. By incorporating architecture-awareness in algorithm design and taking into account the memory hierarchy, data partitioning, and latency hiding in algorithm implementation, the resulting algorithm shows a good scalability on Intel IXP2800. By effectively using the cache system, the algorithm also runs faster than the previous fastest RFC on the Core 2 Duo architecture.

Frontiers of Computer Science in China | 2008

A robust localization algorithm in wireless sensor networks

Xin Li; Bei Hua; Yi Shang; Yan Xiong

Most of the state-of-the-art localization algorithms in wireless sensor networks (WSNs) are vulnerable to various kinds of location attacks, whereas secure localization schemes proposed so far are too complex to apply to power constrainedWSNs. This paper provides a distributed robust localization algorithm called Bilateration that employs a unified way to deal with all kinds of location attacks as well as other kinds of information distortion caused by node malfunction or abnormal environmental noise. Bilateration directly calculates two candidate positions for every two heard anchors, and then uses the average of a maximum set of close-by candidate positions as the location estimation. The basic idea behind Bilateration is that candidate positions calculated from reasonable (i.e., error bounded) anchor positions and distance measurements tend to be close to each other, whereas candidate positions calculated from false anchor positions or distance measurements are highly unlikely to be close to each other if false information are not collaborated. By using ilateration instead of classical multilateration to compute location estimation, Bilateration requires much lower computational complexity, yet still retains the same localization accuracy. This paper also evaluates and compares Bilateration with three multilateration-based localization algorithms, and the simulation results show that Bilateration achieves the best comprehensive performance and is more suitable to real wireless sensor networks.

ACM Transactions in Embedded Computing Systems | 2008

High-performance packet classification algorithm for multithreaded IXP network processor

Duo Liu; Zheng Chen; Bei Hua; Nenghai Yu; Xinan Tang

Packet classification is crucial for the Internet to provide more value-added services and guaranteed quality of service. Besides hardware-based solutions, many software-based classification algorithms have been proposed. However, classifying at 10 Gbps speed or higher is a challenging problem and it is still one of the performance bottlenecks in core routers. In general, classification algorithms face the same challenge of balancing between high classification speed and low memory requirements. This paper proposes a modified recursive flow classification (RFC) algorithm, Bitmap-RFC, which significantly reduces the memory requirements of RFC by applying a bitmap compression technique. To speed up classifying speed, we exploit the multithreaded architectural features in various algorithm development stages from algorithm design to algorithm implementation. As a result, Bitmap-RFC strikes a good balance between speed and space. It can significantly keep both high classification speed and reduce memory space consumption. This paper investigates the main NPU software design aspects that have dramatic performance impacts on any NPU-based implementations: memory space reduction, instruction selection, data allocation, task partitioning, and latency hiding. We experiment with an architecture-aware design principle to guarantee the high performance of the classification algorithm on an NPU implementation. The experimental results show that the Bitmap-RFC algorithm achieves 10 Gbps speed or higher and has a good scalability on Intel IXP2800 NPU.

international conference on future generation communication and networking | 2007

Energy-Based Target Numeration in Wireless Sensor Networks

Yan Guo; Bei Hua; Lihua Yue

Target numeration is of great importance for activity monitoring applications in wireless sensor networks (WSNs); however it is also a challenging problem in a WSN only equipped with simple amplitude sensors. Only a few algorithms have been proposed to solve the problem of target counting, and their accuracy and computational complexity is not satisfactory. This paper provides a two- step energy-based target numeration (EBTN) algorithm that firstly groups the sensor nodes that detect a target into separate clusters, and then calculates the number of targets covered by each cluster based on the total signal energy collected over the cluster. A polynomial regression function is used to approximate the signal strength over a cluster, and the total energy is estimated by taking the integral of the function over the area. By combining with preliminary clustering step, energy-based target counting greatly improves the counting accuracy. Experiments also show that EBTN requires lower node density and computational complexity compared with other algorithms.

international conference on communications | 2007

Link State Based Annulus Localization Algorithm for Wireless Sensor Networks

Xin Li; Bei Hua; Yan Guo

Most of the state-of-the-art localization systems assume an idealistic radio propagation model that is far from the reality, and will lead to lower localization accuracy in real wireless sensor networks. In this paper we describe a coarse-grained link state based annulus (LSBA) localization algorithm that takes into account the anisotropic feature of real radio propagation to improve the localization accuracy and adapts to deployment irregularity as well. We compare LSBA with four typical coarse-grained localization algorithms: centroid, APIT, DV-HOP and amorphous in simulated realistic settings, and experimental results show that LSBA achieves the best tradeoff between localization accuracy and convergence speed in networks with moderate number of anchors. Based on our observation, we also make an improvement suggestion on DV-HOP and Amorphous to redefine the concept of neighboring nodes to reflect the radio irregularity, and simulation results show that both improved algorithms see increased localization accuracy.

international conference on communications | 2014

Realizing video streaming multicast over SDN networks

Siyuan Tang; Bei Hua; Dongyang Wang

Video streaming multicast is an urgently needed application, but is hard to implement in traditional networks as it requires support of multicast routing and QoS guarantee. Software-Defined Networking is an innovative network architecture that abstracts network control and global network view to a logically centralized controller, therefore makes the introduction of new network services and protocols simplified. In this paper, we realize a video streaming multicast application on SDN networks, which can efficiently deliver hierarchically coded video streaming to multiple recipients with as good quality of service as the network and terminals allow. We describe the framework and implementation details of the application, and verify it on an emulation platform.

international conference on parallel and distributed systems | 2011

Building High-Performance Application Protocol Parsers on Multi-core Architectures

Kai Zhang; Junchang Wang; Bei Hua; Xinan Tang

Parsing packet payloads according to the syntax and semantics of an application protocol is a key step in analyzing network traffic. However, it is still a challenge to fulfill this task with high speed(10Gbps+) because parsing packets through deep-content analysis to build a corresponding syntax tree requires tremendous computing resources. Multi-core architectures provide a viable solution for building high-performance parsers for application protocols. Existing sequential application protocol parsers are hard to be reused, and building a new protocol parser from scratch is error-prone and time-consuming. This paper proposes a general and efficient approach to building high-performance parallel application protocol parsers on multi-core platforms. First, the open-source lexical analyzer FLEX is used to describe a protocol and generate a sequential parser. Then a source-to-source translation is performed to transform the sequential parser into a parallel one. Finally, an efficient parallel run-time system is built by employing lock-free design principles from top to bottom to support multi-threaded execution on multi-core processors. Experimental results show that our parsers achieve nearly 20Gbps for average HTTP packets and 5Gbps for the challenging smaller FIX packets.

Explore More