Is this you? Create Your Porfile

Yun Rock Qu

University of Southern California

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yun Rock Qu is active.

Explore More

Publication

Featured researches published by Yun Rock Qu.

architectures for networking and communications systems | 2013

High-performance architecture for dynamically updatable packet classification on FPGA

Yun Rock Qu; Shijie Zhou; Viktor K. Prasanna

Algorithms and FPGA based implementations for packet classification have been studied over the past decade. Algorithmic solutions have focused on high throughput; however, supporting dynamic updates has been challenging. In this paper, we present a 2-dimensional pipelined architecture for packet classification on FPGA, which achieves high throughput while supporting dynamic updates. Fine grained processing elements are arranged in a 2-dimensional array; each processing element accesses its designated memory locally, resulting in a scalable architecture. The entire array is both horizontally and vertically pipelined. As a result, it supports high clock rate that does not deteriorate as the length of the packet header or the size of the rule set increases. The performance of the architecture does not depend on rule set features such as the number of unique values in each field. The architecture also efficiently supports range searches in individual fields. The total memory is proportional to the rule set size. Dynamic updates- modify, delete and insert operations for the rule set during run-time are also supported on the self-reconfigurable processing elements with very little impact on the sustained throughput. Experimental results show that, for a 1K 15-tuple rule set, a state-of-the-art FPGA can sustain 190Gbps throughput with 1million updates/second. To the best of our knowledge, we are not aware of any packet classification approach that simultaneously supports both high throughput and dynamic updates of the rule set. Our architecture demonstrates 4× energy efficiency while achieving 2× throughput compared to TCAM.

IEEE Transactions on Parallel and Distributed Systems | 2016

High-Performance and Dynamically Updatable Packet Classification Engine on FPGA

Yun Rock Qu; Viktor K. Prasanna

High-performance and dynamically updatable hardware architectures for multi-field packet classification have regained much interest in the research community. For example, software defined networking requires 15 fields of the packets to be checked against a predefined rule set. Many algorithmic solutions for packet classification have been studied over the past decade. FPGA-based packet classification engines can achieve very high throughput; however, supporting dynamic updates is yet challenging. In this paper, we present a two-dimensional pipelined architecture for packet classification on FPGA; this architecture achieves high throughput while supporting dynamic updates. In this architecture, modular Processing Elements (PEs) are arranged in a two-dimensional array. Each PE accesses its designated memory locally, and supports prefix match and exact match efficiently. The entire array is both horizontally and vertically pipelined. We exploit striding, clustering, dual-port memory, and power gating techniques to further improve the performance of our architecture. The total memory is proportional to the rule set size. Our architecture sustains high clock rate even if we scale up (1) the length of each packet header, or/and (2) the number of rules in the rule set. The performance of the entire architecture does not depend on rule set features such as the number of unique values in each field. The PEs are also self-reconfigurable; they support dynamic updates of the rule set during run-time with very little throughput degradation. Experimental results show that, for a 1 K 15-tuple rule set, a state-of-the-art FPGA can sustain a throughput of 650 Million Packets Per Second (MPPS) with 1 million updates/second. Compared to TCAM, our architecture demonstrates at least four-fold energy efficiency while achieving two-fold throughput.

architectures for networking and communications systems | 2015

Optimizing Many-field Packet Classification on FPGA, Multi-core General Purpose Processor, and GPU

Yun Rock Qu; Hao H. Zhang; Shijie Zhou; Viktor K. Prasanna

Due to the rapid growth of Internet, there is an increasing need for efficiently classifying packets with many header fields in large rule sets. For example, in Software Defined Networking (SDN), the OpenFlow table lookup can require 15 packet header fields to be examined. In this paper, we present several decomposition-based packet classification implementations with efficient optimization techniques. In the searching phase, packet headers are split or combined. In the merging phase, the partial searching results from all the fields are merged to generate the final result. We prototype our implementations on state-of-the-art Field Programmable Gate Array (FPGA), multi-core General Purpose Processor (GPP), and Graphics Processing Unit (GPU). On FPGA, we propose two optimization techniques to divide generic ranges; modular processing elements are constructed and concatenated into a systolic array. On multi-core GPP, we parallelize both the searching and merging phases using parallel program threads. On the GPU-accelerated platform, we minimize branch divergence and reduce the data communication overhead. Experimental results show that 500Million Packets Per Second (MPPS) throughput and 3μs latency can be achieved for 1:5K rule sets on FPGA. We achieve 14:7MPPS throughput and 30:5MPPS throughput for 32K rule sets on multi-core GPP and GPU-accelerated platforms, respectively. As a heterogeneous solution, our GPU-accelerated packet classier shows 2x speedup compared to the implementation using multi-core GPP only. Compared with prior works, our designs can match long packet headers against very complex rule sets.

parallel computing technologies | 2013

Multi-core Implementation of Decomposition-Based Packet Classification Algorithms

Shijie Zhou; Yun Rock Qu; Viktor K. Prasanna

Multi-field packet classification is a network kernel function where packets are classified based on a set of predefined rules. Many algorithms and hardware architectures have been proposed to accelerate packet classification. Among them, decomposition-based classification approaches are of major interest to the research community because of the parallel search in each packet header field. This paper presents four decomposition-based approaches on multi-core processors. We search in parallel for all the fields using linear search or range-tree search; we store the partial results in a linked list or a bit vector. The partial results are merged to produce the final packet header match. We evaluate the performance with respect to latency and throughput varying the rule set size 1K ~ 64K and the number of threads per core 1 ~ 12. Experimental results show that our approaches can achieve 128 ns processing latency per packet and 11.5 Gbps overall throughput on state-of-the-art 16-core platforms.

high performance switching and routing | 2014

High-throughput traffic classification on multi-core processors

Da Tong; Yun Rock Qu; Viktor K. Prasanna

Traffic classification is a critical task in network management. Decision-trees are commonly used in Machine Learning (ML)-based traffic classification algorithms. Most of the existing implementations are hardware-based, while a new trend for network applications is to use software-based solutions. Since the decision-tree used for traffic classification is highly unbalanced, it is challenging to achieve high throughput for decision-tree-based traffic classification on multi-core platforms. In this paper, we present a high-throughput traffic classifier employing a scalable data structure on multi-core platforms. We convert decision-trees used in ML-based algorithms into a compact rule set table. Based on this data structure, we develop a divide-and-conquer algorithm by (1) searching all the columns of this table in parallel, and (2) merging the outcomes from all the columns into the final classification result. High throughput is sustained using our approach even if the size of the rule set table is scaled up with respect to (1) the number of decision-tree leaves and (2) the number of features examined during the classification process. We prototype our design on state-of-the-art multi-core platforms. For a typical decision-tree-based traffic classifier consisting of 128 leaf nodes and 6 flow-level features, our implementation achieves a throughput of 98 Million Lookups Per Second (MLPS). Our traffic classifier sustains high throughput even for highly unbalanced decision-trees. We achieve 1.5× throughput compared with the C4.5 decision-tree-based implementations, and 13× throughput compared with the SVM based traffic classifiers on multi-core platforms.

application-specific systems, architectures, and processors | 2015

Large-scale packet classification on FPGA

Shijie Zhou; Yun Rock Qu; Viktor K. Prasanna

Packet classification is a key network function enabling a variety of network applications, such as network security, Quality of Service (QoS) routing, and other value-added services. Routers perform packet classification based on a predefined rule set. Packet classification faces two challenges: (1) the data rate of the network traffic keeps increasing, and (2) the size of the rule sets are becoming very large. In this paper, we propose an FPGA-based packet classification engine for large rule sets. We present a decomposition-based approach, where each field of the packet header is searched separately. Then we merge the partial search results from all the fields using a merging network. Experimental results show that our design can achieve a throughput of 147 Million Packets Per Second (MPPS), while supporting upto 256K rules on a state-of-the-art FPGA. Compared to the prior works on FPGA or multi-core processors, our design demonstrates significant performance improvements.

symposium on computer architecture and high performance computing | 2014

Compact Hash Tables for High-Performance Traffic Classification on Multi-core Processors

Yun Rock Qu; Viktor K. Prasanna

Traffic classification is one of the kernel applications in network management. Many Machine Learning (ML) traffic classification algorithms are based on decision-trees. While most of the existing implementations of decision-trees are hardware-based, a new trend in network applications is to use software-based solutions. The decision-tree used for traffic classification is highly unbalanced; it is challenging to achieve high performance on multi-core platforms. In this paper, we present a high-throughput and low-latency traffic classification engine on multi-core platforms. We convert the decision-tree used in the C4.5 algorithm into multiple compact tables. All the compact tables are searched in parallel; efficient hashing techniques are employed to reduce the processing latency. The outcomes from all the tables are merged into the final classification result. High throughput can be sustained even if we scale up (1) the number of concurrent traffic classifiers, (2) the number of decision-tree leaves and (3) the number of features examined during the classification process. We prototype our design on state-of-the-art AMD and Intel multi-core platforms. For a typical C4.5 decision-tree consisting of 92 leaf nodes and 7 flow-level features, we achieve 134.15 Million Lookups Per Second (MLPS) throughput and 238.53 ns processing latency per lookup. Even for highly unbalanced decision-trees or large decision-trees consisting of up to 2 K leaf nodes, our traffic classification engine sustains high throughput and low latency without sacrificing classification accuracy (≥ 98.15%). We achieve 2.7× throughput compared with the classic C4.5 decision-tree-based implementations, and at least 12× speed-up compared with the existing traffic classifiers on multi-core platforms.

field-programmable logic and applications | 2013

Fast dynamically updatable packet classifier on FPGA

Yun Rock Qu; Viktor K. Prasanna

Packet classification requires multiple fields of the packet header to be matched against entries in a prioritized table; it is still challenging to support dynamic updates for packet classification without sacrificing throughput performance. In this paper, we present a high-throughput pipelined architecture for packet classification on FPGA supporting dynamic updates of the rule set. This architecture is based on Dynamic Bit Vector (Dynamic-BV) approach and supports modify, delete and insert operations during run-time with very little impact on sustained throughput. Experimental results show that, for a 1K rule set on a state-of-the-art FPGA, a throughput of 120 Gbps with 1 million updates/second can be sustained using a single pipeline.

IEEE Transactions on Parallel and Distributed Systems | 2017

Accelerating Decision Tree Based Traffic Classification on FPGA and Multicore Platforms

Da Tong; Yun Rock Qu; Viktor K. Prasanna

Machine learning (ML) algorithms have been shown to be effective in classifying a broad range of applications in the Internet traffic. In this paper, we propose algorithms and architectures to realize online traffic classification using flow level features. First, we develop a traffic classifier based on C4.5 decision tree algorithm and Entropy-MDL (Minimum Description Length) discretization algorithm. It achieves an overall accuracy of 97.92 percent for classifying eight major applications. Next we propose approaches to accelerate the classifier on FPGA (Field Programmable Gate Array) and multicore platforms. We optimize the original classifier by merging it with discretization. Our implementation of this optimized decision tree achieves 7500+ Million Classifications Per Second (MCPS) on a state-of-the-art FPGA platform and 75-150 MCPS on two state-of-the-art multicore platforms. We also propose a divide and conquer approach to handle imbalanced decision trees. Our implementation of the divide-and-conquer approach achieves 10,000+ MCPS on a state-of-the-art FPGA platform and 130-340 MCPS on two state-of-the-art multicore platforms. We conduct extensive experiments on both platforms for various application scenarios to compare the two approaches.

ieee high performance extreme computing conference | 2014

Scalable and dynamically updatable lookup engine for decision-trees on FPGA

Yun Rock Qu; Viktor K. Prasanna

Architectures for tree structures on FPGAs as well as ASICs have been proposed over the years. The exponential growth in the memory size with respect to the number of tree levels restricts the scalability of these architectures. In this paper, we propose a scalable lookup engine on FPGA for large decision-trees; this engine sustains high throughput even if the tree is scaled up with respect to (1) the number of fields and (2) the number of leaf nodes. The proposed engine is a 2-dimensional pipelined architecture; this architecture also supports dynamic updates of the decision-tree. Each leaf node of the tree is mapped onto a horizontal pipeline; each field of the tree corresponds to a vertical pipeline. We use dual-port distributed RAM (distRAM) in each individual Processing Element (PE); the resulting architecture for a generic decision-tree accepts two search requests per clock cycle. Post place-and-route results show that, for a typical decision-tree consisting of 512 leaf nodes, with each node storing 320-bit data, our lookup engine can perform 536 Million Lookups Per Second (MLPS). Compared to the state-of-the-art implementation of a binary decision-tree on FPGA, we achieve 2× speed-up; the throughput is sustained even if frequent dynamic updates are performed.

Explore More