Yi-Hua E. Yang
University of Southern California
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yi-Hua E. Yang.
architectures for networking and communications systems | 2008
Yi-Hua E. Yang; Weirong Jiang; Viktor K. Prasanna
In this paper we present a novel architecture for high-speed and high-capacity regular expression matching (REM) on FPGA. The proposed REM architecture, based on nondeterministic finite automaton (RE-NFA), efficiently constructs regular expression matching engines (REME) of arbitrary regular patterns and character classes in a uniform structure, utilizing both logic slices and block memory (BRAM) available on modern FPGA devices. The resulting circuits take advantage of synthesis and routing optimizations to achieve high operating speed and area efficiency. The uniform structure of our RE-NFA design can be stacked in a simple way to produce multi-character input circuits to scale up throughput further. An n-state m-character input REME takes only O (n X log2 m) time to construct and occupies no more than O (n X m) logic units. The REMEs can be staged and pipelined in large numbers to achieve high parallelism without sacrificing clock frequency. Using the proposed RE-NFA architecture, we are able to implement 3 copies of two-character input REMEs, each with 760 regular expressions, 18715 states and 371 character classes, onto a single Xilinx Virtex 4 LX-100-12 device. Each copy processes 2 characters per clock cycle at 300 MHz, resulting in a concurrent throughput of 14.4 Gbps for 760 REMEs. Compared with the automatic NFA-to-VHDL REME compilation [13], our approach achieves over 9x throughput efficiency (Gbps*state/LUT). Compared with state-of-the-art REMEs on FPGA, our approach also indicates up to 70% better throughput efficiency.
international conference on computer communications | 2011
Yi-Hua E. Yang; Viktor K. Prasanna
Regular expression matching (REM) with nondeterministic finite automata (NFA) can be computationally expensive when a large number of patterns are matched concurrently. On the other hand, converting the NFA to a deterministic finite automaton (DFA) can cause state explosion, where the number of states and transitions in the DFA are exponentially larger than in the NFA. In this paper, we seek to answer the following question: to match an arbitrary set of regular expressions, is there a finite automaton that lies between the NFA and DFA in terms of computation and memory complexities? We introduce the semi-deterministic finite automata (SFA) and the state convolvement test to construct an SFA from a given NFA. An SFA consists of a fixed number (p) of constituent DFAs (c-DFA) running in parallel; each c-DFA is responsible for a subset of states in the original NFA. To match a set of regular expressions with n overlapping symbols (that can match to the same input character concurrently), the NFA can require O(n) computation per input character, whereas the DFA can have a state transition table with O(2n) states. By exploiting the state convolvements during the SFA construction, an equivalent SFA reduces the computation complexity to O(p2=c2) per input character while limiting the space requirement to O(|Σ|×p2×(n=p)c) states, where Σ is the alphabet and c ≥ 1 is a small design constant. Although the problem of constructing the optimal (minimum-sized) SFA is shown to be NP-complete, we develop a greedy heuristic to quickly construct a near-optimal SFA in time and space quadratic in the number of states in the original NFA. We demonstrate our SFA construction using real-world regular expressions taken from the Snort IDS.
IEEE Transactions on Parallel and Distributed Systems | 2013
Yi-Hua E. Yang; Viktor K. Prasanna
Conventionally, dictionary-based string pattern matching (SPM) has been implemented as Aho-Corasick deterministic finite automaton (AC-DFA). Due to its large memory footprint, a large-dictionary AC-DFA can experience poor cache performance when matching against inputs with high match ratio on multicore processors. We propose a head-body finite automaton (HBFA), which implements SPM in two parts: a head DFA (H-DFA) and a body NFA (B-NFA). The H-DFA matches the dictionary up to a predefined prefix length in the same way as AC-DFA, but with a much smaller memory footprint. The B-NFA extends the matching to full dictionary lengths in a compact variable-stride branch data structure, accelerated by single-instruction multiple-data (SIMD) operations. A branch grafting mechanism is proposed to opportunistically advance the state of the H-DFA with the matching progress in the B-NFA. Compared with a fully populated AC-DFA, our HBFA prototype has <;1/5 construction time, requires <;1/20 runtime memory, and achieves 3x to 8x throughput when matching real-life large dictionaries against inputs with high match ratios. The throughput scales up 27x to over 34 Gbps on a 32-core Intel Manycore Testing Lab machine based on the Intel Xeon X7560 processors.
international parallel and distributed processing symposium | 2010
Yi-Hua E. Yang; Viktor K. Prasanna; Chenqian Jiang
Dictionary-based string matching (DBSM) is a critical component of Deep Packet Inspection (DPI), where thousands of malicious patterns are matched against high-bandwidth network traffic. Deterministic finite automata constructed with the Aho-Corasick algorithm (AC-DFA) have been widely used for solving this problem. However, the state transition table (STT) of a large-scale DBSM AC-DFA can span hundreds of megabytes of system memory, whose limited bandwidth and long latency could become the performance bottleneck We propose a novel partitioning algorithm which converts an AC-DFA into a “head” and a “body” parts. The head part behaves as a traditional AC-DFA that matches the pattern prefixes up to a predefined length; the body part extends any head match to the full pattern length in parallel body-tree traversals. Taking advantage of the SIMD instructions in modern x86-64 multi-core processors, we design compact and efficient data structures packing multi-path and multi-stride pattern segments in the body-tree. Compared with an optimized AC-DFA solution, our head-body matching (HBM) implementation achieves 1.2x to 3x throughput performance when the input match (attack) ratio varies from 2% to 32%, respectively. Our HBM data structure is over 20x smaller than a fully-populated AC-DFA for both Snort and ClamAV dictionaries. The aggregated throughput of our HBM approach scales almost 7x with 8 threads to over 10 Gbps in a dual-socket quad-core Opteron (Shanghai) server.
reconfigurable computing and fpgas | 2008
Yi-Hua E. Yang; Viktor K. Prasanna
We present algorithms for implementing large-scale regular expression matching (REM) on FPGA. Based on the proposed algorithms, we develop tools that first transform regular expressions into corresponding non-deterministic finite automata (RE-NFA), then convert the RE-NFA into structural VHDL that utilize both logic slices and block memory (BRAM) available on modern FPGA devices. An n-state m-character input regular expression matching engine (REME) can be constructed in O (n times m log2 m) time using O (n times m) memory space, resulting in a circuit that occupies no more than O (n times m) slices on FPGA. A large number of REMEs are placed automatically onto a two-dimensional staged pipeline, allowing scalability to hundreds of REMEs with linear area increase, running at over 300 MHz on Xilinx Virtex 4 devices.
reconfigurable computing and fpgas | 2011
Qu Yun; Yi-Hua E. Yang; Viktor K. Prasanna
Regular expression matching (REM) is widely used by network community for deep packet inspection and network intrusion detection. Most of the existing REM solutions on FPGA address only single-stream matching. In many real-life scenarios, however, multiple data streams are often interleaved on a high-bandwidth input to be matched by a set of regular expressions. Each data stream, for example, can consist of the payloads of a packet flow in the network. This paper presents the design and implementation of a multi-stream regular expression matching engine on FPGA. Our approach includes: (1) a flexible distributed RAM-based context storage design, (2) an efficient context switching mechanism with single-cycle switching overhead. We implemented a multi-stream REM engine on FPGA for matching against up to 96 concurrent input streams. Using our design, a state-of-the-art FPGA device can match ~1,000 regular expressions, each of length up to 100 characters, against up to 64 concurrent input streams. Place-and-route results show that our design achieves 270 MHz while matching 4 input characters per cycle, resulting in a maximum matching throughput of 8.6 Gbps.
reconfigurable computing and fpgas | 2014
Andrea Sanny; Yi-Hua E. Yang; Viktor K. Prasanna
The construction of histograms is an integral part of image processing pipelines, useful for image editing features such as histogram matching, thresholding and histogram equalization. In the past, research done on kernels used in image processing pipelines target advancements to achieve high throughput, area efficiency and low cost. However, a growing topic of interest that has not been fully explored is the use of energy efficiency as a key metric. In this work, we focus on developing an energy-efficient histogram implementation with a minimum frame rate of at least 30 frames per second. We determine the components that consume the most power and propose an optimized histogram implementation with the utilization of multiple optimizations to achieve notable improvement in energy efficiency while maintaining suitable throughput for usage within image processing pipelines. These optimizations include a data-defined memory activation schedule, a careful data layout and circuit-level pipelining. Our architecture is implemented on commonly-used image sizes which vary from 240 × l28 to 1216×912 and assume a pixel width of 16 bits per pixel. The post place-and-route results show that our optimized architecture has up to 15.3× higher energy efficiency when compared against the baseline architecture.
high performance switching and routing | 2012
Yun Qu; Yi-Hua E. Yang; Viktor K. Prasanna
High-throughput regular expression matching (REM) over a single packet flow for deep packet inspection in routers has been well studied. In many real-world cases, however, the packet processing operations are performed on a large number of packet flows, each supported by many run-time states. To handle a large number of flows, the architecture should support a mechanism to perform rapid context switch without adversely affecting the throughput. As the number of flows increases, large-capacity memory is needed to store per flow states of the matching. In this paper, we propose a hardware-accelerated context switch mechanism for managing a large number of states on memory efficiently. With sufficiently large off-chip memory, a state-of-the-art FPGA device can be multiplexed by millions of packet flows with negligible throughput degradation for large-size packets. Post-place-and-route results show that when 8 characters are matched per cycle, our design can achieve 180 MHz clock rate, leading to a throughput of 11.8 Gbps.
high performance switching and routing | 2013
Yi-Hua E. Yang; Yun Qu; Swapnil Haria; Viktor K. Prasanna
We propose a unified methodology for optimizing IPv4 and IPv6 lookup engines based on the balanced range tree (BRTree) architecture on FPGA. A general BRTree-based IP lookup solution features one or more linear pipelines with a large and complex design space. To allow fast exploration of the design space, we develop a concise set of performance models to characterize the tradeoffs among throughput, table size, lookup latency, and resource requirement of the IP lookup engine. In particular, a simple but realistic model of DDR3 memory is used to accurately estimate the off-chip memory performance. The models are then utilized by the proposed methodology to optimize for high lookup rates, large prefix tables, and a fixed maximum lookup latency, respectively. In our prototyping scenarios, a state-of-the-art FPGA could support (1) up to 24 M IPv6 prefixes with 400 Mlps (million lookups per second); (2) up to 1.6 Blps (billion lookups per second) with 1.1 M IPv4 prefixes; and (3) up to 554 K IPv4 prefixes and 400 Mlps with a lookup latency bounded in 400 ns. All our designs achieve 5.6x - 70x the energy efficiency of TCAM, and have performance independent of the prefix distribution.
reconfigurable computing and fpgas | 2012
Da Tong; Yi-Hua E. Yang; Viktor K. Prasanna
High-speed IP lookup remains a challenging problem in next generation routers due to the ever increasing line rate and routing table size. The evolution towards IPv6 results in long prefix length, sparse prefix distribution, and potentially very large routing tables. In this paper we propose a memory-efficient IPv6 lookup engine on Field Programmable Gate Array (FPGA). Static data structures are employed to reduce the on chip memory requirement. We design two novel techniques: implicit match identification and implicit match relay, to enhance the overall memory efficiency. Our experimental results show that the proposed techniques reduce memory usage by 30%. Using our architecture, state-of-the-art FPGA devices can support 2 copies of IPv6 routing table containing around 330k routing prefixes. Using dual ported BRAM and external SRAM, 4 pipelines can be implemented on a single device, achieving a throughput of 720 million lookups per second (MLPS).