Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Yi-Gang Tai is active.

Publication


Featured researches published by Yi-Gang Tai.


IEEE Transactions on Parallel and Distributed Systems | 2012

Accelerating Matrix Operations with Improved Deeply Pipelined Vector Reduction

Yi-Gang Tai; Chia-Tien Dan Lo; Kleanthis Psarris

Many scientific or engineering applications involve matrix operations, in which reduction of vectors is a common operation. If the core operator of the reduction is deeply pipelined, which is usually the case, dependencies between the input data elements cause data hazards. To tackle this problem, we propose a new reduction method with low latency and high pipeline utilization. The performance of the proposed design is evaluated for both single data set and multiple data set scenarios. Further, QR decomposition is used to demonstrate how the proposed method can accelerate its execution. We implement the design on an FPGA and compare its results to other methods.


acm symposium on applied computing | 2008

Hardware implementation for network intrusion detection rules with regular expression support

Chia-Tien Dan Lo; Yi-Gang Tai; Kleanthis Psarris

Signature-based network intrusion detection systems (NIDSs), such as Snort and Bro, rely on a rule database that describes traffic patterns for known attacks. They examine each packets flowing through a network segment and report suspicious packets to assure security. An attack signature may be represented in terms of fields in a packet such as source/destination IP addresses, source/destination ports, protocols, specific contents in payload, etc. Typically, a Perl Compatible Regular Expression (PCRE) is used to describe a specific content in the payload which may identify an attack. Our study shows that over 60% of the execution time in an NIDS is found to perform string comparisons against a signature database of over 5,950 tokens and over 1,763 PCREs. This paper proposes to extend a bit-parallel algorithm to support multi-byte processing and PCRE. This design takes a segment of bytes from the payload of a packet and detects all possible tokens including those crossing text segment boundaries. A tool is designed to generate VHDL code from a rule set automatically. Performance results are reported.


field-programmable logic and applications | 2011

Synthesizing Tiled Matrix Decomposition on FPGAs

Yi-Gang Tai; Kleanthis Psarris; Chia-Tien Dan Lo

Hardware accelerators such as FPGAs and GPUs in heterogeneous systems are being increasingly important for many applications. For high performance computing, the flexibility and efficiency of FPGA makes it an attractive alternative to other approaches. However, for applications that have to decompose large matrices, not many scalable solutions can be found on FPGAs. In this paper, we propose a scalable QR matrix decomposer on FPGAs based on the latest advances in tiled matrix decomposition algorithms for high performance linear algebra. The proposed design can decompose a matrix of size limited only by off-chip memory and has the potential to achieve high performance.


Microprocessors and Microsystems | 2013

Scalable matrix decompositions with multiple cores on FPGAs

Yi-Gang Tai; Chia-Tien Dan Lo; Kleanthis Psarris

Hardware accelerators are getting increasingly important in heterogeneous systems for many applications, including those that employ matrix decompositions. In recent years, a class of tiled matrix decomposition algorithms has been proposed for out-of-memory computations and multi-core architectures including GPU-based heterogeneous systems. However, on FPGAs these scalable solutions for large matrices are rarely found. In this paper we use the latest tiled decomposition algorithms from high performance linear algebra for off-chip memory access and loop mapping on multiple processing cores for on-chip computation to perform scalable and high performance QR and LU matrix decompositions on FPGAs.


field-programmable logic and applications | 2007

Applying Out-of-Core QR Decomposition Algorithms on FPGA-Based Systems

Yi-Gang Tai; Chia-Tien Dan Lo; Kleanthis Psarris

QR decomposition, especially through the means of Householder transformation, is often used to solve least squares problems. A matrix to be decomposed with this method is usually very large, often large enough that it is not able to fit into the main memory of a workstation, let alone the internal memory of an FPGA nowadays. Efficient out-of-core algorithms have been developed to address the factorization of large matrices. This paper describes the application of variants of Householder QR decomposition on FPGA-based systems. More specifically, issues on applying out-of-core algorithms to the relatively small internal memory architecture of FPGAs are investigated.


field-programmable technology | 2010

Multiple data set reduction on FPGAs

Yi-Gang Tai; Chia-Tien Dan Lo; Kleanthis Psarris

Many scientific or engineering applications perform reduction of sets of sequential data streams. If the core operator of the reduction is deeply pipelined, dependencies between the input data elements cause data hazards in the pipeline. To tackle this problem, we propose a multiple set variable length reduction design with low latency and high pipeline utilization in this paper. We prove the buffer size and execution time bounds, and then show its performance on practical multiple data set scenarios. We apply the proposed method to the Householder QR decomposition and compare its performance to other methods with superior results. The proposed design is implemented on FPGAs with resource usage and performance presented.


international parallel and distributed processing symposium | 2008

Accelerating matrix decomposition with replications

Yi-Gang Tai; Chia-Tien Dan Lo; Kleanthis Psarris

Matrix decomposition applications that involve large matrix operations can take advantage of the flexibility and adaptability of reconfigurable computing systems to improve performance. The benefits come from replication, which includes vertical replication and horizontal replication. If viewed on a space-time chart, vertical replication allows multiple computations executed in parallel, and horizontal replication renders multiple functions on the same piece of hardware. In this paper, the reconfigurable architecture that supports replications for matrix decomposition applications on reconfigurable computing systems is described, and issues including the comparison of algorithms on the system and data movement between the internal computation cores and the external memory subsystem are addressed. A prototype of such a system is implemented to prove the concept. It is expected to improve the performance and scalability of matrix decomposition involving large matrices.


systems, man and cybernetics | 2009

An improved reduction algorithm with deeply pipelined operators

Yi-Gang Tai; Chia-Tien Dan Lo; Kleanthis Psarris

Many scientific applications involve reduction or accumulation operations on sequential data streams. Examples such as matrix-vector multiplication include multiple inner product operations on different data sets. If the core operator of the reduction is deeply pipelined, which is usually the case, dependencies between the input data cause data hazards in the pipeline and ask for a proper design. In this paper, we propose a modified design of the reduction operation based on Sips and Lins method. We analyze the performance of the proposed design to prove the correctness of the timing and demonstrate its performance against previous methods.


ACM Transactions on Reconfigurable Technology and Systems | 2009

Space Optimization on Counters for FPGA-Based Perl Compatible Regular Expressions

Chia-Tien Dan Lo; Yi-Gang Tai

With their expressiveness and simplicity, Perl compatible regular expressions (PCREs) have been adopted in mainstream signature based network intrusion detection systems (NIDSs) to describe known attack signatures, especially for polymorphic worms. NIDSs rely on an underlying string matching engine that simulates PCREs to inspect each network packet. PCRE is a superset of traditional regular expressions, and provides advanced features. However, this pattern matching becomes a performance bottleneck of software-based NIDSs, causing a big portion of their execution time to be dedicated to payload inspection, which results in an unacceptable packet drop rate. The penetration of these unexamined packets creates a security hole in such systems. Over the past decade, hardware acceleration for the pattern matching has been studied extensively and a marginal performance has been achieved. Among hardware approaches, FPGA-based acceleration engines provide great flexibility because new signatures can be compiled and programmed into their reconfigurable architecture. As more and more malicious signatures are discovered, it becomes harder to map a complete set of malicious signatures specified in PCREs to an FPGA chip. One of the space consuming components is the counter used in the constrained repetitions for PCREs. Therefore, we propose a space efficient SelectRAM counter for PCREs that use counting. The design takes advantage of the basic components contained in a configurable logic block, and thus optimizes space usage. A set of basic PCRE blocks has been built in hardware to implement PCREs. Experimental results show that the proposed scheme outperforms existing designs by at least fivefold.


applied reconfigurable computing | 2008

Highly Space Efficient Counters for Perl Compatible Regular Expressions in FPGAs

Chia-Tien Dan Lo; Yi-Gang Tai

Signature based network intrusion detection systems (NIDS) rely on an underlying string matching engine that inspects each network packet against a known malicious pattern database. Traditional static pattern descriptions may not efficiently represent sophisticated attack signatures. Recently, most NIDSs have adopted regular expressions such as Perl compatible regular expressions (PCREs) to describe an attack signature, especially for polymorphic worms. PCRE is a superset of traditional regular expression, in which no counters are involved. However, this overloads the performance of software-based NIDSs, causing a big portion of their execution time to be dedicated to pattern matching. Over the past decade, hardware acceleration for the pattern matching has been studied extensively and a marginal performance has been achieved. Among hardware approaches, FPGA-based acceleration engines provide great flexibility because new signatures can be compiled and programmed into their reconfigurable architecture. As more and more malicious signatures are discovered, it becomes harder to map a complete set of malicious signatures specified in PCREs to an FPGA chip. Even worse is that the counters used in PCREs typically take a great deal of hardware resources. Therefore, we propose a space efficient SelectRAM counter for PCREs that involve counting. The design takes advantage of components that consist of a configurable logic block, and thus optimizes space usage. A set of PCRE blocks has been built in hardware to implement PCREs used in Snort/Bro. Experimental results show that the proposed sheme outperforms existing designs by at least 5-fold. Performance results are reported in this paper.

Collaboration


Dive into the Yi-Gang Tai's collaboration.

Top Co-Authors

Avatar

Chia-Tien Dan Lo

Southern Polytechnic State University

View shared research outputs
Top Co-Authors

Avatar

Kleanthis Psarris

University of Texas at San Antonio

View shared research outputs
Researchain Logo
Decentralizing Knowledge