Nhat-Phuong Tran | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nhat-Phuong Tran is active.

Explore More

Publication

Featured researches published by Nhat-Phuong Tran.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

High Throughput Parallel Implementation of Aho-Corasick Algorithm on a GPU

Nhat-Phuong Tran; Myungho Lee; Sugwon Hong; Jaeyoung Choi

Pattern matching is an important operation in various applications such as computer and network security, bioinformatics, image processing, among many others. Aho-Corasick (AC) algorithm is a multiple patterns matching algorithm commonly used for such applications. In order to meet the highly demanding performance requirements imposed on these applications, achieving high performance for AC algorithm is crucial. In this paper, we present a high performance parallel implementation of AC algorithm on a Graphic Processing Unit (GPU) which efficiently utilizes the high degree of on-chip parallelism and the memory hierarchy of the GPU so that the aggregate performance (or throughput) of the GPU can be maximized. For this purpose our approach carefully places and caches the input text data and the reference pattern data used for pattern matching in the on-chip shared memories and the texture caches of the GPU. Furthermore, it efficiently schedules the off-chip global memory loads and the shared memory stores in order to minimize the overheads in loading the input data to the shared memories and also to minimize the shared memory bank conflicts. The proposed approach leads to a significant cut-down of the effective memory access latencies and leads to impressive performance improvements. Experimental results on Nvidia GeForce GTX 285 GPU show that our approach delivers up to 127Gbps throughput performance and up to 222-times speedup compared with a serial version running on 2.2Ghz Core2Duo Intel processor.

ieee international conference on high performance computing data and analytics | 2012

Memory Efficient Parallelization for Aho-Corasick Algorithm on a GPU

Nhat-Phuong Tran; Myungho Lee; Sugwon Hong; Minho Shin

Pattern matching is a commonly used operation in many applications including image processing, computer and network security, bioinformatics, among many others. Aho-Corasick (AC) algorithm is one of the well-known pattern matching techniques and it is intensively used in computer and network security. In order to meet the real-time performance requirements imposed on these security applications, developing a high-speed parallelization technique is essential for the AC algorithm. In this paper, we present a new memory efficient parallelization technique which efficiently places and caches the input text data and the reference data in the on-chip shared memories and texture caches of the Graphic Processing Unit (GPU). Furthermore, the new approach efficiently schedules memory accesses in order to minimize the overhead in loading data to the on-chip shared memories. The approach cuts down the effective memory access latencies and leads to significant performance improvements. Experimental results on Nvidia GeForce 9500GT GPU shows up to 15-times speedup compared with a serial version on 2.2Ghz Core2Duo Intel processor, and 15Gbps throughput performance.

International Conference on ICT for Smart Society | 2013

High performance string matching for security applications

Nhat-Phuong Tran; Myungho Lee

Aho-Corasick (AC) algorithm is a commonly used string matching algorithm. It performs multiple patterns matching for computer and network security applications. These applications impose high computational requirements, thus efficient parallelization of the AC algorithm is crucial. In this paper, we present a multi-stream based parallelization approach for the string matching using the AC algorithm on the latest Nvidia Kepler architecture. Our approach efficiently utilizes the HyperQ feature of the Kepler GPU so that multiple steams generated from a number of OpenMP threads running on the host multicore processor can be efficiently executed on a large number of fine-grain processing cores of the GPU. Experimental results show that our approach delivers up to 585Gbps throughput performance on Nvidia Tesla K20 GPU.

Scientific Programming | 2017

Performance Optimization of 3D Lattice Boltzmann Flow Solver on a GPU

Nhat-Phuong Tran; Myungho Lee; Sugwon Hong

Lattice Boltzmann Method (LBM) is a powerful numerical simulation method of the fluid flow. With its data parallel nature, it is a promising candidate for a parallel implementation on a GPU. The LBM, however, is heavily data intensive and memory bound. In particular, moving the data to the adjacent cells in the streaming computation phase incurs a lot of uncoalesced accesses on the GPU which affects the overall performance. Furthermore, the main computation kernels of the LBM use a large number of registers per thread which limits the thread parallelism available at the run time due to the fixed number of registers on the GPU. In this paper, we develop high performance parallelization of the LBM on a GPU by minimizing the overheads associated with the uncoalesced memory accesses while improving the cache locality using the tiling optimization with the data layout change. Furthermore, we aggressively reduce the register uses for the LBM kernels in order to increase the run-time thread parallelism. Experimental results on the Nvidia Tesla K20 GPU show that our approach delivers impressive throughput performance: 1210.63 Million Lattice Updates Per Second (MLUPS).

ieee international conference on high performance computing data and analytics | 2015

Memory-Efficient Parallelization of 3D Lattice Boltzmann Flow Solver on a GPU

Nhat-Phuong Tran; Myungho Lee; Dong Hoon Choi

Lattice Boltzmann Method (LBM) is a powerful numerical simulation method of the fluid flow. With its data parallel nature and the simple kernel structure, it is a promising candidate for a parallel implementation on a GPU. The LBM, however, is heavily data-intensive and memory bound. In particular, moving the data to the adjacent cells in the streaming computation phase of the LBM incurs a lot of uncoalesced accesses on the GPU which affects the overall performance. In this paper, we parallelize the LBM on a GPU by incorporating memory-efficient techniques such as the tiling optimization with the data layout changes and the data update scheme so called a pull scheme. Furthermore, we developed optimization techniques such as removing branch divergences, reducing the register uses, and reducing the number of double precision floating-point instructions. Experimental results on Nvidia Tesla K20 GPU show that our approach delivers up to 1105 MLUPS (Million Lattice Updates Per Second) and 156-times speedup compared with a serial implementation.

trust security and privacy in computing and communications | 2013

Performance Optimization of Aho-Corasick Algorithm on a GPU

Nhat-Phuong Tran; Myungho Lee; Sugwon Hong; Jongwoo Bae

Aho-Corasick (AC) algorithm is a multiple patternsmatching algorithm commonly used for applications such as computer and network security, bioinformatics, image processing, among others. These applications are computationally demanding, thus optimizing performance for AC algorithm is crucial. In this paper, we present a performance optimization strategy for the AC algorithm on a Graphic Processing Unit(GPU). Our strategy efficiently utilizes the high degree of the on chip parallelism and the complicated memory hierarchy of the GPU so that the aggregate performance (or throughput) for the AC algorithm can be optimized. The strategy significantly cuts down the effective memory access latencies and efficiently utilizes the memory bandwidth. Also, it maximizes the effects of the multithreading capability of the GPU through optimal thread scheduling. Experimental results on Nvidia GeForce GTX 285 GPU show that our approach delivers up to 127 Gbps throughput performance and 222-times speedup compared with a serial version running on single core of 2.2Ghz Core2Duo Intel processor.

computational science and engineering | 2011

Performance Enhancement of Network Devices with Multi-Core Processors

Nhat-Phuong Tran; Sugwon Hong; Myungho Lee; Seung-Jae Lee

In network based applications, packet capture is the main area that attracts many researchers in developing traffic monitoring systems. Along with the packet capture, many other functions such as security are incorporated in network applications. Specialized hardware and software have been developed and used in order to meet the real-time performance requirements for these functions. Recently, with the prevalence of multi-core processors, researchers are deploying multi-core processor based parallel processing approach to the multi-function network applications. However, parallelizing multiple operations of a network device in an integrated way is difficult because of asynchronous characteristics of coordinating these functions. In this paper, we propose a pipelined parallel execution approach using the producer-consumer model applicable to an environment where a stream of packets passes through two successive processes, each of which performs some tasks on packets. We also implement a packet capture process and an encryption process inside the parallel model, and show the effectiveness of our approach.

MUSIC | 2014

Multi-stream Parallel String Matching on Kepler Architecture

Nhat-Phuong Tran; Myungho Lee; Sugwon Hong; Dong Hoon Choi

Aho-Corasick (AC) algorithm is a commonly used string matching algorithm. It performs multiple patterns matching for computer and network security, bioinformatics, among many other applications. These applications impose high computational requirements, thus efficient parallelization of the AC algorithm is crucial. In this paper, we present a multi-stream based parallelization approach for the string matching using the AC algorithm on the latest Nvidia Kepler architecture. Our approach efficiently utilizes the HyperQ feature of the Kepler GPU so that multiple streams generated from a number of OpenMP threads running on the host multicore processor can be efficiently executed on a large number of fine-grain processing cores. Experimental results show that our approach delivers up to 420Gbps throughput performance on Nvidia Tesla K20 GPU.

Scientific Programming | 2015

Cache locality-centric parallel string matching on many-core accelerator chips

Nhat-Phuong Tran; Myungho Lee; Dong Hoon Choi

Aho-Corasick (AC) algorithm is a multiple patterns string matching algorithm commonly used in computer and network security and bioinformatics, among many others. In order to meet the highly demanding computational requirements imposed on these applications, achieving high performance for the AC algorithm is crucial. In this paper, we present a high performance parallelization of the AC on the many-core accelerator chips such as the Graphic Processing Unit (GPU) from Nvidia and the Intel Xeon Phi. Our parallelization approach significantly improves the cache locality of the AC by partitioning a given set of string patterns into multiple smaller sets of patterns in a space-efficient way. Using the multiple pattern sets, intensive pattern matching operations are concurrently conducted with respect to the whole input text data. Compared with the previous approaches where the input data is partitioned amongst multiple threads instead of partitioning the pattern set, our approach significantly improves the performance. Experimental results show that our approach leads up to 2.73 times speedup on the Nvidia K20 GPU and 2.00 times speedup on the Intel Xeon Phi compared with the previous approach. Our parallel implementation delivers up to 693 Gbps throughput performance on the K20.

Cluster Computing | 2017

Parameter based tuning model for optimizing performance on GPU

Nhat-Phuong Tran; Myungho Lee; Jaeyoung Choi

Recently, the graphic processing units (GPUs) are becoming increasingly popular for the high performance computing applications. Although the GPUs provide high peak performance, exploiting the full performance potential for application programs, however, leaves a challenging task to the programmers. When launching a parallel kernel of an application on the GPU, the programmer needs to carefully select the number of blocks (grid size) and the number of threads per block (block size). These values determine the degree of SIMD parallelism and the multithreading, and greatly influence the performance. With a huge range of possible combinations of these values, choosing the right grid size and the block size is not straightforward. In this paper, we propose a mathematical model for tuning the grid size and the block size based on the GPU architecture parameters. Using our model we first calculate a small set of candidate grid size and block size values, then search for the optimal values out of the candidate values through experiments. Our approach significantly reduces the potential search space instead of exhaustive search approaches in the previous research. Thus our approach can be practically applied to the real applications.

Explore More