Is this you? Create Your Porfile

Phung Huynh

Agency for Science, Technology and Research

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Phung Huynh is active.

Explore More

Publication

Featured researches published by Phung Huynh.

very large data bases | 2015

Improving main memory hash joins on Intel Xeon Phi processors: an experimental approach

Saurabh Jha; Beixin Julie He; Mian Lu; Huynh Phung Huynh

Modern processor technologies have driven new designs and implementations in main-memory hash joins. Recently, Intel Many Integrated Core (MIC) co-processors (commonly known as Xeon Phi) embrace emerging x86 single-chip many-core techniques. Compared with contemporary multi-core CPUs, Xeon Phi has quite different architectural features: wider SIMD instructions, many cores and hardware contexts, as well as lower-frequency in-order cores. In this paper, we experimentally revisit the state-of-the-art hash join algorithms on Xeon Phi co-processors. In particular, we study two camps of hash join algorithms: hardware-conscious ones that advocate careful tailoring of the join algorithms to underlying hardware architectures and hardware-oblivious ones that omit such careful tailoring. For each camp, we study the impact of architectural features and software optimizations on Xeon Phi in comparison with results on multi-core CPUs. Our experiments show two major findings on Xeon Phi, which are quantitatively different from those on multi-core CPUs. First, the impact of architectural features and software optimizations has quite different behavior on Xeon Phi in comparison with those on the CPU, which calls for new optimization and tuning on Xeon Phi. Second, hardware oblivious algorithms can outperform hardware conscious algorithms on a wide parameter window. These two findings further shed light on the design and implementation of query processing on new-generation single-chip many-core technologies.

IEEE Transactions on Parallel and Distributed Systems | 2015

Efficient GPU Spatial-Temporal Multitasking

Yun Liang; Huynh Phung Huynh; Kyle Rupnow; Rick Siow Mong Goh; Deming Chen

Heterogeneous computing nodes are now pervasive throughout computing, and GPUs have emerged as a leading computing device for application acceleration. GPUs have tremendous computing potential for data-parallel applications, and the emergence of GPUs has led to proliferation of GPU-accelerated applications. This proliferation has also led to systems in which many applications are competing for access to GPU resources, and efficient utilization of the GPU resources is critical to system performance. Prior techniques of temporal multitasking can be employed with GPU resources as well, but not all GPU kernels make full use of the GPU resources. There is, therefore, an unmet need for spatial multitasking in GPUs. Resources used inefficiently by one kernel can be instead assigned to another kernel that can more effectively use the resources. In this paper we propose a software-hardware solution for efficient spatial-temporal multitasking and a software based emulation framework for our system. We pair an efficient heuristic in software with hardware leaky-bucket based thread-block interleaving to implement spatial-temporal multitasking. We demonstrate our techniques on various GPU architecture using nine representative benchmarks from CUDA SDK. Our experiments on Fermi GTX480 demonstrate performance improvement by up to 46% (average 26%) over sequential GPU task execution and 37% (average 18%) over default concurrent multitasking. Compared with the state-of-the-art Kepler K20 using Hyper-Q technology, our technique achieves up to 40% (average 17%) performance improvement over default concurrent multitasking.

acm sigplan symposium on principles and practice of parallel programming | 2012

Scalable framework for mapping streaming applications onto multi-GPU systems

Huynh Phung Huynh; Andrei Hagiescu; Weng-Fai Wong; Rick Siow Mong Goh

Graphics processing units leverage on a large array of parallel processing cores to boost the performance of a specific streaming computation pattern frequently found in graphics applications. Unfortunately, while many other general purpose applications do exhibit the required streaming behavior, they also possess unfavorable data layout and poor computation-to-communication ratios that penalize any straight-forward execution on the GPU. In this paper we describe an efficient and scalable code generation framework that can map general purpose streaming applications onto a multi-GPU system. This framework spans the entire core and memory hierarchy exposed by the multi-GPU system. Several key features in our framework ensure the scalability required by complex streaming applications. First, we propose an efficient stream graph partitioning algorithm that partitions the complex application to achieve the best performance under a given shared memory constraint. Next, the resulting partitions are mapped to multiple GPUs using an efficient architecture-driven strategy. The mapping balances the workload while considering the communication overhead. Finally, a highly effective pipeline execution is employed for the execution of the partitions on the multi-GPU system. The framework has been implemented as a back-end of the StreamIt programming language compiler. Our comprehensive experiments show its scalability and significant performance speedup compared with a previous state-of-the-art solution.

international conference on big data | 2013

Optimizing the MapReduce framework on Intel Xeon Phi coprocessor

Mian Lu; Lei Zhang; Huynh Phung Huynh; Zhongliang Ong; Yun Liang; Bingsheng He; Rick Siow Mong Goh; Richard Huynh

MapReduce has become one of the most popular framework for building big-data applications. It was originally designed for distributed-computing, and has been extended to various hardware architectures, e.g., multi-core CPUs, GPUs and FPGAs. In this work, we develop the first MapReduce framework on the recently released Intel Xeon Phi coprocessor. We utilize advanced features of the Xeon Phi to achieve high performance. In order to take advantage of the SIMD vector processing units, we propose a vectorization friendly technique to assist the auto-vectorization as well as develop SIMD hash computation algorithms. Furthermore, we utilize MIMD hyper-threading to pipeline the map and reduce phases to improve the resource utilization. We also eliminate multiple local arrays but use low cost atomic operations on the global array for some applications, which can improve the thread scalability and data locality. We conduct comprehensive experiments to compare our optimized MapReduce framework with a state-of-the-art multi-core based MapReduce framework (Phoenix++). By evaluating six real-world applications, the experimental results show that our optimized framework is 1.2X to 38X faster than Phoenix++ for various applications on the Xeon Phi.

international parallel and distributed processing symposium | 2011

Automated Architecture-Aware Mapping of Streaming Applications Onto GPUs

Andrei Hagiescu; Huynh Phung Huynh; Weng-Fai Wong; Rick Siow Mong Goh

Graphic Processing Units (GPUs) are made up of many streaming multiprocessors, each consisting of processing cores that interleave the execution of a large number of threads. Groups of threads - called {\em warps} and {\em wave fronts}, respectively, in nVidia and AMD literature - are selected by the hardware scheduler and executed in lockstep on the available cores. If threads in such a group access the slow off-chip global memory, the entire group has to be stalled, and another group is scheduled instead. The utilization of a given multiprocessor will remain high if there is a sufficient number of alternative thread groups to select from. Many parallel general purpose applications have been efficiently mapped to GPUs. Unfortunately, many stream processing applications exhibit unfavorable data movement patterns and low computation-to-communication ratio that may lead to poor performance. In this paper, we describe an automated compilation flow that maps most stream processing applications onto GPUs by taking into consideration two important architectural features of nVidia GPUs, namely interleaved execution as well as the small amount of shared memory available in each streaming multiprocessors. In particular, we show that using a small number of compute threads such that the memory footprint is reduced, we can achieve high utilization of the GPU cores. Our scheme goes against the conventional wisdom of GPU programming which is to use a large number of homogeneous threads. Instead, it uses a mix of {\em compute} and {\em memory access} threads, together with a carefully crafted schedule that exploits parallelism in the streaming application, while maximizing the effectiveness of the unique memory hierarchy. \% small on-chip memory located within each streaming multiprocessor. We have implemented our scheme in the compiler of the Stream It programming language, and our results show a significant speedup compared to the state-of-the-art solutions.

compilers, architecture, and synthesis for embedded systems | 2007

An efficient framework for dynamic reconfiguration of instruction-set customization

Huynh Phung Huynh; Joon Edward Sim; Tulika Mitra

We present an efficient framework for dynamic reconfiguration of application-specific instruction-set customization. A key component of this framework is an iterative algorithm for temporal and spatial partitioning of the loop kernels. Our algorithm maximizes performance gain of an application while taking into consideration the dynamic reconfiguration cost. It selects the appropriate custom instruction-sets for the loops and maps them into appropriate configurations. We model the temporal partitioning problem as a k-way graph partitioning problem. A dynamic programming based solution is used for the spatial partitioning. Comprehensive experimental results indicate that our iterative algorithm is highly scalable while producing optimal or near-optimal (99% of the optimal) performance gain.

symposium on code generation and optimization | 2015

Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS

Qing Jiao; Mian Lu; Huynh Phung Huynh; Tulika Mitra

Current generation GPUs can accelerate high-performance, compute-intensive applications by exploiting massive thread-level parallelism. The high performance, however, comes at the cost of increased power consumption. Recently, commercial GPGPU architectures have introduced support for concurrent kernel execution to better utilize the computational/memory resources and thereby improve overall throughput. In this paper, we argue and experimentally validate the benefits of concurrent kernels towards energy-efficient execution. We design power-performance models to carefully select the appropriate kernel combinations to be executed concurrently, the relative contributions of the kernels to the thread mix, along with the frequency choices for the cores and the memory to achieve high performance per watt metric. Our experimental evaluation shows that the concurrent kernel execution in combination with DVFS can improve energy-efficiency by up to 34.5% compared to the most energy-efficient sequential execution.

international conference on parallel processing | 2013

Hierarchical parallel algorithm for modularity-based community detection using GPUs

Chun Yew Cheong; Huynh Phung Huynh; David Lo; Rick Siow Mong Goh

This paper describes the design of a hierarchical parallel algorithm for accelerating community detection which involves partitioning a network into communities of densely connected nodes. The algorithm is based on the Louvain method developed at the Universite Catholique de Louvain, which uses modularity to measure community quality and has been successfully applied on many different types of networks. The proposed hierarchical parallel algorithm targets three levels of parallelism in the Louvain method and it has been implemented on single-GPU and multi-GPU architectures. Benchmarking results on several large web-based networks and popular social networks show that on top of offering speedups of up to 5x, the single-GPU version is able to find better quality communities. On average, the multi-GPU version provides an additional 2x speedup over the single-GPU version but with a 3% degradation in community quality.

IEEE Transactions on Parallel and Distributed Systems | 2015

MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors

Mian Lu; Yun Liang; Huynh Phung Huynh; Zhongliang Ong; Bingsheng He; Rick Siow Mong Goh

In this work, we develop MrPhi, an optimized MapReduce framework on a heterogeneous computing platform, particularly equipped with multiple Intel Xeon Phi coprocessors. To the best of our knowledge, this is the first work to optimize the MapReduce framework on the Xeon Phi. We first focus on employing advanced features of the Xeon Phi to achieve high performance on a single coprocessor. We propose a vectorization friendly technique and SIMD hash computation algorithms to utilize the SIMD vectors. Then we pipeline the map and reduce phases to improve the resource utilization. Furthermore, we eliminate multiple local arrays but use low cost atomic operations on the global array to improve the thread scalability. For a given application, our framework is able to automatically detect suitable techniques to apply. Moreover, we extend our framework to a heterogeneous platform to utilize all hardware resource effectively. We adopt non-blocking data transfer to hide the communication overhead. We also adopt aligned memory transfer in order to fully utilize the PCIe bandwidth between the host and coprocessor. We conduct comprehensive experiments to benchmark the Xeon Phi and compare our optimized MapReduce framework with a state-of-the-art multi-core based MapReduce framework (Phoenix++). By evaluating six real-world applications, the experimental results show that our optimized framework is 1.2 to 38× faster than Phoenix++ for various applications on a single Xeon Phi. Additionally, the performance of four applications is able to achieve linear scalability on a platform equipped with up to four Xeon Phi coprocessors.

international conference / workshop on embedded computer systems: architectures, modeling and simulation | 2009

Runtime Adaptive Extensible Embedded Processors -- A Survey

Huynh Phung Huynh; Tulika Mitra

Current generation embedded applications demand the computation engine to offer high performance similar to custom hardware circuits while preserving the flexibility of software solutions. Customizable and extensible embedded processors, where the processor core can be enhanced with application-specific instructions, provide a potential solution to this conflicting requirements of performance and flexibility. However, due to the limited area available for implementation of custom instructions in the datapath of the processor core, we may not be able to exploit all custom instruction enhancements of an application. Moreover, a static extensible processor is fundamentally at odds with highly dynamic applications where the custom instructions requirements vary substantially at runtime. In this context, a runtime adaptive extensible processor that can quickly morph its custom instructions and the corresponding custom functional units at runtime depending on workload characteristics is a promising solution. In this article, we provide a detailed survey of the contemporary architectures that offer such dynamic instruction-set support and discuss compiler and/or runtime techniques to exploit such architectures.

Explore More