Is this you? Create Your Porfile

Rick Siow Mong Goh

Agency for Science, Technology and Research

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rick Siow Mong Goh is active.

Explore More

Publication

Featured researches published by Rick Siow Mong Goh.

IEEE Transactions on Parallel and Distributed Systems | 2015

Efficient GPU Spatial-Temporal Multitasking

Yun Liang; Huynh Phung Huynh; Kyle Rupnow; Rick Siow Mong Goh; Deming Chen

Heterogeneous computing nodes are now pervasive throughout computing, and GPUs have emerged as a leading computing device for application acceleration. GPUs have tremendous computing potential for data-parallel applications, and the emergence of GPUs has led to proliferation of GPU-accelerated applications. This proliferation has also led to systems in which many applications are competing for access to GPU resources, and efficient utilization of the GPU resources is critical to system performance. Prior techniques of temporal multitasking can be employed with GPU resources as well, but not all GPU kernels make full use of the GPU resources. There is, therefore, an unmet need for spatial multitasking in GPUs. Resources used inefficiently by one kernel can be instead assigned to another kernel that can more effectively use the resources. In this paper we propose a software-hardware solution for efficient spatial-temporal multitasking and a software based emulation framework for our system. We pair an efficient heuristic in software with hardware leaky-bucket based thread-block interleaving to implement spatial-temporal multitasking. We demonstrate our techniques on various GPU architecture using nine representative benchmarks from CUDA SDK. Our experiments on Fermi GTX480 demonstrate performance improvement by up to 46% (average 26%) over sequential GPU task execution and 37% (average 18%) over default concurrent multitasking. Compared with the state-of-the-art Kepler K20 using Hyper-Q technology, our technique achieves up to 40% (average 17%) performance improvement over default concurrent multitasking.

acm sigplan symposium on principles and practice of parallel programming | 2012

Scalable framework for mapping streaming applications onto multi-GPU systems

Huynh Phung Huynh; Andrei Hagiescu; Weng-Fai Wong; Rick Siow Mong Goh

Graphics processing units leverage on a large array of parallel processing cores to boost the performance of a specific streaming computation pattern frequently found in graphics applications. Unfortunately, while many other general purpose applications do exhibit the required streaming behavior, they also possess unfavorable data layout and poor computation-to-communication ratios that penalize any straight-forward execution on the GPU. In this paper we describe an efficient and scalable code generation framework that can map general purpose streaming applications onto a multi-GPU system. This framework spans the entire core and memory hierarchy exposed by the multi-GPU system. Several key features in our framework ensure the scalability required by complex streaming applications. First, we propose an efficient stream graph partitioning algorithm that partitions the complex application to achieve the best performance under a given shared memory constraint. Next, the resulting partitions are mapped to multiple GPUs using an efficient architecture-driven strategy. The mapping balances the workload while considering the communication overhead. Finally, a highly effective pipeline execution is employed for the execution of the partitions on the multi-GPU system. The framework has been implemented as a back-end of the StreamIt programming language compiler. Our comprehensive experiments show its scalability and significant performance speedup compared with a previous state-of-the-art solution.

international conference on big data | 2013

Optimizing the MapReduce framework on Intel Xeon Phi coprocessor

Mian Lu; Lei Zhang; Huynh Phung Huynh; Zhongliang Ong; Yun Liang; Bingsheng He; Rick Siow Mong Goh; Richard Huynh

MapReduce has become one of the most popular framework for building big-data applications. It was originally designed for distributed-computing, and has been extended to various hardware architectures, e.g., multi-core CPUs, GPUs and FPGAs. In this work, we develop the first MapReduce framework on the recently released Intel Xeon Phi coprocessor. We utilize advanced features of the Xeon Phi to achieve high performance. In order to take advantage of the SIMD vector processing units, we propose a vectorization friendly technique to assist the auto-vectorization as well as develop SIMD hash computation algorithms. Furthermore, we utilize MIMD hyper-threading to pipeline the map and reduce phases to improve the resource utilization. We also eliminate multiple local arrays but use low cost atomic operations on the global array for some applications, which can improve the thread scalability and data locality. We conduct comprehensive experiments to compare our optimized MapReduce framework with a state-of-the-art multi-core based MapReduce framework (Phoenix++). By evaluating six real-world applications, the experimental results show that our optimized framework is 1.2X to 38X faster than Phoenix++ for various applications on the Xeon Phi.

international parallel and distributed processing symposium | 2011

Automated Architecture-Aware Mapping of Streaming Applications Onto GPUs

Andrei Hagiescu; Huynh Phung Huynh; Weng-Fai Wong; Rick Siow Mong Goh

Graphic Processing Units (GPUs) are made up of many streaming multiprocessors, each consisting of processing cores that interleave the execution of a large number of threads. Groups of threads - called {\em warps} and {\em wave fronts}, respectively, in nVidia and AMD literature - are selected by the hardware scheduler and executed in lockstep on the available cores. If threads in such a group access the slow off-chip global memory, the entire group has to be stalled, and another group is scheduled instead. The utilization of a given multiprocessor will remain high if there is a sufficient number of alternative thread groups to select from. Many parallel general purpose applications have been efficiently mapped to GPUs. Unfortunately, many stream processing applications exhibit unfavorable data movement patterns and low computation-to-communication ratio that may lead to poor performance. In this paper, we describe an automated compilation flow that maps most stream processing applications onto GPUs by taking into consideration two important architectural features of nVidia GPUs, namely interleaved execution as well as the small amount of shared memory available in each streaming multiprocessors. In particular, we show that using a small number of compute threads such that the memory footprint is reduced, we can achieve high utilization of the GPU cores. Our scheme goes against the conventional wisdom of GPU programming which is to use a large number of homogeneous threads. Instead, it uses a mix of {\em compute} and {\em memory access} threads, together with a carefully crafted schedule that exploits parallelism in the streaming application, while maximizing the effectiveness of the unique memory hierarchy. \% small on-chip memory located within each streaming multiprocessor. We have implemented our scheme in the compiler of the Stream It programming language, and our results show a significant speedup compared to the state-of-the-art solutions.

ieee international conference on high performance computing data and analytics | 2013

Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes

Wai Teng Tang; Wen Jun Tan; Rajarshi Ray; Yi Wen Wong; Weiguang Chen; Shyh-hao Kuo; Rick Siow Mong Goh; Stephen John Turner; Weng-Fai Wong

The sparse matrix-vector (SpMV) multiplication routine is an important building block used in many iterative algorithms for solving scientific and engineering problems. One of the main challenges of SpMV is its memory-boundedness. Although compression has been proposed previously to improve SpMV performance on CPUs, its use has not been demonstrated on the GPU because of the serial nature of many compression and decompression schemes. In this paper, we introduce a family of bit-representation-optimized (BRO) compression schemes for representing sparse matrices on GPUs. The proposed schemes, BRO-ELL, BRO-COO, and BRO-HYB, perform compression on index data and help to speed up SpMV on GPUs through reduction of memory traffic. Furthermore, we formulate a BRO-aware matrix reodering scheme as a data clustering problem and use it to increase compression ratios. With the proposed schemes, experiments show that average speedups of 1.5× compared to ELLPACK and HYB can be achieved for SpMV on GPUs.

symposium on code generation and optimization | 2015

Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on Intel Xeon Phi

Wai Teng Tang; Ruizhe Zhao; Mian Lu; Yun Liang; Huynh Phung Huyng; Xibai Li; Rick Siow Mong Goh

Recently, the Intel Xeon Phi coprocessor has received increasing attention in high performance computing due to its simple programming model and highly parallel architecture. In this paper, we implement sparse matrix vector multiplication (SpMV) for scale-free matrices on the Xeon Phi architecture and optimize its performance. Scale-free sparse matrices are widely used in various application domains, such as in the study of social networks, gene networks and web graphs. We propose a novel SpMV format called vectorized hybrid COO+CSR (VHCC). Our SpMV implementation employs 2D jagged partitioning, tiling and vectorized prefix sum computations to improve hardware resource utilization, and thus overall performance. As the achieved performance depends on the number of vertical panels, we also develop a performance tuning method to guide its selection. Experimental results demonstrate that our SpMV implementation achieves an average 3× speedup over Intel MKL for a wide range of scale-free matrices.

distributed simulation and real-time applications | 2012

QoS-Aware Revenue-Cost Optimization for Latency-Sensitive Services in IaaS Clouds

Ta Nguyen Binh Duong; Xiaorong Li; Rick Siow Mong Goh; Xueyan Tang; Wentong Cai

Recently, application service providers have been employing Infrastructure-as-a-Service (IaaS) clouds such as Amazon EC2 to scale their computing resources on-demand to adapt to dynamic workloads. Existing research has been focusing more on cloud resource scaling in batch processing, non latency-sensitive applications. In this paper, we consider the problem of revenue-cost optimization in cloud-based application service providers with stringent QoS requirements, e.g., online gaming services. We propose an integrated approach which combines resource provisioning algorithms and request scheduling disciplines. The main goal is to maximize the service providers revenue via satisfying pre-defined QoS requirements, and at the same time, to minimize cloud resource cost. We have implemented the proposed resource provisioning algorithms and scheduling disciplines into a cloud scaling framework developed in our previous work. Extensive experiments have been conducted with a fully functional implementation and realistic workloads modeled after real traces of popular online game servers. The results demonstrated the effectiveness of our proposed approach.

international conference on parallel and distributed systems | 2009

A Tabu Search for the Heterogeneous DAG Scheduling Problem

Yi Wen Wong; Rick Siow Mong Goh; Shyh-hao Kuo; Malcolm Yoke Hean Low

Scheduling parallel applications on heterogeneous processors/architectures with different computational speed is a difficult problem. Here, a tabu search metaheuristic is developed to improve the schedule generated by list scheduling. Three neighbourhoods variants are proposed and examined, including a novel neighbourhood that takes the shape of the task graph into account. The effectiveness is evaluated based on a set of modified random benchmark graphs, including task graphs of real-world applications. Factors affecting algorithm performance are also examined. We have found that the variants proposed were able to reduce the schedule length produced by HEFT up to an average of 30% and up to on average 20% for the standard graphs. The results also show that using information about the shape of the task graph is a viable strategy.

international conference on parallel processing | 2013

Hierarchical parallel algorithm for modularity-based community detection using GPUs

Chun Yew Cheong; Huynh Phung Huynh; David Lo; Rick Siow Mong Goh

This paper describes the design of a hierarchical parallel algorithm for accelerating community detection which involves partitioning a network into communities of densely connected nodes. The algorithm is based on the Louvain method developed at the Universite Catholique de Louvain, which uses modularity to measure community quality and has been successfully applied on many different types of networks. The proposed hierarchical parallel algorithm targets three levels of parallelism in the Louvain method and it has been implemented on single-GPU and multi-GPU architectures. Benchmarking results on several large web-based networks and popular social networks show that on top of offering speedups of up to 5x, the single-GPU version is able to find better quality communities. On average, the multi-GPU version provides an additional 2x speedup over the single-GPU version but with a 3% degradation in community quality.

IEEE Transactions on Parallel and Distributed Systems | 2015

MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors

Mian Lu; Yun Liang; Huynh Phung Huynh; Zhongliang Ong; Bingsheng He; Rick Siow Mong Goh

In this work, we develop MrPhi, an optimized MapReduce framework on a heterogeneous computing platform, particularly equipped with multiple Intel Xeon Phi coprocessors. To the best of our knowledge, this is the first work to optimize the MapReduce framework on the Xeon Phi. We first focus on employing advanced features of the Xeon Phi to achieve high performance on a single coprocessor. We propose a vectorization friendly technique and SIMD hash computation algorithms to utilize the SIMD vectors. Then we pipeline the map and reduce phases to improve the resource utilization. Furthermore, we eliminate multiple local arrays but use low cost atomic operations on the global array to improve the thread scalability. For a given application, our framework is able to automatically detect suitable techniques to apply. Moreover, we extend our framework to a heterogeneous platform to utilize all hardware resource effectively. We adopt non-blocking data transfer to hide the communication overhead. We also adopt aligned memory transfer in order to fully utilize the PCIe bandwidth between the host and coprocessor. We conduct comprehensive experiments to benchmark the Xeon Phi and compare our optimized MapReduce framework with a state-of-the-art multi-core based MapReduce framework (Phoenix++). By evaluating six real-world applications, the experimental results show that our optimized framework is 1.2 to 38× faster than Phoenix++ for various applications on a single Xeon Phi. Additionally, the performance of four applications is able to achieve linear scalability on a platform equipped with up to four Xeon Phi coprocessors.

Explore More