Tien-Hsiung Weng | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tien-Hsiung Weng is active.

Explore More

Publication

Featured researches published by Tien-Hsiung Weng.

Cluster Computing | 2015

Scaling up MapReduce-based Big Data Processing on Multi-GPU systems

Hai Jiang; Yi Chen; Zhi Qiao; Tien-Hsiung Weng; Kuan-Ching Li

MapReduce is a popular data-parallel processing model encompassed with recent advances in computing technology and has been widely exploited for large-scale data analysis. The high demand on MapReduce has stimulated the investigation of MapReduce implementations with different architectural models and computing paradigms, such as multi-core clusters, Clouds, Cubieboards and GPUs. Particularly, current GPU-based MapReduce approaches mainly focus on single-GPU algorithms and cannot handle large data sets, due to the limited GPU memory capacity. Based on the previous multi-GPU MapReduce version MGMR, this paper proposes an upgrade version MGMR++ to eliminate GPU memory limitation and a pipelined version, PMGMR, to handle the Big Data challenge through both CPU memory and hard disks. MGMR++ is extended from MGMR with flexible C++ templates and CPU memory utilization, while PMGMR fine-tuned the performance through the latest GPU features such as streams and Hyper-Q as well as hard disk utilization. Compared to MGMR (Jiang et al., Cluster Computing 2013), the proposed schemes achieve about 2.5-fold performance improvement, increase system scalability, and allow programmers to write straightforward MapReduce code for Big Data.

ieee international conference on high performance computing data and analytics | 2004

Towards optimisation of openMP codes for synchronisation and data reuse

Tien-Hsiung Weng; Barbara M. Chapman

In this paper, we present the compiler transformation of OpenMP code to an ordered collection of tasks, and the compile time as well as runtime mapping of the resulting task graph to threads for data reuse. The ordering of tasks is relaxed where possible so that the code may be executed in a more loosely synchronous fashion. Our current implementation uses a runtime system that permits tasks to begin execution as soon as their predecessors have completed. A comparison of the performance of two example programs in their original OpenMP form and in the code form resulting from our translation is encouraging.

parallel and distributed computing: applications and technologies | 2003

Dragon: an Open64-based interactive program analysis tool for large applications

Barbara M. Chapman; Oscar R. Hernandez; Lei Huang; Tien-Hsiung Weng; Zhenying Liu; Laksono Adhianto; Yi Wen

A program analysis tool can play an important role in helping users understand and improve large application codes. Dragon is a robust interactive program analysis tool based on the Open64 compiler, which is an Open source C/C++/Fortran77/90 compiler for Intel Itanium systems. We designed and developed the Dragon analysis tool to support manual optimization and parallelization of large applications by exploiting the powerful analyses of the Open64 compiler. Dragon enables users to visualize and print the essential program structure of and obtains information on their large applications. Current features include the call graph, flow graph, and data dependences. Ongoing work extends both Open64 and Dragon by a new call graph construction algorithm and its related interprocedural analysis, global variable definition and usage analysis, and an external interface that can be used by other tools such as profilers and debuggers to share program analysis information. Future work includes supporting the creation and optimization of shared memory parallel programs written using OpenMP.

international workshop on openmp | 2003

Improving the performance of OpenMP by array privatization

Zhenying Liu; Barbara M. Chapman; Tien-Hsiung Weng; Oscar R. Hernandez

The scalability of an OpenMP program in a ccNUMA system with a large number of processors suffers from remote memory accesses, cache misses and false sharing. Good data locality is needed to overcome these problems whereas OpenMP offers limited capabilities to control it on ccNUMA architecture. A so-called SPMD style OpenMP program can achieve data locality by means of array privatization, and this approach has shown good performance in previous research. It is hard to write SPMD OpenMP code; therefore we are building a tool to relieve users from this task by automatically converting OpenMP programs into equivalent SPMD style OpenMP. We show the process of the translation by considering how to modify array declarations, parallel loops, and showing how to handle a variety of OpenMP constructs including REDUCTION, ORDERED clauses and synchronization. We are currently implementing these translations in an interactive tool based on the Open64 compiler.

international workshop on openmp | 2003

Analyses for the translation of OpenMP codes into SPMD style with array privatization

Zhenying Liu; Barbara M. Chapman; Yi Wen; Lei Huang; Tien-Hsiung Weng; Oscar R. Hernandez

A so-called SPMD style OpenMP program can achieve scalability on ccNUMA systems by means of array privatization, and earlier research has shown good performance under this approach. Since it is hard to write SPMD OpenMP code, we showed a strategy for the automatic translation of many OpenMP constructs into SPMD style in our previous work. In this paper, we first explain how to interprocedurally detect whether the OpenMP program consistently schedules the parallel loops. If the parallel loops are consistently scheduled, we may carry out array privatization according to OpenMP semantics. We give two examples of code patterns that can be handled despite the fact that they are not consistent, and where the strategy used to translate them differs from the straightforward approach that can otherwise be applied.

international parallel and distributed processing symposium | 2002

Implementing openMP using dataflow execution model for data locality and efficient parallel execution

Tien-Hsiung Weng; Barbara M. Chapman

In this paper, we show the potential benefits of translating OpenMP code to low-level parallel code using a data flow execution model, instead of targeting it directly to a multi-threaded program. Our goal is to improve data locality as well as reduce synchronization overheads without introducing data distribution directives to OpenMP. We outline an API that enables us to realize this model using SMARTS (Shared Memory Asynchronous Run-Time System), describe the work of the compiler and discuss the benefits of translating OpenMP to parallel code using data flow execution model. We show experimental results based part of the Parallel Ocean Program (POP) code and Jacobi kernel code running on an SGI Origin 2000.

The Journal of Supercomputing | 2009

Performance-based parallel application toolkit for high-performance clusters

Kuan-Ching Li; Tien-Hsiung Weng

Advances in computer technology, encompassed with fast emerging of multicore processor technology, have made the many-core personal computers available and more affordable. The availability of network of workstations and cluster of many-core SMPs have made them an attractive solution for high performance computing by providing computational power equal or superior to supercomputers or mainframes at an affordable cost using commodity components. In order to search alternative ways to extract unused and idle computing power from these computing resources targeting to improve overall performance, as well as to fully utilize the underlying new hardware platforms, these are major topics in this field of research. In this research paper, the design rationale and implementation of an effective toolkit for performance measurement and analysis of parallel applications in cluster environments is introduced; not only generating parallel applications’ timing graph representation, but also to provide application execution’s performance data charts. The goal in developing this toolkit is to permit application developers have a better understanding of the application’s behavior among selected computing nodes purposed for that particular execution. Additionally, multiple execution results of a given application under development can be combined and overlapped, permitting application developers to perform “what-if” analysis, i.e., to deeper understand the utilization of allocated computational resources. Experimentations using this toolkit have shown its effectiveness on the development and performance tuning of parallel applications, extending the use in teaching of message passing, and shared memory model parallel programming courses.

international workshop on openmp | 2004

Efficient implementation of OpenMP for clusters with implicit data distribution

Zhenying Liu; Lei Huang; Barbara M. Chapman; Tien-Hsiung Weng

This paper discusses an approach to implement OpenMP on clusters by translating it to Global Arrays (GA). The basic translation strategy from OpenMP to GA is described. GA requires a data distribution; we do not expect the user to supply this; rather, we show how we perform data distribution and work distribution according to OpenMP static loop scheduling. An inspector-executor strategy is employed for irregular applications in order to gather information on accesses to potentially non-local data, group non-local data transfers and overlap communications with local computations. Furthermore, a new directive INVARIANT is proposed to provide information about the dynamic scope of data access patterns. This directive can help us generate efficient codes for irregular applications using the inspector-executor approach. Our experiments show promising results for the corresponding regular and irregular GA codes.

international conference on computational science | 2003

Asynchronous execution of OpenMP code

Tien-Hsiung Weng; Barbara M. Chapman

This paper presents the transformation of OpenMP source code to a Macro-Task Graph, an internal representation of the parallel program as a collection of tasks, which later can be asynchronously scheduled for out-of-order execution and optimized for locality reuse. The transformation is based on array region analysis. We also show the potential benefits of targeting OpenMP code to a macro-task graph, instead of directly generating a multi-threaded program. We show experimental results for a Jacobi kernel and part of the POP code in OpenMP and compiled traditionally versus macro-dataflow execution model using the SMARTS runtime system on SGI Origin 2000.

Archive | 2016

GPU Computations on Hadoop Clusters for Massive Data Processing

Wenbo Chen; Shungou Xu; Hai Jiang; Tien-Hsiung Weng; Mario Donato Marino; Yi-Siang Chen; Kuan-Ching Li

Hadoop is a well-designed approach for handling massive amount of data. Comprised at the core of the Hadoop File System and MapReduce, it schedules the processing by orchestrating the distributed servers, providing redundancy and fault tolerance. In terms of performance, Hadoop is still behind high performance capacity due to CPUs’ limited parallelism, though. GPU accelerated computing involves the use of a GPU together with a CPU to accelerate applications to data processing on GPU cluster toward higher efficiency. However, GPU cluster has low level data storage capacity. In this chapter, we exploit the hybrid model of GPU and Hadoop to make best use of both capabilities, and the design and implementation of application using Hadoop and CUDA is presented through two interfaces: Hadoop Streaming and Hadoop Pipes. Experimental results on K-means algorithm are presented as well as their performance results are discussed.

Explore More