Wenguang Chen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Wenguang Chen is active.

Explore More

Publication

Featured researches published by Wenguang Chen.

international conference on parallel architectures and compilation techniques | 2010

MapCG: writing parallel program portable between CPU and GPU

Chuntao Hong; Dehao Chen; Wenguang Chen; Weimin Zheng; Haibo Lin

Graphics Processing Units (GPU) have been playing an important role in the general purpose computing market recently. The common approach to program GPU today is to write GPU specific code with low level GPU APIs such as CUDA. Although this approach can achieve very good performance, it raises serious portability issues: programmers are required to write a specific version of code for each potential target architecture. It results in high development and maintenance cost.

acm sigplan symposium on principles and practice of parallel programming | 2010

PHANTOM: predicting performance of parallel applications on large-scale parallel machines using a single node

Jidong Zhai; Wenguang Chen; Weimin Zheng

For designers of large-scale parallel computers, it is greatly desired that performance of parallel applications can be predicted at the design phase. However, this is difficult because the execution time of parallel applications is determined by several factors, including sequential computation time in each process, communication time and their convolution. Despite previous efforts, it remains an open problem to estimate sequential computation time in each process accurately and efficiently for large-scale parallel applications on non-existing target machines. This paper proposes a novel approach to predict the sequential computation time accurately and efficiently. We assume that there is at least one node of the target platform but the whole target system need not be available. We make two main technical contributions. First, we employ deterministic replay techniques to execute any process of a parallel application on a single node at real speed. As a result, we can simply measure the real sequential computation time on a target node for each process one by one. Second, we observe that computation behavior of processes in parallel applications can be clustered into a few groups while processes in each group have similar computation behavior. This observation helps us reduce measurement time significantly because we only need to execute representative parallel processes instead of all of them. We have implemented a performance prediction framework, called PHANTOM, which integrates the above computation-time acquisition approach with a trace-driven network simulator. We validate our approach on several platforms. For ASCI Sweep3D, the error of our approach is less than 5% on 1024 processor cores. Compared to a recent regression-based prediction approach, PHANTOM presents better prediction accuracy across different platforms.

Concurrency and Computation: Practice and Experience | 2007

OpenUH: an optimizing, portable OpenMP compiler

Chunhua Liao; Oscar R. Hernandez; Barbara M. Chapman; Wenguang Chen; Weimin Zheng

OpenMP has gained wide popularity as an API for parallel programming on shared memory and distributed shared memory platforms. Despite its broad availability, there remains a need for a portable, robust, open source, optimizing OpenMP compiler for C/C++/Fortran 90, especially for teaching and research, for example into its use on new target architectures, such as SMPs with chip multi‐threading, as well as learning how to translate for clusters of SMPs. In this paper, we present our efforts to design and implement such an OpenMP compiler on top of Open64, an open source compiler framework, by extending its existing analysis and optimization and adopting a source‐to‐source translator approach where a native back end is not available. The compilation strategy we have adopted and the corresponding runtime support are described. The OpenMP validation suite is used to determine the correctness of the translation. The compilers behavior is evaluated using benchmark tests from the EPCC microbenchmarks and the NAS parallel benchmark. Copyright

international conference on supercomputing | 2006

MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters

Hu Chen; Wenguang Chen; Jian Huang; Bob Robert; H. Kuhn

SMP clusters and multiclusters are widely used to execute message-passing parallel applications. The ways to map parallel processes to processors (or cores) could affect the application performance significantly due to the non-uniform communicating cost in such systems. It is desired to have a tool to map parallel processes to processors (or cores) automatically.Although there have been various efforts to address this issue, the existing solutions either require intensive user intervention, or can not be able to handle the situation of multiclusters well.In this paper, we propose a profile-guided approach to find the optimized mapping automatically to minimize the cost of point-to-point communications for arbitrary message passing applications. The implemented toolset is called MPIPP (MPI Process Placement toolset), and it includes several components:1) A tool to get the communication profile of MPI applications2) A tool to get the network topology of target clusters3) An algorithm to find optimized mapping, which is especially more effective than existing graph partition algorithms for multiclusters.We evaluated the performance of our tool with the NPB benchmarks and three other applications in several clusters. Experimental results show that the optimized process placement generated by our tools can achieve significant speedup.

ieee international conference on high performance computing data and analytics | 2011

Cloud versus in-house cluster: evaluating Amazon cluster compute instances for running MPI applications

Yan Zhai; Mingliang Liu; Jidong Zhai; Xiaosong Ma; Wenguang Chen

The emergence of cloud services brings new possibilities for constructing and using HPC platforms. However, while cloud services provide the flexibility and convenience of customized, pay-as-you-go parallel computing, multiple previous studies in the past three years have indicated that cloud-based clusters need a significant performance boost to become a competitive choice, especially for tightly coupled parallel applications. In this work, we examine the feasibility of running HPC applications in clouds. This study distinguishes itself from existing investigations in several ways: 1) We carry out a comprehensive examination of issues relevant to the HPC community, including performance, cost, user experience, and range of user activities. 2) We compare an Amazon EC2-based platform built upon its newly available HPC-oriented virtual machines with typical local cluster and supercomputer options, using benchmarks and applications with scale and problem size unprecedented in previous cloud HPC studies. 3) We perform detailed performance and scalability analysis to locate the chief limiting factors of the state-of-the-art cloud based clusters. 4) We present a case study on the impact of per-application parallel I/O system configuration uniquely enabled by cloud services. Our results reveal that though the scalability of EC2-based virtual clusters still lags behind traditional HPC alternatives, they are rapidly gaining in overall performance and cost-effectiveness, making them feasible candidates for performing tightly coupled scientific computing. In addition, our detailed benchmarking and profiling discloses and analyzes several problems regarding the performance and performance stability on EC2.

acm sigplan symposium on principles and practice of parallel programming | 2009

MPIWiz: subgroup reproducible replay of mpi applications

Ruini Xue; Xuezheng Liu; Ming Wu; Zhengyu Guo; Wenguang Chen; Weimin Zheng; Zheng Zhang; Geoffrey M. Voelker

Message Passing Interface (MPI) is a widely used standard for managing coarse-grained concurrency on distributed computers. Debugging parallel MPI applications, however, has always been a particularly challenging task due to their high degree of concurrent execution and non-deterministic behavior. Deterministic replay is a potentially powerful technique for addressing these challenges, with existing MPI replay tools adopting either data-replay or order-replay approaches. Unfortunately, each approach has its tradeoffs. Data-replay generates substantial log sizes by recording every communication message. Order-replay generates small logs, but requires all processes to be replayed together. We believe that these drawbacks are the primary reasons that inhibit the wide adoption of deterministic replay as the critical enabler of cyclic debugging of MPI applications. This paper describes subgroup reproducible replay (SRR), a hybrid deterministic replay method that provides the benefits of both data-replay and order-replay while balancing their trade-offs. SRR divides all processes into disjoint groups. It records the contents of messages crossing group boundaries as in data-replay, but records just message orderings for communication within a group as in order-replay. In this way, SRR can exploit the communication locality of traffic patterns in MPI applications. During replay, developers can then replay each group individually. SRR reduces recording overhead by not recording intra-group communication, and reduces replay overhead by limiting the size of each replay group. Exposing these tradeoffs gives the user the necessary control for making deterministic replay practical for MPI applications. We have implemented a prototype, MPIWiz, to demonstrate and evaluate SRR. MPIWiz employs a replay framework that allows transparent binary instrumentation of both library and system calls. As a result, MPIWiz replays MPI applications with no source code modification and relinking, and handles non-determinism in both MPI and OS system calls. Our preliminary results show that MPIWiz can reduce recording overhead by over a factor of four relative to data-replay, yet without requiring the entire application to be replayed as in order-replay. Recording increases execution time by 27% while the application can be replayed in just 53% of its base execution time.

symposium on code generation and optimization | 2010

Taming hardware event samples for FDO compilation

Dehao Chen; Neil Vachharajani; Robert Hundt; Shih-Wei Liao; Vinodha Ramasamy; Paul Yuan; Wenguang Chen; Weiming Zheng

Feedback-directed optimization (FDO) is effective in improving application runtime performance, but has not been widely adopted due to the tedious dual-compilation model, the difficulties in generating representative training data sets, and the high runtime overhead of profile collection. The use of hardware-event sampling to generate estimated edge profiles overcomes these drawbacks. Yet, hardware event samples are typically not precise at the instruction or basic-block granularity. These inaccuracies lead to missed performance when compared to instrumentation-based FDO@. In this paper, we use multiple hardware event profiles and supervised learning techniques to generate heuristics for improved precision of basic-block-level sample profiles, and to further improve the smoothing algorithms used to construct edge profiles. We demonstrate that sampling-based FDO can achieve an average of 78% of the performance gains obtained using instrumentation-based exact edge profiles for SPEC2000 benchmarks, matching or beating instrumentation-based FDO in many cases. The overhead of collection is only 0.74% on average, while compiler based instrumentation incurs 6.8%-53.5% overhead (and 10x overhead on an industrial web search application), and dynamic instrumentation incurs 28.6%-1639.2% overhead.

IEEE Transactions on Parallel and Distributed Systems | 2016

Cloud Performance Modeling with Benchmark Evaluation of Elastic Scaling Strategies

Kai Hwang; Xiaoying Bai; Yue Shi; Muyang Li; Wenguang Chen; Yongwei Wu

In this paper, we present generic cloud performance models for evaluating Iaas, PaaS, SaaS, and mashup or hybrid clouds. We test clouds with real-life benchmark programs and propose some new performance metrics. Our benchmark experiments are conducted mainly on IaaS cloud platforms over scale-out and scale-up workloads. Cloud benchmarking results are analyzed with the efficiency, elasticity, QoS, productivity, and scalability of cloud performance. Five cloud benchmarks were tested on Amazon IaaS EC2 cloud: namely YCSB, CloudSuite, HiBench, BenchClouds, and TPC-W. To satisfy production services, the choice of scale-up or scale-out solutions should be made primarily by the workload patterns and resources utilization rates required. Scaling-out machine instances have much lower overhead than those experienced in scale-up experiments. However, scaling up is found more cost-effective in sustaining heavier workload. The cloud productivity is greatly attributed to system elasticity, efficiency, QoS and scalability. We find that auto-scaling is easy to implement but tends to over provision the resources. Lower resource utilization rate may result from auto-scaling, compared with using scale-out or scale-up strategies. We also demonstrate that the proposed cloud performance models are applicable to evaluate PaaS, SaaS and hybrid clouds as well.

international conference on software engineering | 2011

RACEZ: a lightweight and non-invasive race detection tool for production applications

Tianwei Sheng; Neil Vachharajani; Stephane Eranian; Robert Hundt; Wenguang Chen; Weimin Zheng

Concurrency bugs, particularly data races, are notoriously difficult to debug and are a significant source of unreliability in multithreaded applications. Many tools to catch data races rely on program instrumentation to obtain memory instruction traces. Unfortunately, this instrumentation introduces significant runtime overhead, is extremely invasive, or has a limited domain of applicability making these tools unsuitable for many production systems. Consequently, these tools are typically used during application testing where many data races go undetected. This paper proposes RACEZ , a novel race detection mechanism which uses a sampled memory trace collected by the hardware performance monitoring unit rather than invasive instrumentation. The approach introduces only a modest overhead making it usable in production environments. We validate RACEZ using two open source server applications and the PARSEC benchmarks. Our experiments show that RACEZ catches a set of known bugs with reasonable probability while introducing only 2.8% runtime slow down on average.

international conference on parallel architectures and compilation techniques | 2009

Cache Sharing Management for Performance Fairness in Chip Multiprocessors

Xing Zhou; Wenguang Chen; Weimin Zheng

Resource sharing can cause unfair and unpredictable performance of concurrently executing applications in Chip-Multiprocessors (CMP). The shared last-level cache is one of the most important shared resources because off-chip request latency may take a significant part of total execution cycles for data intensive applications. Instead of enforcing performance fairness directly, prior work addressing fairness issue of cache sharing mainly focuses on the fairness metrics of cache miss numbers or miss rates. However, because of the variation of cache miss penalty, fairness on cache miss cannot guarantee performance fairness. Cache sharing management which directly addresses performance fairness is needed for CMP systems. This paper introduces a model to analyze the performance impact of cache sharing, and proposes a mechanism of cache sharing management to provide performance fairness for concurrently executing applications. The proposed mechanism monitors the actual penalty of all cache misses and dynamically estimates the cache misses with dedicated caches when the applications are actually running with a shared cache. The estimated relative slowdown for each core from dedicated environment to shared environment is used to guide cache sharing in order to guarantee performance fairness. The experiment results show that the proposed mechanism always improves the performance fairness metric, and can provide no worse throughput than the scenario without any management mechanism.

Explore More