Wei Wang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Wei Wang is active.

Explore More

Publication

Featured researches published by Wei Wang.

international symposium on performance analysis of systems and software | 2011

Characterizing multi-threaded applications based on shared-resource contention

Tanima Dey; Wei Wang; Jack W. Davidson; Mary Lou Soffa

For higher processing and computing power, chip multiprocessors (CMPs) have become the new mainstream architecture. This shift to CMPs has created many challenges for fully utilizing the power of multiple execution cores. One of these challenges is managing contention for shared resources. Most of the recent research address contention for shared resources by single-threaded applications. However, as CMPs scale up to many cores, the trend of application design has shifted towards multi-threaded programming and new parallel models to fully utilize the underlying hardware. There are differences between how single- and multi-threaded applications contend for shared resources. Therefore, to develop approaches to reduce shared resource contention for emerging multi-threaded applications, it is crucial to understand how their performances are affected by contention for a particular shared resource. In this research, we propose and evaluate a general methodology for characterizing multi-threaded applications by determining the effect of shared-resource contention on performance. To demonstrate the methodology, we characterize the applications in the widely used PARSEC benchmark suite for shared-memory resource contention. The characterization reveals several interesting aspects of the benchmark suite. Three of twelve PARSEC benchmarks exhibit no contention for cache resources. Nine of the benchmarks exhibit contention for the L2-cache. Of these nine, only three exhibit contention between their own threads-most contention is because of competition with a co-runner. Interestingly, contention for the Front Side Bus is a major factor with all but two of the benchmarks and degrades performance by more than 11%.

architectural support for programming languages and operating systems | 2013

ReQoS: reactive static/dynamic compilation for QoS in warehouse scale computers

Lingjia Tang; Jason Mars; Wei Wang; Tanima Dey; Mary Lou Soffa

As multicore processors with expanding core counts continue to dominate the server market, the overall utilization of the class of datacenters known as warehouse scale computers (WSCs) depends heavily on colocation of multiple workloads on each server to take advantage of the computational power provided by modern processors. However, many of the applications running in WSCs, such as websearch, are user-facing and have quality of service (QoS) requirements. When multiple applications are co-located on a multicore machine, contention for shared memory resources threatens application QoS as severe cross-core performance interference may occur. WSC operators are left with two options: either disregard QoS to maximize WSC utilization, or disallow the co-location of high-priority user-facing applications with other applications, resulting in low machine utilization and millions of dollars wasted. This paper presents ReQoS, a static/dynamic compilation approach that enables low-priority applications to adaptively manipulate their own contentiousness to ensure the QoS of high-priority co-runners. ReQoS is composed of a profile guided compilation technique that identifies and inserts markers in contentious code regions in low-priority applications, and a lightweight runtime that monitors the QoS of high-priority applications and reactively reduces the pressure low-priority applications generate to the memory subsystem when cross-core interference is detected. In this work, we show that ReQoS can accurately diagnose contention and significantly reduce performance interference to ensure application QoS. Applying ReQoS to SPEC2006 and SmashBench workloads on real multicore machines, we are able to improve machine utilization by more than 70% in many cases, and more than 50% on average, while enforcing a 90% QoS threshold. We are also able to improve the energy efficiency of modern multicore machines by 47% on average over a policy of disallowing co-locations.

international symposium on performance analysis of systems and software | 2012

Performance analysis of thread mappings with a holistic view of the hardware resources

Wei Wang; Tanima Dey; Jason Mars; Lingjia Tang; Jack W. Davidson; Mary Lou Soffa

With the shift to chip multiprocessors, managing shared resources has become a critical issue in realizing their full potential. Previous research has shown that thread mapping is a powerful tool for resource management. However, the difficulty of simultaneously managing multiple hardware resources and the varying nature of the workloads have impeded the efficiency of thread mapping algorithms. To overcome the difficulties of simultaneously managing multiple resources with thread mapping, the interaction between various microarchitectural resources and thread characteristics must be well understood. This paper presents an in-depth analysis of PARSEC benchmarks running under different thread mappings to investigate the interaction of various thread mappings with microarchitectural resources including, L1 I/D-caches, I/D TLBs, L2 caches, hardware prefetchers, off-chip memory interconnects, branch predictors, memory disambiguation units and the cores. For each resource, the analysis provides guidelines for how to improve its utilization when mapping threads with different characteristics. We also analyze how the relative importance of the resources varies depending on the workloads. Our experiments show that when only memory resources are considered, thread mapping improves an applications performance by as much as 14% over the default Linux scheduler. In contrast, when both memory and processor resources are considered the mapping algorithm achieves performance improvements by as much as 28%. Additionally, we demonstrate that thread mapping should consider L2 caches, prefetchers and off-chip memory interconnects as one resource, and we present a new metric called L2-misses-memory-latency-product (L2MP) for evaluating their aggregated performance impact.

high-performance computer architecture | 2014

DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead

Wei Wang; Tanima Dey; Jack W. Davidson; Mary Lou Soffa

Memory bandwidth severely limits the scalability and performance of todays multi-core systems. Because of this limitation, many studies that focused on improving multi-core scalability rely on bandwidth usage predictions to achieve the best results. However, existing bandwidth prediction models have low accuracy, causing these studies to have inaccurate conclusions or perform sub-optimally. Most of these models make predictions based on the bandwidth usage samples of a few trial runs. Many factors that affect bandwidth usage and the complex DRAM operations are overlooked. This paper presents DraMon, a model that predicts bandwidth usages for multi-threaded programs with low overhead. It achieves high accuracy through highly accurate predictions of DRAM contention and DRAM concurrency, as well as by considering a wide range of hardware and software factors that impact bandwidth usage. We implemented two versions of DraMon: DraMon-T, a memory-trace based model, and DraMon-R, a run-time model which uses hardware performance counters. When evaluated on a real machine with memory-intensive benchmarks, DraMon-T has average accuracies of 99.17% and 94.70% for DRAM contention predictions and bandwidth predictions, respectively. DraMon-R has average accuracies of 98.55% and 93.37% for DRAM contention and bandwidth predictions respectively, with only 0.50% overhead on average.

ACM Transactions on Architecture and Code Optimization | 2013

ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity

Tanima Dey; Wei Wang; Jack W. Davidson; Mary Lou Soffa

To utilize the full potential of modern chip multiprocessors and obtain scalable performance improvements, it is critical to mitigate resource contention created by multithreaded workloads. In this article, we describe ReSense, the first runtime system that uses application characteristics to dynamically map multithreaded applications from dynamic workloads—workloads where multithreaded applications arrive, execute, and terminate continuously in unpredictable ways. ReSense mitigates contention for the shared resources in the memory hierarchy by applying a novel thread-mapping algorithm that dynamically adjusts the mapping of threads from dynamic workloads using a precalculated sensitivity score. The sensitivity score quantifies an applications sensitivity to sharing a particular memory resource and is calculated by an efficient characterization process that involves running the multithreaded application by itself on the target platform. To measure ReSenses effectiveness, sensitivity scores were determined for 21 benchmarks from PARSEC-2.1 and NPB-OMP-3.3 for the shared resources in the memory hierarchy on four different platforms. Using three different-sized dynamic workloads composed of randomly selected two, four, and eight corunning benchmarks with randomly selected start times, ReSense was able to improve the average response time of the three workloads by up to 27.03%, 20.89%, and 29.34% and throughput by up to 19.97%, 46.56%, and 29.86%, respectively, over the native OS on real hardware. By estimating and comparing ReSenses effectiveness with the optimal thread mapping for two different workloads, we found that the maximum average difference with the experimentally determined optimal performance was 1.49% for average response time and 2.08% for throughput.

high-performance computer architecture | 2016

Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines

Wei Wang; Jack W. Davidson; Mary Lou Soffa

Modern NUMA platforms offer large numbers of cores to boost performance through parallelism and multi-threading. However, because performance scalability is limited by available memory bandwidth, the strategy of allocating all cores can result in degraded performance. Consequently, accurately predicting optimal (best performing) core allocations, and executing applications with these allocations are crucial for achieving the best performance. Previous research focused on the prediction of optimal numbers of cores. However, in this paper, we show that, because of the asymmetric NUMA memory configuration and the asymmetric application memory behavior, optimal core allocations are not merely optimal numbers of cores. Additionally, previous studies do not adequately consider NUMA memory resources, which further limits their ability to accurately predict optimal core allocations. In this paper, we present a model, NuCore, which predicts both memory bandwidth usage and optimal core allocations. NuCore considers various memory resources and NUMA asymmetry, and employs Integer Programming to achieve high accuracy and low overhead. Experimental results from real NUMA machines show that the core allocations predicted by NuCore provide 1.27x average speedup over using all cores with only 75.6% cores allocated. NuCore also provides 1.18x and 1.21x average speedups over two state-of-the-art techniques. Our results also show that NuCore faithfully models NUMA memory systems and predicts memory bandwidth usages with only 10% average error.

virtual execution environments | 2012

REEact: a customizable virtual execution manager for multicore platforms

Wei Wang; Tanima Dey; Ryan W. Moore; Mahmut Aktasoglu; Bruce R. Childers; Jack W. Davidson; Mary Jane Irwin; Mahmut T. Kandemir; Mary Lou Soffa

With the shift to many-core chip multiprocessors (CMPs), a critical issue is how to effectively coordinate and manage the execution of applications and hardware resources to overcome performance, power consumption, and reliability challenges stemming from hardware and application variations inherent in this new computing environment. Effective resource and application management on CMPs requires consideration of user/application/hardware-specific requirements and dynamic adaption of management decisions based on the actual run-time environment. However, designing an algorithm to manage resources and applications that can dynamically adapt based on the run-time environment is difficult because most resource and application management and monitoring facilities are only available at the operating system level. This paper presents REEact, an infrastructure that provides the capability to specify user-level management policies with dynamic adaptation. REEact is a virtual execution environment that provides a framework and core services to quickly enable the design of custom management policies for dynamically managing resources and applications. To demonstrate the capabilities and usefulness of REEact, this paper describes three case studies--each illustrating the use of REEact to apply a specific dynamic management policy on a real CMP. Through these case studies, we demonstrate that REEact can effectively and efficiently implement policies to dynamically manage resources and adapt application execution.

international conference on cloud computing | 2016

Empirical Evaluation of Workload Forecasting Techniques for Predictive Cloud Resource Scaling

In Kee Kim; Wei Wang; Yanjun Qi; Marty Humphrey

Many predictive resource scaling approaches have been proposed to overcome the limitations of the conventional reactive approaches most often used in clouds today. In general, due to the complexity of clouds, these reactive approaches were often forced to make significant limiting assumptions in either the operating conditions/requirements or expected workload patterns. As such, it is extremely difficult for cloud users to know which – if any – existing workload predictor will work best for their particular cloud activity, especially when considering highly-variable workload patterns, non-trivial billing models, variety of resources to add/subtract, etc. To solve this problem, we conduct comprehensive evaluations for a variety of workload predictors under real-world cloud configurations. The workload predictors cover four classes of 21 predictors: naive, regression, temporal, and non-temporal methods. We simulate a cloud application under four realistic workload patterns, two different cloud billing models, and three different styles of predictive scaling. Our evaluation confirms that no workload predictor is universally best for all workload patterns, and shows that Predictive Scaling-out + Predictive Scaling-in has the best cost efficiency and the lowest job deadline miss rate in cloud resource management, on average providing 30% better cost efficiency and 80% less job deadline miss rate compared to other styles of predictive scaling.

international conference on cloud computing | 2015