Vincent M. Weaver
Cornell University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Vincent M. Weaver.
international conference on parallel processing | 2012
Vincent M. Weaver; Matt Johnson; Kiran Kasichayanula; James Ralph; Piotr Luszczek; Daniel Terpstra; Shirley Moore
Energy and power consumption are becoming critical metrics in the design and usage of high performance systems. We have extended the Performance API (PAPI) analysis library to measure and report energy and power values. These values are reported using the existing PAPI API, allowing code previously instrumented for performance counters to also measure power and energy. Higher level tools that build on PAPI will automatically gain support for power and energy readings when used with the newest version of PAPI. We describe in detail the types of energy and power readings available through PAPI. We support external power meters, as well as values provided internally by recent CPUs and GPUs. Measurements are provided directly to the instrumented process, allowing immediate code analysis in real time. We provide examples showing results that can be obtained with our infrastructure.
ieee international symposium on workload characterization | 2009
Major Bhadauria; Vincent M. Weaver; Sally A. McKee
PARSEC is a reference application suite used in industry and academia to assess new Chip Multiprocessor (CMP) designs. No investigation to date has profiled PARSEC on real hardware to better understand scaling properties and bottlenecks. This understanding is crucial in guiding future CMP designs for these kinds of emerging workloads. We use hardware performance counters, taking a systems-level approach and varying common architectural parameters: number of out-of-order cores, memory hierarchy configurations, number of multiple simultaneous threads, number of memory channels, and processor frequencies. We find these programs to be largely compute-bound, and thus limited by number of cores, micro-architectural resources, and cache-to-cache transfers, rather than by off-chip memory or system bus bandwidth. Half the suite fails to scale linearly with increasing number of threads, and some applications saturate performance at few threads on all platforms tested. Exploiting thread level parallelism delivers greater payoffs than exploiting instruction level parallelism. To reduce power and improve performance, we recommend increasing the number of arithmetic units per core, increasing support for TLP, and reducing support for ILP.
ieee international symposium on workload characterization | 2008
Vincent M. Weaver; Sally A. McKee
When creating architectural tools, it is essential to know whether the generated results make sense. Comparing a toolpsilas outputs against hardware performance counters on an actual machine is a common means of executing a quick sanity check. If the results do not match, this can indicate problems with the tool, unknown interactions with the benchmarks being investigated, or even unexpected behavior of the real hardware. To make future analyses of this type easier, we explore the behavior of the SPEC benchmarks with both dynamic binary instrumentation (DBI) tools and hardware counters. We collect retired instruction performance counter data from the full SPEC CPU 2000 and 2006 benchmark suites on nine different implementations of the times86 architecture. When run with no special preparation, hardware counters have a coefficient of variation of up to 1.07%. After analyzing results in depth, we find that minor changes to the experimental setup reduce observed errors to less than 0.002% for all benchmarks. The fact that subtle changes in how experiments are conducted can largely impact observed results is unexpected, and it is important that researchers using these counters be aware of the issues involved.
international conference on cloud and green computing | 2012
Jack J. Dongarra; Hatem Ltaief; Piotr Luszczek; Vincent M. Weaver
We propose to study the impact on the energy footprint of two advanced algorithmic strategies in the context of high performance dense linear algebra libraries: (1) mixed precision algorithms with iterative refinement allow to run at the peak performance of single precision floating-point arithmetic while achieving double precision accuracy and (2) tree reduction technique exposes more parallelism when factorizing tall and skinny matrices for solving over determined systems of linear equations or calculating the singular value decomposition. Integrated within the PLASMA library using tile algorithms, which will eventually supersede the block algorithms from LAPACK, both strategies further excel in performance in the presence of a dynamic task scheduler while targeting multicore architecture. Energy consumption measurements are reported along with parallel performance numbers on a dual-socket quad-core Intel Xeon as well as a quad-socket quad-core Intel Sandy Bridge chip, both providing component-based energy monitoring at all levels of the system, through the Power Pack framework and the Running Average Power Limit model, respectively.
international symposium on performance analysis of systems and software | 2013
Vincent M. Weaver; Daniel Terpstra; Shirley Moore
Ideal hardware performance counters provide exact deterministic results. Real-world performance monitoring unit (PMU) implementations do not always live up to this ideal. Events that should be exact and deterministic (such as retired instructions) show run-to-run variation and overcount on ×86_64 machines, even when run in strictly controlled environments. These effects are non-intuitive to casual users and cause difficulties when strict determinism is desirable, such as when implementing deterministic replay or deterministic threading libraries. We investigate eleven different x86 64 CPU implementations and discover the sources of divergence from expected count totals. Of all the counter events investigated, we find only a few that exhibit enough determinism to be used without adjustment in deterministic execution environments. We also briefly investigate ARM, IA64, POWER and SPARC systems and find that on these platforms the counter events have more determinism. We explore various methods of working around the limitations of the ×86_64 events, but in many cases this is not possible and would require architectural redesign of the underlying PMU.
international conference on parallel processing | 2011
Piotr Luszczek; Eric Meek; Shirley Moore; Daniel Terpstra; Vincent M. Weaver; Jack J. Dongarra
This paper evaluates the performance of the HPC Challenge benchmarks in several virtual environments, including VMware, KVM and VirtualBox. The HPC Challenge benchmarks consist of a suite of tests that examine the performance of HPC architectures using kernels with memory access patterns more challenging than those of the High Performance LINPACK (HPL) benchmark used in the TOP500 list. The tests include four local (matrix-matrix multiply, STREAM, RandomAccess and FFT) and four global (High Performance Linpack --- HPL, parallel matrix transpose --- PTRANS, RandomAccess and FFT) kernel benchmarks. The purpose of our experiments is to evaluate the overheads of the different virtual environments and investigate how different aspects of the system are affected by virtualization. We ran the benchmarks on an 8-core system with Core i7 processors using Open MPI. We did runs on the bare hardware and in each of the virtual environments for a range of problem sizes. As expected, the HPL results had some overhead in all the virtual environments, with the overhead becoming less significant with larger problem sizes. The RandomAccess results show drastically different behavior and we attempt to explain it with pertinent experiments. We show the cause of variability of performance results as well as major causes of measurement error.
ieee international conference on high performance computing data and analytics | 2014
Michael F. Cloutier; Chad Paradis; Vincent M. Weaver
A growing number of supercomputers are being built using processors with low-power embedded ancestry, rather than traditional high-performance cores. In order to evaluate this approach we investigate the energy and performance tradeoffs found with ten different 32-bit ARM development boards while running the HPL Linpack and STREAM benchmarks.Based on these results (and other practical concerns) we chose the Raspberry Pi as a basis for a power-aware embedded cluster computing testbed. Each node of the cluster is instrumented with power measurement circuitry so that detailed cluster-wide power measurements can be obtained, enabling power / performance co-design experiments.While our cluster lags recent x86 machines in performance, the power, visualization, and thermal features make it an excellent low-cost platform for education and experimentation.
international symposium on performance analysis of systems and software | 2013
Vincent M. Weaver; Daniel Terpstra; Heike McCraw; Matt Johnson; Kiran Kasichayanula; James Ralph; John S. Nelson; Philip Mucci; Tushar Mohan; Shirley Moore
The PAPI library [1] was originally developed to provide portable access to the hardware performance counters found on a diverse collection of modern microprocessors. Rather than learning and writing to a new performance infrastructure each time code is moved to a new machine, measurement code can be written to the PAPI API which abstracts away the underlying interface. Over time, other system components besides the processor have gained performance interfaces (for example, GPUs and network interfaces). PAPI was redesigned to have a component architecture to allow modular access to these new sources of performance data [2]. In addition to incremental changes in processor support, the recent PAPI 5 release adds support for two emerging concerns in the high-performance landscape: energy consumption and cloud computing. As processor densities climb, the thermal properties and energy usage of high performance systems are becoming increasingly important. We have extended the PAPI interface to simultaneously monitor processor metrics, thermal sensors, and power meters to provide clues for correlating algorithmic activity with thermal response and energy consumption. We have also extended PAPI to provide support for running inside of Virtual Machines (VMs). This ongoing work will enable developers to use PAPI to engage in performance analysis in a virtualized cloud environment.
international conference on parallel processing | 2012
Matt Johnson; Heike McCraw; Shirley Moore; Philip Mucci; John S. Nelson; Daniel Terpstra; Vincent M. Weaver; Tushar Mohan
This paper describes extensions to the PAPI hardware counter library for virtual environments, called PAPI-V. The extensions support timing routines, I/O measurements, and processor counters. The PAPI-V extensions will allow application and tool developers to use a familiar interface to obtain relevant hardware performance monitoring information in virtual environments.
international conference on computer design | 2009
Vincent M. Weaver; Sally A. McKee
Reducing a programs instruction count can improve cache behavior and bandwidth utilization, lower power consumption, and increase overall performance. Nonetheless, code density is an often overlooked feature in studying processor architectures. We hand-optimize an assembly language embedded benchmark for size on 21 different instruction set architectures, finding up to a factor of three difference in code sizes from ISA alone. We find that the architectural features that contribute most heavily to code density are instruction length, number of registers, availability of a zero register, bit-width, hardware divide units, number of instruction operands, and the availability of unaligned loads and stores. We extend our results to investigate operating system, compiler, and system library effects on code density. We find that the executable starting address, executable format, and system call interface all affect program size. While ISA effects are important, the efficiency of the entire system stack must be taken into account when developing a new dense instruction set architecture.