Allan Porterfield | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Allan Porterfield is active.

Explore More

Publication

Featured researches published by Allan Porterfield.

ieee international conference on high performance computing data and analytics | 2012

OpenMP task scheduling strategies for multicore NUMA systems

Stephen L. Olivier; Allan Porterfield; Kyle Wheeler; Michael Spiegel; Jan F. Prins

The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run-time system. Efficient scheduling of tasks on modern multi-socket multicore shared memory systems requires careful consideration of an increasingly complex memory hierarchy, including shared caches and non-uniform memory access (NUMA) characteristics. In order to evaluate scheduling strategies, we extended the open source Qthreads threading library to implement different scheduler designs, accepting OpenMP programs through the ROSE compiler. Our comprehensive performance study of diverse OpenMP task-parallel benchmarks compares seven different task-parallel run-time scheduler implementations on an Intel Nehalem multi-socket multicore system: our proposed hierarchical work-stealing scheduler, a per-core work-stealing scheduler, a centralized scheduler, and LIFO and FIFO versions of the Qthreads round-robin scheduler. In addition, we compare our results against the Intel and GNU OpenMP implementations. Our hierarchical scheduling strategy leverages different scheduling methods at different levels of the hierarchy. By allowing one thread to steal work on behalf of all of the threads within a single chip that share a cache, the scheduler limits the number of costly remote steals. For cores on the same chip, a shared LIFO queue allows exploitation of cache locality between sibling tasks as well as between a parent task and its newly created child tasks. In the performance evaluation, our Qthreads hierarchical scheduler is competitive on all benchmarks tested. On five of the seven benchmarks, it demonstrates speedup and absolute performance superior to both the Intel and GNU OpenMP run-time systems. Our run-time also demonstrates similar performance benefits on AMD Magny Cours and SGI Altix systems, enabling several benchmarks to successfully scale to 192 CPUs of an SGI Altix.

acm sigplan symposium on principles and practice of parallel programming | 2010

Analyzing lock contention in multithreaded applications

Nathan R. Tallent; John M. Mellor-Crummey; Allan Porterfield

Many programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access to shared data. In many cases, contention for locks reduces parallel efficiency and hurts scalability. Being able to quantify and attribute lock contention is important for understanding where a multithreaded program needs improvement. This paper proposes and evaluates three strategies for gaining insight into performance losses due to lock contention. First, we consider using a straightforward strategy based on call stack profiling to attribute idle time and show that it fails to yield insight into lock contention. Second, we consider an approach that builds on a strategy previously used for analyzing idleness in work-stealing computations; we show that this strategy does not yield insight into lock contention. Finally, we propose a new technique for measurement and analysis of lock contention that uses data associated with locks to blame lock holders for the idleness of spinning threads. Our approach incurs ≤ 5% overhead on a quantum chemistry application that makes extensive use of locking (65M distinct locks, a maximum of 340K live locks, and an average of 30K lock acquisitions per second per thread) and attributes lock contention to its full static and dynamic calling contexts. Our strategy, implemented in HPCToolkit, is fully distributed and should scale well to systems with large core counts.

high performance distributed computing | 2010

SoftPower: fine-grain power estimations using performance counters

Min Yeol Lim; Allan Porterfield; Robert J. Fowler

We present and evaluate a surrogate model, based on hardware performance counter measurements, to estimate computer system power consumption. Power and energy are especially important in the design and operation of large data centers and of clusters used for scientific computing. Tradeoffs are made between performance and power consumption, this needs to be dynamic because activity varies over time. While it is possible to instrument systems for fine-grain power monitoring, such instrumentation is costly and not commonly available. Furthermore, the latency and sampling periods of hardware power monitors can be large compared to time scales at which workloads can change and dynamic power controls can operate. Given these limitations, we argue that surrogate models of the kind we present here can provide low-cost and accurate estimates of power consumption to drive on-line dynamic control mechanisms and for use in off-line tuning. In this brief paper, we discuss a general approach to building system power estimation models based on hardware performance counters. Using this technique, we then present a model for an Intel Core i7 system that has an absolute estimation error of 5.32 percent (median) and acceptable data collection overheads on varying workloads, CPU power states (frequency and voltage), and number of active cores. Since this method is based on event sampling of hardware counters, one can make a tradeoff between estimation accuracy and data-collection overhead.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

Power Measurement and Concurrency Throttling for Energy Reduction in OpenMP Programs

Allan Porterfield; Stephen L. Olivier; Sridutt Bhalachandra; Jan F. Prins

Understanding on-node application power and performance characteristics is critical to the push toward exascale computing. In this paper, we present an analysis of factors that impact both performance and energy usage of OpenMP applications. Using hardware performance counters in the Intel Sandy bridge X86-64 architecture, we measure energy usage and power draw for a variety of OpenMP programs: simple micro-benchmarks, a task parallel benchmark suite, and a hydrodynamics mini-app of a few thousand lines. The evaluation reveals substantial variations in energy usage depending on the algorithm, the compiler, the optimization level, the number of threads, and even the temperature of the chip. Variations of 20% were common and in the extreme were over 2X. In most cases, performance increases and energy usage decreases as more threads are used. However, for programs with sub-linear speedup, minimal energy usage often occurs at a lower thread count than peak performance. Our findings informed the design and implementation of an adaptive run time system that automatically throttles concurrency using data measured on-line from hardware performance counters. Without source code changes or user intervention, the thread scheduler accurately decides when energy can be conserved by limiting the number of active threads. For the target programs, dynamic runtime throttling consistently reduces power and overall energy usage by up to 3%.

international symposium on performance analysis of systems and software | 2010

Modeling memory concurrency for multi-socket multi-core systems

Anirban Mandal; Rob Fowler; Allan Porterfield

Multi-core computers are ubiquitous and multi-socket versions dominate as nodes in compute clusters. Given the high level of parallelism inherent in processor chips, the ability of memory systems to serve a large number of concurrent memory access operations is becoming a critical performance problem. The most common model of memory performance uses just two numbers, peak bandwidth and typical access latency. We introduce concurrency as an explicit parameter of the measurement and modeling processes to characterize more accurately the complexity of memory behavior of multi-socket, multi-core systems. We present a detailed experimental multi-socket, multi-core memory study based on the PCHASE benchmark, which can vary memory loads by controlling the number of concurrent memory references per thread. The make-up and structure of the memory have a major impact on achievable bandwidth. Three discrete bottlenecks were observed at different levels of the hardware architecture: limits on the number of references outstanding per core; limits to the memory requests serviced by a single memory controller; and limits on the global memory concurrency. We use these results to build a memory performance model that ties concurrency, latency and bandwidth together to create a more accurate model of overall performance. We show that current commodity memory sub-systems cannot handle the load offered by high-end processor chips.

international workshop on runtime and operating systems for supercomputers | 2015

Application Runtime Variability and Power Optimization for Exascale Computers

Allan Porterfield; Rob Fowler; Sridutt Bhalachandra; Barry Rountree; Diptorup Deb; Rob Lewis

Exascale computing will be as much about solving problems with the least power/energy as about solving them quickly. In the future, systems are likely to be over-provisioned with processors and will rely on hardware-enforced power bounds to allow operation within power budget/thermal design limits. Variability in High Performance Computing (HPC) makes scheduling and application optimization difficult. For three HPC applications - ADCIRC, WRF and LQCD, we show the effects of heterogeneity on run-to-run execution consistency (with and without a power limit applied). A 4% hardware run-to-run variation is seen in case of a perfectly balanced compute-bound application, while up to 6% variation was observed for an application with minor load imbalances. A simple model based on Dynamic Duty Cycle Modulation (DDCM) is introduced. The model was implemented within the MPI profiling interface for collective operations and used in conjunction with the three applications without source code changes. Energy savings of over 10% are obtained for ADCIRC with only a 1--3% slowdown. With a power limit in place, the energy saving is partially translated to performance improvement. With a power limit of 50W, one version of the model executes 3% faster while saving 6% in energy, and a second version executes 1% faster while saving over 10% energy.

international workshop on runtime and operating systems for supercomputers | 2013

An early prototype of an autonomic performance environment for exascale

Kevin A. Huck; Sameer Shende; Allen D. Malony; Hartmut Kaiser; Allan Porterfield; Robert J. Fowler; Ron Brightwell

Extreme-scale computing requires a new perspective on the role of performance observation in the Exascale system software stack. Because of the anticipated high concurrency and dynamic operation in these systems, it is no longer reasonable to expect that a post-mortem performance measurement and analysis methodology will suffice. Rather, there is a strong need for performance observation that merges first-and third-person observation, in situ analysis, and introspection across stack layers that serves online dynamic feedback and adaptation. In this paper we describe the DOE-funded XPRESS project and the role of autonomic performance support in Exascale systems. XPRESS will build an integrated Exascale software stack (called OpenX) that supports the ParalleX execution model and is targeted towards future Exascale platforms. An initial version of an autonomic performance environment called APEX has been developed for OpenX using the current TAU performance technology and results are presented that highlight the challenges of highly integrative observation and runtime analysis.

international conference on parallel processing | 2015

Using Per-Loop CPU Clock Modulation for Energy Efficiency in OpenMP Applications

Wei Wang; Allan Porterfield; John Cavazos; Sridutt Bhalachandra

As the HPC community moves into the exascale computing era, application energy is becoming as large of a concern as performance. Optimizing for energy will be essential in the effort to overcome the limited power envelope. Existing efforts to optimize energy in applications employ Dynamic Frequency and Voltage Scaling (DVFS) to maximize energy savings in less compute-intensive regions or non-critical execution paths. However, we found that DVFS has high power state switching overhead, preventing its use when a more fine-grained technique is necessary. In this work, we take advantage of the low transition overhead of CPU clock modulation and apply it to fine-grained Open MP parallel loops. The energy behavior of Open MP parallel regions is first characterized by changing the effective frequency using clock modulation. The clock modulation setting that achieves the best energy efficiency is then determined for each region. Finally, different CPU clock modulation settings are applied to the different loops within the same application. The resulting multi-frequency execution of Open MP applications achieves better energy-delay trade-off than any single frequency setting. In the best case scenario, the multi-frequency approach achieved 8.6% energy savings with less than 1.5% execution time increase. Concurrency throttling (i.e., Reducing the number of hardware threads used by an application) saves more energy and can be combined with CPU clock modulation. Using both, we see savings of 21% energy and improvement of energy-delay product (EDP) by 16%.

international parallel and distributed processing symposium | 2017

An Adaptive Core-Specific Runtime for Energy Efficiency

Sridutt Bhalachandra; Allan Porterfield; Stephen L. Olivier; Jan F. Prins

Energy efficiency in high performance computing (HPC) will be critical to limit operating costs and carbon footprints in future supercomputing centers. Energy efficiency of a computation can be improved by reducing time to completion without a substantial increase in power drawn or by reducing power with a little increase in time to completion. We present an Adaptive Core-specific Runtime (ACR) that dynamically adapts core frequencies to workload characteristics, and show examples of both reductions in power and improvement in the average performance. This improvement in energy efficiency is obtained without changes to the application. The adaptation policy embedded in the runtime uses existing core-specific power controls like software-controlled clock modulation and per-core Dynamic Voltage Frequency Scaling (DVFS) introduced in Intel Haswell. Experiments on six standard MPI benchmarks and a real world application show an overall 20% improvement in energy efficiency with less than 1% increase in execution time on 32 nodes (1024 cores) using per-core DVFS. An improvement in energy efficiency of up to 42% is obtained with the real world application ParaDis through a combination of speedup and power reduction. For one configuration, ParaDis achieves an average speedup of 11%, while the power is lowered by about 31%. The average improvement in the performance seen is a direct result of the reduction in run-to-run variation and running at turbo frequencies.

international workshop on energy efficient supercomputing | 2017

Improving Energy Efficiency in Memory-constrained Applications Using Core-specific Power Control

Sridutt Bhalachandra; Allan Porterfield; Stephen L. Olivier; Jan F. Prins; Robert J. Fowler

Power is increasingly the limiting factor in High Performance Computing (HPC) at Exascale and will continue to influence future advancements in supercomputing. Recent processors equipped with on-board hardware counters allow real time monitoring of operating conditions such as energy and temperature, in addition to performance measures such as instructions retired and memory accesses. An experimental memory study presented on modern CPU architectures, Intel Sandybridge and Haswell, identifies a metric, TORo_core, that detects bandwidth saturation and increased latency. TORo-Core is used to construct a dynamic policy applied at coarse and fine-grained levels to modulate per-core power controls on Haswell machines. The coarse and fine-grained application of dynamic policy shows best energy savings of 32.1% and 19.5% with a 2% slowdown in both cases. On average for six MPI applications, the fine-grained dynamic policy speeds execution by 1% while the coarse-grained application results in a 3% slowdown. Energy savings through frequency reduction not only provide cost advantages, they also reduce resource contention and create additional thermal headroom for non-throttled cores improving performance.

Explore More