Is this you? Create Your Porfile

Kevin J. Barker

Pacific Northwest National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kevin J. Barker is active.

Explore More

Publication

Featured researches published by Kevin J. Barker.

international parallel and distributed processing symposium | 2014

MIC-SVM: Designing a Highly Efficient Support Vector Machine for Advanced Modern Multi-core and Many-Core Architectures

Yang You; Shuaiwen Leon Song; Haohuan Fu; Andres Marquez; Maryam Mehri Dehnavi; Kevin J. Barker; Kirk W. Cameron; Amanda Randles; Guangwen Yang

Support Vector Machine (SVM) has been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. In recent years, SVM was adapted to the field of High Performance Computing for power/performance prediction, auto-tuning, and runtime scheduling. However, even at the risk of losing prediction accuracy due to insufficient runtime information, researchers can only afford to apply offline model training to avoid significant runtime training overhead. Advanced multi- and many-core architectures offer massive parallelism with complex memory hierarchies which can make runtime training possible, but form a barrier to efficient parallel SVM design. To address the challenges above, we designed and implemented MIC-SVM, a highly efficient parallel SVM for x86 based multi-core and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools. MIC-SVM achieves 4.4-84x and 18-47x speedups against the popular LIBSVM, on MIC and Ivy Bridge CPUs respectively, for several real-world data-mining datasets. Even compared with GPUSVM, run on a top of the line NVIDIA k20x GPU, the performance of our MIC-SVM is competitive. We also conduct a cross-platform performance comparison analysis, focusing on Ivy Bridge CPUs, MIC and GPUs, and provide insights on how to select the most suitable advanced architectures for specific algorithms and input data patterns.

green computing and communications | 2010

Designing Energy Efficient Communication Runtime Systems for Data Centric Programming Models

Abhinav Vishnu; Shuaiwen Song; Andres Marquez; Kevin J. Barker; Darren J. Kerbyson; Kirk W. Cameron; Pavan Balaji

The insatiable demand of high performance computing is being driven by the most computationally intensive applications such as computational chemistry, climate modeling, nuclear physics, etc. The last couple of decades have observed a tremendous rise in supercomputers with architectures ranging from traditional clusters to system-on-a-chip in order to achieve the petaflop computing barrier. However, with advent of petaflop-plus computing, we have ushered in an era where power efficient system software stack is imperative for execution on exascale systems and beyond. At the same time, computationally intensive applications are exploring programming models beyond traditional message passing, as a combination of Partitioned Global Address Space (PGAS) languages and libraries, providing one-sided communication paradigm with put, get and accumulate primitives. To support the PGAS models, it is critical to design power efficient and high performance one-sided communication runtime systems. In this paper, we design and implement PASCoL, a high performance power aware one-sided communication library using Aggregate Remote Memory Copy Interface (ARMCI), the communication runtime system of Global Arrays. For various communication primitives provided by ARMCI, we study the impact of Dynamic Voltage/Frequency Scaling (DVFS) and a combination of interrupt (blocking)/polling based mechanisms provided by most modern interconnects. We implement our design and evaluate it with synthetic benchmarks using an Infini Band cluster. Our results indicate that PASCoL can achieve significant reduction in energy consumed per byte transfer without additional penalty for various one-sided communication primitives and various message sizes and data transfer patterns.

The Journal of Supercomputing | 2013

Designing energy efficient communication runtime systems: a view from PGAS models

Abhinav Vishnu; Shuaiwen Song; Andres Marquez; Kevin J. Barker; Darren J. Kerbyson; Kirk W. Cameron; Pavan Balaji

As the march to the exascale computing gains momentum, energy consumption of supercomputers has emerged to be the critical roadblock. While architectural innovations are imperative in achieving computing of this scale, it is largely dependent on the systems software to leverage the architectural innovations. Parallel applications in many computationally intensive domains have been designed to leverage these supercomputers, with legacy two-sided communication semantics using Message Passing Interface. At the same time, Partitioned Global Address Space Models are being designed which provide global address space abstractions and one-sided communication for exploiting data locality and communication optimizations. PGAS models rely on one-sided communication runtime systems for leveraging high-speed networks to achieve best possible performance.In this paper, we present a design for Power Aware One-Sided Communication Llibrary – PASCoL. The proposed design detects communication slack, leverages Dynamic Voltage and Frequency Scaling (DVFS), and Interrupt driven execution to exploit the detected slack for energy efficiency. We implement our design and evaluate it using synthetic benchmarks for one-sided communication primitives, Put, Get, and Accumulate and uniformly noncontiguous data transfers. Our performance evaluation indicates that we can achieve significant reduction in energy consumption without performance loss on multiple one-sided communication primitives. The achieved results are close to the theoretical peak available with the experimental test bed.

international workshop on energy efficient supercomputing | 2013

Unified performance and power modeling of scientific workloads

Shuaiwen Leon Song; Kevin J. Barker; Darren J. Kerbyson

It is expected that scientific applications executing on future large-scale HPC must be optimized not only in terms of performance, but also in terms of power consumption. As power and energy become increasingly constrained resources, researchers and developers must have access to tools that will allow for accurate prediction of both performance and power consumption. Reasoning about performance and power consumption in concert will be critical for achieving maximum utilization of limited resources on future HPC systems. To this end, we present a unified performance and power model for the Nek-Bone mini-application developed as part of the DOEs CESAR Exascale Co-Design Center. Our models consider the impact of computation, point-to-point communication, and collective communication individually and quantitatively predict their impact on both performance and energy efficiency. Further, these models are demonstrated to be accurate on currently available HPC system architectures. In this paper, we present our modeling methodology and performance and power models for the Nek-Bone mini-application. We present validation results that indicate the accuracy of these models.

international conference on cluster computing | 2011

Energy Templates: Exploiting Application Information to Save Energy

Darren J. Kerbyson; Abhinav Vishnu; Kevin J. Barker

In this work we consider a novel application centric approach for saving energy on large-scale parallel systems. By using a priori information on the expected application behavior we identify points at which processor-cores will wait for incoming data and thus may be placed in a low power state to save energy. The approach is general and complements many of the existing approaches that rely on saving energy at points of global synchronization. We capture the expected application behavior into an Energy Template whose purpose is to identify when cores are expected to be in an idle state and allow the runtime to use the template information and change the power state of the core. We prototype an Energy Template for a wave front algorithm that contains an complex processing pattern in which cores wait for incoming data before processing local data and whose wait-time varies from phase to phase. The implementation uses PMPI and requires minimal changes to the application code. Using a power instrumented cluster we demonstrate that using an Energy Template for the wave front application lowers the power requirements by 8% when using 216 cores, from the system maximum of 23%, and the energy requirements by 4%. We also show that the wave fronts inherent parallel activity will lead to increased savings on larger systems.

international conference on cluster computing | 2011

Analyzing the Performance Bottlenecks of the POWER7-IH Network

Darren J. Kerbyson; Kevin J. Barker

In this work we provide an early performance analysis of the communication network in a small-scale POWER7-IH processing system from IBM. Using a set of communication micro-benchmarks we quantify the achievable bandwidth of the communication links available in the system that differ in their peak performance characteristics. We also identify the bottlenecks within the communication network and show that the bandwidth a single node can inject into the network is considerably less than the bandwidth available to the IBM hub chip, that acts as a NIC to the node as well as being an integral part of the P7-IH network. Using a communication pattern that is representative of activities in many scientific applications that have regular communication patterns, we show how the default task-to-core assignment on the P7-IH achieves sub-optimal performance in most cases. We also show that when using a diagonal-cyclic assignment, as developed in this work that takes into account the network topology as well as routing strategy, the communication performance can be improved by up to 75%. We expect even greater improvements in the communication performance on larger P7-IH systems.

international conference on parallel and distributed systems | 2012

Comparing the Performance of Blue Gene/Q with Leading Cray XE6 and InfiniBand Systems

Darren J. Kerbyson; Kevin J. Barker; Abhinav Vishnu; Adolfy Hoisie

Three types of systems dominate the current High Performance Computing landscape: the Cray XE6, the IBM Blue Gene, and commodity clusters using InfiniBand. These systems have quite different characteristics making the choice for a particular deployment difficult. The XE6 uses Crays proprietary Gemini 3-D torus interconnect with two nodes at each network endpoint. The latest IBM Blue Gene/Q uses a single socket integrating processor and communication in a 5-D torus network. InfiniBand provides the flexibility of using nodes from many vendors connected in many possible topologies. The performance characteristics of each vary vastly along with their utilization model. In this work we compare the performance of these three systems using a combination of micro-benchmarks and a set of production applications. We also discuss the causes of variability in performance across the systems and quantify where performance is lost using a combination of measurements and models. Our results show that significant performance can be lost in normal production operation of the Cray XE6 and InfiniBand Clusters in comparison to Blue Gene/Q.

ieee international conference on high performance computing data and analytics | 2011

An early performance analysis of POWER7-IH HPC systems

Kevin J. Barker; Adolfy Hoisie; Darren J. Kerbyson

In this work we present a performance evaluation of the POWER7-IH processor and of integrated systems built from it. We describe the architecture of P7-IH with an emphasis on those characteristics that have a direct impact on the performance for large-scale HPC systems and applications. An important area of emphasis is the memory and communication subsystems and their impact on achievable application performance. The results from a set of micro-benchmarks are presented that include memory, communication and OS-noise characteristics. In addition the results from several production level applications are analyzed and their performance linked to the results of the micro-benchmarks through the use of accurate performance models. The models will also be employed in exploring the achievable performance of these applications on much larger systems.

Future Generation Computer Systems | 2014

A performance comparison of current HPC systems: Blue Gene/Q, Cray XE6 and InfiniBand systems

Darren J. Kerbyson; Kevin J. Barker; Abhinav Vishnu; Adolfy Hoisie

We present here a performance analysis of three of current architectures that have become commonplace in the High Performance Computing world. Blue Gene/Q is the third generation of systems from IBM that use modestly performing cores but at large-scale in order to achieve high performance. The XE6 is the latest in a long line of Cray systems that use a 3-D topology but the first to use its Gemini interconnection network. InfiniBand provides the flexibility of using compute nodes from many vendors that can be connected in many possible topologies. The performance characteristics of each vary vastly, and the way in which nodes are allocated in each type of system can significantly impact on achieved performance. In this work we compare these three systems using a combination of micro-benchmarks and a set of production applications. In addition we also examine the differences in performance variability observed on each system and quantify the lost performance using a combination of both empirical measurements and performance models. Our results show that significant performance can be lost in normal production operation of the Cray XE6 and InfiniBand Clusters in comparison to Blue Gene/Q. Performance analysis of system sub-system performance and of production applications on three HPC systems.Demonstration of higher scalability of Blue Gene/Q in comparison to Cray XE6 and InfiniBand clusters.Quantification of the performance lost due to production usage using validated performance models.

international workshop on energy efficient supercomputing | 2014

On the feasibility of dynamic power steering

Kevin J. Barker; Darren J. Kerbyson; Eric Anger

While high performance has always been the primary constraint behind large-scale system design, future systems will be built with increasing energy efficiency in mind. Mechanisms such as fine-grained power scaling and gating will provide tools to system-software and application developers to ensure the most efficient use of tightly constrained power budgets. Such approaches to-date have been focused on node-level optimizations to impact overall system energy efficiency. In this work we introduce Dynamic Power Steering, in which power can be dynamically routed across a system to resources where it will be of most benefit and away from other resources to maintain a near-constant overall power budget. This, a higher-level algorithmic approach to improving energy efficiency, considers the whole extent of a system being used by an application. It can be used for applications in which there is load-imbalance that varies over its execution. Using two classes of applications, namely those that contain a wavefront type processing, and a particle-in-cell, we quantify the benefit of Dynamic Power Steering for a variety of workload characteristics and derive some insight into the ways in which workload behavior affect Power Steering applicability.

Explore More