Is this you? Create Your Porfile

Joshua Peraza

University of California, San Diego

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Joshua Peraza is active.

Explore More

Publication

Featured researches published by Joshua Peraza.

international conference on cloud and green computing | 2012

Green Queue: Customized Large-Scale Clock Frequency Scaling

Ananta Tiwari; Michael A. Laurenzano; Joshua Peraza; Laura Carrington; Allan Snavely

We examine the scalability of a set of techniques related to Dynamic Voltage-Frequency Scaling (DVFS) on HPC systems to reduce the energy consumption of scientific applications through an application-aware analysis and runtime framework, Green Queue. Green Queue supports making CPU clock frequency changes in response to intra-node and inter-node observations about application behavior. Our intra-node approach reduces CPU clock frequencies and therefore power consumption while CPUs lacks computational work due to inefficient data movement. Our inter-node approach reduces clock frequencies for MPI ranks that lack computational work. We investigate these techniques on a set of large scientific applications on 1024 cores of Gordon, an Intel Sandy bridge-based supercomputer at the San Diego Supercomputer Center. Our optimal intra-node technique showed an average measured energy savings of 10.6% and a maximum of 21.0% over regular application runs. Our optimal inter-node technique showed an average 17.4% and a maximum of 31.7% energy savings.

european conference on parallel processing | 2014

Characterizing the Performance-Energy Tradeoff of Small ARM Cores in HPC Computation

Michael A. Laurenzano; Ananta Tiwari; Adam Jundt; Joshua Peraza; William A. Ward; Roy L. Campbell; Laura Carrington

Deploying large numbers of small, low-power cores has been gaining traction recently as a system design strategy in high performance computing (HPC). The ARM platform that dominates the embedded and mobile computing segments is now being considered as an alternative to high-end x86 processors that largely dominate HPC because peak performance per watt may be substantially improved using off-the-shelf commodity processors.

Concurrency and Computation: Practice and Experience | 2016

PMaC's green queue: a framework for selecting energy optimal DVFS configurations in large scale MPI applications

Joshua Peraza; Ananta Tiwari; Michael A. Laurenzano; Laura Carrington; Allan Snavely

This article presents Green Queue, a production quality tracing and analysis framework for implementing application aware dynamic voltage and frequency scaling (DVFS) for message passing interface applications in high performance computing. Green Queue makes use of both intertask and intratask DVFS techniques. The intertask technique targets applications where the workload is imbalanced by reducing CPU clock frequency and therefore power draw for ranks with lighter workloads. The intratask technique targets balanced workloads where all tasks are synchronously running the same code. The strategy identifies program phases and selects the energy‐optimal frequency for each by predicting power and measuring the performance responses of each phase to frequency changes. The success of these techniques is evaluated on 1024 cores on Gordon, a supercomputer at the San Diego Supercomputer Center built using Intel Xeon E5‐2670 (Sandybridge) processors. Green Queue achieves up to 21% and 32% energy savings for the intratask and intertask DVFS strategies, respectively. Copyright

ieee international conference on high performance computing data and analytics | 2015

Performance and energy efficiency analysis of 64-bit ARM using GAMESS

Ananta Tiwari; Kristopher Keipert; Adam Jundt; Joshua Peraza; Sarom Sok Leang; Michael A. Laurenzano; Mark S. Gordon; Laura Carrington

Power efficiency is one of the key challenges facing the HPC co-design community, sparking interest in the ARM processor architecture as a low-power high-efficiency alternative to the high-powered systems that dominate today. Recent advances in the ARM architecture, including the introduction of 64-bit support, have only fueled more interest in ARM. While ARM-based clusters have proven to be useful for data server applications, their viability for HPC applications requires an in-depth analysis of on-node and inter-node performance. To that end, as a co-design exercise, the viability of a commercially available 64-bit ARM cluster is investigated in terms of performance and energy efficiency with the widely used quantum chemistry package GAMESS. The performance and energy efficiency metrics are also compared to a conventional x86 Intel Ivy Bridge system. A 2:1 Moonshot core to Ivy Bridge core performance ratio is observed for the GAMESS calculation types considered. Doubling the number of cores to complete the execution faster on the 64-bit ARM cluster leads to better energy efficiency compared to the Ivy Bridge system; i.e., a 32-core execution of GAMESS calculation has approximately the same performance and better energy-to-solution than a 16-core execution of the same calculation on the Ivy Bridge system.

ieee international conference on high performance computing data and analytics | 2012

A Static Binary Instrumentation Threading Model for Fast Memory Trace Collection

Michael A. Laurenzano; Joshua Peraza; Laura Carrington; Ananta Tiwari; William A. Ward; Roy L. Campbell

In order to achieve a high level of performance, data intensive applications such as the real-time processing of surveillance feeds from unmanned aerial vehicles will require the strategic application of multi/many-core processors and coprocessors using a hybrid of inter-process message passing (e.g. MPI and SHMEM) and intra-process threading (e.g. pthreads and OpenMP). To facilitate program design decisions, memory traces gathered through binary instrumentation can be used to understand the low-level interactions between a data intensive code and the memory subsystem of a multi-core processor or many-core co-processor. Toward this end, this paper introduces the addition of threading support for PMaCs Efficient Binary Instrumentation Toolkit for Linux/x86 (PEBIL) and compares PEBILs threading model to the threading models of two other popular Linux/x86 binary instrumentation platforms - Pin and Dyninst - on both theoretical and empirical grounds. The empirical comparisons are based on experiments which collect memory address traces for the OpenMP-threaded implementations of the NASA Advanced Supercomputing Parallel Benchmarks (NPBs). This work shows that the overhead of collecting full memory address traces for multithreaded programs is higher in PEBIL (7.7x) than in Pin (4.7x), both of which are significantly lower than Dyninst (897x). This work also shows that PEBIL, uniquely, is able to take advantage of interval-based sampling of a memory address trace by rapidly disabling and re-enabling instrumentation at the transitions into and out of sampling periods in order to achieve significant decreases in the overhead of memory address trace collection. For collecting the memory address streams of each of the NPBs at a 10% sampling rate, PEBIL incurs an average slowdown of 2.9x compared to 4.4x with Pin and 897x with Dyninst.

international workshop on energy efficient supercomputing | 2015

Compute bottlenecks on the new 64-bit ARM

Adam Jundt; Allyson Cauble-Chantrenne; Ananta Tiwari; Joshua Peraza; Michael A. Laurenzano; Laura Carrington

The trifecta of power, performance and programmability has spurred significant interest in the 64-bit ARMv8 platform. These new systems provide energy efficiency, a traditional CPU programming model, and the potential of high performance when enough cores are thrown at the problem. However, it remains unclear how well the ARM architecture will work as a design point for the High Performance Computing market. In this paper, we characterize and investigate the key architectural factors that impact power and performance on a current ARMv8 offering (X-Gene 1) and Intels Sandy Bridge processor. Using Principal Component Analysis, multiple linear regression models, and variable importance analysis we conclude that the CPU frontend has the biggest impact on performance on both the X-Gene and Sandy Bridge processors.

international conference on cluster computing | 2013

Understanding the performance of stencil computations on Intel's Xeon Phi

Joshua Peraza; Ananta Tiwari; Michael A. Laurenzano; Laura Carrington; William A. Ward; Roy L. Campbell

Accelerators are becoming prevalent in high performance computing as a way of achieving increased computational capacity within a smaller power budget. Effectively utilizing the raw compute capacity made available by these systems, however, remains a challenge because it can require a substantial investment of programmer time to port and optimize code to effectively use novel accelerator hardware. In this paper we present a methodology for isolating and modeling the performance of common performance-critical patterns of code (so-called idioms) and other relevant behavioral characteristics from large scale HPC applications which are likely to perform favorably on Intel Xeon Phi. The benefits of the methodology are twofold: (1) it directs programmer efforts toward the regions of code most likely to benefit from porting to the Xeon Phi and (2) provides speedup estimates for porting those regions of code. We then apply the methodology to the stencil idiom, showing performance improvements of up to a factor of 4.7× on stencil-based benchmark codes.

ieee international conference on high performance computing data and analytics | 2012

Efficient HPC Data Motion via Scratchpad Memory

Kayla O Seager; Ananta Tiwari; Michael A. Laurenzano; Joshua Peraza; Pietro Cicotti; Laura Carrington

The energy required to move data accounts for a significant portion of the energy consumption of a modern supercomputer. To make systems of today more energy efficient and to bring exascale computing closer to the realm of possibilities, data motion must be made more energy efficient. Because the motion of each bit throughout the memory hierarchy has a large energy and performance cost, energy efficiency will improve if we can ensure that only the bits absolutely necessary for the computation are moved through the hierarchy. Toward reaching that end, in this work we explore the possible benefits of using a software-managed scratchpad memory for HPC applications. Our goal is to observe how data movement (and the associated energy costs) changes when we utilize software-managed scratchpad memory (SPM) instead of the traditional hardware-managed caches. Using an approximate but plausible model for the behavior of SPM, we show via memory simulation tools that HPC applications can benefit from hardware containing both scratchpad and traditional cache memory in order to move an average of 39% fewer bits to and from main memory, with a maximum improvement of 69%.

international conference on cluster computing | 2015

VecMeter: Measuring Vectorization on the Xeon Phi

Joshua Peraza; Ananta Tiwari; William A. Ward; Roy L. Campbell; Laura Carrington

Wide vector units in Intels Xeon Phi accelerator cards can significantly boost application performance when used effectively. However, there is a lack of performance tools that provide programmers accurate information about the level of vectorization in their codes. This paper presents VecMeter, an easy-to-use tool to measure vectorization on the Xeon Phi. VecMeter utilizes binary instrumentation and therefore does not require source code modifications. This paper describes the design of VecMeter, demonstrates its accuracy, defines a metric for quantifying vectorization, and provides an example where the tool can guide code optimization to improve performance by up to 33%.

international conference on supercomputing | 2011