Is this you? Create Your Porfile

Edgar A. León

Lawrence Livermore National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Edgar A. León is active.

Explore More

Publication

Featured researches published by Edgar A. León.

ieee international conference on high performance computing data and analytics | 2009

Instruction-level simulation of a cluster at scale

Edgar A. León; Rolf Riesen; Arthur B. Maccabe; Patrick G. Bridges

Instruction-level simulation is necessary to evaluate new architectures. However, single-node simulation cannot predict the behavior of a parallel application on a supercomputer. We present a scalable simulator that couples a cycle-accurate node simulator with a supercomputer network model. Our simulator executes individual instances of IBMs Mambo PowerPC simulator on hundreds of cores. We integrated a NIC emulator into Mambo and model the network instead of fully simulating it. This decouples the individual node simulators and makes our design scalable. Our simulator runs unmodified parallel message-passing applications on hundreds of nodes. We can change network and detailed node parameters, inject network traffic directly into caches, and use different policies to decide when that is an advantage. This paper describes our simulator in detail, evaluates it, and demonstrates its scalability. We show its suitability for architecture research by evaluating the impact of cache injection on parallel application performance.

Proceedings of the 2015 International Symposium on Memory Systems | 2015

HpMC: An Energy-aware Management System of Multi-level Memory Architectures

Chun-Yi Su; David A. Roberts; Edgar A. León; Kirk W. Cameron; Bronis R. de Supinski; Gabriel H. Loh; Dimitrios S. Nikolopoulos

DRAM technology faces density and power challenges to increase capacity because of limitations of physical cell design. To overcome these limitations, system designers are exploring alternative solutions that combine DRAM and emerging NVRAM technologies. Previous work on heterogeneous memories focuses, mainly, on two system designs: PCache, a hierarchical, inclusive memory system, and HRank, a flat, non-inclusive memory system. We demonstrate that neither of these designs can universally achieve high performance and energy efficiency across a suite of HPC workloads. In this work, we investigate the impact of a number of multi-level memory designs on the performance, power, and energy consumption of applications. To achieve this goal and overcome the limited number of available tools to study heterogeneous memories, we created HMsim, an infrastructure that enables n-level, heterogeneous memory studies by leveraging existing memory simulators. We, then, propose HpMC, a new memory controller design that combines the best aspects of existing management policies to improve performance and energy. Our energy-aware memory management system dynamically switches between PCache and HRank based on the temporal locality of applications. Our results show that HpMC reduces energy consumption from 13% to 45% compared to PCache and HRank, while providing the same bandwidth and higher capacity than a conventional DRAM system.

high performance interconnects | 2007

Reducing the Impact of the MemoryWall for I/O Using Cache Injection

Edgar A. León; Kurt B. Ferreira; Arthur B. Maccabe

Cache injection addresses the continuing disparity between processor and memory speeds by placing data into a processors cache directly from the I/O bus. This disparity adversely affects the performance of memory bound applications including certain scientific computations, encryption, image processing, and some graphics applications. Cache injection can reduce memory latency and memory pressure for I/O. The performance of cache injection is dependent on several factors including timely usage of data, the amount of data, and the applications data usage patterns. We show that cache injection provides significant advantages over data prefetching by reducing the pressure on the memory controller by up to 96%. Despite its benefits, cache injection may degrade application performance due to early injection of data. To overcome this limitation, we propose injection policies to determine when and where to inject data. These policies are based on OS, compiler, and application information.

ieee international symposium on workload characterization | 2012

Model-based, memory-centric performance and power optimization on NUMA multiprocessors

Chun-Yi Su; Dong Li; Dimitrios S. Nikolopoulos; Kirk W. Cameron; Bronis R. de Supinski; Edgar A. León

Non-Uniform Memory Access (NUMA) architectures are ubiquitous in HPC systems. NUMA along with other factors including socket layout, data placement, and memory contention significantly increase the search space to find an optimal mapping of applications to NUMA systems. This search space may be intractable for online optimization and challenging for efficient offline search. This paper presents DyNUMA, a framework for dynamic optimization of programs on NUMA architectures. DyNUMA uses simple, memory-centric, performance and energy models with non-linear terms to capture the complex and interacting effects of system layout, program concurrency, data placement, and memory controller contention. DyNUMA leverages an artificial neural network (ANN) with input, output, and intermediate layers that emulate program threads, memory controllers, processor cores, and their interactions. Using an ANN in conjunction with critical path analysis, DyNUMA autonomously optimizes programs for performance or energy-efficiency metrics. We used DyNUMA on a variety of benchmarks from the NPB and ASC Sequoia suites on three different architectures (a 16-core AMD Barcelona system, a 32-core AMD Magny-Cours system, and a 64-core Tilera TilePro64 system). Our results show that DyNUMA achieves on average 8.7% improvement in performance (12.9% in the best case), 16% improvement in Energy-Delay (30.6% in the best case) and 9.1% improvement in MFLOPS/Watt (10.7% in the best case) compared to the default Linux scheduling.

ieee international conference on cloud engineering | 2015

A Container-Based Approach to OS Specialization for Exascale Computing

Judicael Zounmevo; Swann Perarnau; Kamil Iskra; Kazutomo Yoshii; Roberto Gioiosa; Brian Van Essen; Maya Gokhale; Edgar A. León

Future exascale systems will impose several conflicting challenges on the operating system (OS) running on the compute nodes of such machines. On the one hand, the targeted extreme scale requires the kind of high resource usage efficiency that is best provided by lightweight OSes. At the same time, substantial changes in hardware are expected for exascale systems. Compute nodes are expected to host a mix of general-purpose and special-purpose processors or accelerators tailored for serial, parallel, compute-intensive, or I/O-intensive workloads. Similarly, the deeper and more complex memory hierarchy will expose multiple coherence domains and NUMA nodes in addition to incorporating nonvolatile RAM. That expected workload and hardware heterogeneity and complexity is not compatible with the simplicity that characterizes high performance lightweight kernels. In this work, we describe the Argo Exascale node OS, which is our approach to providing in a single kernel the required OS environments for the two aforementioned conflicting goals. We resort to multiple OS specializations on top of a single Linux kernel coupled with multiple containers.

international parallel and distributed processing symposium | 2014

Characterizing the Impact of Program Optimizations on Power and Energy for Explicit Hydrodynamics

Edgar A. León; Ian Karlin

With the end of Denard scaling, future systems will be constrained by power and energy. This will impact application developers by forcing them to restructure and optimize their algorithms in terms of these resources. In this paper, we analyze the impact of different code optimizations on power, energy, and execution time. Our optimizations include loop fusion, data structure transformations, global allocation, and compiler selection. We analyze the static and dynamic components of power and energy as applied to the processor chip and memory domains within a system. In addition, our analysis correlates energy and power changes with performance events and shows that data motion is highly correlated with memory power and energy usage and executed instructions are partially correlated with processor power and energy. Our results demonstrate key tradeoffs among power, energy, and execution time for explicit hydrodynamics via a representative kernel. In particular, we observe that loop fusion and compiler selection improve all objectives, while global allocation and data layout transformations present tradeoffs that are objective-dependent.

ieee international conference on high performance computing data and analytics | 2016

Characterizing parallel scientific applications on commodity clusters: an empirical study of a tapered fat-tree

Edgar A. León; Ian Karlin; Abhinav Bhatele; Steven H. Langer; Chris Chambreau; Louis H. Howell; Trent D'Hooge; Matthew L. Leininger

Understanding the characteristics and requirements of applications that run on commodity clusters is key to properly configuring current machines and, more importantly, procuring future systems effectively. There are only a few studies, however, that are current and characterize realistic workloads. For HPC practitioners and researchers, this limits our ability to design solutions that will have an impact on real systems. We present a systematic study that characterizes applications with an emphasis on communication requirements. It includes cluster utilization data, identifying a representative set of applications from a U.S. Department of Energy laboratory, and characterizing their communication requirements. The driver for this work is understanding application sensitivity to a tapered fat-tree network. These results provided key insights into the procurement of our next generation commodity systems. We believe this investigation can provide valuable input to the HPC community in terms of workload characterization and requirements from a large supercomputing center.

high performance distributed computing | 2011

Cache injection for parallel applications

Edgar A. León; Rolf Riesen; Kurt Brian Ferreira; Arthur B. Maccabe

For two decades, the memory wall has affected many applications in their ability to benefit from improvements in processor speed. Cache injection addresses this disparity for I/O by writing data into a processors cache directly from the I/O bus. This technique reduces data latency and, unlike data prefetching, improves memory bandwidth utilization. These improvements are significant for data-intensive applications whose performance is dominated by compulsory cache misses. We present an empirical evaluation of three injection policies and their effect on the performance of two parallel applications and several collective micro-benchmarks. We demonstrate that the effectiveness of cache injection on performance is a function of the communication characteristics of applications, the injection policy, the target cache, and the severity of the memory wall. For example, we show that injecting message payloads to the L3 cache can improve the performance of network-bandwidth limited applications. In addition, we show that cache injection improves the performance of several collective operations, but not all-to-all operations (implementation dependent). Our study shows negligible pollution to the target caches.

international conference on cluster computing | 2015

Optimizing Explicit Hydrodynamics for Power, Energy, and Performance

Edgar A. León; Ian Karlin; Ryan E. Grant

Practical considerations for future supercomputer designs will impose limits on both instantaneous power consumption and total energy consumption. Working within these constraints while providing the maximum possible performance, application developers will need to optimize their code for speed alongside power and energy concerns. This paper analyzes the effectiveness of several code optimizations including loop fusion, data structure transformations, and global allocations. A per component measurement and analysis of different architectures is performed, enabling the examination of code optimizations on different compute subsystems. Using an explicit hydrodynamics proxy application from the U.S. Department of Energy, LULESH, we show how code optimizations impact different computational phases of the simulation. This provides insight for simulation developers into the best optimizations to use during particular simulation compute phases when optimizing code for future supercomputing platforms. We examine and contrast both x86 and Blue Gene architectures with respect to these optimizations.

local computer networks | 2002

Instrumenting LogP parameters in GM: implementation and validation

Edgar A. León; Arthur B. Maccabe; Ron Brightwell

This paper describes an apparatus which can be used to vary communication performance parameters for MPI applications, and provides a tool to analyze the impact of communication performance on parallel applications. Our apparatus is based on Myrinet (along with GM). We use an extension of the LogP model to allow higher flexibility in determining the parameter(s) to which parallel applications may be more sensitive. We show that individual communication parameters can be controlled within a small percentage error, and that the other parameters remain unchanged.

Explore More