Alex Ramírez | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alex Ramírez is active.

Explore More

Publication

Featured researches published by Alex Ramírez.

ieee international conference on high performance computing data and analytics | 2016

The mont-blanc prototype: an alternative approach for HPC systems

Nikola Rajovic; Alejandro Rico; F. Mantovani; Daniel Ruiz; Josep Oriol Vilarrubi; Constantino Gómez; Luna Backes; Diego Nieto; Harald Servat; Xavier Martorell; Jesús Labarta; Eduard Ayguadé; Chris Adeniyi-Jones; Said Derradji; Hervé Gloaguen; Piero Lanucara; Nico Sanna; Jean-François Méhaut; Kevin Pouget; Brice Videau; Eric Boyer; Momme Allalen; Axel Auweter; David Brayford; Daniele Tafani; Volker Weinberg; Dirk Brömmel; Rene Halver; Jan H. Meinke; Ramón Beivide

High-performance computing (HPC) is recognized as one of the pillars for further progress in science, industry, medicine, and education. Current HPC systems are being developed to overcome emerging architectural challenges in order to reach Exascale level of performance, projected for the year 2020. The much larger embedded and mobile market allows for rapid development of intellectual property (IP) blocks and provides more flexibility in designing an application-specific system-on-chip (SoC), in turn providing the possibility in balancing performance, energy-efficiency, and cost. In the Mont-Blanc project, we advocate for HPC systems being built from such commodity IP blocks, currently used in embedded and mobile SoCs. As a first demonstrator of such an approach, we present the Mont-Blanc prototype; the first HPC system built with commodity SoCs, memories, and network interface cards (NICs) from the embedded and mobile domain, and off-the-shelf HPC networking, storage, cooling, and integration solutions. We present the systems architecture and evaluate both performance and energy efficiency. Further, we compare the systems abilities against a production level supercomputer. At the end, we discuss parallel scalability and estimate the maximum scalability point of this approach across a set of applications.

IEEE Micro | 2015

Designing Efficient Heterogeneous Memory Architectures

Evgeny Bolotin; David W. Nellans; Oreste Villa; Mike O'Connor; Alex Ramírez; Stephen W. Keckler

Recent packaging technologies that enable DRAM chips to be stacked inside the processor package or on top of the processor chip can lower DRAM energy-per-bit costs, provide wider interfaces, and offer higher bandwidth. However, these technologies are limited in capacity and come at a higher price than traditional off-package memories, requiring system designers to balance price, performance, and capacity tradeoffs. The most obvious means to achieve this balance is to employ both on- and off-package memory in a heterogeneous memory architecture. However, designers must then decide whether to deploy the on-package memory as an additional cache-hierarchy level (controlled by hardware or software) or as a memory peer to the off-package DRAM in a NUMA configuration. This article presents a model and analysis of energy, bandwidth, and latency for current and emerging DRAM technologies that enable an exploration of memory hierarchies combining heterogeneous memory technologies with different attributes. The analysis shows that the gap between on- and off-package DRAM technologies is narrower than what is found between cache layers in traditional memory hierarchies. As a result, heterogeneous memory caches must achieve very high hit rates or risk degrading both system energy and bandwidth efficiency.

international conference on program comprehension | 2015

Limpio: LIghtweight MPI instrumentatiOn

Milan Pavlovic; Milan Radulovic; Alex Ramírez; Petar Radojković

Characterization of high-performance computing applications often has to be done without access to the source code. Computer architects, therefore, have a narrowed choice of instrumentation tools. Moreover, potentially large amount of collected data can prohibit creating a full time stamped event trace and analyzing it post-mortem. This paper describes Limpio -- a Light weight MPI instrumentation framework, that allows dynamic instrumentation of user-selected MPI calls, and customization of data gathering, analysis and visualization.

ieee international symposium on workload characterization | 2016

Rebalancing the core front-end through HPC code analysis

Ugljesa Milic; Paul M. Carpenter; Alejandro Rico; Alex Ramírez

There is a need to increase performance under the same power and area envelope to achieve Exascale technology in high performance computing (HPC). The todays chip multiprocessor (CMP) design is tailored by traditional desktop and server workloads, different from parallel applications commonly run in HPC. In this work, we focus on the HPC code characteristics and processor front-end which factors around 30% of core power and area on the emerging lean-core type of processors used in HPC. Separating serial from parallel code sections inside applications, we characterize three HPC benchmark suites and compare them to a traditional set of desktop integer workloads. HPC applications have biased and mostly backward taken branches, small dynamic instruction footprints, and long basic blocks. Our findings suggest smaller branch predictors (BP) with the additional loop BP, smaller branch target buffers (BTB), and smaller L1 instruction caches (I-cache) with wider lines. Still, the aforementioned downsizing applies only to the cores meant to run parallel code. The difference between serial and parallel code sections in HPC applications points to an asymmetric CMP design, with one baseline core for sequential and many HPCtailored cores designed for parallel code. Predictions using Sniper simulator and McPAT show that an HPC-tailored lean core saves 16% of the core area and 7% of power compared to a baseline core, without performance loss. Using the area savings to add an extra core, an asymmetric CMP with one baseline and eight tailored cores has the same area budget as a symmetric CMP composed out of eight baseline cores demanding 4% more power and providing 12% shorter execution time on average.

international conference on computer design | 2015

Exploring multiple sleep modes in on/off based energy efficient HPC networks

Karthikeyan P. Saravanan; Paul M. Carpente; Alex Ramírez

Energy efficiency is one of the key challenges in high-performance computing (HPC). The current target of 1 ExaFlop in 20 MW requires a ten-fold improvement in energy efficiency, which is only possible through significant improvements in the energy efficiency throughout the system. Interconnects are particularly inefficient, since their links are always on, consuming full power in order to provide low latency, even though the average interconnect utilization is low. To address the above, the Ethernet standards committee in-charge of 40/100/400Gb Ethernet has opted to include protocols that define low power modes, specifically Fast-Wake, alongside the older Deep-Sleep, to make interconnect links energy proportional. With these standards ratified as recently as March 2014, it is unclear how these low power modes can be used in HPC. While energy efficiency is critical, techniques with excessive performance overheads are unlikely to be adopted in HPC. To this end, this paper performs the first detailed analysis of Fast-Wake mode for link energy savings in the context of HPC. Our results show that a combination of Fast-Wake and Deep-Sleep can reduce link energy savings by up to 70% with less than 1% performance overheads. However, we show how the parameters of these low power modes must be carefully configured to obtain the right trade-offs in energy and performance. We believe that our analysis could benefit interconnect vendors looking to use these low power modes for deployment in HPC.

Archive | 1998