Alejandro Rico
Barcelona Supercomputing Center
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Alejandro Rico.
international symposium on performance analysis of systems and software | 2011
Alejandro Rico; Alejandro Duran; Felipe Cabarcas; Yoav Etsion; Alex Ramirez; Mateo Valero
Over the past few years, computer architecture research has moved towards execution-driven simulation, due to the inability of traces to capture timing-dependent thread execution interleaving. However, trace-driven simulation has many advantages over execution-driven that are being missed in multithreaded application simulations. We present a methodology to properly simulate multithreaded applications using trace-driven environments. We distinguish the intrinsic application behavior from the computation for managing parallelism. Application traces capture the intrinsic behavior in the sections of code that are independent from the dynamic multithreaded nature, and the points where parallelism-management computation occurs. The simulation framework is composed of a trace-driven simulation engine and a dynamic-behavior component that implements the parallelism-management operations for the application. Then, at simulation time, these operations are reproduced by invoking their implementation in the dynamic-behavior component. The decisions made by these operations are based on the simulated architecture, allowing to dynamically reschedule sections of code taken from the trace to the target simulated components. As the captured sections of code are independent from the parallel state of the application, they can be simulated on the trace-driven engine, while the parallelism-management operations, that require to be re-executed, are carried out by the execution-driven component, thus achieving the best of both trace- and execution-driven worlds. This simulation methodology creates several new research opportunities, including research on scheduling and other parallelism-management techniques for future architectures, and hardware support for programming models.
high performance embedded architectures and compilers | 2012
Alejandro Rico; Felipe Cabarcas; Carlos Villavieja; Milan Pavlovic; Augusto Vega; Yoav Etsion; Alex Ramirez; Mateo Valero
Simulation is a key tool for computer architecture research. In particular, cycle-accurate simulators are extremely important for microarchitecture exploration and detailed design decisions, but they are slow and, so, not suitable for simulating large-scale architectures, nor are they meant for this. Moreover, microarchitecture design decisions are irrelevant, or even misleading, for early processor design stages and high-level explorations. This allows one to raise the abstraction level of the simulated architecture, and also the application abstraction level, as it does not necessarily have to be represented as an instruction stream. In this paper we introduce a definition of different application abstraction levels, and how these are employed in TaskSim, a multi-core architecture simulator, to provide several architecture modeling abstractions, and simulate large-scale architectures with hundreds of cores. We compare the simulation speed of these abstraction levels to the ones in existing simulation tools, and also evaluate their utility and accuracy. Our simulations show that a very high-level abstraction, which may be even faster than native execution, is useful for scalability studies on parallel applications; and that just simulating explicit memory transfers, we achieve accurate simulations for architectures using non-coherent scratchpad memories, with just a 25x slowdown compared to native execution. Furthermore, we revisit trace memory simulation techniques, that are more abstract than instruction-by-instruction simulations and provide an 18x simulation speedup.
design, automation, and test in europe | 2013
Nikola Rajovic; Alejandro Rico; James Vipond; Isaac Gelado; Nikola Puzovic; Alex Ramirez
The performance of High Performance Computing (HPC) systems is already limited by their power consumption. The majority of top HPC systems today are built from commodity server components that were designed for maximizing the compute performance. The Mont-Blanc project aims at using low-power parts from the mobile domain for HPC. In this paper we present our first experiences with the use of mobile processors and accelerators for the HPC domain based on the research that was performed in the project. We show initial evaluation of NVIDIA Tegra 2 and Tegra 3 mobile SoCs and the NVIDIA Quadro 1000M GPU with a set of HPC micro-benchmarks to evaluate their potential for energy-efficient HPC.
ieee international conference on high performance computing data and analytics | 2016
Nikola Rajovic; Alejandro Rico; F. Mantovani; Daniel Ruiz; Josep Oriol Vilarrubi; Constantino Gómez; Luna Backes; Diego Nieto; Harald Servat; Xavier Martorell; Jesús Labarta; Eduard Ayguadé; Chris Adeniyi-Jones; Said Derradji; Hervé Gloaguen; Piero Lanucara; Nico Sanna; Jean-François Méhaut; Kevin Pouget; Brice Videau; Eric Boyer; Momme Allalen; Axel Auweter; David Brayford; Daniele Tafani; Volker Weinberg; Dirk Brömmel; Rene Halver; Jan H. Meinke; Ramón Beivide
High-performance computing (HPC) is recognized as one of the pillars for further progress in science, industry, medicine, and education. Current HPC systems are being developed to overcome emerging architectural challenges in order to reach Exascale level of performance, projected for the year 2020. The much larger embedded and mobile market allows for rapid development of intellectual property (IP) blocks and provides more flexibility in designing an application-specific system-on-chip (SoC), in turn providing the possibility in balancing performance, energy-efficiency, and cost. In the Mont-Blanc project, we advocate for HPC systems being built from such commodity IP blocks, currently used in embedded and mobile SoCs. As a first demonstrator of such an approach, we present the Mont-Blanc prototype; the first HPC system built with commodity SoCs, memories, and network interface cards (NICs) from the embedded and mobile domain, and off-the-shelf HPC networking, storage, cooling, and integration solutions. We present the systems architecture and evaluate both performance and energy efficiency. Further, we compare the systems abilities against a production level supercomputer. At the end, we discuss parallel scalability and estimate the maximum scalability point of this approach across a set of applications.
international conference on supercomputing | 2015
Kallia Chronaki; Alejandro Rico; Rosa M. Badia; Eduard Ayguadé; Jesús Labarta; Mateo Valero
Current and future parallel programming models need to be portable and efficient when moving to heterogeneous multi-core systems. OmpSs is a task-based programming model with dependency tracking and dynamic scheduling. This paper describes the OmpSs approach on scheduling dependent tasks onto the asymmetric cores of a heterogeneous system. The proposed scheduling policy improves performance by prioritizing the newly-created tasks at runtime, detecting the longest path of the dynamic task dependency graph, and assigning critical tasks to fast cores. While previous works use profiling information and are static, this dynamic scheduling approach uses information that is discoverable at runtime which makes it implementable and functional without the need of an oracle or profiling. The evaluation results show that our proposal outperforms a dynamic implementation of Heterogeneous Earliest Finish Time by up to 1.15x, and the default breadth-first OmpSs scheduler by up to 1.3x in an 8-core heterogeneous platform and up to 2.7x in a simulated 128-core chip.
ieee international conference on high performance computing data and analytics | 2009
Alejandro Rico; Alex Ramirez; Mateo Valero
There is a clear industrial trend towards chip multiprocessors (CMP) as the most power efficient way of further increasing performance. Heterogeneous CMP architectures take one more step along this power efficiency trend by using multiple types of processors, tailored to the workloads they will execute. Programming these CMP architectures has been identified as one of the main challenges in the near future, and programming heterogeneous systems is even more challenging. High-level programming models which allow the programmer to identify parallel tasks, and the runtime management of the inter-task dependencies, have been identified as a suitable model for programming such heterogeneous CMP architectures. In this paper we analyze the performance of Cell Superscalar, a task-based programming model for the Cell Broadband Engine Architecture, in terms of its scalability to higher number of on-chip processors. Our results show that the low performance of the PPE component limits the scalability of some applications to less than 16 processors. Since the PPE has been identified as the limiting element, we perform a set of simulation studies evaluating the impact of out-of-order execution, branch prediction and larger caches on the task management overhead. We conclude that out-of-order execution is a very desirable feature, since it increases task management performance by 50%. We also identify memory latency as a fundamental aspect in performance, while the working set is not that large. We expect a significant performance impact if task management would run using a fast private memory to store the task dependency graph instead of relying on the cache hierarchy.
IEEE Micro | 2017
Nigel John Stephens; Stuart David Biles; Matthias Boettcher; Jacob Eapen; Mbou Eyole; Giacomo Gabrielli; Matt Horsnell; Grigorios Magklis; Alejandro Vicente Martinez; Nathanael Premillieu; Alastair Reid; Alejandro Rico; Paul Walker
This article describes the ARM Scalable Vector Extension (SVE). Several goals guided the design of the architecture. First was the need to extend the vector processing capability associated with the ARM AArch64 execution state to better address the computational requirements in domains such as high-performance computing, data analytics, computer vision, and machine learning. Second was the desire to introduce an extension that can scale across multiple implementations, both now and into the future, allowing CPU designers to choose the vector length most suitable for their power, performance, and area targets. Finally, the architecture should avoid imposing a software development cost as the vector length changes and where possible reduce it by improving the reach of compiler auto-vectorization technologies. SVE achieves these goals. It allows implementations to choose a vector register length between 128 and 2,048 bits. It supports a vector-length agnostic programming model that lets code run and scale automatically across all vector lengths without recompilation. Finally, it introduces several innovative features that begin to overcome some of the traditional barriers to autovectorization.
ieee international symposium on workload characterization | 2016
Victor Garcia; Juan Gómez-Luna; Thomas Grass; Alejandro Rico; Eduard Ayguadé; Antonio J. Peña
Heterogeneous systems are ubiquitous in the field of High- Performance Computing (HPC). Graphics processing units (GPUs) are widely used as accelerators for their enormous computing potential and energy efficiency; furthermore, on-die integration of GPUs and general-purpose cores (CPUs) enables unified virtual address spaces and seamless sharing of data structures, improving programmability and softening the entry barrier for heterogeneous programming. Although on-die GPU integration seems to be the trend among the major microprocessor manufacturers, there are still many open questions regarding the architectural design of these systems. This paper is a step forward towards understanding the effect of on-chip resource sharing between GPU and CPU cores, and in particular, of the impact of last-level cache (LLC) sharing in heterogeneous computations. To this end, we analyze the behavior of a variety of heterogeneous GPU-CPU benchmarks on different cache configurations. We perform an evaluation of the popular Rodinia benchmark suite modified to leverage the unified memory address space. We find such GPGPU workloads to be mostly insensitive to changes in the cache hierarchy due to the limited interaction and data sharing between GPU and CPU. We then evaluate a set of heterogeneous benchmarks specifically designed to take advantage of the finegrained data sharing and low-overhead synchronization between GPU and CPU cores that these integrated architectures enable. We show how these algorithms are more sensitive to the design of the cache hierarchy, and find that when GPU and CPU share the LLC execution times are reduced by 25% on average, and energy-to-solution by over 20% for all benchmarks.
international conference on embedded computer systems: architectures, modeling, and simulation | 2010
Felipe Cabarcas; Alejandro Rico; Yoav Etsion; Alex Ramirez
Memory bandwidth has always been a critical factor for the performance of many data intensive applications. The increasing processor performance, and the advert of single chip multiprocessors have increased the memory bandwidth demands beyond what a single commodity memory device can provide. The immediate solution is to use more than one memory device, and interleave data across them so they can be used in parallel as if they were a single device of higher bandwidth. In this paper we showed that fine-grain memory interleaving on the evaluated many-core architectures with many DRAM channels was critical to achieve high memory bandwidth efficiency. Our results showed that performance can degrade up to 50% due to achievable bandwidths being far from the maximum installed.
Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies | 2010
Augusto Vega; Alejandro Rico; Felipe Cabarcas; Alex Ramirez; Mateo Valero
The emergence of hardware accelerators, such as graphics processing units (GPUs), has challenged the interaction between processing elements (PEs) and main memory. In architectures like the Cell/B.E. or GPUs, the PEs incorporate local memories which are fed with data transferred from memory using direct memory accesses (DMAs). We expect that chip multiprocessors (CMP) with DMA-managed local memories will become more popular in the near future due to the increasing interest in accelerators. In this work we show that, in that case, the way cache hierarchies are conceived should be revised. Particularly for last-level caches, the norm today is to use latency-aware organizations. For instance, in dynamic nonuniform cache architectures (D-NUCA) data is migrated closer to the requester processor to optimize latency. However, in DMA-based scenarios, the memory system latency becomes irrelevant compared with the time consumed for moving the DMA data, so latency-aware designs are, a priori, inefficient. In this work, we revisit the last-level cache designs in DMA-based CMP architectures with master-worker execution. Two scenarios are evaluated. First, we consider a set of private caches with data replication across them, where coherency of the copies is ensured through a hardware protocol. In this scenario, a PE has a nearby copy of the datum, improving cache access latency. Second, we consider a partitioned cache, where the allocation of a datum to a cache block is determined based on its physical address. In this scenario, there are no copies of data, and access to a datum has a variable latency. In contrast with traditional load/store-based architectures, we found that the partitioned last-level cache scheme outperforms the cache with data replication for DMA-based scenarios.