Is this you? Create Your Porfile

Lluc Alvarez

Polytechnic University of Catalonia

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Lluc Alvarez is active.

Explore More

Publication

Featured researches published by Lluc Alvarez.

international symposium on computer architecture | 2015

Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures

Lluc Alvarez; Lluis Vilanova; Miquel Moreto; Marc Casas; Marc Gonzàlez; Xavier Martorell; Nacho Navarro; Eduard Ayguadé; Mateo Valero

The increasing number of cores in manycore architectures causes important power and scalability problems in the memory subsystem. One solution is to introduce scratchpad memories alongside the cache hierarchy, forming a hybrid memory system. Scratchpad memories are more power-efficient than caches and they do not generate coherence traffic, but they suffer from poor programmability. A good way to hide the programmability difficulties to the programmer is to give the compiler the responsibility of generating code to manage the scratchpad memories. Unfortunately, compilers do not succeed in generating this code in the presence of random memory accesses with unknown aliasing hazards. This paper proposes a coherence protocol for the hybrid memory system that allows the compiler to always generate code to manage the scratchpad memories. In coordination with the compiler, memory accesses that may access stale copies of data are identified and diverted to the valid copy of the data. The proposal allows the architecture to be exposed to the programmer as a shared memory manycore, maintaining the programming simplicity of shared memory models and preserving backwards compatibility. In a 64-core manycore, the coherence protocol adds overheads of 4% in performance, 8% in network traffic and 9% in energy consumption to enable the usage of the hybrid memory system that, compared to a cache-based system, achieves a speedup of 1.14x and reduces on-chip network traffic and energy consumption by 29% and 17%, respectively.

european conference on parallel processing | 2015

Runtime-aware architectures

Marc Casas; Miquel Moreto; Lluc Alvarez; Emilio Castillo; Dimitrios Chasapis; Timothy Hayes; Luc Jaulmes; Oscar Palomar; Osman S. Unsal; Adrián Cristal; Eduard Ayguadé; Jesús Labarta; Mateo Valero

In the last few years, the traditional ways to keep the increase of hardware performance to the rate predicted by the Moore’s Law have vanished. When uni-cores were the norm, hardware design was decoupled from the software stack thanks to a well defined Instruction Set Architecture (ISA). This simple interface allowed developing applications without worrying too much about the underlying hardware, while hardware designers were able to aggressively exploit instruction-level parallelism (ILP) in superscalar processors. Current multi-cores are designed as simple symmetric multiprocessors (SMP) on a chip. However, we believe that this is not enough to overcome all the problems that multi-cores face. The runtime system of the parallel programming model has to drive the design of future multi-cores to overcome the restrictions in terms of power, memory, programmability and resilience that multi-cores have. In the paper, we introduce an approach towards a Runtime-Aware Architecture (RAA), a massively parallel architecture designed from the runtime’s perspective.

ieee international conference on high performance computing data and analytics | 2012

Hardware-software coherence protocol for the coexistence of caches and local memories

Lluc Alvarez; Lluis Vilanova; Marc Gonzàlez; Xavier Martorell; Nacho Navarro; Eduard Ayguadé

Cache coherence protocols limit the scalability of chip multiprocessors. One solution is to introduce a local memory alongside the cache hierarchy, forming a hybrid memory system. Local memories are more power-efficient than caches and they do not generate coherence traffic but they suffer from poor programmability. When non-predictable memory access patterns are found compilers do not succeed in generating code because of the incoherency between the two storages. This paper proposes a coherence protocol for hybrid memory systems that allows the compiler to generate code even in the presence of memory aliasing problems. Coherency is ensured by a simple software/hardware co-design where the compiler identifies potentially incoherent memory accesses and the hardware diverts them to the correct copy of the data. The coherence protocol introduces overheads of 0.24% in execution time and of 1.06% in energy consumption to enable the usage of the hybrid memory system.

IEEE Transactions on Computers | 2015

Hardware–Software Coherence Protocol for the Coexistence of Caches and Local Memories

Lluc Alvarez; Lluis Vilanova; Marc Gonzàlez; Xavier Martorell; Nacho Navarro; Eduard Ayguadé

Cache coherence protocols limit the scalability of multicore and manycore architectures and are responsible for an important amount of the power consumed in the chip. A good way to alleviate these problems is to introduce a local memory alongside the cache hierarchy, forming a hybrid memory system. Local memories are more power-efficient than caches and do not generate coherence traffic, but they suffer from poor programmability. When non-predictable memory access patterns are found, compilers do not succeed in generating code because of the incoherence between the two storages. This paper proposes a coherence protocol for hybrid memory systems that allows the compiler to generate code even in the presence of memory aliasing problems. Coherence is ensured by a software/hardware co-design where the compiler identifies potentially incoherent memory accesses and the hardware diverts them to the correct copy of the data. The coherence protocol introduces overheads of 0.26% in execution time and of 2.03% in energy consumption to enable the usage of the hybrid memory system, which outperforms cache-based systems by an speedup of 38% and an energy reduction of 27%.

international conference on parallel architectures and compilation techniques | 2015

Runtime-Guided Management of Scratchpad Memories in Multicore Architectures

Lluc Alvarez; Miquel Moreto; Marc Casas; Emilio Castillo; Xavier Martorell; Jesús Labarta; Eduard Ayguadé; Mateo Valero

The increasing number of cores and the anticipated level of heterogeneity in upcoming multicore architectures cause important problems in traditional cache hierarchies. A good way to alleviate these problems is to add scratchpad memories alongside the cache hierarchy, forming a hybrid memory hierarchy. This memory organization has the potential to improve performance and to reduce the power consumption and the on-chip network traffic, but exposing such a complex memory model to the programmer has a very negative impact on the programmability of the architecture. Emerging task-based programming models are a promising alternative to program heterogeneous multicore architectures. In these models the runtime system manages the execution of the tasks on the architecture, allowing them to apply many optimizations in a generic way at the runtime system level. This paper proposes giving the runtime system the responsibility to manage the scratchpad memories of a hybrid memory hierarchy in multicore processors, transparently to the programmer. In the envisioned system, the runtime system takes advantage of the information found in the task dependences to map the inputs and outputs of a task to the scratchpad memory of the core that is going to execute it. In addition, the paper exploits two mechanisms to overlap the data transfers with computation and a locality-aware scheduler to reduce the data motion. In a 32-core multicore architecture, the hybrid memory hierarchy outperforms cache-only hierarchies by up to 16%, reduces on-chip network traffic by up to 31% and saves up to 22% of the consumed power.

international symposium on performance analysis of systems and software | 2009

Cetra: A trace and analysis framework for the evaluation of Cell BE systems

Julio Merino; Lluc Alvarez; Marisa Gil; Nacho Navarro

The Cell Broadband Engine Architecture (CBEA) is an heterogeneous multiprocessor architecture developed by Sony, Toshiba and IBM. The major implementation of this architecture is the Cell Broadband Engine (Cell for short), a processor that contains one generic PowerPC core and eight accelerators. The Cell is targeted at high-performance computing systems and consumer-level devices that have high computational requirements. The workloads for the former are generally run in a queue-based environment while those for the latter are multiprogrammed.

IEEE Transactions on Parallel and Distributed Systems | 2018

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

Paul Caheny; Lluc Alvarez; Said Derradji; Mateo Valero; Miquel Moreto; Marc Casas

Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the benefits they provide for scaling core count and memory capacity. Also, the flat memory address space they offer considerably improves programmability. However, ccNUMA architectures require sophisticated and expensive cache coherence protocols to enforce correctness during parallel executions, which trigger a significant amount of on- and off-chip traffic in the system. This paper analyses how coherence traffic may be best constrained in a large, real ccNUMA platform comprising 288 cores through the use of a joint hardware/software approach. For several benchmarks, we study coherence traffic in detail under the influence of an added hierarchical cache layer in the directory protocol combined with runtime managed NUMA-aware scheduling and data allocation techniques to make most efficient use of the added hardware. The effectiveness of this joint approach is demonstrated by speedups of 3.14x to 9.97x and coherence traffic reductions of up to 99 percent in comparison to NUMA-oblivious scheduling and data allocation.

international parallel and distributed processing symposium | 2016

CATA: Criticality Aware Task Acceleration for Multicore Processors

Emilio Castillo; Miquel Moreto; Marc Casas; Lluc Alvarez; Enrique Vallejo; Kallia Chronaki; Rosa M. Badia; José Luis Bosque; Ramón Beivide; Eduard Ayguadé; Jesús Labarta; Mateo Valero

Managing criticality in task-based programming models opens a wide range of performance and power optimization opportunities in future manycore systems. Criticality aware task schedulers can benefit from these opportunities by scheduling tasks to the most appropriate cores. However, these schedulers may suffer from priority inversion and static binding problems that limit their expected improvements. Based on the observation that task criticality information can be exploited to drive hardware reconfigurations, we propose a Criticality Aware Task Acceleration (CATA) mechanism that dynamically adapts the computational power of a task depending on its criticality. As a result, CATA achieves significant improvements over a baseline static scheduler, reaching average improvements up to 18.4% in execution time and 30.1% in Energy-Delay Product (EDP) on a simulated 32-core system. The cost of reconfiguring hardware by means of a software-only solution rises with the number of cores due to lock contention and reconfiguration overhead. Therefore, novel architectural support is proposed to eliminate these overheads on future manycore systems. This architectural support minimally extends hardware structures already present in current processors, which allows further improvements in performance with negligible overhead. As a consequence, average improvements of up to 20.4% in execution time and 34.0% in EDP are obtained, outperforming state-of-the-art acceleration proposals not aware of task criticality.

languages and compilers for parallel computing | 2009

Adaptive and speculative memory consistency support for multi-core architectures with on-chip local memories

Nikola Vujic; Lluc Alvarez; Marc Gonzalez Tallada; Xavier Martorell; Eduard Ayguadé

Software cache has been showed as a robust approach in multi-core systems with no hardware support for transparent data transfers between local and global memories. Software cache provides the user with a transparent view of the memory architecture and considerably improves the programmability of such systems. But this software approach can suffer from poor performance due to considerable overheads related to software mechanisms to maintain the memory consistency. This paper presents a set of alternatives to smooth their impact. A specific write-back mechanism is introduced based on some degree of speculation regarding the number of threads actually modifying the same cache lines. A case study based on the Cell BE processor is described. Performance evaluation indicates that improvements due to the optimized software-cache structures combined with the proposed code-optimizations translate into 20% up to 40% speedup factors, compared to a traditional software cache approach.

international conference on supercomputing | 2018

Runtime-Guided Management of Stacked DRAM Memories in Task Parallel Programs

Lluc Alvarez; Marc Casas; Jesús Labarta; Eduard Ayguadé; Mateo Valero; Miquel Moreto

Stacked DRAM memories have become a reality in High-Performance Computing (HPC) architectures. These memories provide much higher bandwidth while consuming less power than traditional off-chip memories, but their limited memory capacity is insufficient for modern HPC systems. For this reason, both stacked DRAM and off-chip memories are expected to co-exist in HPC architectures, giving raise to different approaches for architecting the stacked DRAM in the system. This paper proposes a runtime approach to transparently manage stacked DRAM memories in task-based programming models. In this approach the runtime system is in charge of copying the data accessed by the tasks to the stacked DRAM, without any complex hardware support nor modifications to the application code. To mitigate the cost of copying data between the stacked DRAM and the off-chip memory, the proposal includes an optimization to parallelize the copies across idle or additional helper threads. In addition, the runtime system is aware of the reuse pattern of the data accessed by the tasks, and can exploit this information to avoid unworthy copies of data to the stacked DRAM. Results on the Intel Knights Landing processor show that the proposed techniques achieve an average speedup of 14% against the state-of-the-art library to manage the stacked DRAM and 29% against a stacked DRAM architected as a hardware cache.

Explore More