José M. Llabería | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where José M. Llabería is active.

Explore More

Publication

Featured researches published by José M. Llabería.

international symposium on computer architecture | 1992

Increasing the number of strides for conflict-free vector access

Mateo Valero; Tomás Lang; José M. Llabería; Montse Peiron; Eduard Ayguadé; Juan J. Navarra

Address transformation schemes, such as skewing and linear transformations, have been proposed to achieve conflict-free vector access for some strides in vector processors with multi-module memories. In this paper, we extend these schemes to achieve this conflict-free access for a larger number of strides. The basic idea is to perform an out-of-order access to vectors of fixed length, equal to that of the vector registers of the processor. Both matched and unmatched memories are considered: we show that the number of strides is even larger for the latter case. The hardware for address calculations and access control is described and shown to be of similar complexity as that required for access in order.

ACM Transactions on Programming Languages and Systems | 2002

Register tiling in nonrectangular iteration spaces

Marta Jiménez; José M. Llabería; Agustin Fernández

Loop tiling is a well-known loop transformation generally used to expose coarse-grain parallelism and to exploit data reuse at the cache level. Tiling can also be used to exploit data reuse at the register level and to improve a programs ILP. However, previous proposals in the literature (as well as commercial compilers) are only able to perform multidimensional tiling for the register level when the iteration space is rectangular. In this article we present a new general algorithm to perform multidimensional tiling for the register level in both rectangular and nonrectangular iteration spaces. We also propose a simple heuristic to determine the tiling parameters at this level. Finally, we evaluate our method using as benchmarks typical linear algebra algorithms having nonrectangular iteration spaces and compare our proposal against hand-optimized vendor-supplied numerical libraries and against commercial compilers able to perform optimizing code transformations such as inner unrolling, unroll-and-jam, and software pipelining. Measurements were taken on three different superscalar microprocessors. Results will show that our method outperforms the native compilers (showing speedups of 2.5 in average) and matches the performance of vendor-supplied numerical libraries. The general conclusion is that compiler technology can make it possible for nonrectangular loop nests to achieve as high performance as hand-optimized codes.

high-performance computer architecture | 2003

Tradeoffs in buffering memory state for thread-level speculation in multiprocessors

María Jesús Garzarán; Milos Prvulovic; José M. Llabería; Víctor Viñals; Lawrence Rauchwerger; Josep Torrellas

Thread-level speculation provides architectural support to aggressively run hard-to-analyze code in parallel. As speculative tasks run concurrently, they generate unsafe or speculative memory state that needs to be separately buffered and managed in the presence of distributed caches and buffers. Such state may contain multiple versions of the same variable. In this paper, we introduce a novel taxonomy of approaches to buffering and managing multi-version speculative memory state in multiprocessors. We also present a detailed complexity-benefit tradeoff analysis of the different approaches. Finally, we use numerical applications to evaluate the performance of the approaches under a single architectural framework. Our key insights are that support for buffering the state of multiple speculative tasks and versions per processor is more complexity-effective than support for merging the state of tasks with main memory lazily. Moreover, both supports can be gainfully combined and, in large machines, their effect is nearly fully additive. Finally, the more complex support for future state in main memory can boost performance when buffers are under pressure, but hurts performance when squashes are frequent.

ACM Transactions on Architecture and Code Optimization | 2005

Tradeoffs in buffering speculative memory state for thread-level speculation in multiprocessors

María Jesús Garzarán; Milos Prvulovic; José M. Llabería; Víctor Viñals; Lawrence Rauchwerger; Josep Torrellas

Thread-Level Speculation (TLS) provides architectural support to aggressively run hard-to-analyze code in parallel. As speculative tasks run concurrently, they generate unsafe or speculative memory state that needs to be separately buffered and managed in the presence of distributed caches and buffers. Such a state may contain multiple versions of the same variable. In this paper, we introduce a novel taxonomy of approaches to buffer and manage multiversion speculative memory state in multiprocessors. We also present a detailed complexity-benefit tradeoff analysis of the different approaches. Finally, we use numerical applications to evaluate the performance of the approaches under a single architectural framework. Our key insights are that support for buffering the state of multiple speculative tasks and versions per processor is more complexity-effective than support for lazily merging the state of tasks with main memory. Moreover, both supports can be gainfully combined and, in large machines, their effect is nearly fully additive. Finally, the more complex support for storing future state in the main memory can boost performance when buffers are under pressure, but hurts performance when squashes are frequent.

international conference on parallel architectures and compilation techniques | 2001

Recovery Mechanism for Latency Misprediction

Enric Morancho; José M. Llabería; Àngel Olivé

Signalling result availability from the functional units to the instruction scheduler can increase the cycle time and/or the effective latency of the instructions. The knowledge of all instruction latencies would allow the instruction scheduler to operate without the need for external signalling. However, the latency of some instructions is unknown; but, the scheduler can optimistically predict the latency of these instructions and speculatively issue their dependent instructions. Although prediction techniques have great performance potential, their gain can vanish due to misprediction handling. For instance, holding speculatively scheduled instructions in the issue queue reduces its capacity to lookahead for independent instructions. The paper evaluates a recovery mechanism for latency mispredictions that retains the speculatively issued instructions in a structure apart from the issue queue: the recovery buffer. When data becomes available after a latency misprediction, the dependent instructions will be re-issued from the recovery buffer. Moreover in order to simplify the reissue logic of the recovery buffer, the instructions will be recorded in issue order. On mispredictions, the recovery buffer increases the effective capacity of the issue queue to hold instructions waiting for operands. Our evaluations in integer benchmarks show that the recovery buffer mechanism reduces issue-queue size requirements by about 20-25%. Also, this mechanism is less sensitive to the verification delay than the recovery mechanism that retains the instructions in the issue queue.

international symposium on computer architecture | 2005

Store Buffer Design in First-Level Multibanked Data Caches

Enrique F. Torres; Pablo Ibáñez; Víctor Viñals; José M. Llabería

This paper focuses on how to design a store buffer (STB) well suited to first-level multibanked data caches. Our goal is to forward data from in-flight stores to dependent loads with the latency of a cache bank. For that we propose a particular two-level STB design in which forwarding is done speculatively from a distributed first-level STB made of extremely small banks, while a centralized, second-level STB enforces correct store-load ordering a few cycles later. To that end we have identified several important design decisions: i) delaying allocation of first-level STB entries until stores execute; ii) deallocating first-level STB entries before stores commit; and iii) selecting a recovery policy well-matched to data forwarding misspeculations. Moreover, the two-level STB admits two enhancements that simplify the design leaving performance almost unchanged: i) removing the data forwarding capability from the second-level STB; and ii) not checking instruction age in first-level STB prior to forwarding data to loads. Following our guidelines and running SPECint-2K over an 8-way out-of-order processor, a two-level STB (first level with four STB banks of 8 entries each) performs similarly to an ideal, single-level STB with 128-entry banks working at the first-level cache latency.

international symposium on microarchitecture | 2013

The reuse cache: downsizing the shared last-level cache

Jorge Albericio; Pablo Ibáñez; Víctor Viñals; José M. Llabería

Over recent years, a growing body of research has shown that a considerable portion of the shared last-level cache (SLLC) is dead, meaning that the corresponding cache lines are stored but they will not receive any further hits before being replaced. Conversely, most hits observed by the SLLC come from a small subset of already reused lines. In this paper, we propose the reuse cache, a decoupled tag/data SLLC which is designed to only store the data of lines that have been reused. Thus, the size of the data array can be dramatically reduced. Specifically, we (i) introduce a selective data allocation policy to exploit reuse locality and maintain reused data in the SLLC, (ii) tune the data allocation with a suitable replacement policy and coherence protocol, and finally, (iii) explore different ways of organizing the data/tag arrays and study the performance sensitivity to the size of the resulting structures. The role of a reuse cache to maintain performance with decreasing sizes is investigated in the experimental part of this work, by simulating multi programmed and multithreaded workloads in an eight-core chip multiprocessor. As an example, we show that a reuse cache with a tag array equivalent to a conventional 4 MB cache and only a 1 MB data array would perform as well as a conventional cache of 8 MB, requiring only 16.7% of the storage capacity.

high performance embedded architectures and compilers | 2012

ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache

Jorge Albericio; Ruben Gran; Pablo Ibáñez; Víctor Viñals; José M. Llabería

Hardware data prefetch is a very well known technique for hiding memory latencies. However, in a multicore system fitted with a shared Last-Level Cache (LLC), prefetch induced by a core consumes common resources such as shared cache space and main memory bandwidth. This may degrade the performance of other cores and even the overall system performance unless the prefetch aggressiveness of each core is controlled from a system standpoint. On the other hand, LLCs in commercial chip multiprocessors are more and more frequently organized in independent banks. In this contribution, we target for the first time prefetch in a banked LLC organization and propose ABS, a low-cost controller with a hill-climbing approach that runs stand-alone at each LLC bank without requiring inter-bank communication. Using multiprogrammed SPEC2K6 workloads, our analysis shows that the mechanism improves both user-oriented metrics (Harmonic Mean of Speedups by 27% and Fairness by 11%) and system-oriented metrics (Weighted Speedup increases 22% and Memory Bandwidth Consumption decreases 14%) over an eight-core baseline system that uses aggressive sequential prefetch with a fixed degree. Similar conclusions can be drawn by varying the number of cores or the LLC size, when running parallel applications, or when other prefetch engines are controlled.

international conference on parallel architectures and compilation techniques | 1998

Split last-address predictor

Enric Morancho; José M. Llabería; Àngel Olivé

Recent works have proposed the use of prediction techniques to execute speculatively true data-dependent operations. However, the predictability of the operations do not spread uniformly among them. Then, we propose the use of run-time classification of instructions to increase the efficiency of the predictors. At run time, the proposed mechanism classifies instructions according to their predictability, decoupling this classification from prediction table. Then, the classification is used to avoid the unpredictable instructions from being candidates to allocate an entry in the prediction table. The previous idea of run-time classification is applied to the last-address predictor (Split Last-Address Predictor). The goal of this predictor is to reduce the latency of load instructions. Memory access is performed after the effective address is predicted concurrently with instruction fetch, after that, next true data-dependent instructions can be executed speculatively. We show that our proposal applied to the last-address predictor captures the same predictability than the last-address predictor proposed in literature, increases its accuracy, and reduces its area-cost by 19%.

IEEE Transactions on Parallel and Distributed Systems | 1995

Loop transformation using nonunimodular matrices

Agustin Fernández; José M. Llabería; Miguel Valero-García

Linear transformations are widely used to vectorize and parallelize loops. A subset of these transformations are unimodular transformations. When a unimodular transformation is used, the exact bounds of the transformed loop nest are easily computed and the steps of the loops are equal to 1. Unimodular loop transformations have been widely used since they permit the implementation of many useful loop transformations. Recently, nonunimodular transformations have been proposed to reduce communication requirements or to use the memory hierarchy efficiently. The methods used for unimodular transformations do not work in the case of nonunimodular transformations, since they do not produce the exact bounds of the transformed loop nest. In this paper, we present a method for nested loop transformation which gives the exact bounds for both unimodular and nonunimodular transformations. The basic idea is to use the Hermite Normal Form (HNF) of the transformation matrix. >

Explore More