Pablo Ibáñez
University of Zaragoza
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Pablo Ibáñez.
international symposium on computer architecture | 2005
Enrique F. Torres; Pablo Ibáñez; Víctor Viñals; José M. Llabería
This paper focuses on how to design a store buffer (STB) well suited to first-level multibanked data caches. Our goal is to forward data from in-flight stores to dependent loads with the latency of a cache bank. For that we propose a particular two-level STB design in which forwarding is done speculatively from a distributed first-level STB made of extremely small banks, while a centralized, second-level STB enforces correct store-load ordering a few cycles later. To that end we have identified several important design decisions: i) delaying allocation of first-level STB entries until stores execute; ii) deallocating first-level STB entries before stores commit; and iii) selecting a recovery policy well-matched to data forwarding misspeculations. Moreover, the two-level STB admits two enhancements that simplify the design leaving performance almost unchanged: i) removing the data forwarding capability from the second-level STB; and ii) not checking instruction age in first-level STB prior to forwarding data to loads. Following our guidelines and running SPECint-2K over an 8-way out-of-order processor, a two-level STB (first level with four STB banks of 8 entries each) performs similarly to an ideal, single-level STB with 128-entry banks working at the first-level cache latency.
international symposium on microarchitecture | 2013
Jorge Albericio; Pablo Ibáñez; Víctor Viñals; José M. Llabería
Over recent years, a growing body of research has shown that a considerable portion of the shared last-level cache (SLLC) is dead, meaning that the corresponding cache lines are stored but they will not receive any further hits before being replaced. Conversely, most hits observed by the SLLC come from a small subset of already reused lines. In this paper, we propose the reuse cache, a decoupled tag/data SLLC which is designed to only store the data of lines that have been reused. Thus, the size of the data array can be dramatically reduced. Specifically, we (i) introduce a selective data allocation policy to exploit reuse locality and maintain reused data in the SLLC, (ii) tune the data allocation with a suitable replacement policy and coherence protocol, and finally, (iii) explore different ways of organizing the data/tag arrays and study the performance sensitivity to the size of the resulting structures. The role of a reuse cache to maintain performance with decreasing sizes is investigated in the experimental part of this work, by simulating multi programmed and multithreaded workloads in an eight-core chip multiprocessor. As an example, we show that a reuse cache with a tag array equivalent to a conventional 4 MB cache and only a 1 MB data array would perform as well as a conventional cache of 8 MB, requiring only 16.7% of the storage capacity.
high performance embedded architectures and compilers | 2012
Jorge Albericio; Ruben Gran; Pablo Ibáñez; Víctor Viñals; José M. Llabería
Hardware data prefetch is a very well known technique for hiding memory latencies. However, in a multicore system fitted with a shared Last-Level Cache (LLC), prefetch induced by a core consumes common resources such as shared cache space and main memory bandwidth. This may degrade the performance of other cores and even the overall system performance unless the prefetch aggressiveness of each core is controlled from a system standpoint. On the other hand, LLCs in commercial chip multiprocessors are more and more frequently organized in independent banks. In this contribution, we target for the first time prefetch in a banked LLC organization and propose ABS, a low-cost controller with a hill-climbing approach that runs stand-alone at each LLC bank without requiring inter-bank communication. Using multiprogrammed SPEC2K6 workloads, our analysis shows that the mechanism improves both user-oriented metrics (Harmonic Mean of Speedups by 27% and Fairness by 11%) and system-oriented metrics (Weighted Speedup increases 22% and Memory Bandwidth Consumption decreases 14%) over an eight-core baseline system that uses aggressive sequential prefetch with a fixed degree. Similar conclusions can be drawn by varying the number of cores or the LLC size, when running parallel applications, or when other prefetch engines are controlled.
international conference on supercomputing | 1998
Pablo Ibáñez; Víctor Viñals; José Luis Briz; María Jesús Garzarán
A common mechamsm to pcrlorm hat-dware-based pl-efetching for regular accesses to arrays and chained lists is based on a Load/Store cache (LSC). An LSC associates the address of a Id/SC instruction with Its individual hchavior at every entry. WC show that the implementation cost of the LSC is rather high, and that using it is Ineff’icicnt. We aim to decrease the cost of the LSC but not its pcrl’ormancc. This may he done preventing useless instructions from hemg stored in the LSC. We propose eliminatmg those inslructions that never miss, and those that follow a sequential pilttWl1. This may be carried out by insertin, 0 a lci/ it inslruction in the 1,SC whenever it misses in the data cache (on-miss insertion), and issuing sequential prefetching simultaneously. After having analy~d the perf’ormancc of this proposal through a cycle-by-cycle simulation oveta set 01. 25 benchmarks selected from SPEC9.5, SPEC92 and Perfect Club, we conclude that an LSC of only 8 entries, which combines on-miss insertion and sequential prettching, performs hetter than a conventional LSC of 5 I2 entries. We think thal the low COSI of the proposal makes it worth being taken into account for the development of future microprocessors.
high performance embedded architectures and compilers | 2013
Jorge Albericio; Pablo Ibáñez; Víctor Viñals; José M. Llabería
Optimization of the replacement policy used for Shared Last-Level Cache (SLLC) management in a Chip-MultiProcessor (CMP) is critical for avoiding off-chip accesses. Temporal locality, while being exploited by first levels of private cache memories, is only slightly exhibited by the stream of references arriving at the SLLC. Thus, traditional replacement algorithms based on recency are bad choices for governing SLLC replacement. Recent proposals involve SLLC replacement policies that attempt to exploit reuse either by segmenting the replacement list or improving the rereference interval prediction. On the other hand, inclusive SLLCs are commonplace in the CMP market, but the interaction between replacement policy and the enforcement of inclusion has barely been discussed. After analyzing that interaction, this article introduces two simple replacement policies exploiting reuse locality and targeting inclusive SLLCs: Least Recently Reused (LRR) and Not Recently Reused (NRR). NRR has the same implementation cost as NRU, and LRR only adds one bit per line to the LRU cost. After considering reuse locality and its interaction with the invalidations induced by inclusion, the proposals are evaluated by simulating multiprogrammed workloads in an 8-core system with two private cache levels and an SLLC. LRR outperforms LRU by 4.5% (performing better in 97 out of 100 mixes) and NRR outperforms NRU by 4.2% (performing better in 99 out of 100 mixes). We also show that our mechanisms outperform rereference interval prediction, a recently proposed SLLC replacement policy and that similar conclusions can be drawn by varying the associativity or the SLLC size.
ACM Sigarch Computer Architecture News | 2006
Luis Ramos; José Luis Briz; Pablo Ibáñez; Víctor Viñals
In this paper we evaluate four hardware data prefetchers in the context of a high-performance three-level on chip cache hierarchy with high bandwidth and capacity. We consider two classic prefetchers (Sequential Tagged and Stride) and two correlating prefetchers: PC/DC, a recent method with a superior score and low-sized tables, and P-DFCM, a new method. Like PC/DC, P-DFCM focuses on local delta sequences, but it is based on the DFCM value predictor. We explore different prefetch degrees and distances. Running SPEC2000, Olden and IAbench applications, results show that this kind of cache hierarchy turns prefetching aggressiveness into success for the four prefetchers. Sequential Tagged is the best, and deserves further attention to cut it losses in some applications. PC/DC results are matched or even improved by P-DFCM, using far fewer accesses to tables while keeping sizes low.
euromicro workshop on parallel and distributed processing | 2001
María Jesús Garzarán; J.L. Brit; Pablo Ibáñez; Víctor Viñals
Data prefetching has been widely studied as a technique to hide memory access latency in multiprocessors. Most recent research on hardware prefetching focuses either on uniprocessors, or on distributed shared memory (DSM) and other non bus-based organizations. However, in the context of bus-based SMPs, prefetching poses a number of problems related to the lack of scalability and limited bus bandwidth of these modest-sized machines. This paper considers how the number of processors and the memory access patterns in the program influence the relative performance of sequential and non-sequential prefetching mechanisms in a bus-based SMP. We compare the performance of four inexpensive hardware prefetching techniques, varying the number of processors. After a breakdown of the results based on a performance model, we propose a cost-effective hardware prefetching solution for implementing on such modest-sized multiprocessors.
memory performance dealing with applications systems and architecture | 2007
Ana Bosque; Pablo Ibáñez; Víctor Viñals; Per Stenström; José M. Llabería
Computer manufacturers offer today multicore with multi-threading capabilities and a broad range of number of cores. An important market today for these multicores is in the server domain. Web servers are a class of servers which are widely used to provide access to files and also as front-ends of more complex services. In this paper the performance of Apache web server is characterized on multicore chips using Specweb2005 as URL request generator. This benchmark provides three workloads in order to characterize different usage environments. We also compare its performance against Surge that simulates a static web page URL request generator. We find that the L2 data miss rate per instruction is below 1.4%, more than the 60% of the misses are classified as cold or capacity misses and the true sharing misses represent between 12% and 38% of all the misses. We observe that though the data miss rate is small, accesses to main memory represent up to 42% of the execution time. By contrast the true sharing misses that could be up to 38% of all the misses, represent a small fraction of time due to the small latency of cache-to-cache transfers inside the chip.
IEEE Transactions on Computers | 2016
Alexandra Ferrerón; Darío Suárez-Gracia; Jesús Alastruey-Benedé; Teresa Monreal-Arnal; Pablo Ibáñez
Scaling supply voltage to values near the threshold voltage allows a dramatic decrease in the power consumption of processors; however, the lower the voltage, the higher the sensitivity to process variation, and, hence, the lower the reliability. Large SRAM structures, like the last-level cache (LLC), are extremely vulnerable to process variation because they are aggressively sized to satisfy high density requirements. In this paper, we propose Concertina, an LLC designed to enable reliable operation at low voltages with conventional SRAM cells. Based on the observation that for many applications the LLC contains large amounts of null data, Concertina compresses cache blocks in order that they can be allocated to cache entries with faulty cells, enabling use of 100 percent of the LLC capacity. To distribute blocks among cache entries, Concertina implements a compression- and fault-aware insertion/replacement policy that reduces the LLC miss rate. Concertina reaches the performance of an ideal system implementing an LLC that does not suffer from parameter variation with a modest storage overhead. Specifically, performance degrades by less than 2 percent, even when using small SRAM cells, which implies over 90 percent of cache entries having defective cells, and this represents a notable improvement on previously proposed techniques.
european conference on parallel processing | 2009
Benjamin Sahelices; Pablo Ibáñez; Víctor Viñals; José M. Llabería
Understanding and optimizing the synchronization operations of parallel programs in distributed shared memory multiprocessors ( dsm ), is one of the most important factors leading to significant reductions in execution time. This paper introduces a new methodology for tuning performance of parallel programs. We focus on the critical sections used to assure exclusive access to critical resources and data structures, proposing a specific dynamic characterization of every critical section in order to a) measure the lock contention, b) measure the degree of data sharing in consecutive executions, and c) break down the execution time, reflecting the different overheads that can appear. All the required measurements are taken using a multiprocessor simulator with a detailed timing model of the processor and memory system. We propose also a static classification of critical sections that takes into account how locks are associated with their protected data. The dynamic characterization and the static classification are correlated to identify key critical sections and infer code optimization opportunities (e.g. data layout), which when applied can lead to significant reductions in execution time (up to 33 % in the SPLASH-2 scientific benchmark suite). By using the simulator we can also evaluate whether the performance of the applied code optimizations is sensitive to common hardware optimizations or not.