Víctor Viñals | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Víctor Viñals is active.

Explore More

Publication

Featured researches published by Víctor Viñals.

international conference on parallel processing | 2002

Hardware schemes for early register release

Teresa Monreal; Víctor Viñals; Antonio González; Mateo Valero

Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is quite related to the size and number of ports of the register file. In conventional register renaming schemes, register releasing is conservatively done only after the instruction that redefines the same register is committed. Instead, we propose a scheme that releases registers as soon as the processor knows that there will be no further use of them. We present two early releasing hardware implementations with different performance/complexity trade-offs. Detailed cycle-level simulations show either a significant speedup for a given register file size, or a reduction in register file size for a given performance level.

high-performance computer architecture | 2003

Tradeoffs in buffering memory state for thread-level speculation in multiprocessors

María Jesús Garzarán; Milos Prvulovic; José M. Llabería; Víctor Viñals; Lawrence Rauchwerger; Josep Torrellas

Thread-level speculation provides architectural support to aggressively run hard-to-analyze code in parallel. As speculative tasks run concurrently, they generate unsafe or speculative memory state that needs to be separately buffered and managed in the presence of distributed caches and buffers. Such state may contain multiple versions of the same variable. In this paper, we introduce a novel taxonomy of approaches to buffering and managing multi-version speculative memory state in multiprocessors. We also present a detailed complexity-benefit tradeoff analysis of the different approaches. Finally, we use numerical applications to evaluate the performance of the approaches under a single architectural framework. Our key insights are that support for buffering the state of multiple speculative tasks and versions per processor is more complexity-effective than support for merging the state of tasks with main memory lazily. Moreover, both supports can be gainfully combined and, in large machines, their effect is nearly fully additive. Finally, the more complex support for future state in main memory can boost performance when buffers are under pressure, but hurts performance when squashes are frequent.

ACM Transactions on Architecture and Code Optimization | 2005

Tradeoffs in buffering speculative memory state for thread-level speculation in multiprocessors

María Jesús Garzarán; Milos Prvulovic; José M. Llabería; Víctor Viñals; Lawrence Rauchwerger; Josep Torrellas

Thread-Level Speculation (TLS) provides architectural support to aggressively run hard-to-analyze code in parallel. As speculative tasks run concurrently, they generate unsafe or speculative memory state that needs to be separately buffered and managed in the presence of distributed caches and buffers. Such a state may contain multiple versions of the same variable. In this paper, we introduce a novel taxonomy of approaches to buffer and manage multiversion speculative memory state in multiprocessors. We also present a detailed complexity-benefit tradeoff analysis of the different approaches. Finally, we use numerical applications to evaluate the performance of the approaches under a single architectural framework. Our key insights are that support for buffering the state of multiple speculative tasks and versions per processor is more complexity-effective than support for lazily merging the state of tasks with main memory. Moreover, both supports can be gainfully combined and, in large machines, their effect is nearly fully additive. Finally, the more complex support for storing future state in the main memory can boost performance when buffers are under pressure, but hurts performance when squashes are frequent.

international symposium on computer architecture | 2005

Store Buffer Design in First-Level Multibanked Data Caches

Enrique F. Torres; Pablo Ibáñez; Víctor Viñals; José M. Llabería

This paper focuses on how to design a store buffer (STB) well suited to first-level multibanked data caches. Our goal is to forward data from in-flight stores to dependent loads with the latency of a cache bank. For that we propose a particular two-level STB design in which forwarding is done speculatively from a distributed first-level STB made of extremely small banks, while a centralized, second-level STB enforces correct store-load ordering a few cycles later. To that end we have identified several important design decisions: i) delaying allocation of first-level STB entries until stores execute; ii) deallocating first-level STB entries before stores commit; and iii) selecting a recovery policy well-matched to data forwarding misspeculations. Moreover, the two-level STB admits two enhancements that simplify the design leaving performance almost unchanged: i) removing the data forwarding capability from the second-level STB; and ii) not checking instruction age in first-level STB prior to forwarding data to loads. Following our guidelines and running SPECint-2K over an 8-way out-of-order processor, a two-level STB (first level with four STB banks of 8 entries each) performs similarly to an ideal, single-level STB with 128-entry banks working at the first-level cache latency.

international symposium on microarchitecture | 2013

The reuse cache: downsizing the shared last-level cache

Jorge Albericio; Pablo Ibáñez; Víctor Viñals; José M. Llabería

Over recent years, a growing body of research has shown that a considerable portion of the shared last-level cache (SLLC) is dead, meaning that the corresponding cache lines are stored but they will not receive any further hits before being replaced. Conversely, most hits observed by the SLLC come from a small subset of already reused lines. In this paper, we propose the reuse cache, a decoupled tag/data SLLC which is designed to only store the data of lines that have been reused. Thus, the size of the data array can be dramatically reduced. Specifically, we (i) introduce a selective data allocation policy to exploit reuse locality and maintain reused data in the SLLC, (ii) tune the data allocation with a suitable replacement policy and coherence protocol, and finally, (iii) explore different ways of organizing the data/tag arrays and study the performance sensitivity to the size of the resulting structures. The role of a reuse cache to maintain performance with decreasing sizes is investigated in the experimental part of this work, by simulating multi programmed and multithreaded workloads in an eight-core chip multiprocessor. As an example, we show that a reuse cache with a tag array equivalent to a conventional 4 MB cache and only a 1 MB data array would perform as well as a conventional cache of 8 MB, requiring only 16.7% of the storage capacity.

IEEE Transactions on Computers | 2004

Late allocation and early release of physical registers

Teresa Monreal; Víctor Viñals; Jose Gonzalez; Antonio González; Mateo Valero

The register file is one of the critical components of current processors in terms of access time and power consumption. Among other things, the potential to exploit instruction-level parallelism is closely related to the size and number of ports of the register file. In conventional register renaming schemes, both register allocation and releasing are conservatively done, the former at the rename stage, before registers are loaded with values, and the latter at the commit stage of the instruction redefining the same register, once registers are not used any more. We introduce VP-LAER, a renaming scheme that allocates registers later and releases them earlier than conventional schemes. Specifically, physical registers are allocated at the end of the execution stage and released as soon as the processor realizes that there will be no further use of them. VP-LAER enhances register utilization, that is, the fraction of allocated registers having a value to be read in the future. Detailed cycle-level simulations show either a significant speedup for a given register file size or a reduction in the register file size for a given performance level, especially for floating-point codes, where the register file pressure is usually high.

Journal of Systems Architecture | 2011

Improving the WCET computation in the presence of a lockable instruction cache in multitasking real-time systems

Luis C. Aparicio; Juan Segarra; Clemente Rodríguez; Víctor Viñals

In multitasking real-time systems it is required to compute the WCET of each task and also the effects of interferences between tasks in the worst case. This is very complex with variable latency hardware, such as instruction cache memories, or, to a lesser extent, the line buffers usually found in the fetch path of commercial processors. Some methods disable cache replacement so that it is easier to model the cache behavior. The difficulty in these cache-locking methods lies in obtaining a good selection of the memory lines to be locked into cache. In this paper, we propose an ILP-based method to select the best lines to be loaded and locked into the instruction cache at each context switch (dynamic locking), taking into account both intra-task and inter-task interferences, and we compare it with static locking. Our results show that, without cache, the spatial locality captured by a line buffer doubles the performance of the processor. When adding a lockable instruction cache, dynamic locking systems are schedulable with a cache size between 12.5% and 50% of the cache size required by static locking. Additionally, the computation time of our analysis method is not dependent on the number of possible paths in the task. This allows us to analyze large codes in a relatively short time (100KB with 10^6^5 paths in less than 3min).

high performance embedded architectures and compilers | 2012

ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache

Jorge Albericio; Ruben Gran; Pablo Ibáñez; Víctor Viñals; José M. Llabería

Hardware data prefetch is a very well known technique for hiding memory latencies. However, in a multicore system fitted with a shared Last-Level Cache (LLC), prefetch induced by a core consumes common resources such as shared cache space and main memory bandwidth. This may degrade the performance of other cores and even the overall system performance unless the prefetch aggressiveness of each core is controlled from a system standpoint. On the other hand, LLCs in commercial chip multiprocessors are more and more frequently organized in independent banks. In this contribution, we target for the first time prefetch in a banked LLC organization and propose ABS, a low-cost controller with a hill-climbing approach that runs stand-alone at each LLC bank without requiring inter-bank communication. Using multiprogrammed SPEC2K6 workloads, our analysis shows that the mechanism improves both user-oriented metrics (Harmonic Mean of Speedups by 27% and Fairness by 11%) and system-oriented metrics (Weighted Speedup increases 22% and Memory Bandwidth Consumption decreases 14%) over an eight-core baseline system that uses aggressive sequential prefetch with a fixed degree. Similar conclusions can be drawn by varying the number of cores or the LLC size, when running parallel applications, or when other prefetch engines are controlled.

international conference on supercomputing | 1998

Characterization and improvement of load/store cache-based prefetching

Pablo Ibáñez; Víctor Viñals; José Luis Briz; María Jesús Garzarán

A common mechamsm to pcrlorm hat-dware-based pl-efetching for regular accesses to arrays and chained lists is based on a Load/Store cache (LSC). An LSC associates the address of a Id/SC instruction with Its individual hchavior at every entry. WC show that the implementation cost of the LSC is rather high, and that using it is Ineff’icicnt. We aim to decrease the cost of the LSC but not its pcrl’ormancc. This may he done preventing useless instructions from hemg stored in the LSC. We propose eliminatmg those inslructions that never miss, and those that follow a sequential pilttWl1. This may be carried out by insertin, 0 a lci/ it inslruction in the 1,SC whenever it misses in the data cache (on-miss insertion), and issuing sequential prefetching simultaneously. After having analy~d the perf’ormancc of this proposal through a cycle-by-cycle simulation oveta set 01. 25 benchmarks selected from SPEC9.5, SPEC92 and Perfect Club, we conclude that an LSC of only 8 entries, which combines on-miss insertion and sequential prettching, performs hetter than a conventional LSC of 5 I2 entries. We think thal the low COSI of the proposal makes it worth being taken into account for the development of future microprocessors.

embedded and real-time computing systems and applications | 2010

Combining Prefetch with Instruction Cache Locking in Multitasking Real-Time Systems

Luis C. Aparicio; Juan Segarra; Clemente Rodríguez; Víctor Viñals

In multitasking real-time systems it is required to compute the WCET of each task and also the effects of interferences between tasks in the worst case. This is complex with variable latency hardware usually found in the fetch path of commercial processors. Some methods disable cache replacement so that it is easier to model the cache behavior. Lock-MS is an ILP based method to obtain the best selection of memory lines to be locked in a dynamic locking instruction cache. In this paper we first propose a simple memory architecture implementing the next-line tagged prefetch, specially designed for hard real-time systems. Then, we extend Lock-MS to add support for hardware instruction prefetch. Our results show that the WCET of a system with prefetch and an instruction cache with size 5% of the total code size is better than that of a system having no prefetch and cache size 80% of the code. We also evaluate the effects of the prefetch penalty on the resulting WCET, showing that a system without prefetch penalties has a worst-case performance 95\% of the ideal case. This highlights the importance of a good prefetch design. Finally, the computation time of our analysis method is relatively short, analyzing tasks of 96 KB with 10^65 paths in less than 3 minutes.

Explore More