Teresa Monreal | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Teresa Monreal is active.

Explore More

Publication

Featured researches published by Teresa Monreal.

international symposium on microarchitecture | 1999

Delaying physical register allocation through virtual-physical registers

Teresa Monreal; Antonio González; Mateo Valero; José González; Victor Viñals

Register file access time represents one of the critical delays of current microprocessors, and it is expected to become more critical as future processors increase the instruction window size and the issue width. This paper presents a novel physical register management scheme that allows for a late allocation (at the end of execution) of registers. We show that it can provide significant savings in number of registers and thus, it can significantly shorten the register file access time. The approach is based on virtual-physical registers, which we presented in a previous work, extended with a new register allocation policy. This policy consists of an on-demand allocation in order to maximize the register usage, combined with a stealing mechanism that prevents older instruction from being delayed by younger ones. This shortens the average number of cycles that each physical register is allocated, and allows for an early execution of instructions since they can obtain a physical register for its destination earlier than with the conventional scheme. Early execution is especially beneficial for branches and memory operations, since the former can be resolved earlier and the latter can prefetch their data in advance.

international conference on parallel processing | 2002

Hardware schemes for early register release

Teresa Monreal; Víctor Viñals; Antonio González; Mateo Valero

Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is quite related to the size and number of ports of the register file. In conventional register renaming schemes, register releasing is conservatively done only after the instruction that redefines the same register is committed. Instead, we propose a scheme that releases registers as soon as the processor knows that there will be no further use of them. We present two early releasing hardware implementations with different performance/complexity trade-offs. Detailed cycle-level simulations show either a significant speedup for a given register file size, or a reduction in register file size for a given performance level.

IEEE Transactions on Computers | 2004

Late allocation and early release of physical registers

Teresa Monreal; Víctor Viñals; Jose Gonzalez; Antonio González; Mateo Valero

The register file is one of the critical components of current processors in terms of access time and power consumption. Among other things, the potential to exploit instruction-level parallelism is closely related to the size and number of ports of the register file. In conventional register renaming schemes, both register allocation and releasing are conservatively done, the former at the rename stage, before registers are loaded with values, and the latter at the commit stage of the instruction redefining the same register, once registers are not used any more. We introduce VP-LAER, a renaming scheme that allocates registers later and releases them earlier than conventional schemes. Specifically, physical registers are allocated at the end of the execution stage and released as soon as the processor realizes that there will be no further use of them. VP-LAER enhances register utilization, that is, the fraction of allocated registers having a value to be read in the future. Detailed cycle-level simulations show either a significant speedup for a given register file size or a reduction in the register file size for a given performance level, especially for floating-point codes, where the register file pressure is usually high.

international parallel and distributed processing symposium | 2007

Microarchitectural Support for Speculative Register Renaming

Jesús Alastruey; Teresa Monreal; Víctor Viñals; Mateo Valero

This paper proposes and evaluates a new microarchitecture for out-of-order processors that supports speculative renaming. We call speculative renaming to the speculative omission of physical register allocation along with the speculative early release of physical registers. These renaming policies may cause a register operand not to be kept in the physical register file (PRF). Thus, we add a low-ported auxiliary register file (XRF) located outside the processor core that keeps the values absent in PRF and supplies them at higher latency. To support the location of register operands being either in PRF or XRF, we use virtual registers. We consider omission and release policies directed by hardware prediction. Namely, we use a single last-use predictor that directs both speculative omission and release. We call this mechanism SR-LUP (speculative renaming based on last-use prediction). Two last-use predictor designs of incremental complexity and performance are analyzed. In a 256-ROB, 8-way processor with an 80int+80fp PRF, SR-LUP with an 11-port 256int+256fp XRF, speeds up computations up to 11.5% and 29% for INT and FP SPEC2K benchmarks, respectively. For FP benchmarks, if the PRF limits the clock frequency, a conventionally managed 128int+128fp PRF can be replaced using SR-LUP by a 64int+64fp PRF backed up with a 10-port 224int+224fp XRF, showing 19% IPS gain.

symposium on computer architecture and high performance computing | 2008

Selection of the Register File Size and the Resource Allocation Policy on SMT Processors

Jesús Alastruey; Teresa Monreal; Francisco J. Cazorla; Víctor Viñals; Mateo Valero

The performance impact of the Physical Register File(PRF) size on Simultaneous Multithreading processors has not been extensively studied in spite of being a critical shared resource. In this paper we analyze the effect on performance of the PRF size for a broad set of resource allocation policies (Icount, Stall, Flush, Flush++, Static,Dcra and Hill-climbing) and evaluate them under two metrics: instructions per second (IPS) for throughput and harmonic mean of weighted IPCs (Hmean-wIPC) for fairness. We have found that resource allocation policy and PRF size should be considered together in order to obtain the best score in the proposed metrics. For instance, for the analyzed 2 and 4-threaded SPEC CPU2000 workloads,small PRFs are best managed by Flush, whereas for larger PRFs, Hill-climbing and Static lead to the best values for the throughput and fairness metrics, respectively.The second contribution of this work is a simple procedure that, for a given resource allocation policy, selects the PRF size that maximizes IPS and obtains for Hmean-wIPC a value close to its maximum. According to our results, Hill-climbing with a 320-entry PRF achieves the best figures for 2-threaded workloads. When executing 4-threaded workloads, Hill-Climbing with a 384-entry PRF achieves the best throughput whereas Static obtains the best throughput-fairness balance.

design, automation, and test in europe | 2009

Light NUCA: a proposal for bridging the inter-cache latency gap

Dario Suarez; Teresa Monreal; Fernando Vallejo; Ramón Beivide; Víctor Viñals

To deal with the “memory wall” problem, microprocessors include large secondary on-chip caches. But as these caches enlarge, they originate a new latency gap between them and fast L1 caches (inter-cache latency gap). Recently, Non-Uniform Cache Architectures (NUCAs) have been proposed to sustain the size growth trend of secondary caches that is threatened by wire-delay problems. NUCAs are size-oriented, and they were not conceived to close the inter-cache latency gap. To tackle this problem, we propose Light NUCAs (L-NUCAs) leveraging on-chip wire density to interconnect small tiles through specialized networks, which convey packets with distributed and dynamic routing. Our design reduces the tile delay (cache access plus one-hop routing) to a single processor cycle and places cache lines at a finer granularity than conventional caches, reducing cache latency. Our evaluations show that in general, an L-NUCA improves simultaneously performance, energy, and area when integrated into both conventional or D-NUCA hierarchies.

symposium on computer architecture and high performance computing | 2014

Block Disabling Characterization and Improvements in CMPs Operating at Ultra-low Voltages

Alexandra Ferrerón; Darío Suárez-Gracia; Jesús Alastruey-Benedé; Teresa Monreal; Víctor Viñals

Power density has become the limiting factor in technology scaling as power budget restricts the amount of hardware that can be active at the same time. Reducing supply voltage to ultra-low voltage ranges close to the threshold region has the promise of great energy savings. However, the potential savings of voltage scaling are limited by the correct operation of SRAM cells, which is not guaranteed below V(ddmin), the minimum voltage in which cache structures operate reliably. Understanding the effects of operating below V(ddmin) requires complex modelling, so we introduce an updated probability failure model of SRAM cells at 22nm and explore the reliability impact of lowering the chip voltage supply below V(ddmin) in shared memory coherent chip-multiprocessors (CMP) running a variety of parallel workloads. A micro architectural technique to cope with cache reliability at ultra-low voltages is block disabling, however, in many cases, the savings in on-chip caches do not compensate for the consumption in the rest of the system, as the consumption increase of the off-chip memory may offset the on-chip gain. We make the case that existing coherence mechanisms can provide the substrate to improve energy savings with block disabling and propose two low-complexity techniques. Taking the best of both techniques we can scale voltage below V(ddmin) and reduce system energy up to 39%, and system energy-delay up to 10%. Besides, by lowering the CMP consumption in a power constrained scenario, we could activate offline cores, reaching a potential speedup between 3.7 and 4.4.

computing frontiers | 2006

Speculative early register release

Jesús Alastruey; Teresa Monreal; Víctor Viñals; Mateo Valero

The late release policy of conventional renaming keeps many registers in the register file assigned in spite of containing values that will never be read in the future. In this work, we study the potential of a novel scheme that speculatively releases a physical register as soon as it has been read by a predicted last instruction that references its value. An auxiliary register file placed outside the critical paths of the processor pipeline holds the early released values just in case they are unexpectedly referenced by some instruction. In addition to demonstrate the feasibility of a last-use predictor, this paper also analyzes the auxiliary register file (latency and size) required to support a speculative early release mechanism that uses a perfect predictor. The obtained results set the performance bound that any real speculative early release implementation is able to reach. We show that in a processor with a 64int+64fp register file, a perfect early release supported by an unbounded auxiliary register file has the potential of speeding up computations up to 23% and 47% for SPECint2000 and SPECfp2000 benchmarks, respectively. Speculative early release can also be used to reduce register file size without losing performance. For instance, a processor with a conventionally managed 96int+96fp register file could be replaced for equal IPC with a 64int+64fp register file managed with perfect early register release and backed with a 64int+64fp auxiliary register file, this representing a 12% IPS (Instructions Per Second) increase if the processor frequency were constrained by the register file access time.

ieee international conference on high performance computing data and analytics | 2005

Hardware support for early register release

Teresa Monreal; Víctor Viñals; Antonio González; Mateo Valero

Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is closely related to the size and number of ports of the register file. In conventional register-renaming schemes, register releasing is conservatively done only after the instruction that redefines the same register is committed. Instead, we propose a scheme that releases registers as soon as the processor knows that there will be no further use of them. We present two early releasing hardware implementations with different performance/complexity trade-offs. Detailed cycle-level simulations show either a significant speedup for a given register file size, or a reduction in register file size for a given performance level.

ieee international conference on high performance computing, data, and analytics | 1997