Oliverio J. Santana | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Oliverio J. Santana is active.

Explore More

Publication

Featured researches published by Oliverio J. Santana.

ACM Transactions on Architecture and Code Optimization | 2004

Toward kilo-instruction processors

Adrián Cristal; Oliverio J. Santana; Mateo Valero; Jose F. Martinez

The continuously increasing gap between processor and memory speeds is a serious limitation to the performance achievable by future microprocessors. Currently, processors tolerate long-latency memory operations largely by maintaining a high number of in-flight instructions. In the future, this may require supporting many hundreds, or even thousands, of in-flight instructions. Unfortunately, the traditional approach of scaling up critical processor structures to provide such support is impractical at these levels, due to area, power, and cycle time constraints.In this paper we show that, in order to overcome this resource-scalability problem, the way in which critical processor resources are managed must be changed. Instead of simply upsizing the processor structures, we propose a smarter use of the available resources, supported by a selective checkpointing mechanism. This mechanism allows instructions to commit out of order, and makes a reorder buffer unnecessary. We present a set of techniques such as multilevel instruction queues, late allocation and early release of registers, and early release of load/store queue entries. All together, these techniques constitute what we call a kilo-instruction processor, an architecture that can support thousands of in-flight instructions, and thus may achieve high performance even in the presence of large memory access latencies.

international symposium on microarchitecture | 2005

Kilo-instruction processors: overcoming the memory wall

Adrian Cristal; Oliverio J. Santana; Francisco J. Cazorla; Marco Galluzzi; Tanausu Ramirez; Miquel Pericàs; Mateo Valero

Historically, advances in integrated circuit technology have driven improvements in processor microarchitecture and led to todays microprocessors with sophisticated pipelines operating at very high clock frequencies. However, performance improvements achievable by high-frequency microprocessors have become seriously limited by main-memory access latencies because main-memory speeds have improved at a much slower pace than microprocessor speeds. Its crucial to deal with this performance disparity, commonly known as the memory wall, to enable future high-frequency microprocessors to achieve their performance potential. To overcome the memory wall, we propose kilo-instruction processors-superscalar processors that can maintain a thousand or more simultaneous in-flight instructions. Doing so means designing key hardware structures so that the processor can satisfy the high resource requirements without significantly decreasing processor efficiency or increasing energy consumption.

international symposium on microarchitecture | 2002

Fetching instruction streams

Alex Ramirez; Oliverio J. Santana; Josep-lluis Larriba-pey; Mateo Valero

Fetch performance is a very important factor because it effectively limits the overall processor performance. However there is little performance advantage in increasing front-end performance beyond what the back-end can consume. For each processor design, the target is to build the best possible fetch engine for the required performance level. A fetch engine will be better if it provides better performance, but also if it takes fewer resources, requires less chip area, or consumes less power. In this paper we propose a novel fetch architecture based on the execution of long streams of sequential instructions, taking maximum advantage of code layout optimizations. We describe our architecture in detail, and show that it requires less complexity and resources than other high performance fetch architectures like the trace cache, while providing a high fetch performance suitable for wide-issue superscalar processors. Our results show that using our fetch architecture and code layout optimizations obtains 10% higher performance than the EV8 fetch architecture, and 4% higher than the FTB architecture using state-of-the-art branch predictors, while being only 1.5% slower than the trace cache. Even in the absence of code layout optimizations, fetching instruction streams is still 10% faster than the EV8, and only 4% slower than the trace cache. Fetching instruction streams effectively exploits the special characteristics of layout optimized codes to provide a high fetch performance, close to that of a trace cache, but has a much lower cost and complexity, similar to that of a basic block architecture.

international conference on parallel architectures and compilation techniques | 2007

FAME: FAirly MEasuring Multithreaded Architectures

Javier Vera; Francisco J. Cazorla; Alex Pajuelo; Oliverio J. Santana; Enrique Fernández; Mateo Valero

Nowadays, multithreaded architectures are becoming more and more popular. In order to evaluate their behavior, several methodologies and metrics have been proposed. A methodology defines when the measurements of a given workload execution are taken. A metric combines those measurements to obtain a final evaluation result. However, since current evaluation methodologies do not provide representative measurements for these metrics, the analysis and evaluation of novel ideas could be either unfair or misleading. Given the potential impact of multithreaded architectures on current and future processor designs, it is crucial to develop an accurate evaluation methodology for them. This paper presents FAME, a new evaluation methodology aimed to fairly measure the performance of multithreaded processors. FAME re-executes all traces in a multithreaded workload until all of them are fairly represented in the final measurements taken from the workload. We compare FAME with previously used methodologies for both architectural research simulators and real processors. Our results show that FAME provides more accurate measurements than other methodologies, becoming an ideal evaluation methodology to analyze proposals for multithreaded architectures.

high-performance computer architecture | 2008

Runahead Threads to improve SMT performance

Tanausu Ramirez; Alex Pajuelo; Oliverio J. Santana; Mateo Valero

In this paper, we propose runahead threads (RaT) as a valuable solution for both reducing resource contention and exploiting memory-level parallelism in simultaneous multithreaded (SMT) processors. Our technique converts a resource intensive memory-bound thread to a speculative light thread under long-latency blocking memory operations. These speculative threads prefetch data and instructions with minimal resources, reducing critical resource conflicts between threads. We compare an SMT architecture using RaT to both state-of-the-art static fetch policies and dynamic resource control policies. In terms of throughput and fairness, our results show that RaT performs better than any other policy. The proposed mechanism improves average throughput by 37% regarding previous static fetch policies and by 28% compared to previous dynamic resource scheduling mechanisms. RaT also improves fairness by 36% and 30% respectively. In addition, the proposed mechanism permits register file size reduction of up to 60% in a SMT processor without performance degradation.

Innovative Architecture for Future Generation High-Performance Processors and Systems, 2003 | 2003

Latency tolerant branch predictors

Oliverio J. Santana; Alex Ramirez; Mateo Valero

The access latency of branch predictors is a well known problem of fetch engine design. Prediction overriding techniques are commonly accepted to overcome this problem. However, prediction overriding requires a complex recovery mechanism to discard the wrong speculative work based on overridden predictions. In this paper, we show that stream and trace predictors, which use long basic prediction units, can tolerate access latency without needing overriding, thus reducing fetch engine complexity. We show that both the stream fetch engine and the trace cache architecture not using overriding outperform other efficient fetch engines, such as an EV8-like fetch architecture or the FTB fetch engine, even when they do use overriding.

ACM Transactions on Architecture and Code Optimization | 2004

A low-complexity fetch architecture for high-performance superscalar processors

Oliverio J. Santana; Alex Ramirez; Josep-lluis Larriba-pey; Mateo Valero

Fetch engine performance is a key topic in superscalar processors, since it limits the instruction-level parallelism that can be exploited by the execution core. In the search of high performance, the fetch engine has evolved toward more efficient designs, but its complexity has also increased.In this paper, we present the stream fetch engine, a novel architecture based on the execution of long streams of sequential instructions, taking maximum advantage of code layout optimizations. We describe our design in detail, showing that it achieves high fetch performance, while requiring less complexity than other state-of-the-art fetch architectures.

ieee international conference on high performance computing data and analytics | 2003

Tolerating Branch Predictor Latency on SMT

Ayose Falcón; Oliverio J. Santana; Alex Ramirez; Mateo Valero

Simultaneous Multithreading (SMT) tolerates latency by executing instructions from multiple threads. If a thread is stalled, resources can be used by other threads. However, fetch stall conditions caused by multi-cycle branch predictors prevent SMT to achieve all its potential performance, since the flow of fetched instructions is halted.

ieee international conference on high performance computing data and analytics | 2002

A Comprehensive Analysis of Indirect Branch Prediction

Oliverio J. Santana; Ayose Falcón; Enrique Fernández; Pedro Medina; Alex Ramirez; Mateo Valero

Indirect branch prediction is a performance limiting factor for current computer systems, preventing superscalar processors from exploiting the available ILP. Indirect branches are responsible for 55.7% of mispredictions in our benchmark set, although they only stand for 15.5% of dynamic branches. Moreover, a 10.8% average IPC speedup is achievable by perfectly predicting all indirect branches.The Multi-Stage Cascaded Predictor (MSCP) is a mechanism proposed for improving indirect branch prediction. In this paper, we show that a MSCP can replace a BTB and accurately predict the target address of both indirect and non-indirect branches. We do a detailed analysis of MSCP behavior and evaluate it in a realistic setup, showing that a 5.7% average IPC speedup is achievable.

IEEE Transactions on Computers | 2010

On the Problem of Evaluating the Performance of Multiprogrammed Workloads

Francisco J. Cazorla; Alex Pajuelo; Oliverio J. Santana; Enrique Fernández; Mateo Valero

Multithreaded architectures are becoming more and more popular. In order to evaluate their behavior, several methodologies and metrics have been proposed. A methodology defines when the measurements for a given workload execution are taken. A metric combines those measurements to obtain a final evaluation result. However, since current evaluation methodologies do not provide representative measurements for these metrics, the analysis and evaluation of novel ideas could be either unfair or misleading. Given the potential impact of multithreaded architectures on current and future processor designs, it is crucial to develop an accurate evaluation methodology for them. This paper presents FAME, a new evaluation methodology aimed to fairly measure the performance of multithreaded processors executing multiprogrammed workloads. FAME reexecutes all programs in the workload until all of them are fairly represented in the final measurements taken. We compare FAME with previously used methodologies showing that it provides more accurate measurements, becoming an ideal evaluation methodology to analyze proposals for multithreaded architectures.

Explore More