Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Enrique Fernández is active.

Publication


Featured researches published by Enrique Fernández.


international symposium on microarchitecture | 2004

Dynamically Controlled Resource Allocation in SMT Processors

Francisco J. Cazorla; Alex Ramirez; Mateo Valero; Enrique Fernández

SMT processors increase performance by executing instructions from several threads simultaneously. These threads use the resources of the processor better by sharing them but, at the same time, threads are competing for these resources. The way critical resources are distributed among threads determines the final performance. Currently, processor resources are distributed among threads as determined by the fetch policy that decides which threads enter the processor to compete for resources. However, current fetch policies only use indirect indicators of resource usage in their decision, which can lead to resource monopolization by a single thread or to resource waste when no thread can use them. Both situations can harm performance and happen, for example, after an L2 cache miss. In this paper, we introduce the concept of dynamic resource control in SMT processors. Using this concept, we propose a novel resource allocation policy for SMT processors. This policy directly monitors the usage of resources by each thread and guarantees that all threads get their fair share of the critical shared resources, avoiding monopolization. We also define a mechanism to allow a thread to borrow resources from another thread if that thread does not require them, thereby reducing resource under-use. Simulation results show that our dynamic resource allocation policy outperforms a static resource allocation policy by 8%, on average. It also improves the best dynamic resource-conscious fetch policies like FLUSH++ by 4%, on average, using the harmonic mean as a metric. This indicates that our policy does not obtain the ILP boost by unfairly running high ILP threads over slow memory-bounded threads. Instead, it achieves a better throughput-fairness balance.SMT processors increase performance by executing instructions from several threads simultaneously. These threads use the resources of the processor better by sharing them but, at the same time, threads are competing for these resources. The way critical resources are distributed among threads determines the final performance. Currently, processor resources are distributed among threads as determined by the fetch policy that decides which threads enter the processor to compete for resources. However, current fetch policies only use indirect indicators of resource usage in their decision, which can lead to resource monopolization by a single thread or to resource waste when no thread can use them. Both situations can harm performance and happen, for example, after an L2 cache miss. In this paper, we introduce the concept of dynamic resource control in SMT processors. Using this concept, we propose a novel resource allocation policy for SMT processors. This policy directly monitors the usage of resources by each thread and guarantees that all threads get their fair share of the critical shared resources, avoiding monopolization. We also define a mechanism to allow a thread to borrow resources from another thread if that thread does not require them, thereby reducing resource under-use. Simulation results show that our dynamic resource allocation policy outperforms a static resource allocation policy by 8%, on average. It also improves the best dynamic resource-conscious fetch policies like FLUSH++ by 4%, on average, using the harmonic mean as a metric. This indicates that our policy does not obtain the ILP boost by unfairly running high ILP threads over slow memory-bounded threads. Instead, it achieves a better throughput-fairness balance.


IEEE Transactions on Computers | 2006

Predictable performance in SMT processors: synergy between the OS and SMTs

Francisco J. Cazorla; Peter M. W. Knijnenburg; Rizos Sakellariou; Enrique Fernández; Alex Ramirez; Mateo Valero

Current operating systems (OS) perceive the different contexts of simultaneous multithreaded (SMT) processors as multiple independent processing units, although, in reality, threads executed in these units compete for the same hardware resources. Furthermore, hardware resources are assigned to threads implicitly as determined by the SMT instruction fetch (Ifetch) policy, without the control of the OS. Both factors cause a lack of control over how individual threads are executed, which can frustrate the work of the job scheduler. This presents a problem for general purpose systems, where the OS job scheduler cannot enforce priorities, and also for embedded systems, where it would be difficult to guarantee worst-case execution times. In this paper, we propose a novel strategy that enables a two-way interaction between the OS and the SMT processor and allows the OS to run jobs at a certain percentage of their maximum speed, regardless of the workload in which these jobs are executed. In contrast to previous approaches, our approach enables the OS to run time-critical jobs without dedicating all internal resources to them so that non-time-critical jobs can make significant progress as well and without significantly compromising overall throughput. In fact, our mechanism, in addition to fulfilling OS requirements, achieves 90 percent of the throughput of one of the best currently known fetch policies for SMTs.


computing frontiers | 2004

Predictable performance in SMT processors

Francisco J. Cazorla; Peter M. W. Knijnenburg; Rizos Sakellariou; Enrique Fernández; Alex Ramirez; Mateo Valero

Current instruction fetch policies in SMT processors are oriented towards optimization of overall throughput and/or fairness. However, they provide no control over how individual threads are executed, leading to performance unpredictability, since the IPC of a thread depends on the workload it is executed in and on the fetch policy used.From the point of view of the Operating System (OS), it is the job scheduler that determines how jobs are executed. However, when the OS runs on an SMT processor, the job scheduler cannot guarantee execution time constraints of any job due to this performance unpredictability.In this paper we propose a novel kind of collaboration between the OS and the SMT hardware that enables the OS to enforce that a high priority thread runs at a specific fraction of its full speed. We present an extensive evaluation using many different workloads, that shows that this mechanism gives the required performance in more than 97% of all cases considered, and even more than 99% for the less extreme cases. At the same time, our mechanism does not need to trade off predictability against overall throughput, as it maximizes the IPC of the remaining low priority threads, giving 94% on average (and 97.5% on average for the less extreme cases) of the throughput obtained using instruction fetch policies oriented toward throughput maximization, such as icount.


international parallel and distributed processing symposium | 2004

Dcache Warn: an I-fetch policy to increase SMT efficiency

Francisco J. Cazorla; Alex Ramirez; Mateo Valero; Enrique Fernández

Summary form only given. Simultaneous multithreading (SMT) processors increase performance by executing instructions from multiple threads simultaneously. These threads share the processors resources, but also compete for them. In this environment, a thread missing in the L2 cache may allocate a large number of resources for a long time, causing other threads to run much slower than they could. To prevent this problem we should know in advance if a thread is going to miss in the L2 cache. L1 misses are a clear indicator of a possible L2 miss. However, to stall a thread on every L1 miss is too severe, because not all L1 misses lead to an L2 miss, and this would cause an unnecessary stall and resource under-use. Also, to wait until an L2 miss is declared and squash the thread to free up the allocated resources is too expensive in terms of complexity and reexecuted instructions. We propose a novel fetch policy, which we call DWarn. DWarn uses L1 misses as indicators of L2 misses, giving higher priority to threads with no outstanding L1 misses. DWarn acts on L1 misses, before L2 misses happen in a controlled manner to reduce resource under-use and to avoid harming a thread when L1 misses do not lead to L2 misses. Our results show that DWarn outperforms previously proposed policies, in both throughput and fairness, while requiring fewer resources and avoiding instruction reexecution.


international symposium on microarchitecture | 2004

QoS for high-performance SMT processors in embedded systems

Francisco J. Cazorla; Alex Ramirez; Mateo Valero; Peter M. W. Knijnenburg; Rizos Sakellariou; Enrique Fernández

Although simultaneous multithreading processors provide a good cost-performance tradeoff, they exhibit unpredictable performance in real-time applications. We present a resource management scheme that eliminates a major cause of performance unpredictability in SMTs, making them suitable for many types of embedded systems.


ieee international conference on high performance computing data and analytics | 2003

Improving Memory Latency Aware Fetch Policies for SMT Processors

Francisco J. Cazorla; Enrique Fernández; Alex Ramirez; Mateo Valero

In SMT processors several threads run simultaneously to increase available ILP, sharing but competing for resources. The instruction fetch policy plays a key role, determining how shared resources are allocated.


international conference on parallel architectures and compilation techniques | 2007

FAME: FAirly MEasuring Multithreaded Architectures

Javier Vera; Francisco J. Cazorla; Alex Pajuelo; Oliverio J. Santana; Enrique Fernández; Mateo Valero

Nowadays, multithreaded architectures are becoming more and more popular. In order to evaluate their behavior, several methodologies and metrics have been proposed. A methodology defines when the measurements of a given workload execution are taken. A metric combines those measurements to obtain a final evaluation result. However, since current evaluation methodologies do not provide representative measurements for these metrics, the analysis and evaluation of novel ideas could be either unfair or misleading. Given the potential impact of multithreaded architectures on current and future processor designs, it is crucial to develop an accurate evaluation methodology for them. This paper presents FAME, a new evaluation methodology aimed to fairly measure the performance of multithreaded processors. FAME re-executes all traces in a multithreaded workload until all of them are fairly represented in the final measurements taken from the workload. We compare FAME with previously used methodologies for both architectural research simulators and real processors. Our results show that FAME provides more accurate measurements than other methodologies, becoming an ideal evaluation methodology to analyze proposals for multithreaded architectures.


compilers, architecture, and synthesis for embedded systems | 2005

Architectural support for real-time task scheduling in SMT processors

Francisco J. Cazorla; Peter M. W. Knijnenburg; Rizos Sakellariou; Enrique Fernández; Alex Ramirez; Mateo Valero

In Simultaneous Multithreaded (SMT) architectures most hardware resources are shared between threads. This provides a good cost/performance trade-off which renders these architectures suitable for use in embedded systems. However, since threads share many resources, they also interfere with each other. As a result, execution times of applications become highly unpredictable and dependent on the context in which an application is executed. Obviously, this poses problems if an SMT is to be used in a real-time system.In this paper, we propose two novel hardware mechanisms that can be used to reduce this performance variability. In contrast to previous approaches, our proposed mechanisms do not need any information beyond the information already known by traditional job schedulers. Nor do they require extensive profiling of workloads to determine optimal schedules. Our mechanisms are based on dynamic resource partitioning. The OS level job scheduler needs to be slightly adapted in order to provide the hardware resource allocator some information on how this resource partitioning needs to be done. We show that our mechanisms provide high stability for SMT architectures to be used in real-time systems: the real time benchmarks we used meet their deadlines in more than 98% of the cases considered while the other thread in the workload still achieves high throughput.


ieee international conference on high performance computing data and analytics | 2004

Optimising long-latency-load-aware fetch policies for SMT processors

Francisco J. Cazorla; Alex Ramirez; Mateo Valero; Enrique Fernández

Simultaneous multithreading (SMT) processors fetch instructions from several threads, increasing the available instruction level parallelism of each thread exposed to the processor. In an SMT the fetch engine decides which threads enter the processor and have priority in using resources. Hence, the fetch engine determines how shared resources are allocated, playing a key role in the final performance of the machine. When a thread experiences an L2 cache miss, critical resources can be monopolised for a long time, throttling the execution of remaining threads. Several approaches have been proposed to cope with this problem. The first contribution of this paper is the evaluation and comparison of the three best published policies addressing the long latency load problem. The second and main contributions of this paper are that we have proposed improved versions of these three policies. Our results show that the improved versions significantly enhance the original ones in both throughput and fairness.


ieee international conference on high performance computing data and analytics | 2002

A Comprehensive Analysis of Indirect Branch Prediction

Oliverio J. Santana; Ayose Falcón; Enrique Fernández; Pedro Medina; Alex Ramirez; Mateo Valero

Indirect branch prediction is a performance limiting factor for current computer systems, preventing superscalar processors from exploiting the available ILP. Indirect branches are responsible for 55.7% of mispredictions in our benchmark set, although they only stand for 15.5% of dynamic branches. Moreover, a 10.8% average IPC speedup is achievable by perfectly predicting all indirect branches.The Multi-Stage Cascaded Predictor (MSCP) is a mechanism proposed for improving indirect branch prediction. In this paper, we show that a MSCP can replace a BTB and accurately predict the target address of both indirect and non-indirect branches. We do a detailed analysis of MSCP behavior and evaluate it in a realistic setup, showing that a 5.7% average IPC speedup is achievable.

Collaboration


Dive into the Enrique Fernández's collaboration.

Top Co-Authors

Avatar

Mateo Valero

University of Las Palmas de Gran Canaria

View shared research outputs
Top Co-Authors

Avatar

Francisco J. Cazorla

Barcelona Supercomputing Center

View shared research outputs
Top Co-Authors

Avatar

Alex Ramirez

Polytechnic University of Catalonia

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Oliverio J. Santana

Polytechnic University of Catalonia

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Alex Pajuelo

Polytechnic University of Catalonia

View shared research outputs
Top Co-Authors

Avatar

Alejandro García

Polytechnic University of Catalonia

View shared research outputs
Researchain Logo
Decentralizing Knowledge