Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Fernando Latorre is active.

Publication


Featured researches published by Fernando Latorre.


international conference on supercomputing | 2004

Back-end assignment schemes for clustered multithreaded processors

Fernando Latorre; José González; Antonio González

Power consumption and wire delays are two important limiting factors for current and forthcoming processors. Monolithic designs that keep reasonable power consumption and operate at high clock frequencies are ever harder to implement. In this paper we propose a novel multithreaded clustered microarchitecture that consists of a clustered front-end capable of fetching instructions from multiple hreads and a clustered back-end where instructions are executed. This microarchitecture combines the concepts of multithreading and clustering to a tack both problems: power consumption and wire delays. A key aspect of this microarchitecture is the assignment of resources to the simultaneously running threads. We propose two back-end assignment schemes; in the Static Back-end Assignment (SBA)the back-ends are statically assigned to the front-ends, while in the Dynamic Back-end Assignment (DBA) the back-ends are dynamically assigned according to the demands of each front-end. A limit study of the potential performance of DBA shows a minor benefit compared to SBA. The causes why the DBA scheme does not perform as initially expected are investigated and the main limiting factors of this architecture are evaluated. Finally,we point out he advantages of DBA versus SBA.


international symposium on computer architecture | 2009

Boosting single-thread performance in multi-core systems through fine-grain multi-threading

Carlos Madriles; Pedro Lopez; Josep M. Codina; Enric Gibert; Fernando Latorre; Alejandro Martínez; Raúl Martínez; Antonio González

Industry has shifted towards multi-core designs as we have hit the memory and power walls. However, single thread performance remains of paramount importance since some applications have limited thread-level parallelism (TLP), and even a small part with limited TLP impose important constraints to the global performance, as explained by Amdahls law. In this paper we propose a novel approach for leveraging multiple cores to improve single-thread performance in a multi-core design. The proposed technique features a set of novel hardware mechanisms that support the execution of threads generated at compile time. These threads result from a fine-grain speculative decomposition of the original application and they are executed under a modified multi-core system that includes: (1) mechanisms to support multiple versions; (2) mechanisms to detect violations among threads; (3) mechanisms to reconstruct the original sequential order; and (4) mechanisms to checkpoint the architectural state and recovery to handle misspeculations. The proposed scheme outperforms previous hardware-only schemes to implement the idea of combining cores for executing single-thread applications in a multi-core design by more than 10% on average on Spec2006 for all configurations. Moreover, single-thread performance is improved by 41% on average when the proposed scheme is used on a Tiny Core, and up to 2.6x for some selected applications.


international conference on parallel architectures and compilation techniques | 2009

Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading

Carlos Madriles; Pedro Lopez; Josep M. Codina; Enric Gibert; Fernando Latorre; Alejandro Martínez; Raúl Martínez; Antonio González

Industry is moving towards multi-core designs as we have hit the memory and power walls. Multi-core designs are very effective to exploit thread-level parallelism (TLP) but do not provide benefits when executing serial code (applications with low TLP, serial parts of a parallel application and legacy code). In this paper we propose Anaphase, a novel approach for speculative multithreading to improve single-thread performance in a multi-core design. The proposed technique is based on a graph partitioning technique which performs a decomposition of applications into speculative threads at instruction granularity. Moreover, the proposed technique leverages communications and pre-computation slices to deal with inter-thread dependences. Results presented in this paper show that this approach improves single-thread performance by 32% on average and up to 2.15x for some selected applications of the Spec2006 suite. In addition, the proposed technique outperforms by 21% on average schemes in which thread decomposition is performed at a coarser granularity.


international symposium on computer architecture | 2004

Cache organizations for clustered microarchitectures

José González; Fernando Latorre; Antonio González

Clustered microarchitectures are an effective organization to deal with the problem of wire delays and complexity by partitioning some of the processor resources. The organization of the data cache is a key factor in these processors due to its effect on cache miss rate and inter-cluster communications. This paper investigates alternative designs of the data cache: centralized, distributed, replicated and physically distributed cache architectures are analyzed. Results show similar average performance but significant performance variations depending on the application features, specially cache miss ratio and communications. In addition, we also propose a novel instruction steering scheme in order to reduce communications. This scheme conditionally stalls the dispatch of instructions depending on the occupancy of the clusters, whenever the current instruction cannot be steered to the cluster holding most of the inputs. This new steering outperforms traditional schemes. Results show, an average speedup of 5% and up to 15% for some applications.


virtual execution environments | 2012

DDGacc: boosting dynamic DDG-based binary optimizations through specialized hardware support

Demos Pavlou; Enric Gibert; Fernando Latorre; Antonio González

Dynamic Binary Translators (DBT) and Dynamic Binary Optimization (DBO) by software are used widely for several reasons including performance, design simplification and virtualization. However, the software layer in such systems introduces non-negligible overheads which affect performance and user experience. Hence, reducing DBT/DBO overheads is of paramount importance. In addition, reduced overheads have interesting collateral effects in the rest of the software layer, such as allowing optimizations to be applied earlier. A cost-effective solution to this problem is to provide hardware support to speed up the primitives of the software layer, paying special attention to automate DBT/DBO mechanisms and leave the heuristics to the software, which is more flexible. In this work, we have characterized the overheads of a DBO system using DynamoRIO implementing several basic optimizations. We have seen that the computation of the Data Dependence Graph (DDG) accounts for 5%-10% of the execution time. For this reason, we propose to add hardware support for this task in the form of a new functional unit, called DDGacc, which is integrated in a conventional pipeline processor and is operated through new ISA instructions. Our evaluation shows that DDGacc reduces the cost of computing the DDG by 32x, which reduces overall execution time by 5%-10% on average and up to 18% for applications where the DBO optimizes large code footprints.


high-performance computer architecture | 2011

Fg-STP: Fine-Grain Single Thread Partitioning on Multicores

Rakesn Ranjan; Fernando Latorre; Pedro Marcuello; Antonio González

Power and complexity issues have led the microprocessor industry to shift to Chip Multiprocessors in order to be able to better utilize the additional transistors ensured by Moores law. While parallel programs are going to be able to take most of the advantage of these CMPs, single thread applications are not equipped to benefit from them. In this paper we propose Fine-Grain Single-Thread Partitioning (Fg-STP), a hardware-only scheme that takes advantage of CMP designs to speedup single-threaded applications. Our proposal improves single thread performance by reconfiguring two cores with the aim of collaborating on the fetching and execution of the instructions. These cores are basically conventional out-of-order cores in which execution is orchestrated using a dedicated hardware that has minimum and localized impact on the original design of the cores. This approach partitions the code at instruction granularity and differs from previous proposals on the extensive use of dependence speculation, replication and communication. These features are combined with the ability to look for parallelism on large instruction windows without any software intervention (no re-compilation or profiling hints are needed). These characteristics allow Fg-STP to speedup single thread by 18% and 7% on average over similar hardware-only approaches like Core Fusion, on medium sized and small sized 2-core CMP respectively for Spec 2006 benchmarks.


ieee international conference on high performance computing, data, and analytics | 2009

P-slice based efficient speculative multithreading

Rakesh Ranjan; Pedro Marcuello; Fernando Latorre; Antonio González

Microprocessor industry has recently shifted towards multi-core to take advantage of the ever increasing number of transistors provided by the new technologies. Unfortunately, the multi-core approach does not allow single threaded applications to benefit from the additional cores to improve their execution time. Speculative multithreading (SpMT) has been proposed in the past to boost performance of irregular applications in multi-core environments. In this work, we study the main bottlenecks of these architectures, such as the memory behavior and the pre-computation slices and propose two novel schemes that allow SpMT to get 25% average speedup over single threaded execution. We propose Selective Replication as a technique to improve the performance of the SpMT memory system. This technique does not introduce additional traffic in the bus and improves the performance of a conventional SpMT memory model by 6% on average and up to 21% for some applications. Also, we propose a scheme called Slice Specialization that reduces the number of instructions in the pre-computation slices by adapting the slice to every single speculative thread spawned. The later proposal outperforms previous schemes with slices by 15% and overall, both techniques combined achieve an improvement of 20% over a conventional SpMT processor.


high performance embedded architectures and compilers | 2011

CROB: implementing a large instruction window through compression

Fernando Latorre; Grigorios Magklis; Jose Gonzalez; Pedro Chaparro; Antonio González

Current processors require a large number of in-flight instructions in order to look for further parallelism and hide the increasing gap between memory latency and processor cycle time. These in-flight instructions are typically stored in centralized structures called reorder buffer (ROB), which is a centerpiece to handle precise exceptions and recover a safe state in the event of a branch misprediction. However, this structure is becoming so big that it is difficult to fit it in the power budget of future processors designs. In this paper we propose a novel ROB microarchitecture named CROB (Compressed ROB) that can compress ROB entries and therefore give the illusion of having a larger virtual ROB than the number of ROB entries. The performance study of CROB shows a tremendous benefit, with an average speedup of 20% and 12% for a 128-entry and 256-entry ROB respectively. For some benchmark categories such as SpecFP2000, speedup raise up to 30%.


international parallel and distributed processing symposium | 2008

Efficient resources assignment schemes for clustered multithreaded processors

Fernando Latorre; José González; Antonio González

New feature sizes provide larger number of transistors per chip that architects could use in order to further exploit instruction level parallelism. However, these technologies bring also new challenges that complicate conventional monolithic processor designs. On the one hand, exploiting instruction level parallelism is leading us to diminishing returns and therefore exploiting other sources of parallelism like thread level parallelism is needed in order to keep raising performance with a reasonable hardware complexity. On the other hand, clustering architectures have been widely studied in order to reduce the inherent complexity of current monolithic processors. This paper studies the synergies and trade-offs between two concepts, clustering and simultaneous multithreading (SMT), in order to understand the reasons why conventional SMT resource assignment schemes are not so effective in clustered processors. These trade-offs are used to propose a novel resource assignment scheme that gets and average speed up of 17.6% versus Icount improving fairness in 24%.


memory performance dealing with applications systems and architecture | 2007

Building a large instruction window through ROB compression

Fernando Latorre; Grigorios Magklis; Jose Gonzalez; Pedro Chaparro; Antonio González

Current processors require a large number of in-flight instructions in order to look for further parallelism and hide the increasing gap between memory latency and processor cycle time. These in-flight instructions are typically stored in centralized structures called reorder buffer (ROB), which is a centerpiece to handle precise exceptions and recover a safe state in the event of a branch misprediction. However, this structure is becoming so big that it is difficult to fit it in the power budget of future processors designs. In this paper we propose a novel ROB microarchitecture named CROB (Compressed ROB) that can compress ROB entries and therefore give the illusion of having a larger virtual ROB than the number of ROB entries. The performance study of CROB shows a tremendous benefit, with an average speedup of 20% and 12% for a 128-entry and 256-entry ROB respectively. For some benchmark categories such as SpecFP2000, speedup raise up to 30%.

Collaboration


Dive into the Fernando Latorre's collaboration.

Researchain Logo
Decentralizing Knowledge