Enric Gibert | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Enric Gibert is active.

Explore More

Publication

Featured researches published by Enric Gibert.

international symposium on computer architecture | 2009

Boosting single-thread performance in multi-core systems through fine-grain multi-threading

Carlos Madriles; Pedro Lopez; Josep M. Codina; Enric Gibert; Fernando Latorre; Alejandro Martínez; Raúl Martínez; Antonio González

Industry has shifted towards multi-core designs as we have hit the memory and power walls. However, single thread performance remains of paramount importance since some applications have limited thread-level parallelism (TLP), and even a small part with limited TLP impose important constraints to the global performance, as explained by Amdahls law. In this paper we propose a novel approach for leveraging multiple cores to improve single-thread performance in a multi-core design. The proposed technique features a set of novel hardware mechanisms that support the execution of threads generated at compile time. These threads result from a fine-grain speculative decomposition of the original application and they are executed under a modified multi-core system that includes: (1) mechanisms to support multiple versions; (2) mechanisms to detect violations among threads; (3) mechanisms to reconstruct the original sequential order; and (4) mechanisms to checkpoint the architectural state and recovery to handle misspeculations. The proposed scheme outperforms previous hardware-only schemes to implement the idea of combining cores for executing single-thread applications in a multi-core design by more than 10% on average on Spec2006 for all configurations. Moreover, single-thread performance is improved by 41% on average when the proposed scheme is used on a Tiny Core, and up to 2.6x for some selected applications.

international symposium on microarchitecture | 2002

Effective instruction scheduling techniques for an interleaved cache clustered VLIW processor

Enric Gibert; F. Jesús Sánchez; Antonio González

Clustering is a common technique to overcome the wire delay problem incurred by the evolution of technology. Fully-distributed architectures, where the register file, the functional units and the data cache are partitioned, are particularly effective to deal with these constraints and besides they are very scalable. In this paper effective instruction scheduling techniques for a clustered VLIW processor with a word-interleaved cache are proposed Such scheduling techniques rely on: (i) loop unrolling and variable alignment to increase the percentage of local accesses, (ii) a latency assignment process to schedule memory operations with an appropriate latency and (iii) different heuristics to assign instructions to clusters. In particular, the number of local accesses is increased by more than 25% if these techniques are used and the ratio of stall time over compute time is small. Next, the main source of remote accesses and stall time is investigated. Stall time is mainly due to remote hits, and Attraction Buffers are used to increase local accesses and reduce stall time. Stall time is reduced by 29% and 34% depending on the scheduling heuristic. IPC results for a word-interleaved cache clustered VLIW processor are similar to those of the multiVLIW (a cache-coherent clustered processor with a more complex hardware design), and are 10% and 5% better (depending on the scheduling heuristic) than the IPC for a clustered processor with a unified cache.

international symposium on microarchitecture | 2003

Flexible compiler-managed L0 buffers for clustered VLIW processors

Enric Gibert; F. Jesús Sánchez; Antonio González

Wire delays are a major concern for current and forthcoming processors. One approach to attack this problem is to divide the processor into semi-independent units referred to as clusters. A cluster usually consists of a local register file and a subset of the functional units, while the data cache remains centralized. However, as technology evolves, the latency of such a centralized cache increase leading to an important performance impact. In this paper, we propose to include flexible low-latency buffers in each cluster in order to reduce the performance impact of higher cache latencies. The reduced number of entries in each buffer permits the design of flexible ways to map data from L1 to these buffers. The proposed L0 buffers are managed by the compiler, which is responsible to decide which memory instructions make us of them. Effective instruction scheduling techniques are proposed to generate code that exploits these buffers. Results for the Mediabench benchmark suite show that the performance of a clustered VLIW processor with a unified L1 data cache is improved by 16% when such buffers are used. In addition, the proposed architecture also shows significant advantages over both MultiVLIW processors and clustered processors with a word-interleaved cache, two state-of-the-art designs with a distributed L1 data cache.

international conference on supercomputing | 2002

An interleaved cache clustered VLIW processor

Enric Gibert; F. Jesús Sánchez; Antonio González

Clustered microarchitectures are becoming a common organiza¿tion due to their potential to reduce the penalties caused by wire delays and power consumption. Fully-distributed architectures are particularly effective to deal with these constraints, and besides they are very scalable. However, the distribution of the data cache memory poses a significant challenge and may be crit¿ical for performance. In this work, a distributed data cache VLIW architecture based on an interleaved cache organization along with cyclic scheduling techniques are proposed. Moreover, the use of Attraction Buffers for such an architecture is introduced. Attraction Buffers are a novel hardware mechanism to increase the percentage of local accesses. The idea is to allow the move¿ment of some data towards the clusters that need it.Performance results for 9 Mediabench benchmarks show that our scheduling techniques are able to hide the increased mem¿ory latency when accessing data mapped in a remote cluster. In addition, the local hit ratio is increased by 15% and stall time is reduced by 30% when using the same scheduling techniques with an interleaved cache clustered processor with Attraction Buffers. Finally, the proposed architecture is compared with a state-of-the-art distributed architecture such as the multiVLIW. Results show that the performance of an interleaved cache clustered VLIW pro¿cessor with Attraction Buffers is similar to that of the multiVLIW architecture, whereas the former has a lower hardware complex¿ity.

international conference on parallel architectures and compilation techniques | 2009

Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading

Carlos Madriles; Pedro Lopez; Josep M. Codina; Enric Gibert; Fernando Latorre; Alejandro Martínez; Raúl Martínez; Antonio González

Industry is moving towards multi-core designs as we have hit the memory and power walls. Multi-core designs are very effective to exploit thread-level parallelism (TLP) but do not provide benefits when executing serial code (applications with low TLP, serial parts of a parallel application and legacy code). In this paper we propose Anaphase, a novel approach for speculative multithreading to improve single-thread performance in a multi-core design. The proposed technique is based on a graph partitioning technique which performs a decomposition of applications into speculative threads at instruction granularity. Moreover, the proposed technique leverages communications and pre-computation slices to deal with inter-thread dependences. Results presented in this paper show that this approach improves single-thread performance by 32% on average and up to 2.15x for some selected applications of the Spec2006 suite. In addition, the proposed technique outperforms by 21% on average schemes in which thread decomposition is performed at a coarser granularity.

IEEE Transactions on Computers | 2005

Distributed data cache designs for clustered VLIW processors

Enric Gibert; F. Jesús Sánchez; Antonio González

Wire delays are a major concern for current and forthcoming processors. One approach to deal with this problem is to divide the processor into semi-independent units referred to as clusters. A cluster usually consists of a local register file and a subset of the functional units, while the L1 data cache typically remains centralized in What we call partially distributed architectures. However, as technology evolves, the relative latency of such a centralized cache will increase, leading to an important impact on performance. In this paper, we propose partitioning the L1 data cache among clusters for clustered VLIW processors. We refer to this kind of design as fully distributed processors. In particular; we propose and evaluate three different configurations: a snoop-based cache coherence scheme, a word-interleaved cache, and flexible LO-buffers managed by the compiler. For each alternative, instruction scheduling techniques targeted to cyclic code are developed. Results for the Mediabench suiteshow that the performance of such fully distributed architectures is always better than the performance of a partially distributed one with the same amount of resources. In addition, the key aspects of each fully distributed configuration are explored.

virtual execution environments | 2012

DDGacc: boosting dynamic DDG-based binary optimizations through specialized hardware support

Demos Pavlou; Enric Gibert; Fernando Latorre; Antonio González

Dynamic Binary Translators (DBT) and Dynamic Binary Optimization (DBO) by software are used widely for several reasons including performance, design simplification and virtualization. However, the software layer in such systems introduces non-negligible overheads which affect performance and user experience. Hence, reducing DBT/DBO overheads is of paramount importance. In addition, reduced overheads have interesting collateral effects in the rest of the software layer, such as allowing optimizations to be applied earlier. A cost-effective solution to this problem is to provide hardware support to speed up the primitives of the software layer, paying special attention to automate DBT/DBO mechanisms and leave the heuristics to the software, which is more flexible. In this work, we have characterized the overheads of a DBO system using DynamoRIO implementing several basic optimizations. We have seen that the computation of the Data Dependence Graph (DDG) accounts for 5%-10% of the execution time. For this reason, we propose to add hardware support for this task in the form of a new functional unit, called DDGacc, which is integrated in a conventional pipeline processor and is operated through new ISA instructions. Our evaluation shows that DDGacc reduces the cost of computing the DDG by 32x, which reduces overall execution time by 5%-10% on average and up to 18% for applications where the DBO optimizes large code footprints.

symposium on code generation and optimization | 2003

Local scheduling techniques for memory coherence in a clustered VLIW processor with a distributed data cache

Enric Gibert; F. Jesús Sánchez; Antonio González

Clustering is a common technique to deal with wire delays. Fully-distributed architectures, where the register file, the functional units and the cache memory are partitioned, are particularly effective to deal with these constraints and besides they are very scalable. However the distribution of the data cache introduces a new problem: memory instructions may reach the cache in an order different to the sequential program order, thus possibly violating its contents. In this paper two local scheduling mechanisms that guarantee the serialization of aliased memory instructions are proposed and evaluated: the construction of memory dependent chains (MDC solution), and two transformations (store replication and load-store synchronization) applied to the original data dependence graph (DDGT solution). These solutions do not require any extra hardware. The proposed scheduling techniques are evaluated for a word-interleaved cache clustered VLIW processor (although these techniques can also be used for any other distributed cache configuration). Results for the Mediabench benchmark suite demonstrate the effectiveness of such techniques. In particular, the DDGT solution increases the proportion of local accesses by 16% compared to MDC, and stall time is reduced by 32% since load instructions can be freely scheduled in any cluster However the MDC solution reduces compute time and it often outperforms the former. Finally the impact of both techniques on an architecture with attraction buffers is studied and evaluated.

architectural support for programming languages and operating systems | 2014

Speculative hardware/software co-designed floating-point multiply-add fusion

Marc Lupon; Enric Gibert; Grigorios Magklis; Sridhar Samudrala; Raúl Martínez; Kyriakos Stavrou; David R. Ditzel

A Fused Multiply-Add (FMA) instruction is currently available in many general-purpose processors. It increases performance by reducing latency of dependent operations and increases precision by computing the result as an indivisible operation with no intermediate rounding. However, since the arithmetic behavior of a single-rounding FMA operation is different than independent FP multiply followed by FP add instructions, some algorithms require significant revalidation and rewriting efforts to work as expected when they are compiled to operate with FMA--a cost that developers may not be willing to pay. Because of that, abundant legacy applications are not able to utilize FMA instructions. In this paper we propose a novel HW/SW collaborative technique that is able to efficiently execute workloads with increased utilization of FMA, by adding the option to get the same numerical result as separate FP multiply and FP add pairs. In particular, we extended the host ISA of a HW/SW co-designed processor with a new Combined Multiply-Add (CMA) instruction that performs an FMA operation with an intermediate rounding. This new instruction is used by a transparent dynamic translation software layer that uses a speculative instruction-fusion optimization to transform FP multiply and FP add sequences into CMA instructions. The FMA unit has been slightly modified to support both single-rounding and double-rounding fused instructions without increasing their latency and to provide a conservative fall-back path in case of mispeculation. Evaluation on a cycle-accurate timing simulator showed that CMA improved SPECfp performance by 6.3% and reduced executed instructions by 4.7%.

symposium on code generation and optimization | 2014

Warm-Up Simulation Methodology for HW/SW Co-Designed Processors

Aleksandar Branković; Kyriakos Stavrou; Enric Gibert; Antonio González

Evaluation techniques in microprocessor design are mostly based on simulating selected application samples using a cycle-accurate simulator. In order to achieve accurate results, microarchitectural structures are warmed-up for a few million instructions prior to statistics collection. Unfortunately, this strategy cannot be applied to HW/SW co-designed processors, in which a Transparent Optimization software Layer (TOL) translates and optimizes code on-the-fly from a guest ISA to an internal host custom microarchitecture. We show that the warm-up period in this case needs to be 3-4 orders of magnitude longer than what is needed for traditional microprocessor designs because the TOL state needs to be warmed-up as well. In this paper, we propose a novel simulation technique for HW/SW co-designed processors based on adapting the optimization promotion thresholds using high level application statistics in order to find the best trade-off between accuracy and simulation cost. In particular, the proposed technique reduces the simulation cost by 65X with an average error of just 0.75%. Furthermore, as opposed to other alternatives, the proposed technique satisfies the additional requirement of allowing evaluation using different TOL and microarchitectural configurations.

Explore More