Raúl Martínez
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Raúl Martínez.
international symposium on computer architecture | 2009
Carlos Madriles; Pedro Lopez; Josep M. Codina; Enric Gibert; Fernando Latorre; Alejandro Martínez; Raúl Martínez; Antonio González
Industry has shifted towards multi-core designs as we have hit the memory and power walls. However, single thread performance remains of paramount importance since some applications have limited thread-level parallelism (TLP), and even a small part with limited TLP impose important constraints to the global performance, as explained by Amdahls law. In this paper we propose a novel approach for leveraging multiple cores to improve single-thread performance in a multi-core design. The proposed technique features a set of novel hardware mechanisms that support the execution of threads generated at compile time. These threads result from a fine-grain speculative decomposition of the original application and they are executed under a modified multi-core system that includes: (1) mechanisms to support multiple versions; (2) mechanisms to detect violations among threads; (3) mechanisms to reconstruct the original sequential order; and (4) mechanisms to checkpoint the architectural state and recovery to handle misspeculations. The proposed scheme outperforms previous hardware-only schemes to implement the idea of combining cores for executing single-thread applications in a multi-core design by more than 10% on average on Spec2006 for all configurations. Moreover, single-thread performance is improved by 41% on average when the proposed scheme is used on a Tiny Core, and up to 2.6x for some selected applications.
international conference on parallel architectures and compilation techniques | 2009
Carlos Madriles; Pedro Lopez; Josep M. Codina; Enric Gibert; Fernando Latorre; Alejandro Martínez; Raúl Martínez; Antonio González
Industry is moving towards multi-core designs as we have hit the memory and power walls. Multi-core designs are very effective to exploit thread-level parallelism (TLP) but do not provide benefits when executing serial code (applications with low TLP, serial parts of a parallel application and legacy code). In this paper we propose Anaphase, a novel approach for speculative multithreading to improve single-thread performance in a multi-core design. The proposed technique is based on a graph partitioning technique which performs a decomposition of applications into speculative threads at instruction granularity. Moreover, the proposed technique leverages communications and pre-computation slices to deal with inter-thread dependences. Results presented in this paper show that this approach improves single-thread performance by 32% on average and up to 2.15x for some selected applications of the Spec2006 suite. In addition, the proposed technique outperforms by 21% on average schemes in which thread decomposition is performed at a coarser granularity.
international conference on parallel and distributed systems | 2006
Raúl Martínez; Francisco José Alfaro; José L. Sánchez
Advanced switching (AS) is a new fabric-interconnect technology, which provides the advanced features of existing proprietary fabrics in an open standard. AS is intended to proliferate in multiprocessor, storage, networking, servers, and embedded platform environments. The provision of quality of service (QoS) in computing and communication environments is currently the focus of much discussion and research in industry and academia. AS provides some mechanisms, which correctly used permit us to provide QoS. In this paper, we examine these mechanisms and show how to provide QoS based on bandwidth and latency requirements. Furthermore, we propose a new algorithm based on the self-clocked weighted fair queuing (SCFQ) algorithm, which we call SCFQ credit aware (SCFQ-CA), as an implementation of the AS minimum bandwidth egress link scheduler. Finally, we show that the AS table-based scheduler does not work properly with variable packet sizes, and we propose a modification of the table scheduler, based on the deficit table (DTable) scheduler, to solve this drawback
international conference on parallel processing | 2006
Raúl Martínez; Francisco José Alfaro; José L. Sánchez
The provision of quality of service (QoS) in computing and communication environments is currently the focus of much discussion and research in industry and academia. A key component for networks with QoS support is the output scheduling algorithm. Some of the latest network technology proposals define scheduling algorithms that use an arbitration table to select the next packet to be transmitted. These table-based schedulers are simple to implement and can offer good latency performance. However, the versions proposed until now do not work properly with variable packet sizes. Moreover, they face the problem of bounding the bandwidth and latency assignments. In this paper, we propose a new table-based scheduler, which we call deficit table (DTable), that works properly with variable packet sizes. We also propose a methodology to decouple the bandwidth and latency assignments
parallel, distributed and network-based processing | 2015
Sergi Abadal; Albert Mestres; Raúl Martínez; Eduard Alarcón; Albert Cabellos-Aparicio
The scalability of Network-on-Chip (NoC) designs has become a rising concern as we enter the many core era. Multicast support represents a particular yet relevant case within this context and has been the focus of different research efforts, mainly due to the poor performance of NoCs in the presence of this increasingly important type of traffic. However, most of the proposed schemes have been evaluated using synthetic traffic or within a full system, which is either unrealistic or costly. While traffic models would allow to better assess their performance, existing proposals do not distinguish between unicast and multicast flows and often are bound to a given number of cores. In this paper, a trace-based multicast traffic characterization is presented with the aim to provide guidelines for the modeling of multicast communications in many core settings. To this end, the scaling trends of aspects such as the multicast traffic intensity or the spatiotemporal injection distribution are analyzed. The novelty of this work resides both on its scalability-oriented approach and on the use of correlation metrics to evaluate potential prediction opportunities.
architectural support for programming languages and operating systems | 2014
Marc Lupon; Enric Gibert; Grigorios Magklis; Sridhar Samudrala; Raúl Martínez; Kyriakos Stavrou; David R. Ditzel
A Fused Multiply-Add (FMA) instruction is currently available in many general-purpose processors. It increases performance by reducing latency of dependent operations and increases precision by computing the result as an indivisible operation with no intermediate rounding. However, since the arithmetic behavior of a single-rounding FMA operation is different than independent FP multiply followed by FP add instructions, some algorithms require significant revalidation and rewriting efforts to work as expected when they are compiled to operate with FMA--a cost that developers may not be willing to pay. Because of that, abundant legacy applications are not able to utilize FMA instructions. In this paper we propose a novel HW/SW collaborative technique that is able to efficiently execute workloads with increased utilization of FMA, by adding the option to get the same numerical result as separate FP multiply and FP add pairs. In particular, we extended the host ISA of a HW/SW co-designed processor with a new Combined Multiply-Add (CMA) instruction that performs an FMA operation with an intermediate rounding. This new instruction is used by a transparent dynamic translation software layer that uses a speculative instruction-fusion optimization to transform FP multiply and FP add sequences into CMA instructions. The FMA unit has been slightly modified to support both single-rounding and double-rounding fused instructions without increasing their latency and to provide a conservative fall-back path in case of mispeculation. Evaluation on a cycle-accurate timing simulator showed that CMA improved SPECfp performance by 6.3% and reduced executed instructions by 4.7%.
IEEE Transactions on Parallel and Distributed Systems | 2008
Raúl Martínez; Francisco José Alfaro; José L. Sánchez
Advanced switching (AS) is a network technology that expands the capabilities of PCI-express adding new features like peer-to-peer communication. Together, PCI express and AS have the potential for building the next generation interconnects. Furthermore, the provision of quality of service (QoS) in computing and communication environments is currently the focus of much discussion and research in industry and academia. In this paper we propose a framework to provide QoS based on bandwidth, latency, and jitter over AS employing the mechanisms provided by AS. We also present several implementations for the output scheduling mechanism. Finally, we evaluate our proposals by simulation, comparing the performance of the schedulers that we propose and their implementation complexity.
network computing and applications | 2006
Raúl Martínez; Francisco José Alfaro; José L. Sánchez
Advanced switching (AS) is a new fabric-interconnect technology that further enhances the capabilities of PCI Express, which is the next PCI generation. On the other hand, the provision of quality of service (QoS) in computing and communication environments is currently the focus of much discussion and research in industry and academia. One of the mechanisms that AS provides to support QoS is the minimum bandwidth egress link scheduler, or just MinBW scheduler. In this paper, we propose several implementations of the MinBW scheduler and compare their performance by simulation. These implementations fulfill all the properties that an AS MinBW scheduler must have, including the interaction with the AS link layer flow control
Computers & Electrical Engineering | 2016
Sergi Abadal; Raúl Martínez; Josep Solé-Pareta; Eduard Alarcón; Albert Cabellos-Aparicio
Multicast traffic is characterized and modeled with an emphasis on scalability.Intensity, concentration and burstiness increase with the system size.Growing correlation suggests the use of prediction to optimize NoC designs.Simple multicast source predictors achieve modest but promising accuracies. Display Omitted The scalability of Network-on-Chip (NoC) designs has become a rising concern as we enter the manycore era. Multicast support represents a particular yet relevant case within this context, mainly due to the poor performance of NoCs in the presence of this type of traffic. Multicast techniques are typically evaluated using synthetic traffic or within a full system, which is either simplistic or costly, given the lack of realistic traffic models that distinguish between unicast and multicast flows. To bridge this gap, this paper presents a trace-based multicast traffic characterization, which explores the scaling trends of aspects such as the multicast intensity or the spatiotemporal injection distribution for different coherence schemes. This analysis is the basis upon which the concept of multicast source prediction is proposed, and upon which a multicast traffic model is built. Both aspects pave the way for the development and accurate evaluation of advanced NoCs in the context of manycore computing.
Journal of Systems Architecture | 2007
Alejandro Martínez; Raúl Martínez; Francisco José Alfaro; José L. Sánchez
Advanced Switching (AS) is an open-standard fabric-interconnect technology that is built over the same physical and link layers as PCI Express technology. Moreover, it includes an optimized transaction layer to enable essential communication capabilities, including protocol encapsulation, peer-to-peer communications, mechanisms to provide quality of service (QoS), enhanced fail-over, high availability, multicast communications, and congestion and system management. In this paper, we propose a strategy to use the AS resources that provides a good performance and QoS support at a low cost. When the system is considered as a whole rather than each element being taken separately, it is possible to use only two virtual channels (VCs) at the switches to provide a service like that with many more VCs. As a result, we obtain a noticeable reduction of silicon area and arbitration time. Our proposal is fully compatible with the AS specification and permits us to provide an adequate performance both for typical multimedia applications and for best-effort traffic.