Mateo Valero Cortés
Barcelona Supercomputing Center
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mateo Valero Cortés.
international conference on parallel architectures and compilation techniques | 2007
Miquel Moreto Planas; Francisco Javier Cazorla Almeida; Alejandro Ramírez Bellido; Mateo Valero Cortés
The limitation imposed by instruction-level parallelism (ILP) has motivated the use of thread-level parallelism (TLP) as a common strategy for improving processor performance. TLP paradigms such as simultaneous multithreading (SMT), chip multiprocessing (CMP) and combinations of both offer the opportunity to obtain higher throughputs. However, they also have to face the challenge of sharing resources of the architecture. Simply avoiding any resource control can lead to undesired situations where one thread is monopolizing all the resources and harming the other threads. Some studies deal with the resource sharing problem in SMTs at core level resources like issue queues, registers, etc. In CMPs, resource sharing is lower than in SMT, focusing in the cache hierarchy.
international symposium on computer architecture | 1985
José M. Llaberia Griñó; Mateo Valero Cortés; Enrique Herrada Lillo; Jesús José Labarta Mancho
Performance issues of a single-bus interconnection network for multiprocessor systems, operating in a multiplexed way, are presented in this paper. Several models are developed and used nto allow system performance evaluation. Comparisons with equivalent crossbar systems are provided. It is shown how crossbar EBW values can be reached and exceeded when appropriate operation parameters are chosen in a multiplexed nsingle-bus system. Another architectural feature is considered, concerning the utilization of buffers at the memory modules. With the buffering scheme, memory interference can be reduced so that the system performance is practically improved.
principles of advanced discrete simulation | 2018
Mateo Valero Cortés
In the last years the traditional ways to keep the increase of hardware performance to the rate predicted by the Moores Law vanished. When uni-cores were the norm, hardware design was decoupled from the software stack thanks to a well defined Instruction Set Architecture (ISA). This simple interface allowed developing applications without worrying too much about the underlying hardware, while computer architects proposed techniques to aggressively exploit Instruction-Level Parallelism (ILP) in superscalar processors. Current multi-cores are designed as simple symmetric multiprocessors on a chip. While these designs are able to compensate the clock frequency stagnation, they face multiple problems in terms of power consumption, programmability, resilience or memory. The solution is to give more responsibility to the runtime system and to let it tightly collaborate with the hardware. The runtime has to drive the design of future multi-cores architectures. In this talk, we introduce an approach towards a Runtime-Aware Architecture (RAA), a massively parallel architecture designed from the runtimes perspective. RAA aims at supporting the activity the parallel runtime system in three ways: First, to enable fine-grain tasking and support the opportunities it offers; second, to improve the performance of the memory subsystem by exposing hybrid hierarchies to the runtime system and, third, to improve performance by using vector units. During the talk, we will give a general overview of the problems RAA aims to solve and provide some examples of hardware components supporting the activity of the runtime system in the context of multi-core chips.
symposium on computer architecture and high performance computing | 2009
Carmelo Alexis Acosta Ojeda; Francisco Javier Cazorla Almeida; Alejandro Ramírez Bellido; Mateo Valero Cortés
State-of-the-art high-performance processors like the IBM POWER5 and Intel i7 show a trend in industry towards on-chip Multiprocessors (CMP) involving Simultaneous Multithreading (SMT) in each core. In these processors, the way in which applications are assigned to cores plays a key role in the performance of each application and the overall system performance. In this paper we show that the system throughput highly depends on the Thread to Core Assignment (TCA), regardless the SMT Instruction Fetch (IFetch) Policy implemented in the cores. Our results indicate that a good TCA can improve the results of any underlying IFetch Policy, yielding speedups of up to 28%. Given the relevance of TCA, we propose an algorithm to manage it in CMP+SMT processors. The proposed throughput-oriented TCA Algorithm takes into account the workload characteristics and the underlying SMT IFetch Policy. Our results show that the TCA Algorithm obtains thread-to-core assignments 3% close to the optimal assignation for each case, yielding system throughput improvements up to 21%.State-of-the-art high-performance processors like the IBM POWER5 and Intel i7 show a trend in industry towards on-chip Multiprocessors (CMP) involving Simultaneous Multithreading (SMT) in each core. In these processors, the way in which applications are assigned to cores plays a key role in the performance of each application and the overall system performance. In this paper we show that the system throughput highly depends on the Thread to Core Assignment (TCA), regardless the SMT Instruction Fetch (IFetch) Policy implemented in the cores. Our results indicate that a good TCA can improve the results of any underlying IFetch Policy, yielding speedups of up to 28%. Given the relevance of TCA, we propose an algorithm to manage it in CMP+SMT processors. The proposed throughput-oriented TCA Algorithm takes into account the workload characteristics and the underlying SMT IFetch Policy. Our results show that the TCA Algorithm obtains thread-to-core assignments 3% close to the optimal assignation for each case, yielding system throughput improvements up to 21%.
international conference on parallel architectures and compilation techniques | 2007
Tanausu Ramírez García; Manuel Alejandro Pajuelo González; Oliverio J. Santana Jaria; Mateo Valero Cortés
In this work, we propose Runahead threads as a valuable solution for both exploiting memory-level parallelism and reducing resource contention in simultaneous multithreaded processors.
ieee international symposium on workload characterization | 2006
Friman Sánchez Castaño; Esther Salamí San Juan; Alejandro Ramírez Bellido; Mateo Valero Cortés
Advances in molecular biology have led to a continued growth in the biological information generated by the scientific community. Additionally, this area has become a multi-disciplinary field, including components of mathematics, biology, chemistry, and computer science, generating several challenges in the scientific community from different points of view. For this reason, bioinformatic applications represent an increasingly important workload. However, even though the importance of this field is clear, common bioinformatic applications and their implications on micro-architectural design have not received enough attention from the computer architecture community. This paper presents a micro-architecture performance analysis of recognized bioinformatic applications for the comparison and alignment of biological sequences, including BLAST, FASTA and some recognized parallel implementations of the Smith-Waterman algorithm that use the Altivec SIMD extension to speed-up the performance. We adopt a simulation-based methodology to perform a detailed workload characterization. We analyze architectural and micro-architectural aspects like pipeline configurations, issue widths, functional unit mixes, memory hierarchy and their implications on the performance behavior. We have found that the memory subsystem is the component with more impact in the performance of the BLAST heuristic, the branch predictor is responsible for the major performance loss for FASTA and SSEARCH34, and long dependency chains are the limiting factor in the SIMD implementations of Smith-WatermanAdvances in molecular biology have led to a continued growth in the biological information generated by the scientific community. Additionally, this area has become a multi-disciplinary field, including components of mathematics, biology, chemistry, and computer science, generating several challenges in the scientific community from different points of view. For this reason, bioinformatic applications represent an increasingly important workload. However, even though the importance of this field is clear, common bioinformatic applications and their implications on micro-architectural design have not received enough attention from the computer architecture community. This paper presents a micro-architecture performance analysis of recognized bioinformatic applications for the comparison and alignment of biological sequences, including BLAST, FASTA and some recognized parallel implementations of the Smith-Waterman algorithm that use the Altivec SIMD extension to speed-up the performance. We adopt a simulation-based methodology to perform a detailed workload characterization. We analyze architectural and micro-architectural aspects like pipeline configurations, issue widths, functional unit mixes, memory hierarchy and their implications on the performance behavior. We have found that the memory subsystem is the component with more impact in the performance of the BLAST heuristic, the branch predictor is responsible for the major performance loss for FASTA and SSEARCH34, and long dependency chains are the limiting factor in the SIMD implementations of Smith-Waterman.
international symposium on performance analysis of systems and software | 2005
Friman Sánchez Castaño; Mauricio Álvarez Mesa; Esther Salamí San Juan; Alejandro Ramírez Bellido; Mateo Valero Cortés
SIMD extensions are the most common technique used in current processors for multimedia computing. In order to obtain more performance for emerging applications SIMD extensions need to be scaled. In this paper we perform a scalability analysis of SIMD extensions for multimedia applications. Scaling a 1-dimensional extension, like Intel MMX, was compared to scaling a 2-dimensional (matrix) extension. Evaluations have demonstrated that the 2-d architecture is able to use more parallel hardware than the 1-d extension. Speed-ups over a 2-way superscalar processor with MMX-like extension go up to 4X for kernels and up to 3.3X for complete applications and the matrix architecture can deliver, in some cases, more performance with simpler processor configurations. The experiments also show that the scaled matrix architecture is reaching the limits of the DLP available in the internal loops of common multimedia kernelsSIMD extensions are the most common technique used in current processors for multimedia computing. In order to obtain more performance for emerging applications SIMD extensions need to be scaled. In this paper we perform a scalability analysis of SIMD extensions for multimedia applications. Scaling a 1-dimensional extension, like Intel MMX, was compared to scaling a 2-dimensional (matrix) extension. Evaluations have demonstrated that the 2-d architecture is able to use more parallel hardware than the 1-d extension. Speed-ups over a 2-way superscalar processor with MMX-like extension go up to 4X for kernels and up to 3.3X for complete applications and the matrix architecture can deliver, in some cases, more performance with simpler processor configurations. The experiments also show that the scaled matrix architecture is reaching the limits of the DLP available in the internal loops of common multimedia kernels.
international symposium on performance analysis of systems and software | 2005
Raimir Holanda Filho; Javier Verdú Mulà; Jorge García Vidal; Mateo Valero Cortés
In this paper we study the properties of a new packet trace compression method based on clustering of TCP flows. With our proposed method, the compression ratio that we achieve is around 3%, reducing the file size, for instance, from 100 MB to 3 MB. Although this specification defines a lossy compressed data format, it preserves important statistical properties present into original trace. In order to validate the method, memory performance studies were done with the Radix Tree algorithm executing a trace generated by our method. To give support to these studies, measurements were taken of memory access and cache miss ratio. For the time, the results have showed that our proposed method provides a good solution for packet trace compressionIn this paper we study the properties of a new packet trace compression method based on clustering of TCP flows. With our proposed method, the compression ratio that we achieve is around 3%, reducing the file size, for instance, from 100 MB to 3 MB. Although this specification defines a lossy compressed data format, it preserves important statistical properties present into original trace. In order to validate the method, memory performance studies were done with the Radix Tree algorithm executing a trace generated by our method. To give support to these studies, measurements were taken of memory access and cache miss ratio. For the time, the results have showed that our proposed method provides a good solution for packet trace compression.
international symposium on microarchitecture | 2004
Francisco Javier Cazorla Almeida; Alejandro Ramírez Bellido; Mateo Valero Cortés; Enrique Fernandez Prieto
SMT processors increase performance by executing instructions from several threads simultaneously. These threads use the resources of the processor better by sharing them but, at the same time, threads are competing for these resources. The way critical resources are distributed among threads determines the final performance. Currently, processor resources are distributed among threads as determined by the fetch policy that decides which threads enter the processor to compete for resources. However, current fetch policies only use indirect indicators of resource usage in their decision, which can lead to resource monopolization by a single thread or to resource waste when no thread can use them. Both situations can harm performance and happen, for example, after an L2 cache miss. In this paper, we introduce the concept of dynamic resource control in SMT processors. Using this concept, we propose a novel resource allocation policy for SMT processors. This policy directly monitors the usage of resources by each thread and guarantees that all threads get their fair share of the critical shared resources, avoiding monopolization. We also define a mechanism to allow a thread to borrow resources from another thread if that thread does not require them, thereby reducing resource under-use. Simulation results show that our dynamic resource allocation policy outperforms a static resource allocation policy by 8%, on average. It also improves the best dynamic resource-conscious fetch policies like FLUSH++ by 4%, on average, using the harmonic mean as a metric. This indicates that our policy does not obtain the ILP boost by unfairly running high ILP threads over slow memory-bounded threads. Instead, it achieves a better throughput-fairness balance.SMT processors increase performance by executing instructions from several threads simultaneously. These threads use the resources of the processor better by sharing them but, at the same time, threads are competing for these resources. The way critical resources are distributed among threads determines the final performance. Currently, processor resources are distributed among threads as determined by the fetch policy that decides which threads enter the processor to compete for resources. However, current fetch policies only use indirect indicators of resource usage in their decision, which can lead to resource monopolization by a single thread or to resource waste when no thread can use them. Both situations can harm performance and happen, for example, after an L2 cache miss. In this paper, we introduce the concept of dynamic resource control in SMT processors. Using this concept, we propose a novel resource allocation policy for SMT processors. This policy directly monitors the usage of resources by each thread and guarantees that all threads get their fair share of the critical shared resources, avoiding monopolization. We also define a mechanism to allow a thread to borrow resources from another thread if that thread does not require them, thereby reducing resource under-use. Simulation results show that our dynamic resource allocation policy outperforms a static resource allocation policy by 8%, on average. It also improves the best dynamic resource-conscious fetch policies like FLUSH++ by 4%, on average, using the harmonic mean as a metric. This indicates that our policy does not obtain the ILP boost by unfairly running high ILP threads over slow memory-bounded threads. Instead, it achieves a better throughput-fairness balance.
IEE Proceedings- Software | 2004
Ayose Jesús Falcón Samper; Alejandro Ramírez Bellido; Mateo Valero Cortés
Simultaneous multithreading (SMT) is an architectural technique that allows for the parallel execution of several threads simultaneously. Fetch performance has been identified as the most important bottleneck for SMT processors. The commonly adopted solution has been fetching from more than one thread each cycle. Recent studies have proposed a plethora of fetch policies to deal with fetch priority among threads, trying to increase fetch performance. We demonstrate that the simultaneous sharing of the fetch unit, apart from increasing the complexity of the fetch unit, can be counterproductive in terms of performance. We evaluate the use of high-performance fetch units in the context of SMT. Our new fetch architecture proposal allows us to feed an 8-way processor fetching from a single thread each cycle, reducing complexity, and increasing the usefulness of proposed fetch policies. Our results show that using new high-performance fetch units, like the FTB or the stream fetch, provides higher performance than fetching from two threads using common SMT fetch architectures. Furthermore, our results show that our design obtains better average performance for any kind of workloads (both ILP and memory bounded benchmarks), in contrast to previously proposed solutions.Simultaneous multithreading (SMT) is an architectural technique that allows for the parallel execution of several threads simultaneously. Fetch performance has been identified as the most important bottleneck for SMT processors. The commonly adopted solution has been fetching from more than one thread each cycle. Recent studies have proposed a plethora of fetch policies to deal with fetch priority among threads, trying to increase fetch performance. We demonstrate that the simultaneous sharing of the fetch unit, apart from increasing the complexity of the fetch unit, can be counterproductive in terms of performance. We evaluate the use of high-performance fetch units in the context of SMT. Our new fetch architecture proposal allows us to feed an 8-way processor fetching from a single thread each cycle, reducing complexity, and increasing the usefulness of proposed fetch policies. Our results show that using new high-performance fetch units, like the FTB or the stream fetch, provides higher performance than fetching from two threads using common SMT fetch architectures. Furthermore, our results show that our design obtains better average performance for any kind of workloads (both ILP and memory bounded benchmarks), in contrast to previously proposed solutions.