Is this you? Create Your Porfile

Alex Ramirez

Polytechnic University of Catalonia

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alex Ramirez is active.

Explore More

Publication

Featured researches published by Alex Ramirez.

international symposium on microarchitecture | 2004

Dynamically Controlled Resource Allocation in SMT Processors

Francisco J. Cazorla; Alex Ramirez; Mateo Valero; Enrique Fernández

SMT processors increase performance by executing instructions from several threads simultaneously. These threads use the resources of the processor better by sharing them but, at the same time, threads are competing for these resources. The way critical resources are distributed among threads determines the final performance. Currently, processor resources are distributed among threads as determined by the fetch policy that decides which threads enter the processor to compete for resources. However, current fetch policies only use indirect indicators of resource usage in their decision, which can lead to resource monopolization by a single thread or to resource waste when no thread can use them. Both situations can harm performance and happen, for example, after an L2 cache miss. In this paper, we introduce the concept of dynamic resource control in SMT processors. Using this concept, we propose a novel resource allocation policy for SMT processors. This policy directly monitors the usage of resources by each thread and guarantees that all threads get their fair share of the critical shared resources, avoiding monopolization. We also define a mechanism to allow a thread to borrow resources from another thread if that thread does not require them, thereby reducing resource under-use. Simulation results show that our dynamic resource allocation policy outperforms a static resource allocation policy by 8%, on average. It also improves the best dynamic resource-conscious fetch policies like FLUSH++ by 4%, on average, using the harmonic mean as a metric. This indicates that our policy does not obtain the ILP boost by unfairly running high ILP threads over slow memory-bounded threads. Instead, it achieves a better throughput-fairness balance.SMT processors increase performance by executing instructions from several threads simultaneously. These threads use the resources of the processor better by sharing them but, at the same time, threads are competing for these resources. The way critical resources are distributed among threads determines the final performance. Currently, processor resources are distributed among threads as determined by the fetch policy that decides which threads enter the processor to compete for resources. However, current fetch policies only use indirect indicators of resource usage in their decision, which can lead to resource monopolization by a single thread or to resource waste when no thread can use them. Both situations can harm performance and happen, for example, after an L2 cache miss. In this paper, we introduce the concept of dynamic resource control in SMT processors. Using this concept, we propose a novel resource allocation policy for SMT processors. This policy directly monitors the usage of resources by each thread and guarantees that all threads get their fair share of the critical shared resources, avoiding monopolization. We also define a mechanism to allow a thread to borrow resources from another thread if that thread does not require them, thereby reducing resource under-use. Simulation results show that our dynamic resource allocation policy outperforms a static resource allocation policy by 8%, on average. It also improves the best dynamic resource-conscious fetch policies like FLUSH++ by 4%, on average, using the harmonic mean as a metric. This indicates that our policy does not obtain the ILP boost by unfairly running high ILP threads over slow memory-bounded threads. Instead, it achieves a better throughput-fairness balance.

international symposium on microarchitecture | 2008

Multicore Resource Management

Kyle J. Nesbit; Miquel Moreto; Francisco J. Cazorla; Alex Ramirez; Mateo Valero; James E. Smith

Current resource management mechanisms and policies are inadequate for future multicore systems. Instead, a hardware/software interface based on the virtual private machine abstraction would allow software policies to explicitly manage microarchitecture resources. VPM policies, implemented primarily in software, translate application and system objectives into VPM resource assignments. Then, VPM mechanisms securely multiplex, arbitrate, or distribute hardware resources to satisfy the VPM assignments.

signal processing systems | 2009

Parallel Scalability of Video Decoders

Cor Meenderinck; Arnaldo Azevedo; Ben H. H. Juurlink; Mauricio Alvarez Mesa; Alex Ramirez

An important question is whether emerging and future applications exhibit sufficient parallelism, in particular thread-level parallelism, to exploit the large numbers of cores future chip multiprocessors (CMPs) are expected to contain. As a case study we investigate the parallelism available in video decoders, an important application domain now and in the future. Specifically, we analyze the parallel scalability of the H.264 decoding process. First we discuss the data structures and dependencies of H.264 and show what types of parallelism it allows to be exploited. We also show that previously proposed parallelization strategies such as slice-level, frame-level, and intra-frame macroblock (MB) level parallelism, are not sufficiently scalable. Based on the observation that inter-frame dependencies have a limited spatial range we propose a new parallelization strategy, called Dynamic 3D-Wave. It allows certain MBs of consecutive frames to be decoded in parallel. Using this new strategy we analyze the limits to the available MB-level parallelism in H.264. Using real movie sequences we find a maximum MB parallelism ranging from 4000 to 7000. We also perform a case study to assess the practical value and possibilities of a highly parallelized H.264 application. The results show that H.264 exhibits sufficient parallelism to efficiently exploit the capabilities of future manycore CMPs.

ieee international conference on high performance computing data and analytics | 2013

Supercomputing with commodity CPUs: are mobile SoCs ready for HPC?

Nikola Rajovic; Paul M. Carpenter; Isaac Gelado; Nikola Puzovic; Alex Ramirez; Mateo Valero

In the late 1990s, powerful economic forces led to the adoption of commodity desktop processors in high-performance computing. This transformation has been so effective that the June 2013 TOP500 list is still dominated by x86. In 2013, the largest commodity market in computing is not PCs or servers, but mobile computing, comprising smartphones and tablets, most of which are built with ARM-based SoCs. This leads to the suggestion that once mobile SoCs deliver sufficient performance, mobile SoCs can help reduce the cost of HPC. This paper addresses this question in detail. We analyze the trend in mobile SoC performance, comparing it with the similar trend in the 1990s. We also present our experience evaluating performance and efficiency of mobile SoCs, deploying a cluster and evaluating the network and scalability of production applications. In summary, we give a first answer as to whether mobile SoCs are ready for HPC.

IEEE Transactions on Computers | 2006

Predictable performance in SMT processors: synergy between the OS and SMTs

Francisco J. Cazorla; Peter M. W. Knijnenburg; Rizos Sakellariou; Enrique Fernández; Alex Ramirez; Mateo Valero

Current operating systems (OS) perceive the different contexts of simultaneous multithreaded (SMT) processors as multiple independent processing units, although, in reality, threads executed in these units compete for the same hardware resources. Furthermore, hardware resources are assigned to threads implicitly as determined by the SMT instruction fetch (Ifetch) policy, without the control of the OS. Both factors cause a lack of control over how individual threads are executed, which can frustrate the work of the job scheduler. This presents a problem for general purpose systems, where the OS job scheduler cannot enforce priorities, and also for embedded systems, where it would be difficult to guarantee worst-case execution times. In this paper, we propose a novel strategy that enables a two-way interaction between the OS and the SMT processor and allows the OS to run jobs at a certain percentage of their maximum speed, regardless of the workload in which these jobs are executed. In contrast to previous approaches, our approach enables the OS to run time-critical jobs without dedicating all internal resources to them so that non-time-critical jobs can make significant progress as well and without significantly compromising overall throughput. In fact, our mechanism, in addition to fulfilling OS requirements, achieves 90 percent of the throughput of one of the best currently known fetch policies for SMTs.

international symposium on computer architecture | 2014

Enabling preemptive multiprogramming on GPUs

Ivan Tanasic; Isaac Gelado; Javier Cabezas; Alex Ramirez; Nacho Navarro; Mateo Valero

GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service. In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x.

high performance embedded architectures and compilers | 2007

High-Performance Embedded Architecture and Compilation Roadmap

Koen De Bosschere; Wayne Luk; Xavier Martorell; Nacho Navarro; Michael F. P. O'Boyle; Dionisios N. Pnevmatikatos; Alex Ramirez; Pascal Sainrat; André Seznec; Per Stenström; Olivier Temam

One of the key deliverables of the EU HiPEAC FP6 Network of Excellence is a roadmap on high-performance embedded architecture and compilation --- the HiPEAC Roadmap for short. This paper is the result of the roadmapping process that took place within the HiPEAC community and beyond. It concisely describes the key research challenges ahead of us and it will be used to steer the HiPEAC research efforts. The roadmap details several of the key challenges that need to be tackled in the coming decade, in order to achieve scalable performance in multi-core systems, and in order to make them a practical mainstream technology for high-performance embedded systems. The HiPEAC roadmap is organized around 10 central themes: (i) single core architecture, (ii) multi-core architecture, (iii) interconnection networks, (iv) programming models and tools, (v) compilation, (vi) run-time systems, (vii) benchmarking, (viii) simulation and system modeling, (ix) reconfigurable computing, and (x) real-time systems. Per theme, a list of challenges is identified. In total 55 key challenges are listed in this roadmap. The list of challenges can serve as a valuable source of reference for researchers active in the field, it can help companies building their own R&D roadmap, and --- although not intended as a tutorial document --- it can even serve as an introduction to scientists and professionals interested in learning about high-performance embedded architecture and compilation.

international conference on parallel architectures and compilation techniques | 2011

DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory

Carlos Villavieja; Vasileios Karakostas; Lluis Vilanova; Yoav Etsion; Alex Ramirez; Avi Mendelson; Nacho Navarro; Adrián Cristal; Osman S. Unsal

Translation Look aside Buffers (TLBs) are ubiquitously used in modern architectures to cache virtual-to-physical mappings and, as they are looked up on every memory access, are paramount to performance scalability. The emergence of chip-multiprocessors (CMPs) with per-core TLBs, has brought the problem of TLB coherence to front stage. TLBs are kept coherent at the software-level by the operating system (OS). Whenever the OS modifies page permissions in a page table, it must initiate a coherency transaction among TLBs, a process known as a TLB shoot down. Current CMPs rely on the OS to approximate the set of TLBs caching a mapping and synchronize TLBs using costly Inter-Proceessor Interrupts (IPIs) and software handlers. In this paper, we characterize the impact of TLB shoot downs on multiprocessor performance and scalability, and present the design of a scalable TLB coherency mechanism. First, we show that both TLB shoot down cost and frequency increase with the number of processors and project that software-based TLB shoot downs would thwart the performance of large multiprocessors. We then present a scalable architectural mechanism that couples a shared TLB directory with load/store queue support for lightweight TLB invalidation, and thereby eliminates the need for costly IPIs. Finally, we show that the proposed mechanism reduces the fraction of machine cycles wasted on TLB shoot downs by an order of magnitude.

ieee international symposium on workload characterization | 2005

A performance characterization of high definition digital video decoding using H.264/AVC

Mauricio Alvarez; Esther Salamí; Alex Ramirez; Mateo Valero

H.264/AVC is a new international video coding standard that provides higher coding efficiency with respect to previous standards at the expense of a higher computational complexity. The complexity is even higher when H.264/AVC is used in applications with high bandwidth and high quality like high definition (HD) video decoding. In this paper, we analyze the computational requirements of H.264 decoder with a special emphasis in HD video and we compare it with previous standards and lower resolutions. The analysis was done with a SIMD optimized decoder using hardware performance monitoring. The main objective is to identify the application bottlenecks and to suggest the necessary support in the architecture for processing HD video efficiently. We have found that H.264/AVC decoding of HD video perform many more operations per frame than MPEG-4 and MPEG-2, has new kernels with more demanding memory access patterns and has a lot data dependent branches that are difficult to predict. In order to improve the H.264/AVC decoding process at HD it is necessary to explore a better support in media instructions, specialized prefetching techniques and possibly, the use of some kind of multiprocessor architecture.

computing frontiers | 2004

Predictable performance in SMT processors

Francisco J. Cazorla; Peter M. W. Knijnenburg; Rizos Sakellariou; Enrique Fernández; Alex Ramirez; Mateo Valero

Current instruction fetch policies in SMT processors are oriented towards optimization of overall throughput and/or fairness. However, they provide no control over how individual threads are executed, leading to performance unpredictability, since the IPC of a thread depends on the workload it is executed in and on the fetch policy used.From the point of view of the Operating System (OS), it is the job scheduler that determines how jobs are executed. However, when the OS runs on an SMT processor, the job scheduler cannot guarantee execution time constraints of any job due to this performance unpredictability.In this paper we propose a novel kind of collaboration between the OS and the SMT hardware that enables the OS to enforce that a high priority thread runs at a specific fraction of its full speed. We present an extensive evaluation using many different workloads, that shows that this mechanism gives the required performance in more than 97% of all cases considered, and even more than 99% for the less extreme cases. At the same time, our mechanism does not need to trade off predictability against overall throughput, as it maximizes the IPC of the remaining low priority threads, giving 94% on average (and 97.5% on average for the less extreme cases) of the throughput obtained using instruction fetch policies oriented toward throughput maximization, such as icount.

Explore More