Javier Cabezas
Barcelona Supercomputing Center
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Javier Cabezas.
architectural support for programming languages and operating systems | 2010
Isaac Gelado; John E. Stone; Javier Cabezas; Sanjay J. Patel; Nacho Navarro; Wen-mei W. Hwu
Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory. This paper presents a new programming model for heterogeneous computing, called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data objects to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs. We argue that ADSM reduces programming efforts for heterogeneous computing systems and enhances application portability. We present a software implementation of ADSM, called GMAC, on top of CUDA in a GNU/Linux environment. We show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmer-managed data transfers. This paper presents the GMAC system and evaluates different design choices. We further suggest additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model.
international symposium on computer architecture | 2014
Ivan Tanasic; Isaac Gelado; Javier Cabezas; Alex Ramirez; Nacho Navarro; Mateo Valero
GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service. In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x.
IEEE Transactions on Parallel and Distributed Systems | 2011
Mauricio Araya-Polo; Javier Cabezas; Mauricio Hanzich; Miquel Pericàs; Felix Rubio; Isaac Gelado; Muhammad Shafiq; Enric Morancho; Nacho Navarro; Eduard Ayguadé; José María Cela; Mateo Valero
Oil and gas companies trust Reverse Time Migration (RTM), the most advanced seismic imaging technique, with crucial decisions on drilling investments. The economic value of the oil reserves that require RTM to be localized is in the order of 1013 dollars. But RTM requires vast computational power, which somewhat hindered its practical success. Although, accelerator-based architectures deliver enormous computational power, little attention has been devoted to assess the RTM implementations effort. The aim of this paper is to identify the major limitations imposed by different accelerators during RTM implementations, and potential bottlenecks regarding architecture features. Moreover, we suggest a wish list, that from our experience, should be included as features in the next generation of accelerators, to cope with the requirements of applications like RTM. We present an RTM algorithm mapping to the IBM Cell/B.E., NVIDIA Tesla and an FPGA platform modeled after the Convey HC-1. All three implementations outperform a traditional processor (Intel Harpertown) in terms of performance (10x), but at the cost of huge development effort, mainly due to immature development frameworks and lack of well-suited programming models. These results show that accelerators are well positioned platforms for this kind of workload. Due to the fact that our RTM implementation is based on an explicit high order finite difference scheme, some of the conclusions of this work can be extrapolated to applications with similar numerical scheme, for instance, magneto-hydrodynamics or atmospheric flow simulations.
architectural support for programming languages and operating systems | 2013
Ivan Tanasic; Lluis Vilanova; Marc Jordà; Javier Cabezas; Isaac Gelado; Nacho Navarro; Wen-mei W. Hwu
As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed several sorting algorithms for single-GPU systems. However, some workstations and HPC systems have multiple GPUs, and applications running on them are designed to use all available GPUs in the system. In this paper we present a high performance multi-GPU merge sort algorithm that solves the problem of sorting data distributed across several GPUs. Our merge sort algorithm first sorts the data on each GPU using an existing single-GPU sorting algorithm. Then, a series of merge steps produce a globally sorted array distributed across all the GPUs in the system. This merge phase is enabled by a novel pivot selection algorithm that ensures that merge steps always distribute data evenly among all GPUs. We also present the implementation of our sorting algorithm in CUDA, as well as a novel inter-GPU communication technique that enables this pivot selection algorithm. Experimental results show that an efficient implementation of our algorithm achieves a speed up of 1.9x when running on two GPUs and 3.3x when running on four GPUs, compared to sorting on a single GPU. At the same time, it is able to sort two and four times more records, compared to sorting on one GPU.
international conference of the chilean computer science society | 2009
Javier Cabezas; Mauricio Araya-Polo; Isaac Gelado; Nacho Navarro; Enric Morancho; José María Cela
Partial Differential Equations (PDE) are the heart of most simulations in many scientific fields, from Fluid Mechanics to Astrophysics. One the most popular mathematical schemes to solve a PDE is Finite Difference (FD). In this work we map a PDE-FD algorithm called Reverse Time Migration to a GPU using CUDA. This seismic imaging (Geophysics) algorithm is widely used in the oil industry. GPUs are natural contenders in the aftermath of the clock race, in particular for High-performance Computing (HPC). Due to GPU characteristics, the parallelism paradigm shifts from the classical threads plus SIMD to Single Program Multiple Data (SPMD). The NVIDIA GTX 280 implementation outperforms homogeneous CPUs up to 9x (Intel Harpertown E5420) and up to 14x (IBM PPC 970). These preliminary results confirm that GPUs are a real option for HPC, from performance to programmability.
international conference on supercomputing | 2015
Javier Cabezas; Lluis Vilanova; Isaac Gelado; Thomas B. Jablin; Nacho Navarro; Wen-mei W. Hwu
In this paper we present AMGE, a programming framework and runtime system that transparently decomposes GPU kernels and executes them on multiple GPUs in parallel. AMGE exploits the remote memory access capability in modern GPUs to ensure that data can be accessed regardless of its physical location, allowing our runtime to safely decompose and distribute arrays across GPU memories. It optionally performs a compiler analysis that detects array access patterns in GPU kernels. Using this information, the runtime can perform more efficient computation and data distribution configurations than previous works. The GPU execution model allows AMGE to hide the cost of remote accesses if they are kept below 5%. We demonstrate that a thread block scheduling policy that distributes remote accesses through the whole kernel execution further reduces their overhead. Results show 1.98× and 3.89× execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.
international conference on parallel architectures and compilation techniques | 2014
Javier Cabezas; Lluis Vilanova; Isaac Geladeno; Thomas B. Jablin; Nacho Navarro; Wen-mei W. Hwu
We present AMGE, a programming framework and runtime system to decompose data and GPU kernels and execute them on multiple GPUs concurrently. AMGE exploits the remote memory access capability of recent GPUs to guarantee data accessibility regardless of its physical location, thus allowing AMGE to safely decompose and distribute arrays across GPU memories. AMGE also includes a compiler analysis to detect array access patterns in GPU kernels. The runtime uses this information to automatically choose the best computation and data distribution configuration. Through effective use of GPU caches, AMGE achieves good scalability in spite of the limited interconnect bandwidth between GPUs. Results show 1.95× and 3.73× execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.
IEEE Transactions on Parallel and Distributed Systems | 2015
Javier Cabezas; Isaac Gelado; John E. Stone; Nacho Navarro; David B. Kirk; Wen-mei W. Hwu
Heterogeneous parallel computing applications often process large data sets that require multiple GPUs to jointly meet their needs for physical memory capacity and compute throughput. However, the lack of high-level abstractions in previous heterogeneous parallel programming models force programmers to resort to multiple code versions, complex data copy steps and synchronization schemes when exchanging data between multiple GPU devices, which results in high software development cost, poor maintainability, and even poor performance. This paper describes the HPE runtime system, and the associated architecture support, which enables a simple, efficient programming interface for exchanging data between multiple GPUs through either interconnects or cross-node network interfaces. The runtime and architecture support presented in this paper can also be used to support other types of accelerators. We show that the simplified programming interface reduces programming complexity. The research presented in this paper started in 2009. It has been implemented and tested extensively in several generations of HPE runtime systems as well as adopted into the NVIDIA GPU hardware and drivers for CUDA 4.0 and beyond since 2011. The availability of real hardware that support key HPE features gives rise to a rare opportunity for studying the effectiveness of the hardware support by running important benchmarks on real runtime and hardware. Experimental results show that in a exemplar heterogeneous system, peer DMA and double-buffering, pinned buffers, and software techniques can improve the inter-accelerator data communication bandwidth by 2×. They can also improve the execution speed by 1.6× for a 3D finite difference, 2.5× for 1D FFT, and 1.6× for merge sort, all measured on real hardware. The proposed architecture support enables the HPE runtime to transparently deploy these optimizations under simple portable user code, allowing system designers to freely employ devices of different capabilities. We further argue that simple interfaces such as HPE are needed for most applications to benefit from advanced hardware features in practice.
acm sigplan symposium on principles and practice of parallel programming | 2015
Javier Cabezas; Marc Jordà; Isaac Gelado; Nacho Navarro; Wen-mei W. Hwu
Discrete GPUs in modern multi-GPU systems can transparently access each others memories through the PCIe interconnect. Future systems will improve this capability by including better GPU interconnects such as NVLink. However, remote memory access across GPUs has gone largely unnoticed among programmers, and multi-GPU systems are still programmed like distributed systems in which each GPU only accesses its own memory. This increases the complexity of the host code as programmers need to explicitly communicate data across GPU memories. In this paper we present GPU-SM, a set of guidelines to program multi-GPU systems like NUMA shared memory systems with minimal performance overheads. Using GPU-SM, data structures can be decomposed across several GPU memories and data that resides on a different GPU is accessed remotely through the PCI interconnect. The programmability benefits of the shared-memory model on GPUs are shown using a finite difference and an image filtering applications. We also present a detailed performance analysis of the PCIe interconnect and the impact of remote accesses on kernel performance. While PCIe imposes long latency and has limited bandwidth compared to the local GPU memory, we show that the highly-multithreaded GPU execution model can help reducing its costs. Evaluation of finite difference and image filtering GPU-SM implementations shows close to linear speedups on a system with 4 GPUs, with much simpler code than the original implementations (e.g. a 40% SLOC reduction in the host code of finite difference).
high performance embedded architectures and compilers | 2009
Dominique Chanet; Javier Cabezas; Enric Morancho; Nacho Navarro; Koen De Bosschere
There is a growing trend to use general-purpose operating systems like Linux in embedded systems. Previous research focused on using compaction and specialization techniques to adapt a general-purpose OS to the memory-constrained environment presented by most embedded systems. However, there is still room for improvement: it has been shown that even after application of the aforementioned techniques more than 50% of the kernel code remains unexecuted under normal system operation. We introduce a new technique that reduces the Linux kernel code memory footprint through on-demand code loading of infrequently executed code, for systems that support virtual memory. In this paper, we describe our general approach, and we study code placement algorithms to minimize the performance impact of the code loading. A code size reduction of 68% is achieved, with a 2.2% execution speedup of the system-mode execution time, for a case study based on the MediaBench II benchmark suite.