Antonio J. Peña | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Antonio J. Peña is active.

Explore More

Publication

Featured researches published by Antonio J. Peña.

international conference on high performance computing and simulation | 2010

rCUDA: Reducing the number of GPU-based accelerators in high performance clusters

José Duato; Antonio J. Peña; Federico Silla; Rafael Mayo; Enrique S. Quintana-Ortí

The increasing computing requirements for GPUs (Graphics Processing Units) have favoured the design and marketing of commodity devices that nowadays can also be used to accelerate general purpose computing. Therefore, future high performance clusters intended for HPC (High Performance Computing) will likely include such devices. However, high-end GPU-based accelerators used in HPC feature a considerable energy consumption, so that attaching a GPU to every node of a cluster has a strong impact on its overall power consumption. In this paper we detail a framework that enables remote GPU acceleration in HPC clusters, thus allowing a reduction in the number of accelerators installed in the cluster. This leads to energy, acquisition, maintenance, and space savings.

ieee international conference on high performance computing, data, and analytics | 2011

Enabling CUDA acceleration within virtual machines using rCUDA

José Duato; Antonio J. Peña; Federico Silla; Juan Carlos Fernández; Rafael Mayo; Enrique S. Quintana-Ortí

The hardware and software advances of Graphics Processing Units (GPUs) have favored the development of GPGPU (General-Purpose Computation on GPUs) and its adoption in many scientific, engineering, and industrial areas. Thus, GPUs are increasingly being introduced in high-performance computing systems as well as in datacenters. On the other hand, virtualization technologies are also receiving rising interest in these domains, because of their many benefits on acquisition and maintenance savings. There are currently several works on GPU virtualization. However, there is no standard solution allowing access to GPGPU capabilities from virtual machine environments like, e.g., VMware, Xen, VirtualBox, or KVM. Such lack of a standard solution is delaying the integration of GPGPU into these domains. In this paper, we propose a first step towards a general and open source approach for using GPGPU features within VMs. In particular, we describe the use of rCUDA, a GPGPU (General-Purpose Computation on GPUs) virtualization framework, to permit the execution of GPU-accelerated applications within virtual machines (VMs), thus enabling GPGPU capabilities on any virtualized environment. Our experiments with rCUDA in the context of KVM and VirtualBox on a system equipped with two NVIDIA GeForce 9800 GX2 cards illustrate the overhead introduced by the rCUDA middleware and prove the feasibility and scalability of this general virtualizing solution. Experimental results show that the overhead is proportional to the dataset size, while the scalability is similar to that of the native environment.

parallel computing | 2014

A complete and efficient CUDA-sharing solution for HPC clusters

Antonio J. Peña; Federico Silla; Rafael Mayo; Enrique S. Quintana-Ortí; José Duato

We detail a complete and up-to-date GPGPU remote virtualization solution.We describe efficient mechanisms for remote GPU data transfers.We detail an optimized InfiniBand implementation of rCUDA communications.We provide an extensive experimental evaluation of remote GPU acceleration. In this paper we detail the key features, architectural design, and implementation of rCUDA, an advanced framework to enable remote and transparent GPGPU acceleration in HPC clusters. rCUDA allows decoupling GPUs from nodes, forming pools of shared accelerators, which brings enhanced flexibility to cluster configurations. This opens the door to configurations with fewer accelerators than nodes, as well as permits a single node to exploit the whole set of GPUs installed in the cluster. In our proposal, CUDA applications can seamlessly interact with any GPU in the cluster, independently of its physical location. Thus, GPUs can be either distributed among compute nodes or concentrated in dedicated GPGPU servers, depending on the cluster administrators policy. This proposal leads to savings not only in space but also in energy, acquisition, and maintenance costs. The performance evaluation in this paper with a series of benchmarks and a production application clearly demonstrates the viability of this proposal. Concretely, experiments with the matrix-matrix product reveal excellent performance compared with regular executions on the local GPU; on a much more complex application, the GPU-accelerated LAMMPS, we attain up to 11x speedup employing 8 remote accelerators from a single node with respect to a 12-core CPU-only execution. GPGPU service interaction in compute nodes, remote acceleration in dedicated GPGPU servers, and data transfer performance of similar GPU virtualization frameworks are also evaluated.

international conference on parallel processing | 2011

Performance of CUDA Virtualized Remote GPUs in High Performance Clusters

José Duato; Antonio J. Peña; Federico Silla; Rafael Mayo; Enrique S. Quintana-Ortí

In a previous work we presented the architecture of rCUDA, a middleware that enables CUDA remoting over a commodity network. That is, the middleware allows an application to use a CUDA-compatible Graphics Processor (GPU) installed in a remote computer as if it were installed in the computer where the application is being executed. This approach is based on the observation that GPUs in a cluster are not usually fully utilized, and it is intended to reduce the number of GPUs in the cluster, thus lowering the costs related with acquisition and maintenance while keeping performance close to that of the fully-equipped configuration. In this paper we model rCUDA over a series of high throughput networks in order to assess the influence of the performance of the underlying network on the performance of our virtualization technique. For this purpose, we analyze the traces of two different case studies over two different networks. Using this data, we calculate the expected performance for these same case studies over a series of high throughput networks, in order to characterize the expected behavior of our solution in high performance clusters. The estimations are validated using real 1 Gbps Ethernet and 40 Gbps InfiniBand networks, showing an error rate in the order of 1% for executions involving data transfers above 40 MB. In summary, although our virtualization technique noticeably increases execution time when using a 1 Gbps Ethernet network, it performs almost as efficiently as a local GPU when higher performance interconnects are used. Therefore, the small overhead incurred by our proposal because of the remote use of GPUs is worth the savings that a cluster configuration with less GPUs than nodes reports.

international conference on cluster computing | 2013

Influence of InfiniBand FDR on the performance of remote GPU virtualization

Rafael Mayo; Enrique S. Quintana-Ortí; Federico Silla; José Duato; Antonio J. Peña

The use of GPUs to accelerate general-purpose scientific and engineering applications is mainstream today, but their adoption in current high-performance computing clusters is impaired primarily by acquisition costs and power consumption. Therefore, the benefits of sharing a reduced number of GPUs among all the nodes of a cluster can be remarkable for many applications. This approach, usually referred to as remote GPU virtualization, aims at reducing the number of GPUs present in a cluster, while increasing their utilization rate. The performance of the interconnection network is key to achieving reasonable performance results by means of remote GPU virtualization. To this end, several networking technologies with throughput comparable to that of PCI Express have appeared recently. In this paper we analyze the influence of InfiniBand FDR on the performance of remote GPU virtualization, comparing its impact on a variety of GPU-accelerated applications with other networking technologies, such as Infini-Band QDR and Gigabit Ethernet. Given the severe limitations of freely available remote GPU virtualization solutions, the rCUDA framework is used as the case study for this analysis. Results show that the new FDR interconnect, featuring higher bandwidth than its predecessors, allows the reduction of the overhead of using GPUs remotely, thus making this approach even more appealing.

ieee international conference on high performance computing, data, and analytics | 2012

CU2rCU: Towards the complete rCUDA remote GPU virtualization and sharing solution

Antonio J. Peña; Federico Silla; José Duato; Rafael Mayo; Enrique S. Quintana-Ortí

GPUs are being increasingly embraced by the high performance computing and computational communities as an effective way of considerably reducing execution time by accelerating significant parts of their application codes. However, despite their extraordinary computing capabilities, the adoption of GPUs in current HPC clusters may present certain negative side-effects. In particular, to ease job scheduling in these platforms, a GPU is usually attached to every node of the cluster. In addition to increasing acquisition costs this favors that GPUs may frequently remain idle, as applications usually do not fully utilize them. On the other hand, idle GPUs consume non-negligible amounts of energy, which translates into very poor energy efficiency during idle cycles. rCUDA was recently developed as a software solution to address these concerns. Specifically, it is a middleware that allows transparently sharing a reduced number of GPUs among the nodes in a cluster. rCUDA thus increases the GPU-utilization rate, taking care of job scheduling. While the initial prototype versions of rCUDA demonstrated its functionality, they also revealed several concerns related with usability and performance. With respect to usability, in this paper we present a new component of the rCUDA suite that allows an automatic transformation of any CUDA source code, so that it can be effectively accommodated within this technology. In response to performance, we briefly show some interesting results, which will be deeply analyzed in future publications. The net outcome is a new version of rCUDA that allows, for any CUDA-compatible program, to use remote GPUs in a cluster with minimum overhead.

international conference on cluster computing | 2014

Toward the efficient use of multiple explicitly managed memory subsystems

Antonio J. Peña; Pavan Balaji

The increasing number of memory technologies offering different features such as optimized access patterns or capacity/speed ratios lead us to advocate for future HPC compute nodes equipped with heterogeneous memory subsystems. The aim is to alleviate further the ever-increasing gap between computation and memory access speeds, by taking advantage of the benefits these memory technologies provide. Compute nodes equipped with memory technologies such as scratchpad memory, on-chip 3D-stacked memory, or NVRAM-based memory are already a reality. Careful use of the different memory subsystems is mandatory in order to exploit the potential of such super-computers. While most multiple-memory models concentrate on extending the depth of the memory hierarchy by incorporating more levels of hardware-managed memories, we advocate for compute nodes equipped with heterogeneous software-managed memory subsystems. Although the exact approach to efficiently exploit them is still uncertain, a software ecosystem clearly is required in order to assist in an efficient data distribution. We address this problem at the memory object granularity. In this paper we use an object-differentiated profiling tool we have developed on top of the Valgrind instrumentation framework, in order to assess the most suitable memory subsystem for the different memory objects of two miniapplications from the Mantevo codesign project. Our results considering two different memory configurations as use cases reveal the potential benefits of carefully placing the different memory objects of an application among the different memory subsystems.

ieee/acm international symposium cluster, cloud and grid computing | 2013

Evaluation of Inter- and Intra-node Data Transfer Efficiencies between GPU Devices and their Impact on Scalable Applications

Antonio J. Peña; Sadaf R. Alam

Data movement is of high relevance for GPU Computing. Communication and performance efficiencies of applications and systems with GPU accelerators depend on on- and off-node data paths, thereby making tuning and optimization an increasingly complex task. In this paper we conduct an in-depth study to establish the parameters that influence performance of data transfers between on-node GPU devices, and located on separate nodes (off-node). We compare the most recent version of MVAPICH2 featuring seamless remote GPU transfers with our own low-level benchmarks, and discuss the bottlenecks that may arise. Data path performance and bottlenecks between GPU devices are analyzed and compared for two substantially different systems: an IBM datable relying on an InfiniBand QDR fabric with two on-node GPU devices, and a Cray XK6, featuring a single GPU per node, and connected through a Gemini interconnect. Finally, we adapt LAMMPS, a GPU-accelerated application, to benefit from efficient inter-GPU data transfers, and validate our findings.

international symposium on performance analysis of systems and software | 2017

Chai: Collaborative heterogeneous applications for integrated-architectures

Juan Gómez-Luna; Izzat El Hajjy; Li-Wen Changy; Victor Garcia-Floreszx; Simon Garcia de Gonzaloy; Thomas B. Jablin; Antonio J. Peña; Wen-mei W. Hwu

Heterogeneous system architectures are evolving towards tighter integration among devices, with emerging features such as shared virtual memory, memory coherence, and systemwide atomics. Languages, device architectures, system specifications, and applications are rapidly adapting to the challenges and opportunities of tightly integrated heterogeneous platforms. Programming languages such as OpenCL 2.0, CUDA 8.0, and C++ AMP allow programmers to exploit these architectures for productive collaboration between CPU and GPU threads. To evaluate these new architectures and programming languages, and to empower researchers to experiment with new ideas, a suite of benchmarks targeting these architectures with close CPU-GPU collaboration is needed. In this paper, we classify applications that target heterogeneous architectures into generic collaboration patterns including data partitioning, fine-grain task partitioning, and coarse-grain task partitioning. We present Chai, a new suite of 14 benchmarks that cover these patterns and exercise different features of heterogeneous architectures with varying intensity. Each benchmark in Chai has seven different implementations in different programming models such as OpenCL, C++ AMP, and CUDA, and with and without the use of the latest heterogeneous architecture features. We characterize the behavior of each benchmark with respect to varying input sizes and collaboration combinations, and evaluate the impact of using the emerging features of heterogeneous architectures on application performance.

Proceedings of the 20th European MPI Users' Group Meeting on | 2013

Analysis of topology-dependent MPI performance on Gemini networks

Antonio J. Peña; Ralf G. Correa Carvalho; James Dinan; Pavan Balaji; Rajeev Thakur; William Gropp

Current HPC systems utilize a variety of interconnection networks, with varying features and communication characteristics. MPI normalizes these interconnects with a common interface used by most HPC applications. However, network properties can have a significant impact on application performance. We explore the impact of the interconnect on application performance on the Blue Waters supercomputer. Blue Waters uses a three-dimensional, Cray Gemini torus network, which provides twice the Y-dimension bandwidth in the X and Z dimensions. Through several benchmarks, including a halo-exchange example, we demonstrate that application-level mapping to the network topology yields significant performance improvements.

Explore More