Yuri Torres | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yuri Torres is active.

Explore More

Publication

Featured researches published by Yuri Torres.

international conference on high performance computing and simulation | 2013

A new GPU-based approach to the Shortest Path problem

Hector Ortega-Arranz; Yuri Torres; Diego R. Llanos; Arturo Gonzalez-Escribano

The Single-Source Shortest Path (SSSP) problem arises in many different fields. In this paper we present a GPU-based version of the Crauser et al. SSSP algorithm. Our work significantly speeds up the computation of the SSSP, not only with respect to the CPU-based version, but also to other state-of-the-art GPU implementation based on Dijkstra, due to Martin et al. Both GPU implementations have been evaluated using the last Nvidia architecture (Kepler). Our experimental results show that the new GPU-Crauser algorithm leads to speed-ups from 13× to 220× with respect to the CPU version and a performance gain of up to 17% with respect the GPU-Marten algorithm.

The Journal of Supercomputing | 2013

uBench: exposing the impact of CUDA block geometry in terms of performance

Yuri Torres; Arturo Gonzalez-Escribano; Diego R. Llanos

The choice of thread-block size and shape is one of the most important user decisions when a parallel problem is written for any CUDA architecture. The reason is that thread-block geometry has a significant impact on the global performance of the program. Unfortunately, the programmer has not enough information about the subtle interactions between this choice of parameters and the underlying hardware.This paper presents uBench, a complete suite of micro-benchmarks, in order to explore the impact on performance of (1) the thread-block geometry choice criteria, and (2) the GPU hardware resources and configurations. Each micro-benchmark has been designed to be as simple as possible to focus on a single effect derived from the hardware and thread-block parameter choice.As an example of the capabilities of this benchmark suite, this paper shows an experimental evaluation and comparison of Fermi and Kepler architectures. Our study reveals that, in spite of the new hardware details introduced by Kepler, the principles underlying the block geometry selection criteria are similar for both architectures.

IEEE Transactions on Parallel and Distributed Systems | 2014

An Extensible System for Multilevel Automatic Data Partition and Mapping

Arturo Gonzalez-Escribano; Yuri Torres; Javier Fresno; Diego R. Llanos

Automatic data distribution is a key feature to obtain efficient implementations from abstract and portable parallel codes. We present a highly efficient and extensible runtime library that integrates techniques for automatic data partition and mapping. It uses a novel approach to define an abstract interface and a plug-in system to encapsulate different types of regular and irregular techniques, helping to generate codes which are independent of the exact mapping functions selected. Currently, it supports hierarchical tiling of arrays with dense and stride domains, that allows the implementation of both data and task parallelism using a SPMD model. It automatically computes appropriate domain partitions for a selected virtual topology, mapping them to available processors with static or dynamic load-balancing techniques. Our library also allows the construction of reusable communication patterns that efficiently exploit MPI communication capabilities. The use of our library greatly reduces the complexity of data distribution and communication, hiding the details of the underlying architecture. The library can be used as an abstract layer for building generic tiling operations as well. Our experimental results show that the use of this library allows to achieve similar performance as carefully-implemented manual versions for several, well-known parallel kernels and benchmarks in distributed and multicore systems, and substantially reduces programming effort.

international symposium on parallel and distributed processing and applications | 2012

Using Fermi Architecture Knowledge to Speed up CUDA and OpenCL Programs

Yuri Torres; Arturo Gonzalez-Escribano; Diego R. Llanos

The NVIDIA graphics processing units (GPUs) are playing an important role as general purpose programming devices. The implementation of parallel codes to exploit the GPU hardware architecture is a task for experienced programmers. The threadblock size and shape choice is one of the most important user decisions when a parallel problem is coded. The threadblock configuration has a significant impact on the global performance of the program. While in CUDA parallel programming model it is always necessary to specify the threadblock size and shape, the OpenCL standard also offers an automatic mechanism to take this delicate decision. In this paper we present a study of these criteria for Fermi architecture, introducing a general approach for threadblock choice, and showing that there is considerable room for improvement in OpenCL automatic strategy.

international conference on parallel processing | 2012

Encapsulated synchronization and load-balance in heterogeneous programming

Yuri Torres; Arturo Gonzalez-Escribano; Diego R. Llanos

Programming models and techniques to exploit parallelism in accelerators, such as GPUs, are different from those used in traditional parallel models for shared- or distributed-memory systems. It is a challenge to blend different programming models to coordinate and exploit devices with very different characteristics and computation powers. This paper presents a new extensible framework model to encapsulate run-time decisions related to data partition, granularity, load balance, synchronization, and communication for systems including assorted GPUs. Thus, the main parallel code becomes independent of them, using internal topology and system information to transparently adapt the computation to the system. The programmer can develop specific functions for each architecture, or use existent specialized library functions for different CPU-core or GPU architectures. The high-level coordination is expressed using a programming model built on top of message-passing, providing portability across distributed- or shared-memory systems. We show with an example how to produce a parallel code that can be used to efficiently run on systems ranging from a Beowulf cluster to a machine with mixed GPUs. Our experimental results show how the run-time system, guided by hints about the computational-power ratios of different devices, can automatically part and distribute large computations across heterogeneous systems, improving the overall performance.

International Journal of Parallel Programming | 2015

TuCCompi: A Multi-layer Model for Distributed Heterogeneous Computing with Tuning Capabilities

Hector Ortega-Arranz; Yuri Torres; Arturo Gonzalez-Escribano; Diego R. Llanos

During the last decade, parallel processing architectures have become a powerful tool to deal with massively-parallel problems that require high performance computing (HPC). The last trend of HPC is the use of heterogeneous environments, that combine different computational processing devices, such as CPU-cores and graphics processing units (GPUs). Maximizing the performance of any GPU parallel implementation of an algorithm requires an in-depth knowledge about the GPU underlying architecture, becoming a tedious manual effort only suited for experienced programmers. In this paper, we present TuCCompi, a multi-layer abstract model that simplifies the programming on heterogeneous systems including hardware accelerators, by hiding the details of synchronization, deployment, and tuning. TuCCompi chooses optimal values for their configuration parameters using a kernel characterization provided by the programmer. This model is very useful to tackle problems characterized by independent, high computational-load independent tasks, such as embarrassingly-parallel problems. We have evaluated TuCCompi in different, real-world, heterogeneous environments using the all-pair shortest-path problem as a case study.

international conference on high performance computing and simulation | 2011