Diego R. Llanos
University of Valladolid
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Diego R. Llanos.
acm sigplan symposium on principles and practice of parallel programming | 2003
Marcelo Cintra; Diego R. Llanos
With speculative parallelization, code sections that cannot be fully analyzed by the compiler are aggressively executed in parallel. Hardware schemes are fast but expensive and require modifications to the processors and memory system. Software schemes require no extra hardware but can be inefficient.This paper proposes a new software-only speculative parallelization scheme. The scheme is developed after a systematic evaluation of the design options available and is shown to be efficient and robust and to outperform previously proposed schemes. The novelty and performance advantage of the scheme stem from the use of carefully tuned data structures, synchronization policies, and scheduling mechanisms. Experimental results show that our scheme has small overheads and, for applications with few or no data dependence violations, realizes on average 71% of the speedup of a manually parallelized version of the code, outperforming two recently proposed software-only speculative parallelization schemes. For applications with many data dependence violations, our performance monitors and switches can effectively curb the performance degradation.
IEEE Transactions on Parallel and Distributed Systems | 2005
Marcelo Cintra; Diego R. Llanos
With speculative parallelization, code sections that cannot be fully analyzed by the compiler are optimistically executed in parallel. Hardware schemes are fast but expensive and require modifications to the processors and/or memory system. Software schemes require no changes to the hardware of existing shared-memory systems, but can suffer from significant overheads involved with the speculative execution. In fact, the performance of software schemes is highly dependent on application characteristics, the design and implementation of the scheme, and the system configuration and size. This paper explores the design space of a recently proposed software speculative parallelization scheme. In the process, we gain insight into the most beneficial features of software schemes for speculative parallelization, as well as the most influential application characteristics. For instance, experimental results show that, contrary to intuition, checking for data dependence violations on every speculative store, as opposed to at commit time, leads to little performance degradation in the worst case and to significantly better performance with large configurations. Also, scheduling policies based on windows can perform very close to fully dynamic policies with a fraction of the memory overhead. Finally, experimental results show consistent speedups in the execution of loops that cannot be parallelized at compile time, both with and without RAW data dependences, for 4 to 32 processors.
international conference on management of data | 2006
Diego R. Llanos
This paper presents TPCC-UVa, an open-source implementation of the TPC-C benchmark version 5 intended to be used to measure performance of computer systems. TPCC-UVa is written entirely in C language and it uses the PostgreSQL database engine. This implementation includes all the functionalities described by the TPC-C standard specification for the measurement of both uni- and multiprocessor systems performance. The major characteristics of the TPC-C specification are discussed, together with a description of the TPCC-UVa implementation, architecture, and performance metrics obtained. As working examples, TPCC-UVa is used in this paper to measure performance of different file systems under Linux, and to compare the relative performance of multi-core CPU technologies and their single-core counterparts.This paper presents TPCC-UVa, an open-source implementation of the TPC-C benchmark version 5 intended to be used to measure performance of computer systems. TPCC-UVa is written entirely in C language and it uses the PostgreSQL database engine. This implementation includes all the functionalities described by the TPC-C standard specification for the measurement of both uni- and multiprocessor systems performance. The major characteristics of the TPC-C specification are discussed, together with a description of the TPCC-UVa implementation, architecture, and performance metrics obtained. As working examples, TPCC-UVa is used in this paper to measure performance of different file systems under Linux, and to compare the relative performance of multi-core CPU technologies and their single-core counterparts.
international conference on high performance computing and simulation | 2013
Hector Ortega-Arranz; Yuri Torres; Diego R. Llanos; Arturo Gonzalez-Escribano
The Single-Source Shortest Path (SSSP) problem arises in many different fields. In this paper we present a GPU-based version of the Crauser et al. SSSP algorithm. Our work significantly speeds up the computation of the SSSP, not only with respect to the CPU-based version, but also to other state-of-the-art GPU implementation based on Dijkstra, due to Martin et al. Both GPU implementations have been evaluated using the last Nvidia architecture (Kepler). Our experimental results show that the new GPU-Crauser algorithm leads to speed-ups from 13× to 220× with respect to the CPU version and a performance gain of up to 17% with respect the GPU-Marten algorithm.
The Journal of Supercomputing | 2013
Yuri Torres; Arturo Gonzalez-Escribano; Diego R. Llanos
The choice of thread-block size and shape is one of the most important user decisions when a parallel problem is written for any CUDA architecture. The reason is that thread-block geometry has a significant impact on the global performance of the program. Unfortunately, the programmer has not enough information about the subtle interactions between this choice of parameters and the underlying hardware.This paper presents uBench, a complete suite of micro-benchmarks, in order to explore the impact on performance of (1) the thread-block geometry choice criteria, and (2) the GPU hardware resources and configurations. Each micro-benchmark has been designed to be as simple as possible to focus on a single effect derived from the hardware and thread-block parameter choice.As an example of the capabilities of this benchmark suite, this paper shows an experimental evaluation and comparison of Fermi and Kepler architectures. Our study reveals that, in spite of the new hardware details introduced by Kepler, the principles underlying the block geometry selection criteria are similar for both architectures.
IEEE Transactions on Parallel and Distributed Systems | 2014
Arturo Gonzalez-Escribano; Yuri Torres; Javier Fresno; Diego R. Llanos
Automatic data distribution is a key feature to obtain efficient implementations from abstract and portable parallel codes. We present a highly efficient and extensible runtime library that integrates techniques for automatic data partition and mapping. It uses a novel approach to define an abstract interface and a plug-in system to encapsulate different types of regular and irregular techniques, helping to generate codes which are independent of the exact mapping functions selected. Currently, it supports hierarchical tiling of arrays with dense and stride domains, that allows the implementation of both data and task parallelism using a SPMD model. It automatically computes appropriate domain partitions for a selected virtual topology, mapping them to available processors with static or dynamic load-balancing techniques. Our library also allows the construction of reusable communication patterns that efficiently exploit MPI communication capabilities. The use of our library greatly reduces the complexity of data distribution and communication, hiding the details of the underlying architecture. The library can be used as an abstract layer for building generic tiling operations as well. Our experimental results show that the use of this library allows to achieve similar performance as carefully-implemented manual versions for several, well-known parallel kernels and benchmarks in distributed and multicore systems, and substantially reduces programming effort.
international symposium on parallel and distributed processing and applications | 2012
Yuri Torres; Arturo Gonzalez-Escribano; Diego R. Llanos
The NVIDIA graphics processing units (GPUs) are playing an important role as general purpose programming devices. The implementation of parallel codes to exploit the GPU hardware architecture is a task for experienced programmers. The threadblock size and shape choice is one of the most important user decisions when a parallel problem is coded. The threadblock configuration has a significant impact on the global performance of the program. While in CUDA parallel programming model it is always necessary to specify the threadblock size and shape, the OpenCL standard also offers an automatic mechanism to take this delicate decision. In this paper we present a study of these criteria for Fermi architecture, introducing a general approach for threadblock choice, and showing that there is considerable room for improvement in OpenCL automatic strategy.
The Journal of Supercomputing | 2011
Arturo Gonzalez-Escribano; Diego R. Llanos
Programming models of pure nested-parallelism are appealing due to their ease of programming and good analysis and debugging properties. Although their simple synchronization structure is appropriate to represent abstract parallel algorithms, it does not take into account many implementation issues. In this work we present Trasgo, a programming system based on high-level, nested-parallel specifications. We show how it allows to easily express complex combinations of data and task parallelism with a common scheme, hiding the layout and scheduling details. The approach allows the development of a modular compiler where automatic transformation techniques may exploit lower level and more complex synchronization structures, unlocking the limitations of pure nested-parallel programming. This article presents an overview of the features of Trasgo, and its architecture. We present some performance results using well-known parallel algorithms, and a roadmap of improvements and new features to be added to Trasgo.
international parallel and distributed processing symposium | 2006
Diego R. Llanos; Belén Palop
This paper presents TPCC-UVa, an open-source implementation of the TPC-C benchmark intended to be used in parallel and distributed systems. TPCC-UVa is written entirely in C language and it uses the Post-greSQL database engine. This implementation includes all the functionalities described by the TPC-C standard specification for the measurement of both uni- and multiprocessor systems performance. The major characteristics of the TPC-C specification are discussed, together with a description of the TPCC-UVa implementation and architecture and real examples of performance measurements
IEEE Transactions on Computers | 2007
Diego R. Llanos; David Orden; Belén Palop
In this work, we address the problem of scheduling loops with dependences in the context of speculative parallelization. We show that the scheduling alternatives are highly influenced by the dependence violation pattern the code presents. We center our analysis in those algorithms where dependences are less likely to appear as the execution proceeds. Particularly, we focus on randomized incremental algorithms, widely used as a much more efficient solution to many problems than their deterministic counterparts. These important algorithms are, in general, hard to parallelize by hand and represent a challenge for any automatic parallelization scheme. Our analysis led us to the development of MESETA, a new scheduling strategy that takes into account the probability of a dependence violation to determine the number of iterations being scheduled. MESETA is compared with existing techniques, including fixed-size chunking (FSC), the only scheduling alternative used so far in the context of speculative parallelization. Our experimental results show a 5.5 percent to 36.25 percent speedup improvement over FSC, leading to a better extraction of the parallelism inherent to randomized incremental algorithms. Moreover, when the cost of dependence violations is too high to obtain speedups, MESETA curves the performance degradation