Diego Andrade
University of A Coruña
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Diego Andrade.
Concurrency and Computation: Practice and Experience | 2007
Diego Andrade; Manuel Arenaz; Basilio B. Fraguela; Juan Touriño; Ramón Doallo
The memory hierarchy plays an essential role in the performance of current computers, so good analysis tools that help in predicting and understanding its behavior are required. Analytical modeling is the ideal base for such tools if its traditional limitations in accuracy and scope of application can be overcome. While there has been extensive research on the modeling of codes with regular access patterns, less attention has been paid to codes with irregular patterns due to the increased difficulty in analyzing them. Nevertheless, many important applications exhibit this kind of pattern, and their lack of locality make them more cache‐demanding, which makes their study more relevant. The focus of this paper is the automation of the Probabilistic Miss Equations (PME) model, an analytical model of the cache behavior that provides fast and accurate predictions for codes with irregular access patterns. The information requirements of the PME model are defined and its integration in the XARK compiler, a research compiler oriented to automatic kernel recognition in scientific codes, is described. We show how to exploit the powerful information‐gathering capabilities provided by this compiler to allow the automated modeling of loop‐oriented scientific codes. Experimental results that validate the correctness of the automated PME model are also presented. Copyright
parallel, distributed and network-based processing | 2009
Diego Andrade; Basilio B. Fraguela; James C. Brodman; David A. Padua
Multicore machines are becoming common. There are many languages, language extensions and libraries devoted to improve the programmability and performance of these machines. In this paper we compare two libraries, that face the problem of programming multicores from two different perspectives, task parallelism and data parallelism. The Intel Threading Building Blocks (TBB) library separates logical task patterns, which are easy to understand, from physical threads, and delegates the scheduling of the tasks to the system. On the other hand, Hierarchically Tiled Arrays (HTAs) are data structures that facilitate locality and parallelism of array intensive computations with a block-recursive nature following a data-parallel paradigm.Our comparison considers both ease of programming and the performance obtained using both approaches. In our experience, HTA programs tend to be smaller or as long as TBB programs, while performance of both approaches is very similar.
ACM Transactions on Architecture and Code Optimization | 2007
Diego Andrade; Basilio B. Fraguela; Ramón Doallo
The performance of memory hierarchies, in which caches play an essential role, is critical in nowadays general-purpose and embedded computing systems because of the growing memory bottleneck problem. Unfortunately, cache behavior is very unstable and difficult to predict. This is particularly true in the presence of irregular access patterns, which exhibit little locality. Such patterns are very common, for example, in applications in which pointers or compressed sparse matrices give place to indirections. Nevertheless, cache behavior in the presence of irregular access patterns has not been widely studied. In this paper we present an extension of a systematic analytical modeling technique based on PMEs (probabilistic miss equations), previously developed by the authors, that allows the automated analysis of the cache behavior for codes with irregular access patterns resulting from indirections. The model generates very accurate predictions despite the irregularities and has very low computing requirements, being the first model that gathers these desirable characteristics that can automatically analyze this kind of codes. These properties enable this model to help drive compiler optimizations, as we show with an example.
parallel computing | 2013
Diego Andrade; Basilio B. Fraguela; Ramón Doallo
Multicores are the norm nowadays and in many of them there are cores that share one or several levels of cache. The theoretical performance gain expected when several cores cooperate in the parallel execution of an application can be reduced in some cases by a cache access bottleneck, as the data accessed by them can interfere in the shared cache levels. In other cases the performance gain can be increased due to a greater reuse of the data loaded in the cache. This paper presents an analytical model that can predict the behavior of shared caches when executing applications parallelized at loop level. To the best of our knowledge, this is the first analytical model that tackles the behavior of multithreaded applications on realistic shared caches without requiring profiling. The experimental results show that the model predictions are precise and very fast and that the model can help a compiler or programmer choose the best parallelization strategy.
Journal of Systems Architecture | 2006
Diego Andrade; Basilio B. Fraguela; Ramón Doallo
Several analytical models that predict the memory hierarchy behavior of codes with regular access patterns have been developed. These models help understand this behavior and they can be used successfully to guide compilers in the application of locality-related optimizations requiring small computing times. Still, these models suffer from many limitations. The most important of them is their restricted scope of applicability, since real codes exhibit many access patterns they cannot model. The most common source of such kind of accesses is the presence of irregular access patterns because of the presence of either data-dependent conditionals or indirections in the code. This paper extends the Probabilistic Miss Equations (PME) model to be able to cope with codes that include data-dependent conditional structures too. This approach is systematic enough to enable the automatic implementation of the extended model in a compiler framework. Validations show a good degree of accuracy in the predictions despite the irregularity of the access patterns. This opens the possibility of using our model to guide compiler optimizations for this kind of codes.
international conference on conceptual structures | 2013
Jorge F. Fabeiro; Diego Andrade; Basilio B. Fraguela
Nowadays, computers include several computational devices with parallel capacities, such as multicore processors and Graphic Processing Units (GPUs). OpenCL enables the programming of all these kinds of devices. An OpenCL program consists of a host code which discovers the computational devices available in the host system and it queues up commands to the devices, and the kernel code which defines the core of the parallel computation executed in the devices. This work addresses two of the most important problems faced by an OpenCL programmer: (1) hosts codes are quite verbose but they can be automatically generated if some parameters are known; (2) OpenCL codes that are hand-optimized for a given device do not get necessarily a good performance in a different one. This paper presents a source-to-source iterative optimization tool, called OCLoptimizer, that aims to generate host codes automatically and to optimize OpenCL kernels taking as inputs an annotated version of the original kernel and a configuration file. Iterative optimization is a well-known technique which allows to optimize a given code by exploring different configuration parameters in a systematic manner. For example, we can apply tiling on one loop and the iterative optimizer would select the optimal tile size by exploring the space of possible tile sizes. The experimental results show that the tool can automatically optimize a set of OpenCL kernels for multicore processors.
parallel computing | 2016
Jorge F. Fabeiro; Diego Andrade; Basilio B. Fraguela
This paper presents a performance portable matrix multiplication.The implementation has a set of parameters that can be tuned for each device.These parameters are tuned using a genetic algorithm.Our approach generates matrix multiplications 74% faster than clBLAS.Our approach requires 18% less autotuning time than clBLAS. There are several frameworks that, while providing functional portability of code across different platforms, do not automatically provide performance portability. As a consequence, programmers have to hand-tune the kernel codes for each device. The Heterogeneous Programming Library (HPL) is one of these libraries, but it has the interesting feature that the kernel codes, which implement the computation to be performed, are generated at run-time. This run-time code generation (RTCG) capability can be used, in conjunction with generic parameterized algorithms, to write performance-portable codes. In this paper we explain how these techniques can be applied to a matrix multiplication algorithm. The performance of our implementation is compared to two state-of-the-art adaptive implementations, clBLAS and ViennaCL, on four different platforms, achieving average speedups with respect to them of 1.74 and 1.44, respectively.
real time technology and applications symposium | 2009
Diego Andrade; Basilio B. Fraguela; Ramón Doallo
While caches are essential to reduce execution time and power consumption, they complicate the estimation of the Worst-Case Execution Time (WCET), crucial for many Real-Time Systems (RTS). Most research on static worst-case cache behavior prediction has focused on hard RTS, which need complete information on the access patterns and addresses of the data to guarantee the predicted WCET is a safe upper bound of any execution time. Access patterns are available in those codes that have a steady state of access patterns after the first iteration of a loop (in the following regular codes), however, the addresses of the data are not always known at compile time for many reasons: stack variables, dynamically allocated memory, modules compiled separately, etc. Even when available, their usefulness to predict cache behavior in systems with virtual memory decreases in the presence of physically-indexed caches. In this paper we present a model that predicts a reasonable bound of the worst-case behavior of data caches during the execution of regular codes without information on the base address of the data structures. In 99.7% of our tests the number of misses performed below the boundary predicted by the model. This turns the model into a valuable tool, particularly for non-RTS and soft RTS, which tolerate a percentage of the runs exceeding their deadlines.
software and compilers for embedded systems | 2003
Diego Andrade; Basilio B. Fraguela; Ramón Doallo
The increasing gap between the speed of the processor and the memory makes the role played by the memory hierarchy essential in the system performance. There are several methods for studying this behavior. Trace-driven simulation has been the most widely used by now. Nevertheless, analytical modeling requires shorter computing times and provides more information. In the last years a series of fast and reliable strategies for the modeling of set-associative caches with LRU replacement policy has been presented. However, none of them has considered the modeling of codes with data-dependent conditionals. In this article we present the extension of one of them in this sense.
Journal of Parallel and Distributed Computing | 2017
Moisés Viñas; Basilio B. Fraguela; Diego Andrade; Ramón Doallo
Heterogeneous devices require much more work from programmers than traditional CPUs, particularly when there are several of them, as each one has its own memory space. Multi-device applications require to distribute kernel executions and, even worse, arrays portions that must be kept coherent among the different device memories and the host memory. In addition, when devices with different characteristics participate in a computation, optimally distributing the work among them is not trivial. In this paper we extend an existing framework for the programming of accelerators called Heterogeneous Programming Library (HPL) with three kinds of improvements that facilitate these tasks. The first two ones are the ability to define subarrays and subkernels, which distribute kernels on different devices. The last one is a convenient extension of the subkernel mechanism to distribute computations among heterogeneous devices seeking the best work balance among them. This last contribution includes two analytical models that have proved to automatically provide very good work distributions. Our experiments also show the large programmability advantages of our approach and the negligible overhead incurred. Three approaches to develop multi-device heterogeneous applications are proposed.Easy, efficient and coherent subarray usage for kernels and movements is implemented.Simple argument annotations allow to easily split kernels and arrays among devices.Accurate automatic workload balancing is provided by means of a friendly API.The results are very promising both in terms of performance and programmability.