Basilio B. Fraguela | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Basilio B. Fraguela is active.

Explore More

Publication

Featured researches published by Basilio B. Fraguela.

acm sigplan symposium on principles and practice of parallel programming | 2006

Programming for parallelism and locality with hierarchically tiled arrays

Ganesh Bikshandi; Jia Guo; Daniel Hoeflinger; Gheorghe Almasi; Basilio B. Fraguela; María Jesús Garzarán; David A. Padua; Christoph von Praun

Tiling has proven to be an effective mechanism to develop high performance implementations of algorithms. Tiling can be used to organize computations so that communication costs in parallel programs are reduced and locality in sequential codes or sequential components of parallel programs is enhanced.In this paper, a data type - Hierarchically Tiled Arrays or HTAs - that facilitates the direct manipulation of tiles is introduced. HTA operations are overloaded array operations. We argue that the implementation of HTAs in sequential OO languages transforms these languages into powerful tools for the development of high-performance parallel codes and codes with high degree of locality. To support this claim, we discuss our experiences with the implementation of HTAs for MATLAB and C++ and the rewriting of the NAS benchmarks and a few other programs into HTA-based parallel form.

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2009

Performance Evaluation of MPI, UPC and OpenMP on Multicore Architectures

Damián A. Mallón; Guillermo L. Taboada; Carlos Teijeiro; Juan Touriño; Basilio B. Fraguela; Andrés Gómez; Ramón Doallo; J. Carlos Mouriño

The current trend to multicore architectures underscores the need of parallelism. While new languages and alternatives for supporting more efficiently these systems are proposed, MPI faces this new challenge. Therefore, up-to-date performance evaluations of current options for programming multicore systems are needed. This paper evaluates MPI performance against Unified Parallel C (UPC) and OpenMP on multicore architectures. From the analysis of the results, it can be concluded that MPI is generally the best choice on multicore systems with both shared and hybrid shared/distributed memory, as it takes the highest advantage of data locality, the key factor for performance in these systems. Regarding UPC, although it exploits efficiently the data layout in memory, it suffers from remote shared memory accesses, whereas OpenMP usually lacks efficient data locality support and is restricted to shared memory systems, which limits its scalability.

international conference on parallel architectures and compilation techniques | 1999

Automatic analytical modeling for the estimation of cache misses

Basilio B. Fraguela; Ramón Doallo; Emilio L. Zapata

Caches play a very important role in the performance of modern computer systems due to the gap between the memory and the processor speed. Among the methods for studying their behaviour, the most widely used has been trace-driven simulation. Nevertheless, analytical modeling gives more information and requires smaller computation times that allow it to be used in the compilation step to drive automatic optimizations on the code. The traditional drawback of analytical modeling has been its limited precision and the lack of techniques to apply it systematically without user intervention. In this work we present a methodology to build analytical models for codes with regular access patterns. These models can be applied to caches with an arbitrary size, line size and associativity. Their validation through simulations using typical scientific code fragments has proved a good degree of accuracy.

international symposium on microarchitecture | 2009

Adaptive line placement with the set balancing cache

Dyer Rolán; Basilio B. Fraguela; Ramón Doallo

Efficient memory hierarchy design is critical due to the increasing gap between the speed of the processors and the memory. One of the sources of inefficiency in current caches is the non-uniform distribution of the memory accesses on the cache sets. Its consequence is that while some cache sets may have working sets that are far from fitting in them, other sets may be underutilized because their working set has fewer lines than the set. In this paper we present a technique that aims to balance the pressure on the cache sets by detecting when it may be beneficial to associate sets, displacing lines from stressed sets to underutilized ones. This new technique, called set balancing cache or SBC, achieved an average reduction of 13% in the miss rate often benchmarks from the SPEC CPU2006 suite, resulting in an average IPC improvement of 5%.

acm sigplan symposium on principles and practice of parallel programming | 2008

Programming with tiles

Jia Guo; Ganesh Bikshandi; Basilio B. Fraguela; María Jesús Garzarán; David A. Padua

The importance of tiles or blocks in scientific computing cannot be overstated. Many algorithms, both iterative and recursive, can be expressed naturally if tiles are represented explicitly. From the point of view of performance, tiling, either as a code or a data layout transformation, is one of the most effective ways to exploit locality, which is a must to achieve good performance in current computers because of the significant difference in speed between processor and memory. Furthermore, tiles are also useful to express data distribution in parallel computations. However, despite the importance of tiles, most languages do not support them directly. This gives place to bloated programs populated with numerous subscript expressions which make the code difficult to read and coding mistakes more likely. This paper discusses Hierarchically Tiled Arrays (HTAs), a data type which facilitates the easy manipulation of tiles in object-oriented languages with emphasis on two new features, dynamic partitioning and overlapped tiling. These features facilitate the expression of locality and communication while maintaining the same performance of algorithms written using conventional languages.

acm sigplan symposium on principles and practice of parallel programming | 2003

Programming the FlexRAM parallel intelligent memory system

Basilio B. Fraguela; Jose Renau; Paul Feautrier; David A. Padua; Josep Torrellas

In an intelligent memory architecture, the main memory of a computer is enhanced with many simple processors. The result is a highly-parallel, heterogeneous machine that is able to exploit computation in the main memory. While several instantiations of this architecture have been proposed, the question of how to effectively program them with little effort has remained a major challenge.In this paper, we show how to effectively hand-program an intelligent memory architecture at a high level and with very modest effort. We use FlexRAM as a prototype architecture. To program it, we propose a family of high-level compiler directives inspired by OpenMP called CFlex. Such directives enable the processors in memory to execute the program in cooperation with the main processor. In addition, we propose libraries of highly-optimized functions called Intelligent Memory Operations (IMOs). These functions program the processors in memory through CFlex, but make them completely transparent to the programmer. Simulation results show that, with CFlex and IMOs, a server with 64 simple processors in memory runs on average 10 times faster than a conventional server. Moreover, a set of conventional programs with 240 lines on average are transformed into CFlex parallel form with only 7 CFlex directives and 2 additional statements on average.

Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems | 2004

The Hierarchically Tiled Arrays programming approach

Basilio B. Fraguela; Jia Guo; Ganesh Bikshandi; María Jesús Garzarán; Gheorghe Almasi; José E. Moreira; David A. Padua

In this paper, we show our initial experience with a class of objects, called Hierarchically Tiled Arrays (HTAs), that encapsulate parallelism. HTAs allow the construction of single-threaded parallel programs where a master process distributes tasks to be executed by a collection of servers holding the components (tiles) of the HTAs. The tiled and recursive nature of HTAs facilitates the adaptation of the programs that use them to varying machine configurations, and eases the mapping of data and tasks to parallel computers with a hierarchical organization. We have implemented HTAs as a MATLAB™ toolbox, overloading conventional operators and array functions such that HTA operations appear to the programmer as extensions of MATLAB™. Our experiments show that the resulting environment is ideal for the prototyping of parallel algorithms and greatly improves the ease of development of parallel programs while providing reasonable performance.

IEEE Transactions on Computers | 2003

Probabilistic miss equations: evaluating memory hierarchy performance

Basilio B. Fraguela; Ramón Doallo; Emilio L. Zapata

The increasing gap between processor and main memory speeds makes the role of the memory hierarchy behavior in the system performance essential. Both hardware and software techniques to improve this behavior require good analysis tools that help predict and understand such behavior. Analytical modeling arises as a good choice in this field due to its high speed if its traditional limited precision is overcome. We present a modular analytical modeling strategy for arbitrary set-associative caches with LRU replacement policy. The model differs from all the previous related works in its probabilistic approach. Both perfectly and nonperfectly nested loops as well as reuse between different nests are considered by this model, so it makes the analysis of complete programs with regular computations feasible. Moreover, the model achieves good levels of accuracy while being extremely fast and flexible enough to allow its extension. Our approach has been extensively validated using well-known benchmarks. Finally, the model has also proven its ability to drive code optimizations even more successfully than current production compilers.

international parallel and distributed processing symposium | 2010

Servet: A benchmark suite for autotuning on multicore clusters

Jorge González-Domínguez; Guillermo L. Taboada; Basilio B. Fraguela; María J. Martín; Juan Touriño

The growing complexity in computer system hierarchies due to the increase in the number of cores per processor, levels of cache (some of them shared) and the number of processors per node, as well as the high-speed interconnects, demands the use of new optimization techniques and libraries that take advantage of their features. In this paper Servet, a suite of benchmarks focused on detecting a set of parameters with high influence in the overall performance of multicore systems, is presented. These benchmarks are able to detect the cache hierarchy, including their size and which caches are shared by each core, bandwidths and bottlenecks in memory accesses, as well as communication latencies among cores. These parameters can be used by auto-tuned codes to increase their performance in multicore clusters. Experimental results using different representative systems show that Servet provides very accurate estimates of the parameters of the machine architecture.

Concurrency and Computation: Practice and Experience | 2007

Automated and accurate cache behavior analysis for codes with irregular access patterns

Diego Andrade; Manuel Arenaz; Basilio B. Fraguela; Juan Touriño; Ramón Doallo

The memory hierarchy plays an essential role in the performance of current computers, so good analysis tools that help in predicting and understanding its behavior are required. Analytical modeling is the ideal base for such tools if its traditional limitations in accuracy and scope of application can be overcome. While there has been extensive research on the modeling of codes with regular access patterns, less attention has been paid to codes with irregular patterns due to the increased difficulty in analyzing them. Nevertheless, many important applications exhibit this kind of pattern, and their lack of locality make them more cache‐demanding, which makes their study more relevant. The focus of this paper is the automation of the Probabilistic Miss Equations (PME) model, an analytical model of the cache behavior that provides fast and accurate predictions for codes with irregular access patterns. The information requirements of the PME model are defined and its integration in the XARK compiler, a research compiler oriented to automatic kernel recognition in scientific codes, is described. We show how to exploit the powerful information‐gathering capabilities provided by this compiler to allow the automated modeling of loop‐oriented scientific codes. Experimental results that validate the correctness of the automated PME model are also presented. Copyright

Explore More