Agustin Fernández | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Agustin Fernández is active.

Explore More

Publication

Featured researches published by Agustin Fernández.

international symposium on performance analysis of systems and software | 2006

ATTILA: a cycle-level execution-driven simulator for modern GPU architectures

V.M. del Barrio; Carlos Gonzalez; Jordi Roca; Agustin Fernández; Espasa E

The present work presents a cycle-level execution-driven simulator for modern GPU architectures. We discuss the simulation model used for our GPU simulator, based in the concept of boxes and signals, and the relation between the timing simulator and the functional emulator. The simulation model we use helps to increase the accuracy and reduce the number of errors in the timing simulator while allowing for an easy extensibility of the simulated GPU architecture. We also introduce the OpenGL framework used to feed the simulator with traces from real applications (UT2004, Doom3) and a performance debugging tool (Signal Trace Visualizer). The presented ATTILA simulator supports the simulation of a whole range of GPU configurations and architectures, from the embedded segment to the high end PC segment, supporting both the unified and non unified shader architectural models.

Molecular Brain Research | 1994

Identification and characterization of serotonin 5-HT4 receptor binding sites in human brain: comparison with other mammalian species

T. Doménech; J. Beleta; Agustin Fernández; R.W. Gristwood; F.Cruz Sánchez; E. Tolosa; J.M. Palacios

Specific binding for the 5-HT4-selective radioligand [3H]GR 113808 has been identified in human and calf brain membranes. Using human tissue the distribution of the binding was heterogeneous throughout different brain regions, being highest in the caudate nucleus. For this region a Kd value of 0.59 +/- 0.08 nM and a Bmax of 225 +/- 2.6 fmol/mg were obtained. Other regions with substantial densities were the lenticular nucleus, the substantia nigra, the hippocampus and the frontal cortex, whereas no binding could be detected in the cerebellum. The ability of several standard compounds in displacing the radioligand was compatible with the labelling of 5-HT4 receptors. Correlation analysis showed no significant differences amongst data obtained for these compounds using human, calf and guinea-pig membranes.

international symposium on microarchitecture | 2005

Shader Performance Analysis on a Modern GPU Architecture

Victor Moya; Carlos Gonzalez; Jordi Roca; Agustin Fernández; Roger Espasa

This paper presents an analysis of the performance of the shader processing units in a modern graphics processor unit (GPU) architecture using real graphic applications. The architecture of a modern GPU is described and a simulator and associated framework used to evaluate the architecture is introduced. The paper analyses the effects in performance of different configurations of the shader processing units and compares a classic GPU with a unified shader GPU. The evaluated unified shader architecture proves to be 15% to 30% more efficient, in terms of area, with a 2% to 7% improvement in performance when compared with a similar nonunified architecture

ACM Transactions on Programming Languages and Systems | 2002

Register tiling in nonrectangular iteration spaces

Marta Jiménez; José M. Llabería; Agustin Fernández

Loop tiling is a well-known loop transformation generally used to expose coarse-grain parallelism and to exploit data reuse at the cache level. Tiling can also be used to exploit data reuse at the register level and to improve a programs ILP. However, previous proposals in the literature (as well as commercial compilers) are only able to perform multidimensional tiling for the register level when the iteration space is rectangular. In this article we present a new general algorithm to perform multidimensional tiling for the register level in both rectangular and nonrectangular iteration spaces. We also propose a simple heuristic to determine the tiling parameters at this level. Finally, we evaluate our method using as benchmarks typical linear algebra algorithms having nonrectangular iteration spaces and compare our proposal against hand-optimized vendor-supplied numerical libraries and against commercial compilers able to perform optimizing code transformations such as inner unrolling, unroll-and-jam, and software pipelining. Measurements were taken on three different superscalar microprocessors. Results will show that our method outperforms the native compilers (showing speedups of 2.5 in average) and matches the performance of vendor-supplied numerical libraries. The general conclusion is that compiler technology can make it possible for nonrectangular loop nests to achieve as high performance as hand-optimized codes.

IEEE Transactions on Parallel and Distributed Systems | 1995

Loop transformation using nonunimodular matrices

Agustin Fernández; José M. Llabería; Miguel Valero-García

Linear transformations are widely used to vectorize and parallelize loops. A subset of these transformations are unimodular transformations. When a unimodular transformation is used, the exact bounds of the transformed loop nest are easily computed and the steps of the loops are equal to 1. Unimodular loop transformations have been widely used since they permit the implementation of many useful loop transformations. Recently, nonunimodular transformations have been proposed to reduce communication requirements or to use the memory hierarchy efficiently. The methods used for unimodular transformations do not work in the case of nonunimodular transformations, since they do not produce the exact bounds of the transformed loop nest. In this paper, we present a method for nested loop transformation which gives the exact bounds for both unimodular and nonunimodular transformations. The basic idea is to use the Hermite Normal Form (HNF) of the transformation matrix. >

ieee international symposium on workload characterization | 2006

Workload Characterization of 3D Games

Jordi Roca; Victor Moya; Carlos Gonzalez; Chema Solis; Agustin Fernández; Roger Espasa

The rapid pace of change in 3D game technology makes workload characterization necessary for every game generation. Comparing to CPU characterization, far less quantitative information about games is available. This paper focuses on analyzing a set of modern 3D games at the API call level and at the micro architectural level using the Attila simulator. In addition to common geometry metrics and, in order to understand tradeoffs in modern GPUs, the microarchitectural level metrics allow us to analyze performance key characteristics such as the balance between texture and ALU instructions in fragment programs, dynamic anisotropic ratios, vertex, z-stencil, color and texture cache performance

high performance embedded architectures and compilers | 2005

A single (unified) shader GPU microarchitecture for embedded systems

Victor Moya; Carlos Gonzalez; Jordi Roca; Agustin Fernández; Roger Espasa

We present and evaluate the TILA-rin GPU microarchitecture for embedded systems using the ATTILA GPU simulation framework. We use a trace from an execution of the Unreal Tournament 2004 PC game to eval uate and compare the performance of the proposed embedded GPU against a baseline GPU architecture for the PC. We evaluate the different elements that have been removed from the baseline GPU architecture to accommodate the architecture to the restricted power, bandwidth and area budgets of em bedded systems. The unified shader architecture we present processes verti ces, triangles and fragments in a single processing unit saving space and re ducing hardware complexity. The proposed embedded GPU architecture sustains 20 frames per second on the selected UT 2004 trace.

IEEE Transactions on Parallel and Distributed Systems | 2003

A cost-effective implementation of multilevel tiling

Marta Jiménez; José M. Llabería; Agustin Fernández

This paper presents a new cost-effective algorithm to compute exact loop bounds when multilevel tiling is applied to a loop nest having affine functions as bounds (nonrectangular loop nest). Traditionally, exact loop bounds computation has not been performed because its complexity is doubly exponential on the number of loops in the multilevel tiled code and, therefore, for certain classes of loops (i.e., nonrectangular loop nests), can be extremely time consuming. Although computation of exact loop bounds is not very important when tiling only for cache levels, it is critical when tiling includes the register level. This paper presents an efficient implementation of multilevel tiling that computes exact loop bounds and has a much lower complexity than conventional techniques. To achieve this lower complexity, our technique deals simultaneously with all levels to be tiled, rather than applying tiling level by level as is usually done. For loop nests having very simple affine functions as bounds, results show that our method is between 15 and 28 times faster than conventional techniques. For loop nests caving not so simple bounds, we have measured speedups as high as 2,300. Additionally, our technique allows eliminating redundant bounds efficiently. Results show that eliminating redundant bounds in our method is between 22 and 11 times faster than in conventional techniques for typical linear algebra programs.

international conference on supercomputing | 1998

A general algorithm for tiling the register level

Marta Jiménez; José M. Llabería; Agustin Fernández; Enric Morancho

1. ABSTRACT Tiling is a well-known loop transformation that can be used to exploit data reuse at the register level and to improve a program’s ILP. Previous work on tiling and also commercial compilers are able to perform tiling for the register level in more than one dimension when the iteration space is rectangular. However, they either cannot handle or can only handle limited cases of non-rectangular iteration spaces. Nonrectangular iteration spaces’ are commonly found in linear algebra algorithms or can arise as a result of applying previous transformations such as loop skewing. In this paper we present a new general algorithm to perform tiling for the register level in more than one dimension in both rectangular and nonrectangular iteration spaces. Our method uses index set splitting to distinguish loop nests that traverse boundary tiles of the tiled iteration space from loop nests that traverse nonboundary tiles. We evaluate our method using as benchmarks typical linear algebra algorithms having non-rectangular iteration spaces. Results measured on both ALPHA 21064 and MIPS RlOOOO machines show that our method achieves speedups in the range of 1.11 to 5.96 over commercial compilers and preprocessors able to perform optimizing code transformations.

high-performance computer architecture | 1998

Performance evaluation of tiling for the register level

Marta Jiménez; José M. Llabería; Agustin Fernández

Tiling is a well-known loop transformation, which is basically used to expose coarse-grain parallelism and to exploit data reuse at the cache level. However, it can also be used to exploit data reuse at the register level and to improve programss ILP. Previous work on tiling and also commercial compilers are able to perform tiling for the register level in more than one dimension when the iteration space is rectangular. Non-rectangular iteration spaces are commonly found in linear algebra algorithms or can arise as a result of applying previous transformations such as loop skewing. In this paper we evaluate the technique presented in Jimenez et al. (1996) which is able to perform tiling for the register level in more than one dimension in both rectangular and non-rectangular iteration spaces. We use typical linear algebra algorithms having non-rectangular iteration spaces as benchmarks and compare our proposal against commercial preprocessors able to perform optimizing code transformations such as inner unrolling, outer unrolling and software pipelining. We will also present quantitative data showing the benefits of tiling only for the register level, tiling only for the cache level and tiling for both levels simultaneously. Results measured on a ALPHA 21164 processor show that tiling for both cache and register levels improves upon commercial compilers and preprocessors by factors in the range of 1.3 to 6.3.

Explore More