Daniel Jiménez-González

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daniel Jiménez-González is active.

Explore More

Publication

Featured researches published by Daniel Jiménez-González.

International Journal of Parallel Programming | 2010

Extending OpenMP to Survive the Heterogeneous Multi-Core Era

Eduard Ayguadé; Rosa M. Badia; Pieter Bellens; Daniel Cabrera; Alejandro Duran; Roger Ferrer; Marc Gonzàlez; Francisco D. Igual; Daniel Jiménez-González; Jesus Labarta; Luis Martinell; Xavier Martorell; Rafael Mayo; Josep M. Perez; Judit Planas; Enrique S. Quintana-Ortí

This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, the Cell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer.

international symposium on performance analysis of systems and software | 2007

Performance Analysis of Cell Broadband Engine for High Memory Bandwidth Applications

Daniel Jiménez-González; Xavier Martorell; Alex Ramirez

The cell broadband engine (CBE) is designed to be a general purpose platform exposing an enormous arithmetic performance due to its eight SIMD-only synergistic processor elements (SPEs), capable of achieving 134.4 GFLOPS (16.8 GFLOPS * 8) at 2.1 GHz, and a 64-bit power processor element (PPE). Each SPE has a 256Kb non-coherent local memory, and communicates to other SPEs and main memory through its DMA controller. CBE main memory is connected to all the CBE processor elements (PPE and SPEs) through the element interconnect bus (EIB), which has a 134.4 GB/s bandwidth performance peak at half the processor speed. Therefore, CBE platform is suitable to be used by applications using MPI and streaming programming models with a potential high performance peak. In this paper we focus on the communication part of those applications, and measure the actual memory bandwidth that each of the CBE processor components can sustain. We have measured the sustained bandwidth between PPE and memory, SPE and memory, two individual SPEs to determine if this bandwidth depends on their physical location, pairs of SPEs to achieve maximum bandwidth in nearly-ideal conditions, and in a cycle of SPEs representing a streaming kind of computation. Our results on a real machine show that following some strict programming rules, individual SPE to SPE communication almost achieves the peak bandwidth when using the DMA controllers to transfer memory chunks of at least 1024 Bytes. In addition, SPE to memory bandwidth should be considered in streaming programming. For instance, implementing two data streams using 4 SPEs each can be more efficient than having a single data stream using the 8 SPEs

international conference on systems | 2009

OpenMP extensions for FPGA accelerators

Daniel Cabrera; Xavier Martorell; Georgi Gaydadjiev; Eduard Ayguadé; Daniel Jiménez-González

Reconfigurable computing is one of the paths to explore towards low-power supercomputing. However, programming these reconfigurable devices is not an easy task and still requires significant research and development efforts to make it really productive. In addition, the use of these devices as accelerators in multicore, SMPs and ccNUMA architectures adds an additional level of programming complexity in order to specify the offloading of tasks to reconfigurable devices and the interoperability with current shared-memory programming paradigms such as OpenMP. This paper presents extensions to OpenMP 3.0 that try to address this second challenge and an implementation in a prototype runtime system. With these extensions the programmer can easily express the offloading of an already existing reconfigurable binary code (bitstream) hiding all the complexities related with device configuration, bitstream loading, data arrangement and movement to the device memory. Our current prototype implementation targets the SGI Altix systems with RASC blades (based on the Virtex 4 FPGA). We analyze the overheads introduced in this implementation and propose a hybrid host/device operational mode to hide some of these overheads, significantly improving the performance of the applications. A complete evaluation of the system is done with a matrix multiplication kernel, including an estimation considering different FPGA frequencies.

IEEE Transactions on Parallel and Distributed Systems | 2014

Hybrid Dataflow/von-Neumann Architectures

Fahimeh Yazdanpanah; Carlos Alvarez-Martinez; Daniel Jiménez-González; Yoav Etsion

General purpose hybrid dataflow/von-Neumann architectures are gaining attraction as effective parallel platforms. Although different implementations differ in the way they merge the conceptually different computational models, they all follow similar principles: they harness the parallelism and data synchronization inherent to the dataflow model, yet maintain the programmability of the von-Neumann model. In this paper, we classify hybrid dataflow/von-Neumann models according to two different taxonomies: one based on the execution model used for inter- and intrablock execution, and the other based on the integration level of both control and dataflow execution models. The paper reviews the basic concepts of von-Neumann and dataflow computing models, highlights their inherent advantages and limitations, and motivates the exploration of a synergistic hybrid computing model. Finally, we compare a representative set of recent general purpose hybrid dataflow/von-Neumann architectures, discuss their different approaches, and explore the evolution of these hybrid processors.

parallel, distributed and network-based processing | 2003

CC-Radix: a cache conscious sorting based on Radix sort

Daniel Jiménez-González; Juan J. Navarro; Josep-lluis Larriba-pey

We focus on the improvement of data locality for the in-core sequential Radix sort algorithm for 32-bit positive integer keys. We propose a new algorithm that we call Cache Conscious Radix sort, CC-Radix. CC-Radix improves the data locality by dynamically partitioning the data set into subsets that fit in cache level L/sub 2/. Once in that cache level, each subset is sorted with Radix sort. In order to obtain the best implementations, we analyse the algorithms and obtain the algorithmic parameters that minimize the number of misses on cache levels L/sub 1/ and L/sub 2/, and the TLB structure. Here, we present results for a MIPS R10000 processor based computer, the SGI Origin 2000. Our results show that our algorithm is about 2 and 1.4 times faster than Quicksort and Explicit Block Transfer Radix sort, which is the previous fastest sorting algorithm to our knowledge, respectively.

field programmable gate arrays | 2014

OmpSs@Zynq all-programmable SoC ecosystem

Antonio Filgueras; Eduard Gil; Daniel Jiménez-González; Carlos Álvarez; Xavier Martorell; Jan Langer; Juanjo Noguera; Kees A. Vissers

OmpSs is an OpenMP-like directive-based programming model that includes heterogeneous execution (MIC, GPU, SMP, etc.) and runtime task dependencies management. Indeed, OmpSs has largely influenced the recently appeared OpenMP 4.0 specification. Zynq All-Programmable SoC combines the features of a SMP and a FPGA and benefits DLP, ILP and TLP parallelisms in order to efficiently exploit the new technology improvements and chip resource capacities. In this paper, we focus on programmability and heterogeneous execution support, presenting a successful combination of the OmpSs programming model and the Zynq All-Programmable SoC platforms.

high performance embedded architectures and compilers | 2008

Drug design issues on the cell BE

Harald Servat; Cecilia González-Álvarez; Xavier Aguilar; Daniel Cabrera-Benítez; Daniel Jiménez-González

Structure alignment prediction between proteins (protein docking) is crucial for drug design, and a challenging problem for bioinformatics, pharmaceutics, and current and future processors due to it is a very time consuming process. Here, we analyze a well known protein docking application in the Bioinformatic field, Fourier Transform Docking (FTDock), on a 3.2GHz Cell Broadband Engine (BE) processor. FTDock is a geometry complementary approximation of the protein docking problem, and baseline of several protein docking algorithms currently used. In particular, we measure the performance impact of reducing, tuning and overlapping memory accesses, and the efficiency of different parallelization strategies (SIMD, MPI, OpenMP, etc.) on porting that biomedical application to the Cell BE. Results show the potential of the Cell BE processor for drug design applications, but also that there are important memory and computer architecture aspects that should be considered.

international conference on supercomputing | 2001

Fast parallel in-memory 64-bit sorting

Daniel Jiménez-González; Juan J. Navarro; Josep-L. Larrba-Pey

Parallel in-memory 64-bit sorting is an important problem in Database Management Systems and other applications such as Internet Search Engines and Data Mining Tools. We propose a new algorithm that we call Parallel Counting Split Radix sort, PCS-Radix sort. The parallel stages of our algorithm increase the data locality, balance the load between processors caused by data skew and reduce significantly the amount of data communicated. The local stages of PCS-Radix sort are performed only on the bits of the key that have not been sorted during the parallel stages of the algorithm. All those improvements save a significant amount of computational and communication effort. Also, PCS-Radix sort adapts to any parallel computer by changing three simple algorithmic parameters. We have implemented our algorithm on a Cray T3E-900 and the results show that it is more than 2 times faster than the previous fastest 64-bit parallel sorting algorithm. PCS-Radix sort achieves a speed up of more than 23 in 32 processors in relation to the fastest sequential algorithm at our hands.

international conference on conceptual structures | 2013

Analysis of the Task Superscalar Architecture Hardware Design

Fahimeh Yazdanpanah; Daniel Jiménez-González; Carlos Alvarez-Martinez; Yoav Etsion; Rosa M. Badia

In this paper, we analyze the operational flow of two hardware implementations of the Task Superscalar architecture. The Task Superscalar is an experimental task based dataflow scheduler that dynamically detects inter-task data dependencies, identifies task-level parallelism, and executes tasks in the out-of-order manner. In this paper, we present a base implementation of the Task Superscalar architecture, as well as a new design with improved performance. We study the behavior of processing some dependent and non-dependent tasks with both base and improved hardware designs and present the simulation results compared with the results of the runtime implementation.

international conference on supercomputing | 1999

Communication conscious radix sort

Daniel Jiménez-González; Josep-lluis Larriba-pey; Juan J. Navarro

The exploitation of data locality in parallel computers is paramount to reduce the memory traffic and communication among processing nodes. We focus on the exploitation of locality by Parallel Radix sort. The original Parallel Radix sort has several communication steps in which one sorting key may have to visit several processing nodes. In response to this, we propose a reorganization of Radix sort that leads to a highly local version of the algorithm at a very low cost. As a key feature, our algorithm performs one only communication step, forcing keys to move at most once between processing nodes. Also, our algorithm reduces the amount of data communicated. Finally, the new algorithm achieves a good load balance which makes it insensitive to skewed data distributions. We call the new version of Parallel Radix sort that combines locality and load balance, Communication and Cache Conscious Radix sort (C3-Radix sort). Our results on 16 processors of the SGI 02000 show that C3-Radix sort reduces the execution time of the previous fastest version of Parallel Radix sort by 3 times for data sets larger than 8M keys and by almost 2 times for smaller data sets.

Explore More