Is this you? Create Your Porfile

Nacho Navarro

Barcelona Supercomputing Center

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nacho Navarro is active.

Explore More

Publication

Featured researches published by Nacho Navarro.

international symposium on microarchitecture | 2003

Beating in-order stalls with "flea-flicker" two-pass pipelining

Ronald D. Barnes; Erik M. Nystrom; John W. Sias; Sanjay J. Patel; Nacho Navarro; Wen-mei W. Hwu

Accommodating the uncertain latency of load instructions is one of the most vexing problems in in-order microarchitecture design and compiler development. Compilers can generate schedules with a high degree of instruction-level parallelism but cannot effectively accommodate unanticipated latencies; incorporating traditional out-of-order execution into the microarchitecture hides some of this latency but redundantly performs work done by the compiler and adds additional pipeline stages. Although effective techniques, such as prefetching and threading, have been proposed to deal with anticipable, long latency misses, the shorter, more diffuse stalls due to difficult-to-anticipate, first- or second-level misses are less easily hidden on in-order architectures. This paper addresses this problem by proposing a microarchitectural technique, referred to as two-pass pipelining, wherein the program executes on two in-order back-end pipelines coupled by a queue. The advance pipeline executes instructions greedily, without stalling on unanticipated latency dependences (executing independent instructions while otherwise blocking instructions are deferred). The backup pipeline allows concurrent resolution of instructions that were deferred in the other pipeline, resulting in the absorption of shorter misses and the overlap of longer ones. This paper argues that this design is both achievable and a good use of transistor resources and shows results indicating that it can deliver significant speedups for in-order processor designs.

Microprocessors and Microsystems | 2014

TERAFLUX: Harnessing dataflow in next generation teradevices

Roberto Giorgi; Rosa M. Badia; François Bodin; Albert Cohen; Paraskevas Evripidou; Paolo Faraboschi; Bernhard Fechner; Guang R. Gao; Arne Garbade; Rahulkumar Gayatri; Sylvain Girbal; Daniel Goodman; Behram Khan; Souad Koliai; Joshua Landwehr; Nhat Minh Lê; Feng Li; Mikel Luján; Avi Mendelson; Laurent Morin; Nacho Navarro; Tomasz Patejko; Antoniu Pop; Pedro Trancoso; Theo Ungerer; Ian Watson; Sebastian Weis; Stéphane Zuckerman; Mateo Valero

The improvements in semiconductor technologies are gradually enabling extreme-scale systems such as teradevices (i.e., chips composed by 1000 billion of transistors), most likely by 2020. Three major challenges have been identified: programmability, manageable architecture design, and reliability. TERAFLUX is a Future and Emerging Technology (FET) large-scale project funded by the European Union, which addresses such challenges at once by leveraging the dataflow principles. This paper presents an overview of the research carried out by the TERAFLUX partners and some preliminary results. Our platform comprises 1000+ general purpose cores per chip in order to properly explore the above challenges. An architectural template has been proposed and applications have been ported to the platform. Programming models, compilation tools, and reliability techniques have been developed. The evaluation is carried out by leveraging on modifications of the HP-Labs COTSon simulator.

international symposium on computer architecture | 2014

CODOMs: protecting software with code-centric memory domains

Lluis Vilanova; Muli Ben-Yehuda; Nacho Navarro; Yoav Etsion; Mateo Valero

Todays complex software systems are neither secure nor reliable. The rudimentary software protection primitives provided by current hardware forces systems to run many distrusting software components (e.g., procedures, libraries, plugins, modules) in the same protection domain, or otherwise suffer degraded performance from address space switches. We present CODOMs (COde-centric memory DOMains), a novel architecture that can provide finer-grained isolation between software components with effectively zero run-time overhead, all at a fraction of the complexity of other approaches. An implementation of CODOMs in a cycle-accurate full-system x86 simulator demonstrates that with the right hardware support, finer-grained protection and run-time performance can peacefully coexist.

field programmable logic and applications | 2012

PPMC: Hardware scheduling and memory management support for multi accelerators

Tassadaq Hussain; Miquel Pericàs; Nacho Navarro; Eduard Ayguadé

A generic multi-accelerator system comprises a microprocessor unit that schedules the accelerators along with the necessary data movements. The system, having the processor as control unit, encounters multiple delays (memory and task management) which degrade the overall system performance. This performance degradation demands an efficient memory manager and high speed scheduler, which feeds prearranged data to the appropriate accelerator. In this work we propose the integration of an efficient scheduler and an intelligent memory manger into an existing core known as PPMC (Programmable Pattern based Memory Controller), such that data movement and computational tasks can be handled proficiently. Consequently, the modified PPMC system improves performance by managing data movements and address generation in hardware and scheduling accelerators without the intervention of a control processor nor an operating system. The PPMC system is evaluated with six memory intensive accelerators: Laplacian solver, FIR, FFT, Thresholding, Matrix Multiplication and 3D-Stencil. This modified PPMC system is implemented and tested on a Xilinx ML505 evaluation FPGA board. The performance of the system is compared with a microprocessor based system that has been integrated with the Xilkernel operating system. Results show that the modified PPMC based multi-accelerator system consumes 50% less hardware resources, 32% less on-chip power and achieves approximately a 27× speed-up compared to the MicroBlaze-based system.

international conference on embedded computer systems architectures modeling and simulation | 2015

The AXIOM project (Agile, eXtensible, fast I/O Module)

Dimitris Theodoropoulos; Dionisios N. Pnevmatikatos; Carlos Álvarez; Eduard Ayguadé; Javier Bueno; Antonio Filgueras; Daniel Jiménez-González; Xavier Martorell; Nacho Navarro; Carlos Segura; Carles Fernández; David Oro; Javier R. Saeta; Paolo Gai; Antonio Rizzo; Roberto Giorgi

The AXIOM project (Agile, eXtensible, fast I/O Module) aims at researching new software/hardware architectures for the future Cyber-Physical Systems (CPSs). These systems are expected to react in real-time, provide enough computational power for the assigned tasks, consume the least possible energy for such task (energy efficiency), scale up through modularity, allow for an easy programmability across performance scaling, and exploit at best existing standards at minimal costs.

international conference on parallel architectures and compilation techniques | 2014

Automatic execution of single-GPU computations across multiple GPUs

Javier Cabezas; Lluis Vilanova; Isaac Geladeno; Thomas B. Jablin; Nacho Navarro; Wen-mei W. Hwu

We present AMGE, a programming framework and runtime system to decompose data and GPU kernels and execute them on multiple GPUs concurrently. AMGE exploits the remote memory access capability of recent GPUs to guarantee data accessibility regardless of its physical location, thus allowing AMGE to safely decompose and distribute arrays across GPU memories. AMGE also includes a compiler analysis to detect array access patterns in GPU kernels. The runtime uses this information to automatically choose the best computation and data distribution configuration. Through effective use of GPU caches, AMGE achieves good scalability in spite of the limited interconnect bandwidth between GPUs. Results show 1.95× and 3.73× execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.

IEEE Transactions on Parallel and Distributed Systems | 2015

Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications

Javier Cabezas; Isaac Gelado; John E. Stone; Nacho Navarro; David B. Kirk; Wen-mei W. Hwu

Heterogeneous parallel computing applications often process large data sets that require multiple GPUs to jointly meet their needs for physical memory capacity and compute throughput. However, the lack of high-level abstractions in previous heterogeneous parallel programming models force programmers to resort to multiple code versions, complex data copy steps and synchronization schemes when exchanging data between multiple GPU devices, which results in high software development cost, poor maintainability, and even poor performance. This paper describes the HPE runtime system, and the associated architecture support, which enables a simple, efficient programming interface for exchanging data between multiple GPUs through either interconnects or cross-node network interfaces. The runtime and architecture support presented in this paper can also be used to support other types of accelerators. We show that the simplified programming interface reduces programming complexity. The research presented in this paper started in 2009. It has been implemented and tested extensively in several generations of HPE runtime systems as well as adopted into the NVIDIA GPU hardware and drivers for CUDA 4.0 and beyond since 2011. The availability of real hardware that support key HPE features gives rise to a rare opportunity for studying the effectiveness of the hardware support by running important benchmarks on real runtime and hardware. Experimental results show that in a exemplar heterogeneous system, peer DMA and double-buffering, pinned buffers, and software techniques can improve the inter-accelerator data communication bandwidth by 2×. They can also improve the execution speed by 1.6× for a 3D finite difference, 2.5× for 1D FFT, and 1.6× for merge sort, all measured on real hardware. The proposed architecture support enables the HPE runtime to transparently deploy these optimizations under simple portable user code, allowing system designers to freely employ devices of different capabilities. We further argue that simple interfaces such as HPE are needed for most applications to benefit from advanced hardware features in practice.

international conference on transparent optical networks | 2015

iONE: An environment for experimentally assessing in-operation network planning algorithms

Lluis Gifre; Nacho Navarro; Adrià Asensio; Marc Ruiz; Luis Velasco

Huge amount of algorithmic research is being done in the field of optical networks, including Routing and Spectrum Allocation (RSA), elastic operations, spectrum defragmentation, and other re-optimization algorithms. Frequently, those algorithms are developed and executed on simulated environments, where many assumptions are done about network control and management issues. Those issues are relevant, since they might prevent algorithms to be deployed in real scenarios. To completely validate network-related algorithms, we developed an extensible control and management plane test-bed, named as iONE, for single layer and multilayer flexgrid-based optical networks. iONE is based on the Applications-Based Network Operations (ABNO) architecture currently under standardization by the IETF. iONE enables deploying and assessing the designed algorithms by defining workflows. This paper presents the iONE test-bed architecture, describes its components, and experimentally demonstrates its operation with a specific use-case.

measurement and modeling of computer systems | 2012

POTRA: a framework for building power models for next generation multicore architectures

Ramon Bertran; Marc Gonzàlez; Xavier Martorell; Nacho Navarro; Eduard Ayguadé

Ramon Bertran Barcelona Supercomputing Center C. Jordi Girona 1-3 08034 Barcelona, Spain [email protected] Marc Gonzalez Universitat Politecnica de Catalunya C. Jordi Girona 1-3 08034 Barcelona, Spain [email protected] Xavier Martorell Barcelona Supercomputing Center C. Jordi Girona 1-3 08034 Barcelona, Spain [email protected] Nacho Navarro Barcelona Supercomputing Center C. Jordi Girona 1-3 08034 Barcelona, Spain [email protected] Eduard Ayguade Barcelona Supercomputing Center C. Jordi Girona 1-3 08034 Barcelona, Spain [email protected]

international bhurban conference on applied sciences and technology | 2013

Design space explorations for streaming accelerators using Streaming Architectural Simulator

Muhammad Shafiq; Miquel Pericàs; Nacho Navarro; Eduard Ayguadé

In the recent years streaming accelerators like GPUs have been pop-up as an effective step towards parallel computing. The wish-list for these devices span from having a support for thousands of small cores to a nature very close to the general purpose computing. This makes the design space very vast for the future accelerators containing thousands of parallel streaming cores. This complicates to exercise a right choice of the architectural configuration for the next generation devices. However, accurate design space exploration tools developed for the massively parallel architectures can ease this task. The main objectives of this work are twofold. (i) We present a complete environment of a trace driven simulator named SArcs (Streaming Architectural Simulator) for the streaming accelerators. (ii) We use our simulation tool-chain for the design space explorations of the GPU like streaming architectures. Our design space explorations for different architectural aspects of a GPU like device a e with reference to a base line established for NVIDIAs Fermi architecture (GPU Tesla C2050). The explored aspects include the performation effects by the variations in the configurations of Streaming Multiprocessors Global Memory Bandwidth, Channles between SMs down to Memory Hierarchy and Cache Hierarchy. The explorations are performed using application kernels from Vector Reduction, 2D-Convolution. Matrix-Matrix Multiplication and 3D-Stencil. Results show that the configurations of the computational resources for the current Fermi GPU device can deliver higher performance with further improvement in the global memory bandwidth for the same device.

Explore More