Is this you? Create Your Porfile

Juan C. Moure

Autonomous University of Barcelona

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Juan C. Moure is active.

Explore More

Publication

Featured researches published by Juan C. Moure.

international conference on conceptual structures | 2016

Embedded Real-time Stereo Estimation via Semi-global Matching on the GPU

D. Hernandez-Juarez; A. Chacn; Antonio Espinosa; D. Vzquez; Juan C. Moure; A.M. Lpez

Dense, robust and real-time computation of depth information from stereo-camera systems is a computationally demanding requirement for robotics, advanced driver assistance systems (ADAS) and autonomous vehicles. Semi-Global Matching (SGM) is a widely used algorithm that propagates consistency constraints along several paths across the image. This work presents a real-time system producing reliable disparity estimation results on the new embedded energy-efficient GPU devices. Our design runs on a Tegra X1 at 42 frames per second (fps) for an image size of 640480, 128 disparity levels, and using 4 path directions for the SGM method.

IEEE Transactions on Parallel and Distributed Systems | 2015

Implementation of the DWT in a GPU through a Register-based Strategy

Pablo Enfedaque; Francesc Auli-Llinas; Juan C. Moure

The release of the CUDA Kepler architecture in March 2012 has provided Nvidia GPUs with a larger register memory space and instructions for the communication of registers among threads. This facilitates a new programming strategy that utilizes registers for data sharing and reusing in detriment of the shared memory. Such a programming strategy can significantly improve the performance of applications that reuse data heavily. This paper presents a register-based implementation of the Discrete Wavelet Transform (DWT), the prevailing data decorrelation technique in the field of image coding. Experimental results indicate that the proposed method is, at least, four times faster than the best GPU implementation of the DWT found in the literature. Furthermore, theoretical analysis coincide with experimental tests in proving that the execution times achieved by the proposed implementation are close to the GPUs performance limits.

international conference on conceptual structures | 2013

n-step FM-Index for Faster Pattern Matching

Alejandro Chacón; Juan C. Moure; Antonio Espinosa; Porfidio Hernández

Fast pattern matching is a requirement for many problems, specially for bioinformatics sequence analysis like short read mapping applications. This work presents a variation of the FM-index method, denoted n-step FM-index, that is applied in exact match genome search. We propose an alternative two-dimensional FM-index structure that allows backward-search navigation giving steps of n symbols at a time. The main advantages of this arrangement are the reduction of the computational work, but most importantly, the reduction by n of the chain of dependent data accesses, and the increase in the temporal locality of the data access pattern. This benefit comes at the expense of increasing the total amount of data required for the index. We present an in-depth performance analysis of a multi-core implementation of the algorithm using large references (up to 1.5G). We identify memory latency as the major performance limiter for single-thread execution and memory bandwidth for multi-thread execution. Our proposal provides speedups ranging from 1.4× to 2.4×, when there is no limitation on DRAM capacity. We also analyse the trade-off of compacting the proposed data structure in order to reduce memory capacity requirements, now at the expense of increasing execution time. An extra 33% of DRAM space allows our proposal to improve performance by 1.2×, while doubling DRAM size enables an additional 1.5×. Our proposal of n-step algorithm provides an alternative for pseudo-random memory access algorithms to be redesigned to scale in current and future computer systems.

ACM Transactions on Computing Education \/ ACM Journal of Educational Resources in Computing | 2002

The KScalar simulator

Juan C. Moure; Dolores Rexachs; Emilio Luque

Modern processors increase their performance with complex microarchitectural mechanisms, which makes them more and more difficult to understand and evaluate. KScalar is a graphical simulation tool that facilitates the study of such processors. It allows students to analyze the performance behavior of a wide range of processor microarchitectures: from a very simple in-order, scalar pipeline, to a detailed out-of-order, superscalar pipeline with non-blocking caches, speculative execution, and complex branch prediction. The simulator interprets executables for the Alpha AXP instruction set: from very short program fragments to large applications. The objects program execution may be simulated in varying levels of detail: either cycle-by-cycle, observing all the pipeline events that determine processor performance, or million cycles at once, taking statistics of the main performance issues.Instructors may use KScalar in several ways. First, it may be used to provide demonstrations in lectures or online learning environments. Second, it allows students to investigate the characteristics of specific processor microarchitectures as practical short assignments associated to a lecture course. Third, students may undertake major projects involving the optimization of real programs at the software-hardware interface, or involving the optimization of a processor microarchitecture for a given application workload.A preliminary version of KScalar has been successfully used in several lecture courses during the last two years in the University Autónoma of Barcelona. It runs on a x86/Linux/KDE system. The graphical interface has been developed using the KDE and QT libraries. The simulator engine running behind the graphical interface is a heavily-modified version of SimpleScalar. KScalar code is available under the terms of the GNU and SimpleScalar General Public License

Proceedings of the 20th European MPI Users' Group Meeting on | 2013

Job scheduling for optimizing data locality in Hadoop clusters

Aprigio Bezerra; Porfidio Hernández; Antonio Espinosa; Juan C. Moure

We describe the use of non-dedicated clusters by a known group of local applications sharing the computational resources with additional bioinformatics MapReduce applications. We have studied how to effectively use the resources shared by both application types during their execution. In order to keep local application execution times unaffected we consider the configuration of a group of parameters of the Hadoop platform. One of the most relevant aspects to consider is the job scheduling policy. Our aim is to allow that tasks from different jobs that handle the same data blocks are grouped to be run on the same node where the blocks are allocated. Experimental results show that our approach outperforms traditional policies.

computing frontiers | 2006

Evaluation of the field-programmable cache: performance and energy consumption

Domingo Benitez; Juan C. Moure; Dolores Rexachs; Emilio Luque

Many authors have proposed power management techniques for general-purpose processors at the cost of degraded performance such as lower IPC or longer delay. Some proposals have focused on cache memories because they consume a significant fraction of total microprocessor power. We propose a reconfigurable and adaptive cache microarchitecture based on field-programmable technology that is intended to deliver high performance at low energy consumption. In this paper, we evaluate the performance and energy consumption of a run-time algorithm when used to manage a field-programmable L1 data cache. The adaptation strategy is based on two techniques: a learning process provides the best cache configuration for each program phase, and a recognition process detects program phase changes by using data working-set signatures to activate a low-overhead reconfiguration mechanism. Our proposals achieve performance improvement and cache energy saving at the same time. Considering a design scenario driven by performance constraints, we show that processor execution time and cache energy consumption can be reduced on average by 15.2% and 9.9% compared to a non-adaptive high-performance microarchitecture. Alternatively, when energy saving is prioritized and considering a non-adaptive energy-efficient microarchitecture as baseline, cache energy and processor execution time are reduced on average by 46.7% and 9.4% respectively. In addition to comparing to conventional microarchitectures, we show that the proposed microarchitecture achieves better performance and more cache energy reduction than other configurable caches.

design, automation, and test in europe | 2010

A reconfigurable cache memory with heterogeneous banks

Domingo Benitez; Juan C. Moure; Dolores Rexachs; Emilio Luque

The optimal size of a large on-chip cache can be different for different programs: at some point, the reduction of cache misses achieved when increasing cache size hits diminishing returns, while the higher cache latency hurts performance. This paper presents the Amorphous Cache (AC), a reconfigurable L2 on-chip cache aimed at improving performance as well as reducing energy consumption. AC is composed of heterogeneous sub-caches as opposed to common caches using homogenous sub-caches. The sub-caches are turned off depending on the application workload to conserve power and minimize latencies. A novel reconfiguration algorithm based on Basic Block Vectors is proposed to recognize program phases, and a learning mechanism is used to select the appropriate cache configuration for each program phase. We compare our reconfigurable cache with existing proposals of adaptive and non-adaptive caches. Our results show that the combination of AC and the novel reconfiguration algorithm provides the best power consumption and performance. For example, on average, it reduces the cache access latency by 55.8%, the cache dynamic energy by 46.5%, and the cache leakage power by 49.3% with respect to a non-adaptive cache.

The Journal of Supercomputing | 2012

Analysis and improvement of map-reduce data distribution in read mapping applications

Antonio Espinosa; Porfidio Hernández; Juan C. Moure; J. Protasio; Ana Ripoll

The map-reduce paradigm has shown to be a simple and feasible way of filtering and analyzing large data sets in cloud and cluster systems. Algorithms designed for the paradigm must implement regular data distribution patterns so that appropriate use of resources is ensured. Good scalability and performance on Map-Reduce applications greatly depend on the design of regular intermediate data generation-consumption patterns at the map and reduce phases. We describe the data distribution patterns found in current Map-Reduce read mapping bioinformatics applications and show some data decomposition principles to greatly improve their scalability and performance

international conference on computational science | 2016

GPU-based Pedestrian Detection for Autonomous Driving

V. Campmany; S. Silva; Antonio Espinosa; Juan C. Moure; D. Vzquez; A.M. Lpez

We propose a real-time pedestrian detection system for the embedded Nvidia Tegra X1 GPU-CPU hybrid platform. The detection pipeline is composed by the following state-of-the-art algorithms: features extracted from the input image are Histograms of Local Binary Patterns (LBP) and Histograms of Oriented Gradients (HOG); candidate generation using Pyramidal Sliding Window technique; and classification with Support Vector Machine (SVM). Experimental results show that the Tegra ARM platform is two times more energy efficient than a desktop GPU and at least 8 times faster than a desktop multicore CPU.

Journal of Parallel and Distributed Computing | 2017

Introducing computational thinking, parallel programming and performance engineering in interdisciplinary studies

Eduardo César; Ana Cortés; Antonio Espinosa; Tomàs Margalef; Juan C. Moure; Anna Sikora; Remo Suppi

Abstract Nowadays, many fields of science and engineering are evolving through the joint contribution of complementary fields. Computer science, and especially High Performance Computing, has become a key factor in the development of many research fields, establishing a new paradigm called computational science. Researchers and professionals from many different fields require knowledge of High Performance Computing, including parallel programming, to develop fruitful and efficient work in their particular field. Therefore, at Universitat Autonoma of Barcelona (Spain), an interdisciplinary Master on “Modeling for Science and Engineering” was started 5 years ago to provide a thorough knowledge of the application of modeling and simulation to graduate students in different fields (Mathematics, Physics, Chemistry, Engineering, Geology, etc.). In this Master’s degree, “Parallel Programming” appears as a compulsory subject because it is a key topic for them. The concepts learned in this subject must be applied to real applications. Therefore, a complementary subject on “Applied Modeling and Simulation” has also been included. It is very important to show the students how to analyze their particular problems, think about them from a computational perspective and consider the related performance issues. So, in this paper, the methodology and the experience in introducing computational thinking, parallel programming and performance engineering in this interdisciplinary Master’s degree are shown. This overall approach has been refined through the Master’s life, leading to excellent academic results and improving the industry and students appraisal of this programme.

Explore More