Manuel Ujaldon | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Manuel Ujaldon is active.

Explore More

Publication

Featured researches published by Manuel Ujaldon.

Journal of Parallel and Distributed Computing | 2013

Enhancing data parallelism for Ant Colony Optimization on GPUs

José M. Cecilia; José M. García; Andy Nisbet; Martyn Amos; Manuel Ujaldon

Graphics Processing Units (GPUs) have evolved into highly parallel and fully programmable architecture over the past five years, and the advent of CUDA has facilitated their application to many real-world applications. In this paper, we deal with a GPU implementation of Ant Colony Optimization (ACO), a population-based optimization method which comprises two major stages: tour construction and pheromone update. Because of its inherently parallel nature, ACO is well-suited to GPU implementation, but it also poses significant challenges due to irregular memory access patterns. Our contribution within this context is threefold: (1) a data parallelism scheme for tour construction tailored to GPUs, (2) novel GPU programming strategies for the pheromone update stage, and (3) a new mechanism called I-Roulette to replicate the classic roulette wheel while improving GPU parallelism. Our implementation leads to factor gains exceeding 20x for any of the two stages of the ACO algorithm as applied to the TSP when compared to its sequential counterpart version running on a similar single-threaded high-end CPU. Moreover, an extensive discussion focused on different implementation paths on GPUs shows the way to deal with parallel graph connected components. This, in turn, suggests a broader area of inquiry, where algorithm designers may learn to adapt similar optimization methods to GPU architecture.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Parallelization strategies for ant colony optimisation on GPUs

José M. Cecilia; José M. García; Manuel Ujaldon; Andy Nisbet; Martyn Amos

Ant Colony Optimisation (ACO) is an effective population-based meta-heuristic for the solution of a wide variety of problems. As a population-based algorithm, its computation is intrinsically massively parallel, and it is therefore theoretically well-suited for implementation on Graphics Processing Units (GPUs). The ACO algorithm comprises two main stages: textit{Tour construction} and textit{Pheromone update}. The former has been previously implemented on the GPU, using a task-based parallelism approach. However, up until now, the latter has always been implemented on the CPU. In this paper, we discuss several parallelisation strategies for {it both} stages of the ACO algorithm on the GPU. We propose an alternative {it data-based} parallelism scheme for textit{Tour construction}, which fits better on the GPU architecture. We also describe novel GPU programming strategies for the textit{Pheromone update} stage. Our results show a total speed-up exceeding 28x for the textit{Tour construction} stage, and 20x for textit{Pheromone update}, and suggest that ACO is a potentially fruitful area for future research in the GPU domain.

international symposium on biomedical imaging | 2008

Pathological image segmentation for neuroblastoma using the GPU

Antonio Ruiz; Jun Kong; Manuel Ujaldon; Kim L. Boyer; Joel H. Saltz; Metin N. Gurcan

We present a novel use of GPUs (graphics processing units) for the analysis of histopathological images of neuroblastoma, a childhood cancer. Thanks to the advent of modern microscopy scanners, whole-slide histopathological images can now be acquired but the computational costs to analyze these images using sophisticated image analysis algorithms are usually high. In this study, we have implemented previously developed image analysis algorithms using GPUs to exploit their outstanding processing power and memory bandwidth. The resulting GPU code was contrasted and combined with a C++ implementation on a multicore CPU to maximize parallelism on emerging architectures. Our codes were tested on different classes of images, with performance gain factors about 5.6x when the execution time of a Matlab code running on the CPU is compared with a code running jointly on CPU and GPU.

soft computing | 2012

The GPU on the simulation of cellular computing models

José M. Cecilia; José M. García; Ginés D. Guerrero; Miguel A. Martínez-del-Amor; Mario J. Pérez-Jiménez; Manuel Ujaldon

Membrane Computing is a discipline aiming to abstract formal computing models, called membrane systems or P systems, from the structure and functioning of the living cells as well as from the cooperation of cells in tissues, organs, and other higher order structures. This framework provides polynomial time solutions to NP-complete problems by trading space for time, and whose efficient simulation poses challenges in three different aspects: an intrinsic massively parallelism of P systems, an exponential computational workspace, and a non-intensive floating point nature. In this paper, we analyze the simulation of a family of recognizer P systems with active membranes that solves the Satisfiability problem in linear time on different instances of Graphics Processing Units (GPUs). For an efficient handling of the exponential workspace created by the P systems computation, we enable different data policies to increase memory bandwidth and exploit data locality through tiling and dynamic queues. Parallelism inherent to the target P system is also managed to demonstrate that GPUs offer a valid alternative for high-performance computing at a considerably lower cost. Furthermore, scalability is demonstrated on the way to the largest problem size we were able to run, and considering the new hardware generation from Nvidia, Fermi, for a total speed-up exceeding four orders of magnitude when running our simulations on the Tesla S2050 server.

bioinformatics and biomedicine | 2007

Pathological Image Analysis Using the GPU: Stroma Classification for Neuroblastoma

Antonio Ruiz; Olcay Sertel; Manuel Ujaldon; Joel H. Saltz; Metin N. Gurcan

Neuroblastoma is one of the most malignant childhood cancers affecting infants mostly. The current prognosis is based on microscopic examination of slides by expert pathologists, a process that is error-prone, time consuming and may lead to inter- and intra-reader variations. Therefore, we are developing a Computer Aided Prognosis (CAP) system which provides computerized image analysis to assist pathologist in their prognosis. Since this system operates on relatively large- scale images and requires sophisticated algorithms, it takes a long time to process whole-slide images. In this paper, we propose a novel and efficient approach for the execution of a CAP system for neuroblastoma prognosis, using the graphics processing unit (GPU). By leveraging high memory bandwidth and strong floating point operation capabilities of the GPU, our goal is to achieve order of magnitude reduction in the overall execution time as compared to that on a CPU alone. The proposed approach was tested on a set of testing images with a promising accuracy of 99.4% and an execution performance gain factor up to 45 times compared to C++ code running on the CPU.

Cluster Computing | 2016

Dynamic load balancing on heterogeneous clusters for parallel ant colony optimization

Antonio Llanes; José M. Cecilia; Antonia M. Sánchez; José M. García; Martyn Amos; Manuel Ujaldon

Ant colony optimisation (ACO) is a nature-inspired, population-based metaheuristic that has been used to solve a wide variety of computationally hard problems. In order to take full advantage of the inherently stochastic and distributed nature of the method, we describe a parallelization strategy that leverages these features on heterogeneous and large-scale, massively-parallel hardware systems. Our approach balances workload effectively, by dynamically assigning jobs to heterogeneous resources which then run ACO implementations using different search strategies. Our experimental results confirm that we can obtain significant improvements in terms of both solution quality and energy expenditure, thus opening up new possibilities for the development of metaheuristic-based solutions to “real world” problems on high-performance, energy-efficient contemporary heterogeneous computing platforms.

Journal of Real-time Image Processing | 2012

The 2D wavelet transform on emerging architectures: GPUs and multicores

Joaquín Franco; Gregorio Bernabé; Juan C. Fernandez; Manuel Ujaldon

Because of the computational power of today’s GPUs, they are starting to be harnessed more and more to help out CPUs on high-performance computing. In addition, an increasing number of today’s state-of-the-art supercomputers include commodity GPUs to bring us unprecedented levels of performance in terms of raw GFLOPS and GFLOPS/cost. In this work, we present a GPU implementation of an image processing application of growing popularity: The 2D fast wavelet transform (2D-FWT). Based on a pair of Quadrature Mirror Filters, a complete set of application-specific optimizations are developed from a CUDA perspective to achieve outstanding factor gains over a highly optimized version of 2D-FWT run in the CPU. An alternative approach based on the Lifting Scheme is also described in Franco et al. (Acceleration of the 2D wavelet transform for CUDA-enabled Devices, 2010). Then, we investigate hardware improvements like multicores on the CPU side, and exploit them at thread-level parallelism using the OpenMP API and pthreads. Overall, the GPU exhibits better scalability and parallel performance on large-scale images to become a solid alternative for computing the 2D-FWT versus those thread-level methods run on emerging multicore architectures.

data mining in bioinformatics | 2009

Stroma classification for neuroblastoma on graphics processors

Antonio Ruiz; Olcay Sertel; Manuel Ujaldon; Joel H. Saltz; Metin N. Gurcan

Neuroblastoma is one of the most common childhood cancers. We are developing an image analysis system to assist pathologists in their prognosis. Since this system operates on relatively large-scale images and requires sophisticated algorithms, computerised analysis takes a long time to execute. In this paper, we propose a novel approach to benefit from high memory bandwidth and strong floating-point capabilities of graphics processing units. The proposed approach achieves a promising classification accuracy of 99.4% and an execution performance with a gain factor up to 45 times compared to hand-optimised C++ code running on the CPU.

Pattern Recognition Letters | 2008

On the computation of the Circle Hough Transform by a GPU rasterizer

Manuel Ujaldon; Antonio Ruiz; Nicolás Guil

This paper presents an alternative for a fast computation of the Hough transform by taking advantage of commodity graphics processors that provide a unique combination of low cost and high performance platforms for this sort of algorithms. Optimizations focus on the features of a GPU rasterizer to evaluate, in hardware, the entire spectrum of votes for circle candidates from a small number of key points or seeds computed by the GPU vertex processor in a typical CPU manner. Number of votes and fidelity of their values are analyzed within the GPU using mathematical models as a function of the radius size for the circles to be detected and the resolution for the texture storing the results. Empirical results validate the output obtained for a much faster execution of the Circle Hough Transform (CHT): On a 1024x1024 sample image containing 20 circles of r=50 pixels, the GPU accelerates the code an order of magnitude and its rasterizer contributes with an additional 4x factor for a total speed-up greater than 40x versus a baseline CPU implementation.

parallel computing | 1995

Sparse Block and Cyclic Data Distributions for Matrix Computations

Rafael Asenjo; Luis F. Romero; Manuel Ujaldon; Emilio L. Zapata

A significant part of scientific codes consist of sparse matrix computations. In this work we propose two new pseudoregular data distributions for sparse matrices. The Multiple Recursive Decomposition (MRD) partitions the data using the prime factors of the dimensions of a multiprocessor network with mesh topology. Furthermore, we introduce a new storage scheme, storage-by-row-of-blocks, that significantly increases the efficiency of the Scatter distribution. We will name Block Row Scatter (BRS) distribution this new variant. The MRD and BRS methods achieve results that improve those obtained by other analyzed methods, being their implementation easier. In fact, the data distributions resulting from the MRD and BRS methods are a generalization of the Block and Cyclic distributions used in dense matrices.

Explore More