Cristobal A. Navarro | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Cristobal A. Navarro is active.

Explore More

Publication

Featured researches published by Cristobal A. Navarro.

Computer Physics Communications | 2016

Adaptive multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model

Cristobal A. Navarro; Wei Huang; Youjin Deng

Abstract This work presents an adaptive multi-GPU Exchange Monte Carlo approach for the simulation of the 3D Random Field Ising Model (RFIM). The design is based on a two-level parallelization. The first level, spin-level parallelism , maps the parallel computation as optimal 3D thread-blocks that simulate blocks of spins in shared memory with minimal halo surface, assuming a constant block volume. The second level, replica-level parallelism , uses multi-GPU computation to handle the simulation of an ensemble of replicas. CUDA’s concurrent kernel execution feature is used in order to fill the occupancy of each GPU with many replicas, providing a performance boost that is more notorious at the smallest values of L . In addition to the two-level parallel design, the work proposes an adaptive multi-GPU approach that dynamically builds a proper temperature set free of exchange bottlenecks. The strategy is based on mid-point insertions at the temperature gaps where the exchange rate is most compromised. The extra work generated by the insertions is balanced across the GPUs independently of where the mid-point insertions were performed. Performance results show that spin-level performance is approximately two orders of magnitude faster than a single-core CPU version and one order of magnitude faster than a parallel multi-core CPU version running on 16-cores. Multi-GPU performance is highly convenient under a weak scaling setting, reaching up to 99 % efficiency as long as the number of GPUs and L increase together. The combination of the adaptive approach with the parallel multi-GPU design has extended our possibilities of simulation to sizes of L = 32 , 64 for a workstation with two GPUs. Sizes beyond L = 64 can eventually be studied using larger multi-GPU systems.

high performance computing and communications | 2014

GPU Maps for the Space of Computation in Triangular Domain Problems

Cristobal A. Navarro; Nancy Hitschfeld

There is a stage in the GPU computing pipeline where a grid of thread-blocks, or space of computation, is mapped to the problem domain. Normally, the space of computation is a k-dimensional bounding box (BB) that covers a k-dimensional problem. Threads that fall inside the problem domain perform computations and threads that fall outside are discarded, all happening at runtime. For problems with non-square geometry, this approach makes the space of computation larger than what is necessary, wasting many threads. Our case of interest are the two-dimensional triangular domain problems, alias td-problems, where almost half of the space of computation is unnecessary when using the BB approach. Problems such as the Euclidean distance map or collision detection are td-problems and they appear frequently as part of a larger computational problem. In this work, we study several mapping functions and their contribution to a better space of computation by reducing the number of unnecessary threads. We compare the performance of four existing mapping strategies; the bounding box (BB), the upper-triangular mapping (UTM), the rectangular box (RB) and the recursive partition (REC). In addition, we propose a map g(λ), that maps any λ block to a unique location (i, j) in the triangular domain. The mapping is based on the properties of the lower triangular matrix and works in block space. The theoretical improvement I obtained from using g(λ) is upper bounded as I <; 2 and the number of unnecessary blocks is reduced from O(n2) to O(n). Experimental results using different Nvidia Kepler GPUs show that for computing the Euclidean distance matrix g(λ) achieves an improvement of up to 18% over the basic bounding box (BB) strategy, runs faster than UTM and REC strategies and it is almost as fast as RB. Performance results on shared memory 3D collision detection show that g(λ) is the fastest map of all, and the only one capable of surpassing the brute force (BB) approach by a margin of up to 7%. These results help us realize that one of the main advantages of g(λ) is the fact that it uses block space mapping, where coordinate values are small in magnitude and thread organization is not compromised, making the map stable in performance under different memory access patterns.

Computer Physics Communications | 2015

Parallel family trees for transfer matrices in the Potts model

Cristobal A. Navarro; Fabrizio Canfora; Nancy Hitschfeld; Gonzalo Navarro

Abstract The computational cost of transfer matrix methods for the Potts model is related to the question in how many ways can two layers of a lattice be connected? Answering the question leads to the generation of a combinatorial set of lattice configurations. This set defines the configuration space of the problem, and the smaller it is, the faster the transfer matrix can be computed. The configuration space of generic ( q , v ) transfer matrix methods for strips is in the order of the Catalan numbers, which grows asymptotically as O ( 4 m ) where m is the width of the strip. Other transfer matrix methods with a smaller configuration space indeed exist but they make assumptions on the temperature, number of spin states, or restrict the structure of the lattice. In this paper we propose a parallel algorithm that uses a sub-Catalan configuration space of O ( 3 m ) to build the generic ( q , v ) transfer matrix in a compressed form. The improvement is achieved by grouping the original set of Catalan configurations into a forest of family trees, in such a way that the solution to the problem is now computed by solving the root node of each family. As a result, the algorithm becomes exponentially faster than the Catalan approach while still highly parallel. The resulting matrix is stored in a compressed form using O ( 3 m × 4 m ) of space, making numerical evaluation and decompression to be faster than evaluating the matrix in its O ( 4 m × 4 m ) uncompressed form. Experimental results for different sizes of strip lattices show that the parallel family trees (PFT) strategy indeed runs exponentially faster than the Catalan Parallel Method (CPM) , especially when dealing with dense transfer matrices. In terms of parallel performance, we report strong-scaling speedups of up to 5.7 × when running on an 8-core shared memory machine and 28 × for a 32-core cluster. The best balance of speedup and efficiency for the multi-core machine was achieved when using p = 4 processors, while for the cluster scenario it was in the range p ∈ [ 8 , 10 ] . Because of the parallel capabilities of the algorithm, a large-scale execution of the parallel family trees strategy in a supercomputer could contribute to the study of wider strip lattices.

Computer Physics Communications | 2018

GPU parallel simulation algorithm of Brownian particles with excluded volume using Delaunay triangulations

Francisco Carter; Nancy Hitschfeld; Cristobal A. Navarro; Rodrigo Soto

Abstract A novel parallel simulation algorithm on the GPU, implemented in CUDA and C++, is presented for the simulation of Brownian particles that display excluded volume repulsion and interact with long and short range forces. When an explicit Euler–Maruyama integration step is performed to take into account the pairwise forces and Brownian motion, particle overlaps can appear. The excluded volume property brings up the need for correcting these overlaps as they happen, since predicting them is not feasible due to the random displacement of Brownian particles. The proposed solution handles, at each time step, a Delaunay triangulation of the particle positions because it allows us to efficiently solve overlaps between particles by checking just their neighborhood. The algorithm starts by generating a periodic Delaunay triangulation of the particle initial positions on CPU, but after that the triangulation is always kept on GPU memory. We used a parallel edge-flip implementation to keep the triangulation updated during each time step, checking previously that the triangulation was not rendered invalid due to the particle displacements. We designed and implemented an exact long range force simulation with an all-pairs N -body simulation, tiling the particle interaction computations based on the warp size of the target device architecture. The resulting implementation was validated with two models of active colloidal particles, also showing a speedup of up to two orders of magnitude when compared to a sequential implementation. A short range forces simulation using Verlet lists for neighborhood handling was also developed and validated, showing similar performance improvements.

international conference on computer vision | 2013

Quasi-Delaunay Triangulations Using GPU-Based Edge-Flips

Cristobal A. Navarro; Nancy Hitschfeld; Eliana Scheihing

The edge-flip technique has been widely used for transforming any existing triangular mesh into a Delaunay mesh. Although several tools for generating Delaunay triangulations are known, there is no one that offers a realtime solution capable of maintaining the Delaunay condition on dynamically changing triangulations and, in particular, one integrable with the OpenGL rendering pipeline. In this paper we present an iterative GPGPU-based method capable of improving triangulations under the Delaunay criterion. Since the algorithm uses an \(\epsilon \) value to handle co-circular or close to co-circular point configurations, a low percentage of triangles do not fulfill the Delaunay condition. We have compared the triangulations generated by our method with the ones generated by the Triangle software and by the CGAL library and we obtained less than 0.05 % different triangles for full random meshes and less than 1 % for noise based ones. Based on our experimental results, we report speedups from 14\(\times \) to 50\(\times \) against Lawson’s sequential algorithm and of approximately 3\(\times \) against the CGAL’s and Triangle’s constructive algorithms when processing full random triangulations. In our noise based tests we report up to 36\(\times \) and 27\(\times \) of speedup against CGAL and Triangle, respectively.

high performance computing and communications | 2013

Multi-core Computation of Transfer Matrices for Strip Lattices in the Potts Model

Cristobal A. Navarro; Nancy Hitschfeld; Fabrizio Canfora

The transfer-matrix technique is a convenient way for studying strip lattices in the Potts model since the computational costs depend just on the periodic part of the lattice and not on the whole. However, even when the cost is reduced, the transfer-matrix technique is still an NP-hard problem since the time T (|V |, |E|) needed to compute the matrix grows exponentially as a function of the graph width. In this work, we present a parallel transfer-matrix implementation that scales performance under multi-core architectures. The construction of the matrix is based on several repetitions of the deletion-contraction technique, allowing parallelism suitable to multi-core machines. Our experimental results show that the multi-core implementation achieves speedups of 3.7X with p = 4 processors and 5.7X with p = 8. The efficiency of the implementation lies between 60% and 95%, achieving the best balance of speedup and efficiency at p = 4 processors for actual multi-core architectures. The algorithm also takes advantage of the lattice symmetry, making the transfer matrix computation to run up to 2X faster than its non-symmetric counterpart and use up to a quarter of the original space.

EasyChair Preprints | 2018

Analyzing GPU Tensor Core Potential for Fast Reductions

Roberto A. Carrasco Cavieres; Raimundo Vega; Cristobal A. Navarro

The Nvidia GPU architecture has introduced new computing elements such as the tensor cores, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate Deep Learning applications. In this work we present the idea of using tensor cores for a different purpose such as the parallel arithmetic reduction problem, and propose a new GPU tensor-core based algorithm as well as analyze its potential performance benefits in comparison to a traditional GPU-based one. The proposed method, encodes the reduction of n numbers as a set of m × m MMA tensor-core operations (for Nvidia’s Volta architecture m = 16) and takes advantage from the fact that each MMA operation takes just one GPU cycle. When analyzing the cost under a simplified GPU computing model, the result is that the new algorithm manages to reduce a problem of n numbers in

Computer Physics Communications | 2018