Jorge González-Domínguez

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jorge González-Domínguez is active.

Explore More

Publication

Featured researches published by Jorge González-Domínguez.

ieee international conference on high performance computing data and analytics | 2012

Communication avoiding and overlapping for numerical linear algebra

Evangelos Georganas; Jorge González-Domínguez; Edgar Solomonik; Yili Zheng; Juan Touriño; Katherine A. Yelick

To efficiently scale dense linear algebra problems to future exascale systems, communication cost must be avoided or overlapped. Communication-avoiding 2.5D algorithms improve scalability by reducing inter-processor data transfer volume at the cost of extra memory usage. Communication overlap attempts to hide messaging latency by pipelining messages and overlapping with computational work. We study the interaction and compatibility of these two techniques for two matrix multiplication algorithms (Cannon and SUMMA), triangular solve, and Cholesky factorization. For each algorithm, we construct a detailed performance model that considers both critical path dependencies and idle time. We give novel implementations of 2.5D algorithms with overlap for each of these problems. Our software employs UPC, a partitioned global address space (PGAS) language that provides fast one-sided communication. We show communication avoidance and overlap provide a cumulative benefit as core counts scale, including results using over 24K cores of a Cray XE6 system.

international parallel and distributed processing symposium | 2010

Servet: A benchmark suite for autotuning on multicore clusters

Jorge González-Domínguez; Guillermo L. Taboada; Basilio B. Fraguela; María J. Martín; Juan Touriño

The growing complexity in computer system hierarchies due to the increase in the number of cores per processor, levels of cache (some of them shared) and the number of processors per node, as well as the high-speed interconnects, demands the use of new optimization techniques and libraries that take advantage of their features. In this paper Servet, a suite of benchmarks focused on detecting a set of parameters with high influence in the overall performance of multicore systems, is presented. These benchmarks are able to detect the cache hierarchy, including their size and which caches are shared by each core, bandwidths and bottlenecks in memory accesses, as well as communication latencies among cores. These parameters can be used by auto-tuned codes to increase their performance in multicore clusters. Experimental results using different representative systems show that Servet provides very accurate estimates of the parameters of the machine architecture.

grid computing | 2013

Analysis of I/O Performance on an Amazon EC2 Cluster Compute and High I/O Platform

Roberto R. Expósito; Guillermo L. Taboada; Sabela Ramos; Jorge González-Domínguez; Juan Touriño; Ramón Doallo

Cloud computing is currently being explored by the scientific community to assess its suitability for High Performance Computing (HPC) environments. In this novel paradigm, compute and storage resources, as well as applications, can be dynamically provisioned on a pay-per-use basis. This paper presents a thorough evaluation of the I/O storage subsystem using the Amazon EC2 Cluster Compute platform and the recent High I/O instance type, to determine its suitability for I/O-intensive applications. The evaluation has been carried out at different layers using representative benchmarks in order to evaluate the low-level cloud storage devices available in Amazon EC2, ephemeral disks and Elastic Block Store (EBS) volumes, both on local and distributed file systems. In addition, several I/O interfaces (POSIX, MPI-IO and HDF5) commonly used by scientific workloads have also been assessed. Furthermore, the scalability of a representative parallel I/O code has also been analyzed at the application level, taking into account both performance and cost metrics. The analysis of the experimental results has shown that available cloud storage devices can have different performance characteristics and usage constraints. Our comprehensive evaluation can help scientists to increase significantly (up to several times) the performance of I/O-intensive applications in Amazon EC2 cloud. An example of optimal configuration that can maximize I/O performance in this cloud is the use of a RAID 0 of 2 ephemeral disks, TCP with 9,000 bytes MTU, NFS async and MPI-IO on the High I/O instance type, which provides ephemeral disks backed by Solid State Drive (SSD) technology.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2015

Parallelizing Epistasis Detection in GWAS on FPGA and GPU-Accelerated Computing Systems

Jorge González-Domínguez; Lars Wienbrandt; Jan Christian Kässens; David Ellinghaus; Manfred Schimmler; Bertil Schmidt

High-throughput genotyping technologies (such as SNP-arrays) allow the rapid collection of up to a few million genetic markers of an individual. Detecting epistasis (based on 2-SNP interactions) in Genome-Wide Association Studies is an important but time consuming operation since statistical computations have to be performed for each pair of measured markers. Computational methods to detect epistasis therefore suffer from prohibitively long runtimes; e.g., processing a moderately-sized dataset consisting of about 500,000 SNPs and 5,000 samples requires several days using state-of-the-art tools on a standard 3 GHz CPU. In this paper, we demonstrate how this task can be accelerated using a combination of fine-grained and coarse-grained parallelism on two different computing systems. The first architecture is based on reconfigurable hardware (FPGAs) while the second architecture uses multiple GPUs connected to the same host. We show that both systems can achieve speedups of around four orders-of-magnitude compared to the sequential implementation. This significantly reduces the runtimes for detecting epistasis to only a few minutes for moderatelysized datasets and to a few hours for large-scale datasets.

Concurrency and Computation: Practice and Experience | 2012

UPCBLAS: a library for parallel matrix computations in Unified Parallel C

Jorge González-Domínguez; María J. Martín; Guillermo L. Taboada; Juan Touriño; Ramón Doallo; Damián A. Mallón; Brian Wibecan

The popularity of Partitioned Global Address Space (PGAS) languages has increased during the last years thanks to their high programmability and performance through an efficient exploitation of data locality, especially on hierarchical architectures such as multicore clusters. This paper describes UPCBLAS, a parallel numerical library for dense matrix computations using the PGAS Unified Parallel C language. The routines developed in UPCBLAS are built on top of sequential basic linear algebra subprograms functions and exploit the particularities of the PGAS paradigm, taking into account data locality in order to achieve a good performance. Furthermore, the routines implement other optimization techniques, several of them by automatically taking into account the hardware characteristics of the underlying systems on which they are executed. The library has been experimentally evaluated on a multicore supercomputer and compared with a message‐passing‐based parallel numerical library, demonstrating good scalability and efficiency. Copyright

international conference on cluster computing | 2014

UPC++ for bioinformatics: A case study using genome-wide association studies

Jan Christian Kässens; Jorge González-Domínguez; Lars Wienbrandt; Bertil Schmidt

Modern genotyping technologies are able to obtain up to a few million genetic markers (such as SNPs) of an individual within a few minutes of time. Detecting epistasis, such as SNP-SNP interactions, in Genome-Wide Association Studies is an important but time-consuming operation since statistical computations have to be performed for each pair of measured markers. Therefore, a variety of HPC architectures have been used to accelerate these studies. In this work we present a parallel approach for multi-core clusters, which is implemented with UPC++ and takes advantage of the features available in the Partitioned Global Address Space and Object Oriented Programming models. Our solution is based on a well-known regression model (used by the popular BOOST tool) to test SNP-pairs interactions. Experimental results show that UPC++ is suitable for parallelizing data-intensive bioinformatics applications on clusters. For instance, it reduces the time to analyze a real-world dataset with more than 500,000 SNPs and 5,000 individuals from several days when using a single core to less than one minute using 512 nodes (12,288 cores) of a Cray XC30 supercomputer.

Journal of Computational Science | 2015

High-speed exhaustive 3-locus interaction epistasis analysis on FPGAs

Jan Christian Kässens; Lars Wienbrandt; Jorge González-Domínguez; Bertil Schmidt; Manfred Schimmler

Abstract Epistasis, the interaction between genes, has become a major topic in molecular and quantitative genetics. It is believed that these interactions play a significant role in genetic variations causing complex diseases. Several algorithms have been employed to detect pairwise interactions in genome-wide association studies (GWAS) but revealing higher order interactions remains a computationally challenging task. State of the art tools are not able to perform exhaustive search for all three-locus interactions in reasonable time even for relatively small input datasets. In this paper we present how a hardware-assisted design can solve this problem and provide fast, efficient and exhaustive third-order epistasis analysis with up-to-date FPGA technology.

PLOS ONE | 2016

Parallel and Scalable Short-Read Alignment on Multi-Core Clusters Using UPC++

Jorge González-Domínguez; Yongchao Liu; Bertil Schmidt

The growth of next-generation sequencing (NGS) datasets poses a challenge to the alignment of reads to reference genomes in terms of alignment quality and execution speed. Some available aligners have been shown to obtain high quality mappings at the expense of long execution times. Finding fast yet accurate software solutions is of high importance to research, since availability and size of NGS datasets continue to increase. In this work we present an efficient parallelization approach for NGS short-read alignment on multi-core clusters. Our approach takes advantage of a distributed shared memory programming model based on the new UPC++ language. Experimental results using the CUSHAW3 aligner show that our implementation based on dynamic scheduling obtains good scalability on multi-core clusters. Through our evaluation, we are able to complete the single-end and paired-end alignments of 246 million reads of length 150 base-pairs in 11.54 and 16.64 minutes, respectively, using 32 nodes with four AMD Opteron 6272 16-core CPUs per node. In contrast, the multi-threaded original tool needs 2.77 and 5.54 hours to perform the same alignments on the 64 cores of one node. The source code of our parallel implementation is publicly available at the CUSHAW3 homepage (http://cushaw3.sourceforge.net).

Journal of Computational Science | 2016

Multithreaded and Spark parallelization of feature selection filters

Carlos Eiras-Franco; Verónica Bolón-Canedo; Sabela Ramos; Jorge González-Domínguez; Amparo Alonso-Betanzos; Juan Touriño

Abstract Vast amounts of data are generated every day, constituting a volume that is challenging to analyze. Techniques such as feature selection are advisable when tackling large datasets. Among the tools that provide this functionality, Weka is one of the most popular ones, although the implementations it provides struggle when processing large datasets, requiring excessive times to be practical. Parallel processing can help alleviate this problem, effectively allowing users to work with Big Data. The computational power of multicore machines can be harnessed by using multithreading and distributed programming, effectively helping to tackle larger problems. Both these techniques can dramatically speed up the feature selection process allowing users to work with larger datasets. The reimplementation of four popular feature selection algorithms included in Weka is the focus of this work. Multithreaded implementations previously not included in Weka as well as parallel Spark implementations were developed for each algorithm. Experimental results obtained from tests on real-world datasets show that the new versions offer significant reductions in processing times.

Bioinformatics | 2016

ParDRe: Faster Parallel Duplicated Reads Removal Tool for Sequencing Studies

Jorge González-Domínguez; Bertil Schmidt

UNLABELLED Current next generation sequencing technologies often generate duplicated or near-duplicated reads that (depending on the application scenario) do not provide any interesting biological information but can increase memory requirements and computational time of downstream analysis. In this work we present ParDRe, a de novo parallel tool to remove duplicated and near-duplicated reads through the clustering of Single-End or Paired-End sequences from fasta or fastq files. It uses a novel bitwise approach to compare the suffixes of DNA strings and employs hybrid MPI/multithreading to reduce runtime on multicore systems. We show that ParDRe is up to 27.29 times faster than Fulcrum (a representative state-of-the-art tool) on a platform with two 8-core Sandy-Bridge processors. AVAILABILITY AND IMPLEMENTATION Source code in C ++ and MPI running on Linux systems as well as a reference manual are available at https://sourceforge.net/projects/pardre/ CONTACT [email protected].

Explore More