Carlos Teijeiro | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Carlos Teijeiro is active.

Explore More

Publication

Featured researches published by Carlos Teijeiro.

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2009

Performance Evaluation of MPI, UPC and OpenMP on Multicore Architectures

Damián A. Mallón; Guillermo L. Taboada; Carlos Teijeiro; Juan Touriño; Basilio B. Fraguela; Andrés Gómez; Ramón Doallo; J. Carlos Mouriño

The current trend to multicore architectures underscores the need of parallelism. While new languages and alternatives for supporting more efficiently these systems are proposed, MPI faces this new challenge. Therefore, up-to-date performance evaluations of current options for programming multicore systems are needed. This paper evaluates MPI performance against Unified Parallel C (UPC) and OpenMP on multicore architectures. From the analysis of the results, it can be concluded that MPI is generally the best choice on multicore systems with both shared and hybrid shared/distributed memory, as it takes the highest advantage of data locality, the key factor for performance in these systems. Regarding UPC, although it exploits efficiently the data layout in memory, it suffers from remote shared memory accesses, whereas OpenMP usually lacks efficient data locality support and is restricted to shared memory systems, which limits its scalability.

Proceedings of the Third Conference on Partitioned Global Address Space Programing Models | 2009

Evaluation of UPC programmability using classroom studies

Carlos Teijeiro; Guillermo L. Taboada; Juan Touriño; Basilio B. Fraguela; Ramón Doallo; Damián A. Mallón; Andrés Gómez; José Carlos Mouriño; Brian Wibecan

The study of a language in terms of programmability is a very interesting issue in parallel programming. Traditional approaches in this field have studied different methods, such as the number of Lines of Code or the analysis of programs, in order to prove the benefits of using a paradigm compared to another. Nevertheless, these methods usually focus only on code analysis, without giving much importance to the conditions of the development process and even to the learning stage, or the benefits and disadvantages of the language reported by the programmers. In this paper we present a methodology to accomplish a programmability study with UPC (Unified Parallel C) through the use of classroom studies with a group of novice UPC programmers. This work will show the design of these sessions and the analysis of the results obtained (code analysis and survey responses). Thus, it is possible to characterize the current benefits and disadvantages of UPC, as well as to report some desirable features that could be included in this language standard.

international symposium on computers and communications | 2007

High Performance Java Remote Method Invocation for Parallel Computing on Clusters

Guillermo L. Taboada; Carlos Teijeiro; Juan Touriño

This paper presents a more efficient Java remote method invocation (RMI) implementation for high-speed clusters. The use of Java for parallel programming on clusters is limited by the lack of efficient communication middleware and high-speed cluster interconnect support. This implementation overcomes these limitations through a more efficient Java RMI protocol based on several basic assumptions on clusters. Moreover, the use of a high performance sockets library provides with direct high-speed interconnect support. The performance evaluation of this middleware on a gigabit Ethernet (GbE) and a scalable coherent interface (SCI) cluster shows experimental evidence of throughput increase. Moreover, qualitative aspects of the solution such as transparency to the user, interoperability with other systems and no need of source code modification can augment the performance of existing parallel Java codes and boost the development of new high performance Java RMI applications.

international conference on parallel and distributed systems | 2011

Design and Implementation of MapReduce Using the PGAS Programming Model with UPC

Carlos Teijeiro; Guillermo L. Taboada; Juan Touriño; Ramón Doallo

MapReduce is a powerful tool for processing large data sets used by many applications running in distributed environments. However, despite the increasing number of computationally intensive problems that require low-latency communications, the adoption of MapReduce in High Performance Computing (HPC) is still emerging. Here languages based on the Partitioned Global Address Space (PGAS) programming model have shown to be a good choice for implementing parallel applications, in order to take advantage of the increasing number of cores per node and the programmability benefits achieved by their global memory view, such as the transparent access to remote data. This paper presents the first PGAS-based MapReduce implementation that uses the Unified Parallel C (UPC) language, which (1) obtains programmability benefits in parallel programming, (2) offers advanced configuration options to define a customized load distribution for different codes, and (3) overcomes performance penalties and bottlenecks that have traditionally prevented the deployment of MapReduce applications in HPC. The performance evaluation of representative applications on shared and distributed memory environments assesses the scalability of the presented MapReduce framework, confirming its suitability.

high performance computing and communications | 2009

Performance Evaluation of Unified Parallel C Collective Communications

Guillermo L. Taboada; Carlos Teijeiro; Juan Touriño; Basilio B. Fraguela; Ramón Doallo; José Carlos Mouriño; Damián A. Mallón; Andrés Gómez

Unified Parallel C (UPC) is an extension of ANSI C designed for parallel programming. UPC collective primitives, which are part of the UPC standard, increase programming productivity while reducing the communication overhead. This paper presents an up-to-date performance evaluation of two publicly available UPC collective implementations on three scenarios: shared, distributed, and hybrid shared/distributed memory architectures. The characterization of the throughput of collective primitives is useful for increasing performance through the runtime selection of the appropriate primitive implementation, which depends on the message size and the memory architecture, as well as to detect inefficient implementations. In fact, based on the analysis of the UPC collectives performance, we proposed some optimizations for the current UPC collective libraries. We have also compared the performance of the UPC collective primitives and their MPI counterparts, showing that there is room for improvement. Finally, this paper concludes with an analysis of the influence of the performance of the UPC collectives on a representative communication-intensive application, showing that their optimization is highly important for UPC scalability.

Proceedings of the Third Conference on Partitioned Global Address Space Programing Models | 2009

UPC performance evaluation on a multicore system

Damián A. Mallón; Andrés Gómez; José Carlos Mouriño; Guillermo L. Taboada; Carlos Teijeiro; Juan Touriño; Basilio B. Fraguela; Ramón Doallo; Brian Wibecan

As size and architectural complexity of High Performance Computing systems increases, the need for productive programming tools and languages becomes more important. The UPC language aims to be a good choice for a productive parallel programming. However, productivity is influenced not only by expressiveness of the language, but also by its performance. To assess the current UPC performance in high performance multicore systems, and therefore to help improve UPC developers future productivity, this paper provides an up-to-date UPC performance evaluation at various levels, evaluating two collective implementations, comparing their results with their MPI counterparts, and finally evaluating UPC and MPI performance in computational kernels. This analysis shows a path to optimize UPC collectives performance. This work also provides a performance snapshot of UPC vs the currently most popular choice for parallel programming, MPI. This snapshot, altogether with the UPC collectives analysis, shows that there is room for improvement and, besides its worse performance, UPC is suitable for a productive development of most HPC applications.

Computer Physics Communications | 2013

Parallel Brownian dynamics simulations with the message-passing and PGAS programming models

Carlos Teijeiro; Godehard Sutmann; Guillermo L. Taboada; Juan Touriño

Abstract The simulation of particle dynamics is among the most important mechanisms to study the behavior of molecules in a medium under specific conditions of temperature and density. Several models can be used to compute efficiently the forces that act on each particle, and also the interactions between them. This work presents the design and implementation of a parallel simulation code for the Brownian motion of particles in a fluid. Two different parallelization approaches have been followed: (1) using traditional distributed memory message-passing programming with MPI, and (2) using the Partitioned Global Address Space (PGAS) programming model, oriented towards hybrid shared/distributed memory systems, with the Unified Parallel C (UPC) language. Different techniques for domain decomposition and work distribution are analyzed in terms of efficiency and programmability, in order to select the most suitable strategy. Performance results on a supercomputer using up to 2048 cores are also presented for both MPI and UPC codes.

Cluster Computing | 2014

Scalable PGAS collective operations in NUMA clusters

Damián A. Mallón; Guillermo L. Taboada; Carlos Teijeiro; Jorge González-Domínguez; Andrés Gómez; Brian Wibecan

The increasing number of cores per processor is turning manycore-based systems in pervasive. This involves dealing with multiple levels of memory in non uniform memory access (NUMA) systems and processor cores hierarchies, accessible via complex interconnects in order to dispatch the increasing amount of data required by the processing elements. The key for efficient and scalable provision of data is the use of collective communication operations that minimize the impact of bottlenecks. Leveraging one sided communications becomes more important in these systems, to avoid unnecessary synchronization between pairs of processes in collective operations implemented in terms of two sided point to point functions. This work proposes a series of algorithms that provide a good performance and scalability in collective operations, based on the use of hierarchical trees, overlapping one-sided communications, message pipelining and the available NUMA binding features. An implementation has been developed for Unified Parallel C, a Partitioned Global Address Space language, which presents a shared memory view across the nodes for programmability, while keeping private memory regions for performance. The performance evaluation of the proposed implementation, conducted on five representative systems (JuRoPA, JUDGE, Finis Terrae, SVG and Superdome), has shown generally good performance and scalability, even outperforming MPI in some cases, which confirms the suitability of the developed algorithms for manycore architectures.

Journal of Computer Science and Technology | 2013

Design and Implementation of an Extended Collectives Library for Unified Parallel C

Carlos Teijeiro; Guillermo L. Taboada; Juan Touriño; Ramón Doallo; José Carlos Mouriño; Damián A. Mallón; Brian Wibecan

Unified Parallel C (UPC) is a parallel extension of ANSI C based on the Partitioned Global Address Space (PGAS) programming model, which provides a shared memory view that simplifies code development while it can take advantage of the scalability of distributed memory architectures. Therefore, UPC allows programmers to write parallel applications on hybrid shared/distributed memory architectures, such as multi-core clusters, in a more productive way, accessing remote memory by means of different high-level language constructs, such as assignments to shared variables or collective primitives. However, the standard UPC collectives library includes a reduced set of eight basic primitives with quite limited functionality. This work presents the design and implementation of extended UPC collective functions that overcome the limitations of the standard collectives library, allowing, for example, the use of a specific source and destination thread or defining the amount of data transferred by each particular thread. This library fulfills the demands made by the UPC developers community and implements portable algorithms, independent of the specific UPC compiler/runtime being used. The use of a representative set of these extended collectives has been evaluated using two applications and four kernels as case studies. The results obtained confirm the suitability of the new library to provide easier programming without trading off performance, thus achieving high productivity in parallel programming to harness the performance of hybrid shared/distributed memory architectures in high performance computing.

International Journal of High Performance Computing Applications | 2017

Optimized parallel simulations of analytic bond-order potentials on hybrid shared/distributed memory with MPI and OpenMP

Carlos Teijeiro; Godehard Sutmann; Ralf Drautz; Thomas Hammerschmidt

Analytic bond-order potentials (BOPs) allow to obtain a highly accurate description of interatomic interactions at a reasonable computational cost. However, for simulations with very large systems, the high memory demands require the use of a parallel implementation, which at the same time also optimizes the use of computational resources. The calculations of analytic BOPs are performed for a restricted volume around every atom and therefore have shown to be well suited for a message passing interface (MPI)-based parallelization based on a domain decomposition scheme, in which one process manages one big domain using the entire memory of a compute node. On the basis of this approach, the present work focuses on the analysis and enhancement of its performance on shared memory by using OpenMP threads on each MPI process, in order to use many cores per node to speed up computations and minimize memory bottlenecks. Different algorithms are described and their corresponding performance results are presented, showing significant performance gains for highly parallel systems with hybrid MPI/OpenMP simulations up to several thousands of threads.

Explore More