Eduardo Rocha Rodrigues

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Eduardo Rocha Rodrigues is active.

Explore More

Publication

Featured researches published by Eduardo Rocha Rodrigues.

international symposium on computers and communications | 2009

Multi-core aware process mapping and its impact on communication overhead of parallel applications

Eduardo Rocha Rodrigues; Felipe Lopes Madruga; Philippe Olivier Alexandre Navaux; Jairo Panetta

We propose an approach to reduce the execution time of applications with a steady communication pattern on clusters of multi-core processors by leveraging the asymmetry of core communication speeds. In addition to the well known fact that communication link speeds on a fixed cluster vary with processor selection, we consider one effect of multicore processor chips: link speeds vary with core selection within a single processor chip. The approach requires measuring link speeds among cluster cores as well as communication volumes and computational loads of the selected application processes. This data is fed into the Dual Recursive Bipartitioning method to obtain close to optimal application process placement on cluster cores. We apply this approach to a real world application achieving sensible execution time reduction without even recompiling source code.

high performance computing and communications | 2010

Evaluating Thread Placement Based on Memory Access Patterns for Multi-core Processors

Matthias Diener; Felipe Lopes Madruga; Eduardo Rocha Rodrigues; Marco Antonio Zanata Alves; Jörg Schneider; Philippe Olivier Alexandre Navaux; Hans-Ulrich Heiss

Process placement is a technique widely used on parallel machines with heterogeneous interconnects to reduce the overall communication time. For instance, two processes which communicate frequently are mapped close to each other. Finding the optimal mapping between threads and cores in a shared-memory environment (for example, OpenMP and Pthreads) is an even more complex task due to implicit communication. In this work, we examine data sharing patterns between threads in different workloads and use those patterns in a similar way as messages are used to map processes in cluster computers. We evaluated our technique on a state-of-the-art multicore processor and achieved moderate improvements in the common case and considerable improvements in some cases, reducing execution time by up to 45%.

symposium on computer architecture and high performance computing | 2010

A Comparative Analysis of Load Balancing Algorithms Applied to a Weather Forecast Model

Eduardo Rocha Rodrigues; Philippe Olivier Alexandre Navaux; Jairo Panetta; Alvaro Luiz Fazenda; Celso L. Mendes; Laxmikant V. Kalé

Among the many reasons for load imbalance in weather forecasting models, the dynamic imbalance caused by localized variations on the state of the atmosphere is the hardest one to handle. As an example, active thunderstorms may substantially increase load at a certain time step with respect to previous time steps in an unpredictable manner – after all, tracking storms is one of the reasons for running a weather forecasting model. In this paper, we present a comparative analysis of different load balancing algorithms to deal with this kind of load imbalance. We analyze the impact of these strategies on computation and communication and the effects caused by the frequency at which the load balancer is invoked on execution time. This is done without any code modification, employing the concept of processor virtualization, which basically means that the domain is over-decomposed and the unit of rebalance is a sub-domain. With this approach, we were able to reduce the execution time of a full, real-world weather model.

acm symposium on applied computing | 2010

A new technique for data privatization in user-level threads and its use in parallel applications

Eduardo Rocha Rodrigues; Philippe Olivier Alexandre Navaux; Jairo Panetta; Celso L. Mendes

User-level threads have been used to implement migratable MPI processes. This is a better strategy to implement load balancing mechanisms. That is because, in general, these threads are faster to create, manage and migrate than heavy processes and kernel threads. However, they present some issues concerning private data because they break the private address space that MPI programs typically assume. In this paper, we propose a new approach to privatize data in user-level threads. This approach is based on Thread Local Storage, which is used by kernel threads. We apply this technique to enable MPI processes based on user thread to execute a wider range of parallel programs. We show that this alternative has a more efficient context switch and lower migration cost than other approaches.

Future Generation Computer Systems | 2017

Job placement advisor based on turnaround predictions for HPC hybrid clouds

Renato L. F. Cunha; Eduardo Rocha Rodrigues; Leonardo P. Tizzei; Marco Aurelio Stelmar Netto

Several companies and research institutes are moving their CPU-intensive applications to hybrid High Performance Computing (HPC) cloud environments. Such a shift depends on the creation of software systems that help users decide where a job should be placed considering execution time and queue wait time to access on-premise clusters. Relying blindly on turnaround prediction techniques will affect negatively response times inside HPC cloud environments. This paper introduces a tool to make job placement decisions in HPC hybrid cloud environments taking into account the inaccuracy of execution and waiting time predictions. We used job traces from real supercomputing centers to run our experiments, and compared the performance between environments using real speedup curves. We also extended a state-of-the-art machine learning based predictor to work with data from the cluster scheduler. Our main findings are: (i) depending on workload characteristics, there is a turning point where predictions should be disregarded in favor of a more conservative decision to minimize job turnaround times and (ii) scheduler data plays a key role in improving predictions generated with machine learning using job trace data---our experiments showed around 20% prediction accuracy improvements.

ieee international conference on high performance computing, data, and analytics | 2010

Optimizing an MPI weather forecasting model via processor virtualization

Eduardo Rocha Rodrigues; Philippe Olivier Alexandre Navaux; Jairo Panetta; Celso L. Mendes; Laxmikant V. Kalé

Weather forecasting models are computationally intensive applications. These models are typically executed in parallel machines and a major obstacle for their scalability is load imbalance. The causes of such imbalance are either static (e.g. topography) or dynamic (e.g. shortwave radiation, moving thunderstorms). Various techniques, often embedded in the applications source code, have been used to address both sources. However, these techniques are inflexible and hard to use in legacy codes. In this paper, we demonstrate the effectiveness of processor virtualization for dynamically balancing the load in BRAMS, a mesoscale weather forecasting model based on MPI paral-lelization. We use the Charm++ infrastructure, with its over-decomposition and object-migration capabilities, to move subdomains across processors during execution of the model. Processor virtualization enables better overlap between computation and communication and improved cache efficiency. Furthermore, by employing an appropriate load balancer, we achieve better processor utilization while requiring minimal changes to the models code.

symposium on computer architecture and high performance computing | 2004

A parallel engine for graphical interactive molecular dynamics simulations

Eduardo Rocha Rodrigues; Airam Jonatas Preto; Stephan Stephany

The current work proposes a parallel implementation for interactive molecular dynamics simulations (MD). The interactive capability is modeled by finite automata that are executed in the processing nodes. Any interaction implies in a communication between the user interface and the finite automata. The ADKS, an interactive sequential MD code that provides graphical output was chosen as a case study. A parallel version of this code was developed using the MPI communication library to check its parallel performance without/with visualization. Performance results are discussed for both cases and the influence of visualization in the performance is also treated, including image update rate. In order to allow a modular approach, a new parallel version of the ADKS is being implemented employing the PyMPI Python extension.

arXiv: Distributed, Parallel, and Cluster Computing | 2016

Helping HPC users specify job memory requirements via machine learning

Eduardo Rocha Rodrigues; Renato L. F. Cunha; Marco Aurelio Stelmar Netto; Michael J. Spriggs

Resource allocation in High Performance Computing (HPC) settings is still not easy for end-users due to the wide variety of application and environment configuration options. Users have difficulties to estimate the number of processors and amount of memory required by their jobs, select the queue and partition, and estimate when job output will be available to plan for next experiments. Apart from wasting infrastructure resources by making wrong allocation decisions, overall user response time can also be negatively impacted. Techniques that exploit batch scheduler systems to predict waiting time and runtime of user jobs have already been proposed. However, we observed that such techniques are not suitable for predicting job memory usage. In this paper we introduce a tool to help users predict their memory requirements using machine learning. We describe the integration of the tool with a batch scheduler system, discuss how batch scheduler log data can be exploited to generate memory usage predictions through machine learning, and present results of two production systems containing thousands of jobs.

irregular applications: architectures and algorithms | 2013

A novel finite element method assembler for co-processors and accelerators

Nina Hanzlikova; Eduardo Rocha Rodrigues

Finite element method (FEM) is a popular approach to solving Differential equations [5]. Among its many attractive features is its ability to handle complex geometries. The domain is discretised using simple elements whose local contributions are assembled into a global system of equations. This is in contrast to the finite difference method (FDM) which can typically only handle regular geometries. However before solution is possible the system of equations of the FEM has to be assembled, a procedure which can be significant to the computational performance of the FEM solver, particularly when coupled with highly parallel execution [3]. In this work we outline a new algorithm for achieving a highly parallel assembler routine compatible with Intel® Xeon Phi and GPU architectures. We also present performance comparison and analysis of our algorithm and the globalNZ algorithm outlined by Cecka et al. in [2], as implemented on Intel® Xeon Phi architecture and compare these to the serial implementation of Hughes [5].

2012 13th Symposium on Computer Systems | 2012

Improving the Scalability of an Operational Scientific Application in a Large Multi-core Cluster

Alvaro Luiz Fazenda; Eduardo Rocha Rodrigues; Simone Tomita; Jairo Panetta; Celso L. Mendes

Currently, High-Performance Computers use nodes with a tendency of an increasing number of cores per chip. In this scenario, enhancing scalability of an existing application requires a comprehensive approach, since system parameters such as memory per core and I/O speeds increase slower with time than cores per chip. This work describes the enhancements incorporated in BRAMS - a regional weather forecasting model - to reach a target execution time using 9,600 cores. We show that some common coding techniques may prevent scalability and that I/O and memory are constraints as core counts increase.

Explore More