Luís Fabrício Wanderley Góes
Pontifícia Universidade Católica de Minas Gerais
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Luís Fabrício Wanderley Góes.
parallel computing | 2007
Walfredo Cirne; Francisco Vilar Brasileiro; Daniel Paranhos; Luís Fabrício Wanderley Góes; William Voorsluys
Large distributed systems challenge traditional schedulers, as it is often hard to determine a priori how long each task will take to complete on each resource, information that is input for such schedulers. Task replication has been applied in a variety of scenarios as a way to circumvent this problem. Task replication consists of dispatching multiple replicas of a task and using the result from the first replica to finish. Replication schedulers (i.e. schedulers that employ task replication) are able to achieve good performance even in the absence of information on tasks and resources. They are also of smaller complexity than traditional schedulers, making them better suitable for large distributed systems. On the other hand, replication schedulers waste cycles with the replicas that are not the first to finish. Moreover, this extra consumption of resources raises severe concerns about the system-wide performance of a distributed system with multiple, competing replication schedulers. This paper presents a comprehensive study of task replication, comparing replication schedulers against traditional information-based schedulers, and establishing their efficacy (the performance delivered to the application), efficiency (the amount of resources wasted), and emergent behavior (the system-wide behavior of a system with multiple replication schedulers). We also introduce a simple access control strategy that can be implemented locally by each resource and greatly improves overall performance of a system on which multiple replication schedulers compete for resources.
frontiers in education conference | 2002
Carlos Augusto Paiva da Silva Martins; João Batista T. Corrêa; Luís Fabrício Wanderley Góes; Luiz E. Ramos; Talles Henrique Medeiros
We present a new learning method of microprocessor architecture based on design and verification using functional simulation. Our main goals are to improve and optimize the learning process, motivating students to study and learn theoretical and practical aspects of microprocessor architecture, using functional simulators to validate the microprocessor design and to construct knowledge; and develop research activities during an undergraduate course. Our method is based on learning, constructivism theory, problem based learning, group projects, design of academic microprocessors as motivation for theory study/learning and verification of designed microprocessors through functional simulators developed by students. To validate the proposed method we analyze two microprocessors and functional simulators: a digital signal processor using ASIP and RISC concepts, and a RISC ASIP home automation processor. They were developed in a computer architecture course (computer science, PUC-Minas, Brazil) as the application of this method. In the conclusion students and professor analyze the results, highlighting the main differences, advantages and disadvantages of the new method.
Journal of Parallel and Distributed Computing | 2014
Márcio Castro; Luís Fabrício Wanderley Góes; Jean-François Méhaut
Transactional Memory (TM) is a programmer friendly alternative to traditional lock-based concurrency. Although it intends to simplify concurrent programming, the performance of the applications still relies on how frequent they synchronize and the way they access shared data. These aspects must be taken into consideration if one intends to exploit the full potential of modern multicore platforms. Since these platforms feature complex memory hierarchies composed of different levels of cache, applications may suffer from memory latencies and bandwidth problems if threads are not properly placed on cores. An interesting approach to efficiently exploit the memory hierarchy is called thread mapping. However, a single fixed thread mapping cannot deliver the best performance when dealing with a large range of transactional workloads, TM systems and platforms. In this article, we propose and implement in a TM system a set of adaptive thread mapping strategies for TM applications to tackle this problem. They range from simple strategies that do not require any prior knowledge to strategies based on Machine Learning techniques. Taking the Linux default strategy as baseline, we achieved performance improvements of up to 64.4% on a set of synthetic applications and an overall performance improvement of up to 16.5% on the standard STAMP benchmark suite.
Archive | 2005
Christiane V. Pousa; Luiz E. Ramos; Luís Fabrício Wanderley Góes; Carlos Augusto Paiva da Silva Martins
In this paper, we present a new version of ClusterSim (Cluster Simulation Tool), in which we included two new modules: Message-Passing (MP) and Distributed Shared Memory (DSM). ClusterSim supports the visual modeling and the simulation of clusters and their workloads for performance analysis. A modeled cluster is composed of single or multi-processed nodes, parallel job schedulers, network topologies, message-passing communications, distributed shared memory and technologies. A modeled workload is represented by users that submit jobs composed of tasks described by probability distributions and their internal structure (CPU, I/O, DSM and MPI instructions). Our main objectives in this paper are: to present a new version of ClusterSim with the inclusion of Message-Passing and Distributed Shared Memory simulation modules; to present the new software architecture and simulation model; to verify the proposal and implementation of MPI collective communication functions using different communication patterns (Message-Passing Module); to verify the proposal and implementation of DSM operations, consistency models and coherence protocols for object sharing (Distributed Shared Memory Module); to analyze ClusterSim v. 1.1 by means of two case studies. Our main contributions are the inclusion of the Message-Passing and Distributed Shared Memory simulation modules, a more detailed simulation model of ClusterSim and new features in the graphical environment.
Concurrency and Computation: Practice and Experience | 2017
Rodrigo C. O. Rocha; Alyson D. Pereira; Luiz E. Ramos; Luís Fabrício Wanderley Góes
The stencil pattern is important in many scientific and engineering domains, spurring great interest from researchers and industry. In recent years, various optimizations have been proposed for parallel stencil applications running on graphics processing units (GPUs). In particular, tiling is a technique that can significantly enhance application performance by improving data locality and by reducing the volume of communication between host memory and GPU. In addition, tiling enables stencil applications to process inputs that are larger than the physical GPU memory. However, implementing tiling efficiently is complex, time‐consuming, and error‐prone. In this paper, we propose transparently optimized automatic stencil tiling (TOAST), an automatic tiling mechanism for iterative stencil computations running on GPUs; TOAST has 3 main benefits: (1) It incorporates an optimization model that seeks to maximize data reuse within tiles while respecting the amount of dynamically available GPU memory; (2) it offers a virtualized GPU memory for stencil computations, allowing for large input data; and (3) it performs optimal tiling transparently to the developer of the parallel stencil application. The current implementation of TOAST augments the PSkel framework with an internal solver based on genetic algorithms. Our experimental results show that TOAST improves the performance of iterative stencil applications by up to 13 × compared with their multithreaded (central processing unit–based) optimized versions and up to 48 × compared with a naive tiling approach on GPU. The TOAST mechanism is able to automatically achieve a low percentual overhead of data management compared with actual stencil computation.
job scheduling strategies for parallel processing | 2005
Luís Fabrício Wanderley Góes; Pedro Henrique Calais Guerra; Bruno Coutinho; Leonardo C. da Rocha; Wagner Meira; Renato Ferreira; Dorgival O. Guedes; Walfredo Cirne
Irregular and iterative I/O-intensive jobs need a different approach from parallel job schedulers. The focus in this case is not only the processing requirements anymore: memory, network and storage capacity must all be considered in making a scheduling decision. Job executions are irregular and data dependent, alternating between CPU-bound and I/O-bound phases. In this paper, we propose and implement a parallel job scheduling strategy for such jobs, called AnthillSched, based on a simple heuristic: we map the behavior of a parallel application with minimal resources as we vary its input parameters. From that mapping we infer the best scheduling for a certain set of input parameters given the available resources. To test and verify AnthillSched we used logs obtained from a real system executing data mining jobs. Our main contributions are the implementation of a parallel job scheduling strategy in a real system and the performance analysis of AnthillSched, which allowed us to discard some other scheduling alternatives considered previously.
international parallel and distributed processing symposium | 2005
Christiane V. Pousa; Luís Fabrício Wanderley Góes; Carlos Augusto Paiva da Silva Martins
Consistency is an important issue in distributed shared memory (DSM) systems. These systems share a set of objects or virtual memory pages. The data sharing enables the applications in workloads to access the data concurrently. But, these concurrent accesses can generate some inconsistencies in the shared data state. The consistency models are responsible for managing consistency of shared data for the workloads. In this work, we propose, present and analyze a reconfigurable consistency model for object based DSMs. We called this consistency model ROCoM (reconfigurable object consistency model). ROCoM behavior was represented using a reconfigurable algorithm (RA) and it analysis was made using a simulation tool (ClusterSim - Cluster Simulation Tool). Our results show that ROCoM, on average, had 55% better performance than the other traditional consistency models.
international parallel and distributed processing symposium | 2006
Milene Barbosa Carvalho; Luís Fabrício Wanderley Góes; Carlos Augusto Paiva da Silva Martins
In this paper, we present a dynamically reconfigurable cache architecture using adaptive block allocation policy analyzed by means of simulation. Our main objectives are: to propose a reconfigurable cache architecture and to propose, implement and analyze the performance of an adaptive cache block allocation policy. First, we present a proposal of the reconfigurable cache architecture that can adapt according to the workload. Then we present our adaptive policy and do some performance tests comparing our cache architecture with some set associative configurations. In these tests, we use some traces from BYU Trace Distribution Center of SPEC 2000 Benchmark. Finally, we analyze the results based on some metrics like cache miss ratio, response time, etc.
international parallel and distributed processing symposium | 2005
Christiane V. Pousa; Luís Fabrício Wanderley Góes; Dulcinéia Oliveira da Penha; Carlos Augusto Paiva da Silva Martins
In this paper, we propose, implement and analyze the performance of a reconfigurable sequential consistency algorithm (RSCA) using simulation. Extending the concepts of reconfigurable devices to the algorithmic level, we model RSCA that is a reconfigurable sequential consistency algorithm for asynchronous distributed systems that manage concurrent objects stating. As our main results, we present that, on average, the performance of RSCA was 36% better than the traditional sequential consistency algorithms. The main contributions of this paper are: the definition, proposal, implementation and performance analysis of RSCA.
frontiers in education conference | 2002
Luiz E. Ramos; Luís Fabrício Wanderley Góes; Carlos Augusto Paiva da Silva Martins
In this paper we analyze the teaching and learning of parallel processing through performance analysis using a software tool called Prober. This tool is a functional and performance analyzer of parallel programs that we proposed and developed during an undergraduate research project. Our teaching and learning approach consists of a practical class where students receive explanations about some concepts of parallel processing and the use of the tool. They do some oriented and simple performance tests on parallel programs and analyze their results using Prober as a single aid tool. Finally, students answer a self-assessment questionnaire about their formation, their knowledge of parallel processing concepts and also about the usability of Prober. Our main goal is to show that students can learn concepts of parallel processing in a clearer, faster and more efficient way using our approach.
Collaboration
Dive into the Luís Fabrício Wanderley Góes's collaboration.
Carlos Augusto Paiva da Silva Martins
Pontifícia Universidade Católica de Minas Gerais
View shared research outputs