Luiz E. Ramos
Rutgers University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Luiz E. Ramos.
international conference on supercomputing | 2011
Luiz E. Ramos; Eugene Gorbatov; Ricardo Bianchini
Phase-Change Memory (PCM) technology has received substantial attention recently. Because PCM is byte-addressable and exhibits access times in the nanosecond range, it can be used in main memory designs. In fact, PCM has higher density and lower idle power consumption than DRAM. Unfortunately, PCM is also slower than DRAM and has limited endurance. For these reasons, researchers have proposed memory systems that combine a small amount of DRAM and a large amount of PCM. In this paper, we propose a new hybrid design that features a hardware-driven page placement policy. The policy relies on the memory controller (MC) to monitor access patterns, migrate pages between DRAM and PCM, and translate the memory addresses coming from the cores. Periodically, the operating system updates its page mappings based on the translation information used by the MC. Detailed simulations of 27 workloads show that our system is more robust and exhibits lower energy-delay2 than state-of-the-art hybrid systems.
architectural support for programming languages and operating systems | 2006
Taliver Heath; Ana Paula Centeno; Pradeep George; Luiz E. Ramos; Yogesh Jaluria; Ricardo Bianchini
Power densities have been increasing rapidly at all levels of server systems. To counter the high temperatures resulting from these densities, systems researchers have recently started work on softwarebased thermal management. Unfortunately, research in this new area has been hindered by the limitations imposed by simulators and real measurements. In this paper, we introduce Mercury, a software suite that avoids these limitations by accurately emulating temperatures based on simple layout, hardware, and componentutilization data. Most importantly, Mercury runs the entire software stack natively, enables repeatable experiments, and allows the study of thermal emergencies without harming hardware reliability. We validate Mercury using real measurements and a widely used commercial simulator. We use Mercury to develop Freon, a system that manages thermal emergencies in a server cluster without unnecessary performance degradation. Mercury will soon become available from http://www.darklab.rutgers.edu.
architectural support for programming languages and operating systems | 2011
Qingyuan Deng; David Meisner; Luiz E. Ramos; Thomas F. Wenisch; Ricardo Bianchini
Main memory is responsible for a large and increasing fraction of the energy consumed by servers. Prior work has focused on exploiting DRAM low-power states to conserve energy. However, these states require entire DRAM ranks to be idled, which is difficult to achieve even in lightly loaded servers. In this paper, we propose to conserve memory energy while improving its energy-proportionality by creating active low-power modes for it. Specifically, we propose MemScale, a scheme wherein we apply dynamic voltage and frequency scaling (DVFS) to the memory controller and dynamic frequency scaling (DFS) to the memory channels and DRAM devices. MemScale is guided by an operating system policy that determines the DVFS/DFS mode of the memory subsystem based on the current need for memory bandwidth, the potential energy savings, and the performance degradation that applications are willing to withstand. Our results demonstrate that MemScale reduces energy consumption significantly compared to modern memory energy management approaches. We conclude that the potential benefits of the MemScale mechanisms and policy more than compensate for their small hardware cost.
high-performance computer architecture | 2008
Luiz E. Ramos; Ricardo Bianchini
Designing thermal management policies for todaypsilas power-dense server clusters is currently a challenge, since it is difficult to predict the exact temperature and performance that would result from trying to react to a thermal emergency. To address this challenge, in this paper we propose C-Oracle, a software infrastructure for Internet services that dynamically predicts the temperature and performance impact of different thermal management reactions into the future, allowing the thermal management policy to select the best reaction at each point in time. We experimentally evaluate C-Oracle for thermal management policies based on load redistribution and dynamic voltage/frequency scaling in both single-tier and multi-tier services. Our results show that, regardless of management policy or service organization, C-Oracle enables non-trivial decisions that effectively manage thermal emergencies, while avoiding unnecessary performance degradation.
international symposium on circuits and systems | 2006
Henrique C. Freitas; Milene Barbosa Carvalho; Alexandre Marques Amaral; Amanda Rafaela Diniz; Carlos Augusto Paiva da Silva Martins; Luiz E. Ramos
This paper presents the proposal and development of a reconfigurable crossbar switch (RCS) architecture for network processors. Its main purpose is to increase the performance, and flexibility for environments with multiprocessors and computer clusters. The results include VHDL simulation of RCS and the use of it in a broadcast function implementation, found in message passing support middleware
Concurrency and Computation: Practice and Experience | 2015
Alyson D. Pereira; Luiz E. Ramos; Luís Fabrício Wanderley Góes
The use of Graphics Processing Units (GPUs) for high‐performance computing has gained growing momentum in recent years. Unfortunately, GPU‐programming platforms like Compute Unified Device Architecture (CUDA) are complex, user unfriendly, and increase the complexity of developing high‐performance parallel applications. In addition, runtime systems that execute those applications often fail to fully utilize the parallelism of modern CPU‐GPU systems. Typically, parallel kernels run entirely on the most powerful device available, leaving other devices idle. These observations sparked research in two directions: (1) high‐level approaches to software development for GPUs, which strike a balance between performance and ease of programming; and (2) task partitioning to fully utilize the available devices. In this paper, we propose a framework, called PSkel, that provides a single high‐level abstraction for stencil programming on heterogeneous CPU‐GPU systems, while allowing the programmer to partition and assign data and computation to both CPU and GPU. Our current implementation uses parallel skeletons to transparently leverage Intel Threading Building Blocks (Intel Corporation, Santa Clara, CA, USA) and NVIDIA CUDA (Nvidia Corporation, Santa Clara, CA, USA). In our experiments, we observed that parallel applications with task partitioning can improve average performance by up to 76% and 28% compared with CPU‐only and GPU‐only parallel applications, respectively. Copyright
frontiers in education conference | 2002
Carlos Augusto Paiva da Silva Martins; João Batista T. Corrêa; Luís Fabrício Wanderley Góes; Luiz E. Ramos; Talles Henrique Medeiros
We present a new learning method of microprocessor architecture based on design and verification using functional simulation. Our main goals are to improve and optimize the learning process, motivating students to study and learn theoretical and practical aspects of microprocessor architecture, using functional simulators to validate the microprocessor design and to construct knowledge; and develop research activities during an undergraduate course. Our method is based on learning, constructivism theory, problem based learning, group projects, design of academic microprocessors as motivation for theory study/learning and verification of designed microprocessors through functional simulators developed by students. To validate the proposed method we analyze two microprocessors and functional simulators: a digital signal processor using ASIP and RISC concepts, and a RISC ASIP home automation processor. They were developed in a computer architecture course (computer science, PUC-Minas, Brazil) as the application of this method. In the conclusion students and professor analyze the results, highlighting the main differences, advantages and disadvantages of the new method.
international symposium on microarchitecture | 2012
Qingyuan Deng; Luiz E. Ramos; Ricardo Bianchini; David Meisner; Thomas F. Wenisch
Main memory accounts for a growing fraction of server energy usage. Investigating active low-power modes for managing main memory, with a system called MemScale, the authors offer a solution for performance-aware energy management. By creating a set of low-power modes, hardware mechanisms and software policies, MemScale trades memory bandwidth for energy savings while tightly limiting the associated performance impact.
Archive | 2005
Christiane V. Pousa; Luiz E. Ramos; Luís Fabrício Wanderley Góes; Carlos Augusto Paiva da Silva Martins
In this paper, we present a new version of ClusterSim (Cluster Simulation Tool), in which we included two new modules: Message-Passing (MP) and Distributed Shared Memory (DSM). ClusterSim supports the visual modeling and the simulation of clusters and their workloads for performance analysis. A modeled cluster is composed of single or multi-processed nodes, parallel job schedulers, network topologies, message-passing communications, distributed shared memory and technologies. A modeled workload is represented by users that submit jobs composed of tasks described by probability distributions and their internal structure (CPU, I/O, DSM and MPI instructions). Our main objectives in this paper are: to present a new version of ClusterSim with the inclusion of Message-Passing and Distributed Shared Memory simulation modules; to present the new software architecture and simulation model; to verify the proposal and implementation of MPI collective communication functions using different communication patterns (Message-Passing Module); to verify the proposal and implementation of DSM operations, consistency models and coherence protocols for object sharing (Distributed Shared Memory Module); to analyze ClusterSim v. 1.1 by means of two case studies. Our main contributions are the inclusion of the Message-Passing and Distributed Shared Memory simulation modules, a more detailed simulation model of ClusterSim and new features in the graphical environment.
Concurrency and Computation: Practice and Experience | 2017
Rodrigo C. O. Rocha; Alyson D. Pereira; Luiz E. Ramos; Luís Fabrício Wanderley Góes
The stencil pattern is important in many scientific and engineering domains, spurring great interest from researchers and industry. In recent years, various optimizations have been proposed for parallel stencil applications running on graphics processing units (GPUs). In particular, tiling is a technique that can significantly enhance application performance by improving data locality and by reducing the volume of communication between host memory and GPU. In addition, tiling enables stencil applications to process inputs that are larger than the physical GPU memory. However, implementing tiling efficiently is complex, time‐consuming, and error‐prone. In this paper, we propose transparently optimized automatic stencil tiling (TOAST), an automatic tiling mechanism for iterative stencil computations running on GPUs; TOAST has 3 main benefits: (1) It incorporates an optimization model that seeks to maximize data reuse within tiles while respecting the amount of dynamically available GPU memory; (2) it offers a virtualized GPU memory for stencil computations, allowing for large input data; and (3) it performs optimal tiling transparently to the developer of the parallel stencil application. The current implementation of TOAST augments the PSkel framework with an internal solver based on genetic algorithms. Our experimental results show that TOAST improves the performance of iterative stencil applications by up to 13 × compared with their multithreaded (central processing unit–based) optimized versions and up to 48 × compared with a naive tiling approach on GPU. The TOAST mechanism is able to automatically achieve a low percentual overhead of data management compared with actual stencil computation.
Collaboration
Dive into the Luiz E. Ramos's collaboration.
Carlos Augusto Paiva da Silva Martins
Pontifícia Universidade Católica de Minas Gerais
View shared research outputs