Carlos H. Andrade Costa

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Carlos H. Andrade Costa is active.

Explore More

Publication

Featured researches published by Carlos H. Andrade Costa.

Ibm Journal of Research and Development | 2015

Active Memory Cube: A processing-in-memory architecture for exascale systems

Ravi Nair; Samuel F. Antao; Carlo Bertolli; Pradip Bose; José R. Brunheroto; Tong Chen; Chen-Yong Cher; Carlos H. Andrade Costa; J. Doi; Constantinos Evangelinos; Bruce M. Fleischer; Thomas W. Fox; Diego S. Gallo; Leopold Grinberg; John A. Gunnels; Arpith C. Jacob; P. Jacob; Hans M. Jacobson; Tejas Karkhanis; Choon Young Kim; Jaime H. Moreno; John Kevin Patrick O'Brien; Martin Ohmacht; Yoonho Park; Daniel A. Prener; Bryan S. Rosenburg; Kyung Dong Ryu; Olivier Sallenave; Mauricio J. Serrano; Patrick Siegl

Many studies point to the difficulty of scaling existing computer architectures to meet the needs of an exascale system (i.e., capable of executing

ieee international conference on high performance computing data and analytics | 2014

A system software approach to proactive memory-error avoidance

Carlos H. Andrade Costa; Yoonho Park; Bryan S. Rosenburg; Chen-Yong Cher; Kyung Dong Ryu

10^{18}

IEEE Transactions on Parallel and Distributed Systems | 2017

Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics

Bogdan Nicolae; Carlos H. Andrade Costa; Claudia Misale; Kostas Katrinis; Yoonho Park

floating-point operations per second), consuming no more than 20 MW in power, by around the year 2020. This paper outlines a new architecture, the Active Memory Cube, which reduces the energy of computation significantly by performing computation in the memory module, rather than moving data through large memory hierarchies to the processor core. The architecture leverages a commercially demonstrated 3D memory stack called the Hybrid Memory Cube, placing sophisticated computational elements on the logic layer below its stack of dynamic random-access memory (DRAM) dies. The paper also describes an Active Memory Cube tuned to the requirements of a scientific exascale system. The computational elements have a vector architecture and are capable of performing a comprehensive set of floating-point and integer instructions, predicated operations, and gather-scatter accesses across memory in the Cube. The paper outlines the software infrastructure used to develop applications and to evaluate the architecture, and describes results of experiments on application kernels, along with performance and power projections.

cluster computing and the grid | 2016

Towards Memory-Optimized Data Shuffling Patterns for Big Data Analytics

Bogdan Nicolae; Carlos H. Andrade Costa; Claudia Misale; Kostas Katrinis; Yoonho Park

Todays HPC systems use two mechanisms to address main-memory errors. Error-correcting codes make correctable errors transparent to software, while checkpoint/restart (CR) enables recovery from uncorrectable errors. Unfortunately, CR overhead will be enormous at exascale due to the high failure rate of memory. We propose a new OS-based approach that proactively avoids memory errors using prediction. This scheme exposes correctable error information to the OS, which migrates pages and off lines unhealthy memory to avoid application crashes. We analyze memory error patterns in extensive logs from a BG/P system and show how correctable error patterns can be used to identify memory likely to fail. We implement a proactive memory management system on BG/Q by extending the firmware and Linux. We evaluate our approach with a realistic workload and compare our overhead against CR. We show improved resilience with negligible performance overhead for applications.

international conference on bioinformatics | 2017

SparkGA: A Spark Framework for Cost Effective, Fast and Accurate DNA Analysis at Scale

Hamid Mushtaq; Frank Liu; Carlos H. Andrade Costa; Gang Liu; Peter Hofstee; Zaid Al-Ars

Big data analytics is an indispensable tool in transforming science, engineering, medicine, health-care, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. In this context, data shuffling, a particularly difficult transformation pattern, introduces important challenges. Specifically, data shuffling is a key component of complex computations that has a major impact on the overall performance and scalability. Thus, speeding up data shuffling is a critical goal. To this end, state-of-the-art solutions often rely on overlapping the data transfers with the shuffling phase. However, they employ simple mechanisms to decide how much data and where to fetch it from, which leads to sub-optimal performance and excessive auxiliary memory utilization for the purpose of prefetching. The latter aspect is a growing concern, given evidence that memory per computation unit is continuously decreasing while interconnect bandwidth is increasing. This paper contributes a novel shuffle data transfer strategy that addresses the two aforementioned dimensions by dynamically adapting the prefetching to the computation. We implemented this novel strategy in Spark, a popular in-memory data analytics framework. To demonstrate the benefits of our proposal, we run extensive experiments on an HPC cluster with large core count per node. Compared with the default Spark shuffle strategy, our proposal shows: up to 40 percent better performance with 50 percent less memory utilization for buffering and excellent weak scalability.

ieee international conference on high performance computing, data, and analytics | 2016

Performance Analysis of Spark/GraphX on POWER8 Cluster

Xinyu Que; Lars Schneidenbach; Fabio Checconi; Carlos H. Andrade Costa; Daniele Buono

Big data analytics is an indispensable tool in transforming science, engineering, medicine, healthcare, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. However, this introduces important challenges, among which data shuffling is particularly difficult: on one hand it is a key part of the computation that has a major impact on the overall performance and scalability so its efficiency is paramount, while on the other hand it needs to operate with scarce memory in order to leave as much memory available for data caching. In this context, efficient scheduling of data transfers such that it addresses both dimensions of the problem simultaneously is non-trivial. State-of-the-art solutions often rely on simple approaches that yield sub optimal performance and resource usage. This paper contributes a novel shuffle data transfer strategy that dynamically adapts to the computation with minimal memory utilization, which we briefly underline as a series of design principles.

computer and information technology | 2016

Application Characterization Assisted System Design

I-Hsin Chung; Carlos H. Andrade Costa; Leopold Grinberg; Hui-Fang Wen

In recent years, the cost of NGS (Next Generation Sequencing) technology has dramatically reduced, making it a viable method for diagnosing genetic diseases. The large amount of data generated by NGS technology, usually in the order of hundreds of gigabytes per experiment, have to be analyzed quickly to generate meaningful variant results. The GATK best practices pipeline from the Broad Institute is one of the most popular computational pipelines for DNA analysis. Many components of the GATK pipeline are not very parallelizable though. In this paper, we present a parallel implementation of a DNA analysis pipeline based on the big data Apache Spark framework. This implementation is highly scalable and capable of parallelizing computation by utilizing data-level parallelism as well as load balancing techniques. In order to reduce the analysis cost, the framework can run on nodes with as little memory as 16GB. For whole genome sequencing experiments, we show that the runtime can be reduced to about 1.5 hours on a 20-node cluster with an accuracy of up to 99.9981%. Our solution is about 71% faster than other state-of-the-art solutions while also being more accurate. The source code of the software described in this paper is publicly available at https://github.com/HamidMushtaq/SparkGA1.git.

Archive | 2014

METHODS, APPARATUS AND SYSTEM FOR SELECTIVE DUPLICATION OF SUBTASKS

Carlos H. Andrade Costa; Chen-Yong Cher; Yoonho Park; Bryan S. Rosenburg; Kyung Dong Ryu

POWER 8, the latest RISC (Reduced Instruction Set Computer) microprocessor of the IBM Power architecture family, was designed to significantly benefit emerging workloads, including Business Analytics, Cloud Computing and High Performance Computing. In this paper, we provide a thorough performance evaluation on a widely used large-scale graph processing framework, Spark/GraphX, on a POWER 8 cluster. Note that we use Spark and Java versions out of the box without any optimization. We examine the performance with several important graph kernels such as Breadth-First Search, Connected Components, and PageRank using both large real-world social graphs and synthetic graphs of billions of edges. We study the Spark/GraphX performance against some architectural aspects and perform the first Spark/GraphX scalability test with up to 16 POWER 8 nodes.

Archive | 2013

METHODS, APPARATUS AND SYSTEM FOR NOTIFICATION OF PREDICTABLE MEMORY FAILURE

Chen-Yong Cher; Carlos H. Andrade Costa; Yoonho Park; Bryan S. Rosenburg; Kyung Dong Ryu

The understanding of application characteristics such as hardware resource requirements and communication patterns is key in building highly utilized high performance computing systems for target workloads at a reasonable cost and with available technology. The characterization drives the design decision of both hardware and software. Memory access pattern is a key factor as data movement is a major challenge in todays system design. In this paper, we describe how memory characterization unveils key differences among applications and helps to shape the system design space. With examples, it is shown how memory characterization can be used in the hardware architecture design and application enablement to achieve better performance and system utilization.

Archive | 2016