Is this you? Create Your Porfile

Marc Casas

Barcelona Supercomputing Center

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Marc Casas is active.

Explore More

Publication

Featured researches published by Marc Casas.

international conference on supercomputing | 2012

Fault resilience of the algebraic multi-grid solver

Marc Casas; Bronis R. de Supinski; Greg Bronevetsky; Martin Schulz

As HPC system sizes grow to millions of cores and chip feature sizes continue to decrease, HPC applications become increasingly exposed to transient hardware faults. These faults can cause aborts and performance degradation. Most importantly, they can corrupt results. Thus, we must evaluate the fault vulnerability of key HPC algorithms to develop cost-effective techniques to improve application resilience. We present an approach that analyzes the vulnerability of applications to faults, systematically reduces it by protecting the most vulnerable components and predicts application vulnerability at large scales. Weinitially focus on sparse scientific applications and apply our approachin this paper to the Algebraic Multi Grid (AMG) algorithm. We empirically analyze AMGs vulnerability to hardware faults in both sequential and parallel (hybrid MPI/OpenMP) executions on up to 1,600 cores and propose and evaluate the use of targeted pointer replication to reduce it. Our techniques increase AMGs resilience to transient hardware faults by 50-80% and improve its scalability on faulty computational environments by 35%. Further, we show how to model AMGs scalability in fault-prone environments to predict execution times of large-scale runs accurately.

ieee international conference on high performance computing data and analytics | 2010

Automatic Phase Detection and Structure Extraction of MPI Applications

Marc Casas; Rosa M. Badia; Jesús Labarta

In this paper we present an automatic system able to detect the internal structure of executions of high-performance computing applications. This automatic system is able to rule out non-significant regions of executions, to detect redundancies, and, finally, to select small but significant execution regions. This automatic detection process is based on spectral analysis (wavelet transform, Fourier transform, etc.) and works detecting the most important frequencies of the application’s execution. These main frequencies are strongly related to the internal loops of the application’s source code. The automatic detection of small but significant execution regions shown in the paper reduces the load of the performance analysis process remarkably.

european conference on parallel processing | 2007

Automatic structure extraction from MPI applications tracefiles

Marc Casas; Rosa M. Badia; Jesús Labarta

The process of obtaining useful message passing applications tracefiles for performance analysis in supercomputers is a large and tedious task. When using hundreds or thousands of processors, the tracefile size can grow up to 10 or 20 GB. It is clear that analyzing or even storing these large traces is a problem. The methodology we have developed and implemented performs an automatic analysis that can be applied to huge tracefiles, which obtains its internal structure and selects meaningful parts of the tracefile. The paper presents the methodology and results we have obtained from real applications.

international conference on supercomputing | 2008

Automatic analysis of speedup of MPI applications

Marc Casas; Rosa M. Badia; Jesús Labarta

The intricacy of high performance computing applications has been growing very fast in the last years. Only skilled analysts are able to determine the factors that are undermining the performance of up-to-date applications. Analyst time is a very expensive resource and, for that reason, a strong effort to develop automatic performance analysis methodologies has been made by the scientific community. In this paper, we propose a methodology that is able to automatically detect the main performance problems of applications. This methodology is based on, first, a size reduction of the performance data obtained from the executions and, second, an analytical model obtained from this performance data which fits the speedup of the applications in terms of several parameters related to several performance issues. The paper also shows results obtained from real up-to-date applications and validates the conclusions automatically derived from the methodology.

IEEE Micro | 2011

Simulating Whole Supercomputer Applications

J Gonzalez; Marc Casas; Judit Gimenez; Miquel Moreto; Alex Ramirez; Jess Labarta; Mateo Valero

Detailed simulations of large scale message-passing interface parallel applications are extremely time consuming and resource intensive. A new methodology that combines signal processing and data mining techniques plus a multilevel simulation reduces the simulated data by various orders of magnitude. This reduction makes possible detailed software performance analysis and accurate performance predictions in a reasonable time.

ACM Transactions on Architecture and Code Optimization | 2016

PARSECSs: Evaluating the Impact of Task Parallelism in the PARSEC Benchmark Suite

Dimitrios Chasapis; Marc Casas; Miquel Moreto; Raul Vidal; Eduard Ayguadé; Jesús Labarta; Mateo Valero

In this work, we show how parallel applications can be implemented efficiently using task parallelism. We also evaluate the benefits of such parallel paradigm with respect to other approaches. We use the PARSEC benchmark suite as our test bed, which includes applications representative of a wide range of domains from HPC to desktop and server applications. We adopt different parallelization techniques, tailored to the needs of each application, to fully exploit the task-based model. Our evaluation shows that task parallelism achieves better performance than thread-based parallelization models, such as Pthreads. Our experimental results show that we can obtain scalability improvements up to 42p on a 16-core system and code size reductions up to 81p. Such reductions are achieved by removing from the source code application specific schedulers or thread pooling systems and transferring these responsibilities to the runtime system software.In this work, we show how parallel applications can be implemented efficiently using task parallelism. We also evaluate the benefits of such parallel paradigm with respect to other approaches. We use the PARSEC benchmark suite as our test bed, which includes applications representative of a wide range of domains from HPC to desktop and server applications. We adopt different parallelization techniques, tailored to the needs of each application, to fully exploit the task-based model. Our evaluation shows that task parallelism achieves better performance than thread-based parallelization models, such as Pthreads. Our experimental results show that we can obtain scalability improvements up to 42% on a 16-core system and code size reductions up to 81%. Such reductions are achieved by removing from the source code application specific schedulers or thread pooling systems and transferring these responsibilities to the runtime system software.

international symposium on computer architecture | 2015

Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures

Lluc Alvarez; Lluis Vilanova; Miquel Moreto; Marc Casas; Marc Gonzàlez; Xavier Martorell; Nacho Navarro; Eduard Ayguadé; Mateo Valero

The increasing number of cores in manycore architectures causes important power and scalability problems in the memory subsystem. One solution is to introduce scratchpad memories alongside the cache hierarchy, forming a hybrid memory system. Scratchpad memories are more power-efficient than caches and they do not generate coherence traffic, but they suffer from poor programmability. A good way to hide the programmability difficulties to the programmer is to give the compiler the responsibility of generating code to manage the scratchpad memories. Unfortunately, compilers do not succeed in generating this code in the presence of random memory accesses with unknown aliasing hazards. This paper proposes a coherence protocol for the hybrid memory system that allows the compiler to always generate code to manage the scratchpad memories. In coordination with the compiler, memory accesses that may access stale copies of data are identified and diverted to the valid copy of the data. The proposal allows the architecture to be exposed to the programmer as a shared memory manycore, maintaining the programming simplicity of shared memory models and preserving backwards compatibility. In a 64-core manycore, the coherence protocol adds overheads of 4% in performance, 8% in network traffic and 9% in energy consumption to enable the usage of the hybrid memory system that, compared to a cache-based system, achieves a speedup of 1.14x and reduces on-chip network traffic and energy consumption by 29% and 17%, respectively.

ieee international conference on high performance computing data and analytics | 2015

Exploiting asynchrony from exact forward recovery for DUE in iterative solvers

Luc Jaulmes; Marc Casas; Miquel Moreto; Eduard Ayguadé; Jesús Labarta; Mateo Valero

This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE) relying on error detection techniques already available in commodity hardware. Detection operates at the memory page level, which enables the use of simple algorithmic redundancies to correct errors. Such redundancies would be inapplicable under coarse grain error detection, but become very powerful when the hardware is able to precisely detect errors. Relations straightforwardly extracted from the solver allow to recover lost data exactly. This method is free of the overheads of backwards recoveries like checkpointing, and does not compromise mathematical convergence properties of the solver as restarting would do. We apply this recovery to three widely used Krylov subspace methods, CG, GMRES and BiCGStab, and their preconditioned versions. We implement our resilience techniques on CG considering scenarios from small (8 cores) to large (1024 cores) scales, and demonstrate very low overheads compared to state-of-the-art solutions. We deploy our recovery techniques either by overlapping them with algorithmic computations or by forcing them to be in the critical path of the application. A trade-off exists between both approaches depending on the error rate the solver is suffering. Under realistic error rates, overlapping decreases overheads from 5.37% down to 3.59% for a non-preconditioned CG on 8 cores.

european conference on parallel processing | 2015

Runtime-aware architectures

Marc Casas; Miquel Moreto; Lluc Alvarez; Emilio Castillo; Dimitrios Chasapis; Timothy Hayes; Luc Jaulmes; Oscar Palomar; Osman S. Unsal; Adrián Cristal; Eduard Ayguadé; Jesús Labarta; Mateo Valero

In the last few years, the traditional ways to keep the increase of hardware performance to the rate predicted by the Moore’s Law have vanished. When uni-cores were the norm, hardware design was decoupled from the software stack thanks to a well defined Instruction Set Architecture (ISA). This simple interface allowed developing applications without worrying too much about the underlying hardware, while hardware designers were able to aggressively exploit instruction-level parallelism (ILP) in superscalar processors. Current multi-cores are designed as simple symmetric multiprocessors (SMP) on a chip. However, we believe that this is not enough to overcome all the problems that multi-cores face. The runtime system of the parallel programming model has to drive the design of future multi-cores to overcome the restrictions in terms of power, memory, programmability and resilience that multi-cores have. In the paper, we introduce an approach towards a Runtime-Aware Architecture (RAA), a massively parallel architecture designed from the runtime’s perspective.

international parallel and distributed processing symposium | 2014

Active Measurement of Memory Resource Consumption

Marc Casas; Greg Bronevetsky

Hierarchical memory is a cornerstone of modern hardware design because it provides high memory performance and capacity at a low cost. However, the use of multiple levels of memory and complex cache management policies makes it very difficult to optimize the performance of applications running on hierarchical memories. As the number of compute cores per chip continues to rise faster than the total amount of available memory, applications will become increasingly starved for memory storage capacity and bandwidth, making the problem of performance optimization even more critical. We propose a new methodology for measuring and modeling the performance of hierarchical memories in terms of the applications utilization of the key memory resources: capacity of a given memory level and bandwidth between two levels. This is done by actively interfering with the applications use of these resources. The applications sensitivity to reduced resource availability is measured by observing the effect of interference on application performance. The resulting resource-oriented model of performance both greatly simplifies application performance analysis and makes it possible to predict an applications performance when running with various resource constraints. This is useful to predict performance for future memory-constrained architectures.

Explore More