Marc Gonzàlez
Polytechnic University of Catalonia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Marc Gonzàlez.
international conference on supercomputing | 2010
Ramon Bertran; Marc Gonzàlez; Xavier Martorell; Nacho Navarro; Eduard Ayguadé
Power modeling based on performance monitoring counters (PMCs) attracted the interest of researchers since it became a quick approach to understand and analyse power behavior on real systems. As a result, several power-aware policies use power models to guide their decisions and to trigger low-level mechanisms such as voltage and frequency scaling. Hence, the presence of power models that are informative, accurate and capable of detecting power phases is critical to increase the power-aware research chances and to improve the success of power-saving techniques based on them. In addition, the design of current processors has varied considerably with the inclusion of multiple cores with some resources shared on a single die. As a result, PMC-based power models warrant further investigation on current energy-efficient multi-core processors. In this paper, we present a methodology to produce decomposable PMC-based power models on current multicore architectures. Apart from being able to estimate the power consumption accurately, the models provide per component power consumption, supplying extra insights about power behavior. Moreover, we validate their responsiveness -the capacity to detect power phases-. Specifically, we produce a set of power models for an Intel® Core#8482; 2 Duo. We model one and two cores for a wide set of DVFS configurations. The models are empirically validated by using the SPEC-cpu2006 benchmark suite and we compare them to other models built using existing approaches. Overall, we demonstrate that the proposed methodology produces more accurate and responsive power models. Concretely, our models show a [1.89--6]% error range and almost 100% accuracy in detecting phase variations above 0.5 watts.
international workshop on openmp | 2009
Eduard Ayguadé; Rosa M. Badia; Daniel Cabrera; Alejandro Duran; Marc Gonzàlez; Francisco D. Igual; Daniel Jimenez; Jesús Labarta; Xavier Martorell; Rafael Mayo; Josep M. Perez; Enrique S. Quintana-Ortí
OpenMP has evolved recently towards expressing unstructured parallelism, targeting the parallelization of a broader range of applications in the current multicore era. Homogeneous multicore architectures from major vendors have become mainstream, but with clear indications that a better performance/power ratio can be achieved using more specialized hardware (accelerators), such as SSE-based units or GPUs, clearly deviating from the easy-to-understand shared-memory homogeneous architectures. This paper investigates if OpenMP could still survive in this new scenario and proposes a possible way to extend the current specification to reasonably integrate heterogeneity while preserving simplicity and portability. The paper leverages on a previous proposal that extended tasking with dependencies. The runtime is in charge of data movement, tasks scheduling based on these data dependencies and the appropriate selection of the target accelerator depending on system configuration and resource availability.
international conference on supercomputing | 1999
Xavier Martorell; Eduard Ayguadé; Nacho Navarro; Julita Corbalan; Marc Gonzàlez; Jesús Labarta
This paper presents some techniques for efficient thread forking and joining in parallel execution environments, taking into consideration the physical structure of NUMA machines and the support for multi-level parallelization and processor grouping. Two work generation schemes and one join mechanism are designed, implemented, evaluated and compared with the ones used in the IRIX MP library, an efficient implementation which supports a single level of parallelism. Supporting multiple levels of parallelism is a current research goal, both in shared and distributed memory machines. Our proposals include a first work generation scheme (GWD, or global work descriptor) which supports multiple levels of parallelism, but not processor grouping. The second work generation scheme (LWD, or local work descriptor) has been designed to support multiple levels of parallelism and processor grouping. Processor grouping is needed to distribute processors among different parts of the computation and maintain the working set of each processor across different parallel constructs. The mechanisms are evaluated using synthetic benchmarks, two SPEC95fp applications and one NAS application. The performance evaluation concludes that: i) the overhead of the proposed mechanisms is similar to the overhead of the existing ones when exploiting a single level of parallelism, and ii) a remarkable improvement in performance is obtained for applications that have multiple levels of parallelism. The comparison with the traditional single-level parallelism exploitation gives an improvement in the range of 30-65% for these applications.
International Journal of Parallel Programming | 2010
Eduard Ayguadé; Rosa M. Badia; Pieter Bellens; Daniel Cabrera; Alejandro Duran; Roger Ferrer; Marc Gonzàlez; Francisco D. Igual; Daniel Jiménez-González; Jesus Labarta; Luis Martinell; Xavier Martorell; Rafael Mayo; Josep M. Perez; Judit Planas; Enrique S. Quintana-Ortí
This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, the Cell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer.
international conference on parallel architectures and compilation techniques | 2008
Marc Gonzàlez; Nikola Vujic; Xavier Martorell; Eduard Ayguadé; Alexandre E. Eichenberger; Tong Chen; Zehra Sura; Tao Zhang; Kevin O'Brien; Kathryn M. O'Brien
Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that classifies at compile time memory accesses in two classes, high-locality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software cache overhead in the innermost loop. Performance evaluation indicates that improvements due to the optimized software-cache structures combined with the proposed code-optimizations translate into 3.5 to 8.4 speedup factors, compared to a traditional software cache approach. As a result, we demonstrate that the Cell BE processor can be a competitive alternative to a modern server-class multi-core such as the IBM Power5 processor for a set of parallel NAS applications.
languages and compilers for parallel computing | 2007
Jairo Balart; Marc Gonzàlez; Xavier Martorell; Eduard Ayguadé; Zehra Sura; Tong Chen; Tao Zhang; Kevin O'Brien; Kathryn M. O'Brien
This paper describes the implementation of a runtime library for asynchronous communication in the Cell BE processor. The runtime library implementation provides with several services that allow the compiler to generate code, maximizing the chances for overlapping communication and computation. The library implementation is organized as a Software Cache and the main services correspond to mechanisms for data look up, data placement and replacement, data write back, memory synchronization and address translation. The implementation guarantees that all those services can be totally uncoupled when dealing with memory references. Therefore this provides opportunities to the compiler to organize the generated code in order to overlap as much as possible computation with communication. The paper also describes the necessary mechanism to overlap the communication related to write back operations with actual computation. The paper includes the description of the compiler basic algorithms and optimizations for code generation. The system is evaluated measuring bandwidth and global updates ratios, with two benchmarks from the HPCC benchmark suite: Stream and Random Access.
international conference on parallel processing | 1999
Eduard Ayguadé; Xavier Martorell; Jesús Labarta; Marc Gonzàlez; Nacho Navarro
Most current shared-memory parallel programming environments are based on thread packages that allow the exploitation of a single level of parallelism. These thread packages do not enable the spawning of new parallelism from a previously activated parallel region. Current initiatives (like OpenMP) include in their definition the exploitation of multiple levels of parallelism through the nesting of parallel constructs. This paper analyzes the requirements towards an efficient multi-level parallelization and reports some conclusions gathered from the experience in the parallelization of two benchmark applications. The underlying system is based on: i) an OpenMP compiler which accepts some extensions to the original definition and ii) a user-level threads library that supports the exploitation of both fine-grain and multi-level parallelism.
languages and compilers for parallel computing | 2010
Roger Ferrer; Judit Planas; Pieter Bellens; Alejandro Duran; Marc Gonzàlez; Xavier Martorell; Rosa M. Badia; Eduard Ayguadé; Jesús Labarta
In this paper, we present OMPSs, a programming model based on OpenMP and StarSs, that can also incorporate the use of OpenCL or CUDA kernels. We evaluate the proposal on three different architectures, SMP, Cell/B.E. and GPUs, showing the wide usefulness of the approach. The evaluation is done with four different benchmarks, Matrix Multiply, BlackScholes, Perlin Noise, and Julia Set. We compare the results obtained with the execution of the same benchmarks written in OpenCL, in the same architectures. The results show that OMPSs greatly outperforms the OpenCL environment. It is more flexible to exploit multiple accelerators. And due to the simplicity of the annotations, it increases programmers productivity.
grid computing | 2010
Ramon Bertran; Yolanda Becerra; David Carrera; Vicenç Beltran; Marc Gonzàlez; Xavier Martorell; Jordi Torres; Eduard Ayguadé
Virtualized infrastructure providers demand new methods to increase the accuracy of the accounting models used to charge their customers. Future data centers will be composed of many-core systems that will host a large number of virtual machines (VMs) each. While resource utilization accounting can be achieved with existing system tools, energy accounting is a complex task when per-VM granularity is the goal. In this paper, we propose a methodology that brings new opportunities to energy accounting by adding an unprecedented degree of accuracy on the per-VM measurements. We present a system -which leverages CPU and memory power models based in performance monitoring counters (PMCs)- to perform energy accounting in virtualized systems. The contribution of this paper is twofold. First, we show that PMC-based power modeling methods are still valid on virtualized environments. And second, we introduce a novel methodology for accounting of energy consumption in virtualized systems. In overall, the results for an Intel® Core™ 2 Duo show errors in energy estimations below the 5%. Such approach brings flexibility to the chargeback models used by service and infrastructure providers. For instance, we show that VMs executed during the same amount of time, present more than 20% differences in energy consumption even only taking into account the consumption of the CPU and the memory.
international conference on supercomputing | 2005
Alejandro Duran; Marc Gonzàlez; Julita Corbalan
OpenMP is becoming the standard programming model for shared-memory parallel architectures. One of its most interesting features in the language is the support for nested parallelism. Previous research and parallelization experiences have shown the benefits of using nested parallelism as an alternative to combining several programming models such as MPI and OpenMP. However, all these works rely on the manual definition of an appropriate distribution of all the available thread across the different levels of parallelism. Some proposals have been made to extend the OpenMP language to allow the programmers to specify the thread distribution.This paper proposes a mechanism to dynamically compute the most appropriate thread distribution strategy. The mechanism is based on gathering information at runtime to derive the structure of the nested parallelism. This information is used to determine how the overall computation is distributed between the parallel branches in the outermost level of parallelism, which is constant in this work. According to this, threads in the innermost level of parallelism are distributed.The proposed mechanism is evaluated in two different environments: a research environment, the Nanos OpenMP research platform, and a commercial environment, the IBM XL runtime library. The performance numbers obtained validate the mechanism in both environments and they show the importance of selecting the proper amount of parallelism in the outer level.