Alexandra Carpen-Amarie
Vienna University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Alexandra Carpen-Amarie.
Future Generation Computer Systems | 2016
Shadi Ibrahim; Tien-Dat Phan; Alexandra Carpen-Amarie; Houssem-Eddine Chihoub; Diana Moise; Gabriel Antoniu
With increasingly inexpensive storage and growing processing power, the cloud has rapidly become the environment of choice to store and analyze data for a variety of applications. Most large-scale data computations in the cloud heavily rely on the MapReduce paradigm and on its Hadoop implementation. Nevertheless, this exponential growth in popularity has significantly impacted power consumption in cloud infrastructures. In this paper, we focus on MapReduce processing and we investigate the impact of dynamically scaling the frequency of compute nodes on the performance and energy consumption of a Hadoop cluster. To this end, a series of experiments are conducted to explore the implications of Dynamic Voltage and Frequency Scaling (DVFS) settings on power consumption in Hadoop clusters. By enabling various existing DVFS governors (i.e., performance, powersave, ondemand, conservative and userspace) in a Hadoop cluster, we observe significant variation in performance and power consumption across different applications: the different DVFS settings are only sub-optimal for several representative MapReduce applications. Furthermore, our results reveal that the current CPU governors do not exactly reflect their design goal and may even become ineffective to manage the power consumption in Hadoop clusters. This study aims at providing a clearer understanding of the interplay between performance and power management in Hadoop clusters and therefore offers useful insight into designing power-aware techniques for Hadoop systems. An overview of the state-of-the-art for energy-efficiency in Hadoop.Evaluation of a set of representative MapReduce workloads with different DVFS settings and governors.A comprehensive analysis of the impact of DVFS settings and governors on Hadoops performance and energy efficiency.Discussion on the application behavior sensitivity to different parameters employed in ondemand and conservative governors.
Proceedings of the 21st European MPI Users' Group Meeting on | 2014
Sascha Hunold; Alexandra Carpen-Amarie; Jesper Larsson Träff
The Message Passing Interface (MPI) is the prevalent programming model for supercomputers. Optimizing the performance of individual MPI functions is therefore of great interest for the HPC community. However, a fair comparison of different algorithms and implementations requires a statistically sound analysis. It is often overlooked that the time to complete an MPI communication function does not only depend on internal factors such as the algorithm but also on external factors such as the system noise. Most noise produced by the system is uncontrollable without changing the software stack, e.g., the memory allocation method used by the operating system. Possibly controllable factors have not yet been identified as such in this context. We investigate several possible factors---which have been discovered in other microbenchmarks---whether they have a significant effect on the execution time of MPI functions. We experimentally and statistically show that results obtained with other common benchmarking methods for MPI functions can be misleading when comparing alternatives. To overcome these issues, we explain how to carefully design MPI micro-benchmarking experiments and how to make a fair, statistically sound comparison of MPI implementations.
ieee international conference on cloud computing technology and science | 2014
Shadi Ibrahim; Diana Moise; Houssem-Eddine Chihoub; Alexandra Carpen-Amarie; Luc Bougé; Gabriel Antoniu
With increasingly inexpensive cloud storage and increasingly powerful cloud processing, the cloud has rapidly become the environment to store and analyze data. Most of the large-scale data computations in the cloud heavily rely on the MapReduce paradigm and its Hadoop implementation. Nevertheless, this exponential growth in popularity has significantly impacted power consumption in cloud infrastructures. In this paper, we focus on MapReduce and we investigate the impact of dynamically scaling the frequency of compute nodes on the performance and energy consumption of a Hadoop cluster. To this end, a series of experiments are conducted to explore the implications of Dynamic Voltage Frequency scaling (DVFS) settings on power consumption in Hadoop-clusters. By adapting existing DVFS governors (i.e., performance, powersave, ondemand, conservative and userspace) in the Hadoop cluster, we observe significant variation in performance and power consumption of the cluster with different applications when applying these governors: the different DVFS settings are only sub-optimal for different MapReduce applications. Furthermore, our results reveal that the current CPU governors do not exactly reflect their design goal and may even become ineffective to manage the power consumption in Hadoop clusters. This study aims at providing more clear understanding of the interplay between performance and power management in Hadoop cluster and therefore offers useful insight into designing power-aware techniques for Hadoop systems.
IEEE Transactions on Parallel and Distributed Systems | 2016
Sascha Hunold; Alexandra Carpen-Amarie
The Message Passing Interface (MPI) is the prevalent programming model used on todays supercomputers. Therefore, MPI library developers are looking for the best possible performance (shortest run-time) of individual MPI functions across many different supercomputer architectures. Several MPI benchmark suites have been developed to assess the performance of MPI implementations. Unfortunately, the outcome of these benchmarks is often neither reproducible nor statistically sound. To overcome these issues, we show which experimental factors have an impact on the run-time of blocking collective MPI operations and how to measure their effect. Finally, we present a new experimental method that allows us to obtain reproducible and statistically sound measurements of MPI functions.
european conference on parallel processing | 2016
Sascha Hunold; Alexandra Carpen-Amarie; Felix Donatus Lübbe; Jesper Larsson Träff
The Message Passing Interface MPI is the most commonly used application programming interface for process communication on current large-scale parallel systems. Due to the scale and complexity of modern parallel architectures, it is becoming increasingly difficult to optimize MPI libraries, as many factors can influence the communication performance. To assist MPI developers and users, we propose an automatic way to check whether MPI libraries respect self-consistent performance guidelines for collective communication operations. We introduce the PGMPI framework to detect violations of performance guidelines through benchmarking. Our experimental results show that PGMPI can pinpoint undesired and often unexpected performance degradations of collective MPI operations. We demonstrate how to overcome performance issues of several libraries by adapting the algorithmic implementations of their respective collective MPI calls.
Proceedings of the 22nd European MPI Users' Group Meeting on | 2015
Sascha Hunold; Alexandra Carpen-Amarie
We consider the problem of accurately measuring the time to complete an MPI collective operation, as the result strongly depends on how the time is measured. Our goal is to develop an experimental method that allows for reproducible measurements of MPI collectives. When executing large parallel codes, MPI processes are often skewed in time when entering a collective operation. However, to obtain reproducible measurements, it is a common approach to synchronize all processes before they call the MPI collective operation. We therefore take a closer look at two commonly used process synchronization schemes: (1) relying on MPI_Barrier or (2) applying a window-based scheme using a common global time. We analyze both schemes experimentally and show the strengths and weaknesses of each approach. As window-based schemes require the notion of global time, we thoroughly evaluate different clock synchronization algorithms in various experiments. We also propose a novel clock synchronization algorithm that combines two advantages of known algorithms, which are (1) taking the inherent clock drift into account and (2) using a tree-based synchronization scheme to reduce the synchronization duration.
european conference on parallel processing | 2014
Alexandra Carpen-Amarie; Antoine Rougier; Felix Donatus Lübbe
Experimental research plays an important role in parallel computing, as in this field scientific discovery often relies on experimental findings, which complement and validate theoretical models. However, parallel hardware and applications have become extremely complex to study, due to their diversity and rapid evolution. Furthermore, applications are designed to run on thousands of nodes, often spanning across several programming models and generating large amounts of data. In this context, reproducibility is essential to foster reliable scientific results. In this paper we aim at studying the requirements and pitfalls of each stage of experimental research, from data acquisition to data analysis, with respect to achieving reproducible results. We investigate state-of-the-art experimental practices in parallel computing by conducting a survey on the papers published in EuroMPI 2013, a major conference targeting the MPI community. Our findings show that while there is a clear concern for reproducibility in the parallel computing community, a better understanding of the criteria for achieving it is necessary.
parallel computing | 2017
Alexandra Carpen-Amarie; Sascha Hunold; Jesper Larsson Träff
Abstract We are interested in the cost of communicating simple, common, non-contiguous data layouts in various scenarios using the MPI derived datatype mechanism. Our aim is twofold. First, we provide a framework for studying communication performance for non-contiguous data layouts described with MPI derived datatypes in comparison to baseline performance with the same amount of contiguously stored data. Second, we explicate natural expectations on derived datatype communication performance that any MPI library implementation should arguably fulfill. These expectations are stated semi-formally as MPI datatype performance guidelines. Using our framework, we examine several MPI libraries on two different systems. Our findings are in many ways surprising and disappointing. First, using derived datatypes as intended by the MPI standard sometimes performs worse than the semantically equivalent packing and unpacking with the corresponding MPI functionality followed by contiguous communication. Second, communication performance with a single, contiguous datatype can be significantly worse than a repetition of its constituent datatype. Third, the heuristics that are typically employed by MPI libraries at type-commit time turn out to be insufficient to enforce the performance guidelines, showing room for better algorithms and heuristics for representing and processing derived datatypes in MPI libraries. In particular, we show cases where all MPI type constructors are necessary to achieve the expected performance. Our findings provide useful information to MPI library implementers, and hints to application programmers on good use of derived datatypes. Improved MPI libraries can be validated using our framework and approach.
ieee international conference on high performance computing data and analytics | 2018
Sascha Hunold; Alexandra Carpen-Amarie
MPI collective operations provide a standardized interface for performing data movements within a group of processes. The efficiency of collective communication operations depends on the actual algorithm, its implementation, and the specific communication problem (type of communication, message size, and number of processes). Many MPI libraries provide numerous algorithms for specific collective operations. The strategy for selecting an efficient algorithm is often times predefined (hard-coded) in MPI libraries, but some of them, such as Open MPI, allow users to change the algorithm manually. Finding the best algorithm for each case is a hard problem, and several approaches to tune these algorithmic parameters have been proposed. We use an orthogonal approach to the parameter-tuning of MPI collectives, that is, instead of testing individual algorithmic choices provided by an MPI library, we compare the latency of a specific MPI collective operation to the latency of semantically equivalent functions, which we call the mock-up implementations. The structure of the mock-up implementations is defined by self-consistent performance guidelines. The advantage of this approach is that tuning using mock-up implementations is always possible, whether or not an MPI library allows users to select a specific algorithm at run-time. We implement this concept in a library called PGMPITuneLib, which is layered between the user code and the actual MPI implementation. This library selects the best-performing algorithmic pattern of an MPI collective by intercepting MPI calls and redirecting them to our mock-up implementations. Experimental results show that PGMPITuneLib can significantly reduce the latency of MPI collectives, and also equally important, that it can help identifying the tuning potential of MPI libraries.
arXiv: Distributed, Parallel, and Cluster Computing | 2016
Alexandra Carpen-Amarie; Sascha Hunold; Jesper Larsson Träff
Collaboration
Dive into the Alexandra Carpen-Amarie's collaboration.
French Institute for Research in Computer Science and Automation
View shared research outputs