Alexandru Calotoiu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alexandru Calotoiu is active.

Explore More

Publication

Featured researches published by Alexandru Calotoiu.

ieee international conference on high performance computing data and analytics | 2013

Using automated performance modeling to find scalability bugs in complex codes

Alexandru Calotoiu; Torsten Hoefler; Marius Poke; Felix Wolf

Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made-a point where remediation can be difficult. However, creating analytical performance models that would allow such issues to be pinpointed earlier is so laborious that application developers attempt it at most for a few selected kernels, running the risk of missing harmful bottlenecks. In this paper, we show how both coverage and speed of this scalability analysis can be substantially improved. Generating an empirical performance model automatically for each part of a parallel program, we can easily identify those parts that will reduce performance at larger core counts. Using a climate simulation as an example, we demonstrate that scalability bugs are not confined to those routines usually chosen as kernels.

international conference on supercomputing | 2015

Exascaling Your Library: Will Your Implementation Meet Your Expectations?

Sergei Shudler; Alexandru Calotoiu; Torsten Hoefler; Alexandre Strube; Felix Wolf

Many libraries in the HPC field encapsulate sophisticated algorithms with clear theoretical scalability expectations. However, hardware constraints or programming bugs may sometimes render these expectations inaccurate or even plainly wrong. While algorithm engineers have already been advocating the systematic combination of analytical performance models with practical measurements for a very long time, we go one step further and show how this comparison can become part of automated testing procedures. The most important applications of our method include initial validation, regression testing, and benchmarking to compare implementation and platform alternatives. Advancing the concept of performance assertions, we verify asymptotic scaling trends rather than precise analytical expressions, relieving the developer from the burden of having to specify and maintain very fine grained and potentially non-portable expectations. In this way, scalability validation can be continuously applied throughout the whole development cycle with very little effort. Using MPI as an example, we show how our method can help uncover non-obvious limitations of both libraries and underlying platforms.

international conference on cluster computing | 2016

Fast Multi-parameter Performance Modeling

Alexandru Calotoiu; David Beckinsale; Christopher Earl; Torsten Hoefler; Ian Karlin; Martin Schulz; Felix Wolf

Tuning large applications requires a clever exploration of the design and configuration space. Especially on supercomputers, this space is so large that its exhaustive traversal via performance experiments becomes too expensive, if not impossible. Manually creating analytical performance models provides insights into optimization opportunities but is extremely laborious if done for applications of realistic size. If we must consider multiple performance-relevant parameters and their possible interactions, a common requirement, this task becomes even more complex. We build on previous work on automatic scalability modeling and significantly extend it to allow insightful modeling of any combination of application execution parameters. Multi-parameter modeling has so far been outside the reach of automatic methods due to the exponential growth of the model search space. We develop a new technique to traverse the search space rapidly and generate insightful performance models that enable a wide range of uses from performance predictions for balanced machine design to performance tuning.

european conference on parallel processing | 2015

10,000 Performance Models per Minute – Scalability of the UG4 Simulation Framework

Andreas Vogel; Alexandru Calotoiu; Alexandre Strube; Sebastian Reiter; Arne Nägel; Felix Wolf; Gabriel Wittum

Numerically addressing scientific questions such as simulating drug diffusion through the human stratum corneum is a challenging task requiring complex codes and plenty of computational resources. The UG4 framework is used for such simulations, and though empirical tests have shown good scalability so far, its sheer size precludes analytical modeling of the entire code. We have developed a process which combines the power of our automated performance modeling method and the workflow manager JUBE to create insightful models for entire UG4 simulations. Examining three typical use cases, we identified and resolved a previously unknown latent scalability bottleneck. In collaboration with the code developers, we validated the performance expectations in each of the use cases, creating over 10,000 models in less than a minute, a feat previously impossible without our automation techniques.

european conference on parallel processing | 2015

How Many Threads will be too Many? On the Scalability of OpenMP Implementations

Christian Iwainsky; Sergei Shudler; Alexandru Calotoiu; Alexandre Strube; Michael Knobloch; Christian H. Bischof; Felix Wolf

Exascale systems will exhibit much higher degrees of parallelism both in terms of the number of nodes and the number of cores per node. OpenMP is a widely used standard for exploiting parallelism on the level of individual nodes. Although successfully used on today’s systems, it is unclear how well OpenMP implementations will scale to much higher numbers of threads. In this work, we apply automated performance modeling to examine the scalability of OpenMP constructs across different compilers and platforms. We ran tests on Intel Xeon multi-board, Intel Xeon Phi, and Blue Gene with compilers from GNU, IBM, Intel, and PGI. The resulting models reveal a number of scalability issues in implementations of OpenMP constructs and show unexpected differences between compilers.

european conference on parallel processing | 2014

Catwalk: A Quick Development Path for Performance Models

Felix Wolf; Christian H. Bischof; Torsten Hoefler; Bernd Mohr; Gabriel Wittum; Alexandru Calotoiu; Christian Iwainsky; Alexandre Strube; Andreas Vogel

Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made—a point where remediation can be difficult. However, creating analytical performance models that would allow such issues to be pinpointed earlier is so laborious that application developers attempt it at most for a few selected kernels, running the risk of missing harmful bottlenecks. The objective of the Catwalk project, which is carried out as part of the DFG Priority Programme 1648 Software for Exascale Computing (SPPEXA), is to automate key activities of the performance modeling process, making this powerful methodology easier to use and expanding its coverage. This article gives an overview of the project objectives, describes the results achieved so far, and outlines future work.

european conference on parallel processing | 2017

Following the Blind Seer – Creating Better Performance Models Using Less Information

Patrick Reisert; Alexandru Calotoiu; Sergei Shudler; Felix Wolf

Offering insights into the behavior of applications at higher scale, performance models are useful for finding performance bugs and tuning the system. Extra-P, a tool for automated performance modeling, uses statistical methods to automatically generate, from a small number of performance measurements, models that can be used to predict performance where no measurements are available. However, the current version requires the manual pre-configuration of a search space, which might turn out to be unsuitable for the problem at hand. Furthermore, noise in the data often leads to models that indicate a worse behavior than there actually is. In this paper, we propose a new model-generation algorithm that solves both of the above problems: The search space is built and automatically refined on demand, and a scale-independent error metric tells both when to stop the refinement process and whether a model reflects faithfully enough the behavior the data exhibits. This makes Extra-P easier to use, while also allowing it to produce more accurate results. Using data from previous case studies, we show that the mean relative prediction error decreases from 46% to 13%.

european conference on parallel processing | 2017

Off-Road Performance Modeling — How to Deal with Segmented Data

M. Kashif Ilyas; Alexandru Calotoiu; Felix Wolf

Besides correctness, scalability is one of the top priorities of parallel programmers. With manual analytical performance modeling often being too laborious, developers increasingly resort to empirical performance modeling as a viable alternative, which learns performance models from a limited amount of performance measurements. Although powerful automatic techniques exist for this purpose, they usually struggle with the situation where performance data representing two or more different phenomena are conflated into a single performance model. This not only generates an inaccurate model for the given data, but can also either fail to point out existing scalability issues or create the appearance of such issues when none are present. In this paper, we present an algorithm to detect segmentation in a sequence of performance measurements and estimate the point where the behavior changes. Our method correctly identified segmentation in more than 80% of 5.2 million synthetic tests and confirmed expected segmentation in three application case studies.

acm sigplan symposium on principles and practice of parallel programming | 2017

Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications

Sergei Shudler; Alexandru Calotoiu; Torsten Hoefler; Felix Wolf

Task-based programming offers an elegant way to express units of computation and the dependencies among them, making it easier to distribute the computational load evenly across multiple cores. However, this separation of problem decomposition and parallelism requires a sufficiently large input problem to achieve satisfactory efficiency on a given number of cores. Unfortunately, finding a good match between input size and core count usually requires significant experimentation, which is expensive and sometimes even impractical. In this paper, we propose an automated empirical method for finding the isoefficiency function of a task-based program, binding efficiency, core count, and the input size in one analytical expression. This allows the latter two to be adjusted according to given (realistic) efficiency objectives. Moreover, we not only find (i) the actual isoefficiency function but also (ii) the function one would yield if the program execution was free of resource contention and (iii) an upper bound that could only be reached if the program was able to maintain its average parallelism throughout its execution. The difference between the three helps to explain low efficiency, and in particular, it helps to differentiate between resource contention and structural conflicts related to task dependencies or scheduling. The insights gained can be used to co-design programs and shared system resources.

Software for Exascale Computing | 2016

Automated Performance Modeling of the UG4 Simulation Framework

Andreas Vogel; Alexandru Calotoiu; Arne Nägel; Sebastian Reiter; Alexandre Strube; Gabriel Wittum; Felix Wolf

Many scientific research questions such as the drug diffusion through the upper part of the human skin are formulated in terms of partial differential equations and their solution is numerically addressed using grid based finite element methods. For detailed and more realistic physical models this computational task becomes challenging and thus complex numerical codes with good scaling properties up to millions of computing cores are required. Employing empirical tests we presented very good scaling properties for the geometric multigrid solver in Reiter et al. (Comput Vis Sci 16(4):151–164, 2013) using the UG4 framework that is used to address such problems. In order to further validate the scalability of the code we applied automated performance modeling to UG4 simulations and presented how performance bottlenecks can be detected and resolved in Vogel et al. (10,000 performance models per minute—scalability of the UG4 simulation framework. In: Traff JL, Hunold S, Versaci F (eds) Euro-Par 2015: Parallel processing, theoretical computer science and general issues, vol 9233. Springer, Springer, Heidelberg, pp 519–531, 2015). In this paper we provide an overview on the obtained results, present a more detailed analysis via performance models for the components of the geometric multigrid solver and comment on how the performance models coincide with our expectations.

Explore More