Giovanni Mariani | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Giovanni Mariani is active.

Explore More

Publication

Featured researches published by Giovanni Mariani.

embedded systems for real time multimedia | 2014

A Bayesian network approach for compiler auto-tuning for embedded processors

Amir Hossein Ashouri; Giovanni Mariani; Gianluca Palermo; Cristina Silvano

The complexity and diversity of todays architectures require an additional effort from the programmers in porting and tuning the application code across different platforms. The problem is even more complex when considering that also the compiler requires some tuning, since standard optimization options have been customized for specific architectures or designed for the average case. This paper proposes a machine-learning approach for reducing the cost of the compiler auto-tuning phase and to speedup the application performance in embedded architectures. The proposed framework is based on an application characterization done dynamically with microarchitecture independent features and based on the usage of Bayesian Networks. The main characteristic of the Bayesian Network approach consists of not describing the solution as a strict set of compiler transformations to be applied, but as a complex probability distribution function to be sampled. Experimental results, carried out on an ARM platform and GCC transformation space, proved the effectiveness of the proposed methodology for the selected benchmarks. The selected set of solutions (less than 10% of the search space) demonstrated to be very close to the optimal sequence of transformations, showing also an applications performance speedup up to 2.8 (1.5 on average) with respect to -O2 and -O3 for the cBench suite. Additionally, the proposed method demonstrated a 3× speedup in terms of search time with respect to an iterative compilation approach, given the same quality of the solutions1.

ACM Transactions on Architecture and Code Optimization | 2016

COBAYN: Compiler Autotuning Framework Using Bayesian Networks

Amir Hossein Ashouri; Giovanni Mariani; Gianluca Palermo; Eunjung Park; John Cavazos; Cristina Silvano

The variety of today’s architectures forces programmers to spend a great deal of time porting and tuning application codes across different platforms. Compilers themselves need additional tuning, which has considerable complexity as the standard optimization levels, usually designed for the average case and the specific target architecture, often fail to bring the best results.n This article proposes COBAYN: Compiler autotuning framework using BAYesian Networks, an approach for a compiler autotuning methodology using machine learning to speed up application performance and to reduce the cost of the compiler optimization phases. The proposed framework is based on the application characterization done dynamically by using independent microarchitecture features and Bayesian networks. The article also presents an evaluation based on using static analysis and hybrid feature collection approaches. In addition, the article compares Bayesian networks with respect to several state-of-the-art machine-learning models.n Experiments were carried out on an ARM embedded platform and GCC compiler by considering two benchmark suites with 39 applications. The set of compiler configurations, selected by the model (less than 7% of the search space), demonstrated an application performance speedup of up to 4.6 × on Polybench (1.85 × on average) and 3.1 × on cBench (1.54 × on average) with respect to standard optimization levels. Moreover, the comparison of the proposed technique with (i) random iterative compilation, (ii) machine learning--based iterative compilation, and (iii) noniterative predictive modeling techniques shows, on average, 1.2 × , 1.37 × , and 1.48 × speedup, respectively. Finally, the proposed method demonstrates 4 × and 3 × speedup, respectively, on cBench and Polybench in terms of exploration efficiency given the same quality of the solutions generated by the random iterative compilation model.

international conference on computer design | 2015

Analytic processor model for fast design-space exploration

Rik Jongerius; Giovanni Mariani; Andreea Anghel; Gero Dittmann; Erik Vermij; Henk Corporaal

In this paper, we propose an analytic model that takes as inputs a) a parametric microarchitecture-independent characterization of the target workload, and b) a hardware configuration of the core and the memory hierarchy, and returns as output an estimation of processor-core performance. To validate our technique, we compare our performance estimates with measurements on an Intel® Xeon® system. The average error increases from 21% for a state-of-the-art simulator to 25% for our model, but we achieve a speedup of several orders of magnitude. Thus, the model enables fast designspace exploration and represents a first step towards an analytic exascale system model.

International Journal of Parallel Programming | 2016

An Instrumentation Approach for Hardware-Agnostic Software Characterization

Andreea Anghel; Laura Mihaela Vasilescu; Giovanni Mariani; Rik Jongerius; Gero Dittmann

Simulators and empirical profiling data are often used to understand how suitable a specific hardware architecture is for an application. However, simulators can be slow, and empirical profiling-based methods can only provide insights about the existing hardware on which the applications are executed. While the insights obtained in this way are valuable, such methods cannot be used to evaluate a large number of system designs efficiently. Analytical performance evaluation models are fast alternatives, particularly well-suited for system design-space exploration. However, to be truly application-specific, they need to be combined with a workload model that captures relevant application characteristics. In this paper we introduce PISA, a framework based on the LLVM infrastructure that is able to generate such a model for sequential and parallel applications by performing hardware-independent characterization. Characteristics such as instruction-level parallelism, memory access patterns and branch behavior are analyzed per thread or process during application execution. To illustrate the potential of the framework, we provide a detailed characterization of a representative benchmark for graph-based analytics, Graph 500. Finally, we analyze how the properties extracted with PISA across Graph 500 and SPEC CPU2006 applications compare to measurements performed on x86 and POWER8 processors.

ieee acm international symposium cluster cloud and grid computing | 2017

Predicting Cloud Performance for HPC Applications: a User-oriented Approach

Giovanni Mariani; Andreea Anghel; Rik Jongerius; Gero Dittmann

Cloud computing enables end users to execute high-performance computing applications by renting the required computing power. This pay-for-use approach enables small enterprises and startups to run HPC-related businesses with a significant saving in capital investment and a short time to market. When deploying an application in the cloud, the users may a) fail to understand the interactions of the application with the software layers implementing the cloud system, b) be unaware of some hardware details of the cloud system, and c) fail to understand how sharing part of the cloud system with other users might degrade application performance. These misunderstandings may lead the users to select suboptimal cloud configurations in terms of cost or performance. To aid the users in selecting the optimal cloud configuration for their applications, we suggest that the cloud provider generate a prediction model for the provided system. We propose applying machine-learning techniques to generate this prediction model. First, the cloud provider profiles a set of training applications by means of a hardware-independent profiler and then executes these applications on a set of training cloud configurations to collect actual performance values. The prediction model is trained to learn the dependencies of actual performance data on the application profile and cloud configuration parameters. The advantage of using a hardware-independent profiler is that the cloud users and the cloud provider can analyze applications on different machines and interface with the same prediction model. We validate the proposed methodology for a cloud system implemented with OpenStack. We apply the prediction model to the NAS parallel benchmarks. The resulting relative error is below 15% and the Pareto optimal cloud configurations finally found when maximizing application speed and minimizing execution cost on the prediction model are also at most 15% away from the actual optimal solutions.

International Journal of Parallel Programming | 2016

Scaling Properties of Parallel Applications to Exascale

Giovanni Mariani; Andreea Anghel; Rik Jongerius; Gero Dittmann

A detailed profile of exascale applications helps to understand the computation, communication and memory requirements for exascale systems and provides the insight necessary for fine-tuning the computing architecture. Obtaining such a profile is challenging as exascale systems will process unprecedented amounts of data. Profiling applications at the target scale would require the exascale machine itself. In this work we propose a methodology to extrapolate the exascale profile from experimental observations over datasets feasible for today’s machines. Extrapolation models are carefully selected by means of statistical techniques and a high-level complexity analysis is included in the selection process to speed up the learning phase and to improve the accuracy of the final model. We extrapolate run-time properties of the target applications including information about the instruction mix, memory access pattern, instruction-level parallelism, and communication requirements. Compared to state-of-the-art techniques, the proposed methodology reduces the prediction error by an order of magnitude on the instruction count and improves the accuracy by up to 1.3

computing frontiers | 2015