Enrico Calore
University of Ferrara
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Enrico Calore.
parallel computing | 2016
Enrico Calore; Alessandro Gabbana; Jiri Kraus; Elisa Pellegrini; Sebastiano Fabio Schifano; R. Tripiccione
Abstract This paper describes a massively parallel code for a state-of-the art thermal lattice–Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bottlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and optimization methodology that can be used for the development of other high performance applications for computational physics.
Concurrency and Computation: Practice and Experience | 2016
Enrico Calore; Alessandro Gabbana; Jiri Kraus; Sebastiano Fabio Schifano; R. Tripiccione
An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi‐core CPUs with power efficient accelerators. Designing efficient applications for these systems have been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability, and correctness. Several new programming environments try to tackle this problem. Among them, OpenACC offers a high‐level approach based on compiler directives to mark regions of existing C, C++, or Fortran codes to run on accelerators. This approach directly addresses code portability, leaving to compilers the support of each different accelerator, but one has to carefully assess the relative costs of portable approaches versus computing efficiency. In this paper, we address precisely this issue, using as a test‐bench a massively parallel lattice Boltzmann algorithm. We first describe our multi‐node implementation and optimization of the algorithm, using OpenACC and MPI. We then benchmark the code on a variety of processors, including traditional CPUs and GPUs, and make accurate performance comparisons with other GPU implementations of the same algorithm using CUDA and OpenCL. We also asses the performance impact associated with portable programming, and the actual portability and performance‐portability of OpenACC‐based applications across several state‐of‐the‐art architectures. Copyright
european conference on parallel processing | 2015
Enrico Calore; Sebastiano Fabio Schifano; R. Tripiccione
Energy efficiency is becoming more and more important in the HPC field; high-end processors are quickly evolving towards more advanced power-saving and power-monitoring technologies. On the other hand, low-power processors, designed for the mobile market, attract interest in the HPC area for their increasing computing capabilities, competitive pricing and low power consumption. In this work we study energy and computing performances of a Tegra K1 mobile processor using an HPC Lattice Boltzmann application as a benchmark. We run this application on the ARM Cortex-A15 CPU and on the GK20A GPU, both available in this processor. Our analysis uses time-accurate measurements, obtained by a simple custom-developed current monitor. We discuss several energy and performance metrics, interesting per se and also in view of a prospective use of these processors in a HPC context.
european conference on parallel processing | 2014
Enrico Calore; Sebastiano Fabio Schifano; R. Tripiccione
High performance computing increasingly relies on heterogeneous systems, based on multi-core CPUs, tightly coupled to accelerators: GPUs or many core systems. Programming heterogeneous systems raises new issues: reaching high sustained performances means that one must exploit parallelism at several levels; at the same time the lack of a standard programming environment has an impact on code portability. This paper presents a performance assessment of a massively parallel and portable Lattice Boltzmann code, based on the Open Computing Language (OpenCL) and the Message Passing Interface (MPI). Exactly the same code runs on standard clusters of multi-core CPUs, as well as on hybrid clusters including accelerators. We consider a state-of-the-art Lattice Boltzmann model that accurately reproduces the thermo-hydrodynamics of a fluid in 2 dimensions. This algorithm has a regular structure suitable for accelerator architectures with a large degree of parallelism, but it is not straightforward to obtain a large fraction of the theoretically available performance. In this work we focus on portability of code across several heterogeneous architectures preserving performances and also on techniques to move data between accelerators minimizing overheads of communication latencies. We describe the organization of the code and present and analyze performance and scalability results on a cluster of nodes based on NVIDIA K20 GPUs and Intel Xeon-Phi accelerators.
international conference on parallel processing | 2015
Enrico Calore; Nicola Demo; Sebastiano Fabio Schifano; R. Tripiccione
Current development trends of fast processors calls for an increasing number of cores, each core featuring wide vector processing units. Applications must then exploit both directions of parallelism to run efficiently. In this work we focus on the efficient use of vector instructions. These process several data-elements in parallel, and memory data layout plays an important role to make this efficient. An optimal memory-layout depends in principle on the access patterns of the algorithm but also on the architectural features of the processor. However, different parts of the application may have different requirements, and then the choice of the most efficient data-structure for vectorization has to be carefully assessed. We address these problems for a Lattice Boltzmann (LB) code, widely used in computational fluid-dynamics. We consider a state-of-the-art two-dimensional LB model, that accurately reproduces the thermo-hydrodynamics of a 2D-fluid. We write our codes in C and expose vector parallelism using directive-based programming approach. We consider different data layouts and analyze the corresponding performance. Our results show that, if an appropriate data layout is selected, it is possible to write a code for this class of applications that is automatically vectorized and performance portable on several architectures. We end up with a single code that runs efficiently onto traditional multi-core processors as well as on recent many-core systems such as the Xeon-Phi.
international conference on conceptual structures | 2014
Enrico Calore; Sebastiano Fabio Schifano; R. Tripiccione
Abstract The architecture of high performance computing systems is becoming more and more heterogeneous, as accelerators play an increasingly important role alongside traditional CPUs. Programming heterogeneous systems efficiently is a complex task, that often requires the use of specific programming environments. Programming frameworks supporting codes portable across different high performance architectures have recently appeared, but one must carefully assess the relative costs of portability versus computing efficiency, and find a reasonable tradeoff point. In this paper we address precisely this issue, using as test-bench a Lattice Boltzmann code implemented in OpenCL. We analyze its performance on several different state-of-the-art processors: NVIDIA GPUs and Intel Xeon-Phi many-core accelerators, as well as more traditional Ivy Bridge and Opteron multi-core commodity CPUs. We also compare with results obtained with codes specifically optimized for each of these systems. Our work shows that a properly structured OpenCL code runs on many different systems reaching performance levels close to those obtained by architecture-tuned CUDA or C codes.
Concurrency and Computation: Practice and Experience | 2017
Enrico Calore; Alessandro Gabbana; Sebastiano Fabio Schifano; R. Tripiccione
Energy efficiency is becoming increasingly important for computing systems, in particular for large scale High Performance Computing (HPC) facilities. In this work, we evaluate, from a user perspective, the use of Dynamic Voltage and Frequency Scaling techniques, assisted by the power and energy monitoring capabilities of modern processors to tune applications for energy efficiency. We run selected kernels and a full HPC application on 2 high‐end processors widely used in the HPC context, namely, an NVIDIA K80 GPU and an Intel Haswell CPU. We evaluate the available trade‐offs between energy‐to‐solution and time‐to‐solution, attempting a function‐by‐function frequency tuning. We finally estimate the benefits obtainable running the full code on an HPC multi‐GPU node, with respect to default clock frequency governors. We instrument our code to accurately monitor power consumption and execution time without the need of any additional hardware, and we enable it to change CPUs and GPUs clock frequencies while running. We analyze our results on the different architectures using a simple energy‐performance model and derive a number of energy saving strategies, which can be easily adopted on recent high‐end HPC systems for generic applications.
International Journal of Modern Physics C | 2017
Claudio Bonati; Simone Coscetti; Massimo D’Elia; Michele Mesiti; Francesco Negro; Enrico Calore; Sebastiano Fabio Schifano; Giorgio Silvi; R. Tripiccione
The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core Graphics Processor Units (GPUs), exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work, we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenAcc, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.
ieee international conference on high performance computing data and analytics | 2015
Claudio Bonati; Enrico Calore; Simone Coscetti; Massimo D'Elia; Michele Mesiti; Francesco Negro; Sebastiano Fabio Schifano; R. Tripiccione
Many scientific software applications, that solve complex compute-or data-intensive problems, such as large parallel simulations of physics phenomena, increasingly use HPC systems in order to achieve scientifically relevant results. An increasing number of HPC systems adopt heterogeneous node architectures, combining traditional multi-core CPUs with energy-efficient massively parallel accelerators, such as GPUs. The need to exploit the computing power of these systems, in conjunction with the lack of standardization in their hardware and/or programming frameworks, raises new issues with respect to scientific software development choices, which strongly impact software maintainability, portability and performance. Several new programming environments have been introduced recently, in order to address these issues. In particular, the Open ACC programming standard has been designed to ease the software development process for codes targeted for heterogeneous machines, helping to achieve code and performance portability. In this paper we present, as a specific Open ACC use case, an example of design, porting and optimization of an LQCD Monte Carlo code intended to be portable across present and future heterogeneous HPC architectures, we describe the design process, the most critical design choices and evaluate the trade off between portability and efficiency.
Proceedings of SPIE | 2012
Enrico Calore; Raffaella Folgieri; Davide Gadia; Daniele Marini
Stereoscopic visualization in cinematography and Virtual Reality (VR) creates an illusion of depth by means of two bidimensional images corresponding to different views of a scene. This perceptual trick is used to enhance the emotional response and the sense of presence and immersivity of the observers. An interesting question is if and how it is possible to measure and analyze the level of emotional involvement and attention of the observers during a stereoscopic visualization of a movie or of a virtual environment. The research aims represent a challenge, due to the large number of sensorial, physiological and cognitive stimuli involved. In this paper we begin this research by analyzing possible differences in the brain activity of subjects during the viewing of monoscopic or stereoscopic contents. To this aim, we have performed some preliminary experiments collecting electroencephalographic (EEG) data of a group of users using a Brain- Computer Interface (BCI) during the viewing of stereoscopic and monoscopic short movies in a VR immersive installation.