Alessandro Fanfarillo
National Center for Atmospheric Research
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Alessandro Fanfarillo.
Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models | 2014
Alessandro Fanfarillo; Tobias Burnus; Valeria Cardellini; Salvatore Filippone; Dan Nagle; Damian W. I. Rouson
Coarray Fortran is a set of features of the Fortran 2008 standard that make Fortran a PGAS parallel programming language. Two commercial compilers currently support coarrays: Cray and Intel. Here we present two coarray transport layers provided by the new OpenCoarrays project: one library based on MPI and the other on GASNet. We link the GNU Fortran (GFortran) compiler to either of the two OpenCoarrays implementations and present performance comparisons between executables produced by GFortran and the Cray and Intel compilers. The comparison includes synthetic benchmarks, application prototypes, and an application kernel. In our tests, Intel outperforms GFortran only on intra-node small transfers (in particular, scalars). GFortran outperforms Intel on intra-node array transfers and in all settings that require inter-node transfers. The Cray comparisons are mixed, with either GFortran or Cray being faster depending on the chosen hardware platform, network, and transport layer.
ACM Transactions on Mathematical Software | 2017
Salvatore Filippone; Valeria Cardellini; Davide Barbieri; Alessandro Fanfarillo
The multiplication of a sparse matrix by a dense vector (SpMV) is a centerpiece of scientific computing applications: it is the essential kernel for the solution of sparse linear systems and sparse eigenvalue problems by iterative methods. The efficient implementation of the sparse matrix-vector multiplication is therefore crucial and has been the subject of an immense amount of research, with interest renewed with every major new trend in high-performance computing architectures. The introduction of General-Purpose Graphics Processing Units (GPGPUs) is no exception, and many articles have been devoted to this problem. With this article, we provide a review of the techniques for implementing the SpMV kernel on GPGPUs that have appeared in the literature of the last few years. We discuss the issues and tradeoffs that have been encountered by the various researchers, and a list of solutions, organized in categories according to common features. We also provide a performance comparison across different GPGPU models and on a set of test matrices coming from various application domains.
international parallel and distributed processing symposium | 2016
Valeria Cardellini; Alessandro Fanfarillo; Salvatore Filippone
In order to reach challenging performance goals, computer architectures will change significantly in the next future. Heterogeneous chips, equipped with different types of cores and memory will compel application developers to deal with irregular communication patterns, high parallelism, and unexpected behaviors. Load balancing among the heterogeneous compute units will be a critical task in order to exploit all the computational power provided by such new architectures. In this highly dynamic scenario, Partitioned Global Address Space (PGAS) languages, like Coarray Fortran (CAF), appear to be a promising alternative to standard MPI programming using two sided communications, in particular because of their one-sided semantic. In this work, we show how Coarray Fortran can be used for implementing dynamic load balancing algorithms on an exascale compute node and how these algorithms can produce performance benefits for an Asian option pricing problem, running in symmetric mode on Intel Xeon Phi (KNC).
Proceedings of the 23rd European MPI Users' Group Meeting on | 2016
Alessandro Fanfarillo; Jeff R. Hammond
MPI-3.1 is currently the most recent version of the MPI standard. It adds important extensions to MPI-2, including a simplified semantic for the one-sided communication routines and a new tool interface, capable of exposing performance data of the MPI implementation to users and libraries. These and other new features make MPI-3 a good candidate for being the transport layer of PGAS languages like Coarray Fortran. OpenCoarrays, the free coarray implementation used by the GNU Fortran compiler, implements almost all Coarray Fortran 2008 and several Coarray Fortran 2015 features on top of MPI-3. Among the Fortran 2015 features, one of the most relevant for performance improvement is events; such a feature represents a fine grain synchronization mechanism based on a limited implementation of the well known semaphore primitives. In this paper, we analyze two possible implementations of events using MPI-3 features and show how to dynamically select the best implementation, according to the capabilities provided by the MPI implementation. We also show how events can improve the overall performance by reducing idle times in parallel applications.
parallel computing | 2013
Valeria Cardellini; Alessandro Fanfarillo; Salvatore Filippone
Hybrid GPU/CPU clusters are becoming very popular in the scientific computing community, as attested by the number of such systems present in the Top 500 list. In this paper, we address one of the key algorithms for scientific applications: the computation of sparse matrix-vector products that lies at the heart of iterative solvers for sparse linear systems. We detail how design patterns for sparse matrix computations enable us to easily adapt to such a heterogeneous GPU/CPU platform using several sparse matrix formats in order to achieve best performance; then, we analyze static load balancing strategies for devising a suitable data decomposition and propose our approach. We discuss our experience in using different sparse matrix formats and data partitioning algorithms with a number of computational experiments executed on three different hybrid GPU/CPU platforms.
parallel computing | 2017
Valeria Cardellini; Alessandro Fanfarillo; Salvatore Filippone
Abstract In order to reach challenging performance goals, computer architecture is expected to change significantly in the near future. Heterogeneous chips, equipped with different types of cores and memory, will force application developers to deal with irregular communication patterns, high levels of parallelism, and unexpected behavior. Load balancing among the heterogeneous compute units will be a critical task in order to achieve an effective usage of the computational power provided by such new architectures. In this highly dynamic scenario, Partitioned Global Address Space (PGAS) languages, like Coarray Fortran, appear a promising alternative to standard MPI programming that uses two-sided communications, in particular because of PGAS one-sided semantic and ease of programmability. In this paper, we show how Coarray Fortran can be used for implementing dynamic load balancing algorithms on an exascale compute node and how these algorithms can produce performance benefits for an Asian option pricing problem, running in symmetric mode on Intel Xeon Phi Knights Corner and Knights Landing architectures.
parallel computing | 2015
Valeria Cardellini; Alessandro Fanfarillo; Salvatore Filippone; Damian W. I. Rouson
Accelerators such as NVIDIA GPUs and Intel MICs are currently provided as co-processor devices, usable only through a CPU host. For Intel MICs it is planned that this constraint will be lifted in the near future: CPU and accelerator(s) will then form a single, many-core, processor capable of peak performance of several Teraflops with high energy efficiency. In order to exploit the available computational power, the user will be compelled to write a code more “hardware-aware”, in contrast to the common philosophy of hiding hardware details as much as possible. The simple two-sided communication approach often used in message-passing applications introduces synchronization costs that may limit the performance on the next generation machines. PGAS languages, like coarray Fortran and UPC, propose a one-sided approach where a process accesses directly the remote memory of another process without interrupting its execution. In this paper, we propose a CUDA-aware coarray implementation, capable of merging the expressive syntax of coarrays with the computational power of GPUs. We propose a new keyword for the Fortran language, which allows the user to map with a high-level syntax some hardware features of the many-core machines. Our hybrid coarray implementation is based on OpenCoarrays, the coarray transport layer currently adopted by the GNU Fortran compiler.
high performance computing systems and applications | 2014
Valeria Cardellini; Alessandro Fanfarillo; Salvatore Filippone
Hybrid nodes containing GPUs are rapidly becoming the norm in parallel machines. We have conducted some experiments regarding how to plug GPU-enabled computational kernels into PSBLAS, a MPI-based library specifically geared towards sparse matrix computations. In this paper, we present our findings on which strategies are more promising in the quest for the optimal compromise among raw performance, speedup, software maintainability, and extensibility. We consider several solutions to implement the data exchange with the GPU focusing on the data access and transfer, and present an experimental evaluation for a cluster system with up to two GPUs per node. In particular, we compare the pinned memory and the Open-MPI approaches, which are the two most used alternatives for multi-GPU communication in a cluster environment. We find that OpenMPI turns out to be the best solution for large data transfers, while the pinned memory approach is still a good solution for small transfers between GPUs.
parallel computing | 2018
Alessandro Fanfarillo; Davide Del Vento
Abstract With the increasing availability of the Remote Direct Memory Access (RDMA) support in computer networks, the so called Partitioned Global Address Space (PGAS) model has evolved in the last few years. Although there are several cases where a PGAS approach can easily solve difficult message passing situations, like in particle tracking and adaptive mesh refinement applications, the producer-consumer pattern, usually adopted in task-based parallelism, can only be implemented inefficiently because of the separation between data transfer and synchronization (which is usually unified in message passing programming models). In this paper, we provide two contributions: (1) we propose an extension for the Fortran language that provides the concept of Notified Access by associating regular coarray variables with event variables. (2) We demonstrate that the MPI extension proposed by foMPI for Notified Access can be used effectively to implement the same concept in a PGAS run-time library like OpenCoarrays. Moreover, for a hydrodynamics mini-application, we found that Fortran 2018 events perform always better than Fortran 2008 sync statements on many-core processors. We finally show how the proposed Notified Access can improve the performance even more.
parallel computing | 2017
Alessandro Fanfarillo; Davide Del Vento; Patrick Nichols
Since the beginning of distributed computing, overlapping communication with computation has always been an attractive technique used to mask high communication costs. Although easy to detect by a human being, communication/computation overlapping requires knowledge about architectural and network details in order to be performed effectively. When low level details influence performance and productivity, compilers and run-time libraries play the critical role of translating the high level statements understandable by humans into efficient commands suitable for machines. With the advent of PGAS languages, parallelism becomes part of the programming language and communication can be expressed with simple variable assignments. As for serial programs, PGAS compilers should be able to optimize all aspects of the language. That would include communication, but unfortunately this is not yet the case. In this work we consider parallel scientific programs written in Coarray Fortran and we focus on how to build a PGAS compiler capable to optimize the communication, in particular by automatically exploiting opportunities for communication/computation overlapping. We also sketch an extension for the Fortran language that allows one to express the relation between data and synchronization events; we finally show how this relation can be used by the compiler to perform additional communication optimizations.