Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Massimiliano Meneghin.
international conference on parallel processing | 2011
Marco Aldinucci; Marco Danelutto; Peter Kilpatrick; Massimiliano Meneghin; Massimo Torquati
FastFlow is a programming framework specifically targeting cache-coherent shared-memory multi-cores. It is implemented as a stack of C++ template libraries built on top of lock-free (and memory fence free) synchronization mechanisms. Its philosophy is to combine programmability with performance. In this paper a new FastFlow programming methodology aimed at supporting parallelization of existing sequential code via offloading onto a dynamically created software accelerator is presented. The new methodology has been validated using a set of simple micro-benchmarks and some real applications.
parallel, distributed and network-based processing | 2010
Marco Aldinucci; Massimiliano Meneghin; Massimo Torquati
Shared memory multiprocessors have returned to popularity thanks to rapid spreading of commodity multi-core architectures. However, little attention has been paid to supporting effective streaming applications on these architectures. We describe FastFlow, a low-level programming framework based on lock-free queues explicitly designed to support high-level languages for streaming applications. We compare FastFlow with state-of-the-art programming frameworks such as Cilk, OpenMP, and Intel TBB. We experimentally demonstrate that FastFlow is always more efficient than them on a given real world application: the speedup of FastFlow over other solutions may be substantial for fine grain tasks, for example +35\% over OpenMP, +226\% over Cilk, +96\% over TBB for the alignment of protein P01111 against UniProt DB using the Smith-Waterman algorithm.
international conference on parallel processing | 2012
Marco Aldinucci; Marco Danelutto; Peter Kilpatrick; Massimiliano Meneghin; Massimo Torquati
The use of efficient synchronization mechanisms is crucial for implementing fine grained parallel programs on modern shared cache multi-core architectures. In this paper we study this problem by considering Single-Producer/Single-Consumer (SPSC) coordination using unbounded queues. A novel unbounded SPSC algorithm capable of reducing the row synchronization latency and speeding up Producer-Consumer coordination is presented. The algorithm has been extensively tested on a shared-cache multi-core platform and a sketch proof of correctness is presented. The queues proposed have been used as basic building blocks to implement the FastFlow parallel framework, which has been demonstrated to offer very good performance for fine-grain parallel applications.
high performance distributed computing | 2012
Davide Pasetto; Massimiliano Meneghin; Hubertus Franke; Fabrizio Petrini; Jimi Xenidis
The three major solutions for increasing the nominal performance of a CPU are: multiplying the number of cores per socket, expanding the embedded cache memories and use multi-threading to reduce the impact of the deep memory hierarchy. Systems with tens or hundreds of hardware threads, all sharing a cache coherent UMA or NUMA memory space, are today the de-facto standard. While these solutions can easily provide benefits in a multi-program environment, they require recoding of applications to leverage the available parallelism. Threads must synchronize and exchange data, and the overall performance is heavily in influenced by the overhead added by these mechanisms, especially as developers try to exploit finer grain parallelism to be able to use all available resources.
international conference on computer communications | 2013
Ken Inoue; Davide Pasetto; Karol Lynch; Massimiliano Meneghin; Kay Muller; John Sheehan
Ultra low-latency networking is critical in many domains, such as high frequency trading and high performance computing (HPC), and highly desirable in many others such as VoIP and on-line gaming. In closed systems - such as those found in HPC - Infiniband, iWARP or RoCE are common choices as system architects have the opportunity to choose the best host configurations and networking fabric. However, the vast majority of networks are built upon Ethernet with nodes exchanging data using the standard TCP/IP stack. On such networks, achieving ultra low-latency while maintaining compatibility with a standard TCP/IP stack is crucial. To date, most efforts for low-latency packet transfers have focused on three main areas: (i) avoiding context switches, (ii) avoiding buffer copies, and (iii) off-loading protocol processing. This paper describes IBM PowerENTM and its networking stack, showing that an integrated system design which treats Ethernet adapters as first class citizens that share the system bus with CPUs and memory, rather than as peripheral PCI Express attached devices, is a winning solution for achieving minimal latency. The work presents outstanding performance figures, including 1.30μs from wire to wire for UDP, usually the chosen protocol for latency sensitive applications, and excellent latency and bandwidth figures for the more complex TCP.
Archive | 2008
Carlo Bertolli; Massimiliano Meneghin; Joaquim Gabarro
One of the main issues for Grid applications is to deal with frequent failures, due to the dynamic and distributed nature of Grid platforms. This issue becomes even more important ifwewant to exploitGrid platforms to supportHigh-Performance applications. Our work starts from the choice of structured parallelism (e.g. skeletons) as programming model to attack this issue. We present our study of the performance impact of failures on the execution time of a specific class of structured parallel programs, namely task parallel computations. We introduce a Markov model for task parallel computations and we present a framework to study it. The result is an analytical tool for predicting the completion time of task parallel computations, in the case the number of tasks is known in advance. Otherwise such a number is unknown, we can still obtain the steady-state performance. We describe the framework and we present preliminar experimental results to validate it.
arXiv: Distributed, Parallel, and Cluster Computing | 2010
Marco Aldinucci; Marco Danelutto; Peter Kilpatrick; Massimiliano Meneghin; Massimo Torquati
iasted international conference on parallel and distributed computing and systems | 2009
Carlo Bertolli; Daniele Buono; Silvia Lametti; Gabriele Mencagli; Massimiliano Meneghin; Alessio Pascucci; Marco Vanneschi; L. B. Pontecorvo
international symposium on performance evaluation of computer and telecommunication systems | 2009
Gianni Antichi; Christian Callegari; Andrea Di Pietro; Domenico Ficara; Stefano Giordano; Fabio Vitucci; Massimiliano Meneghin; Massimo Torquati; Marco Vanneschi; Massimo Coppola
Archive | 2009
Massimiliano Meneghin; Marco Vanneschi