Davide Rossetti | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Davide Rossetti is active.

Explore More

Publication

Featured researches published by Davide Rossetti.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

GPU Peer-to-Peer Techniques Applied to a Cluster Interconnect

Roberto Ammendola; Massimo Bernaschi; Andrea Biagioni; Mauro Bisson; Massimiliano Fatica; Ottorino Frezza; Francesca Lo Cicero; Alessandro Lonardo; Enrico Mastrostefano; Pier Stanislao Paolucci; Davide Rossetti; Francesco Simula; Laura Tosoratto; P. Vicini

Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific hardware features which are not available on current generation network adapters. In this paper we describe the architectural modifications required to implement peer-to-peer access to NVIDIA Fermi- and Kepler-class GPUs on an FPGA-based cluster interconnect. Besides, the current software implementation, which integrates this feature by minimally extending the RDMA programming model, is discussed, as well as some issues raised while employing it in a higher level API like MPI. Finally, the current limits of the technique are studied by analyzing the performance improvements on low-level benchmarks and on two GPU-accelerated applications, showing when and how they seem to benefit from the GPU peer-to-peer method.

arXiv: Computational Physics | 2011

APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters

Roberto Ammendola; Andrea Biagioni; Ottorino Frezza; Francesca Lo Cicero; Alessandro Lonardo; Pier Stanislao Paolucci; Davide Rossetti; A. Salamon; G. Salina; Francesco Simula; Laura Tosoratto; P. Vicini

We describe herein the APElink+ board, a PCIe interconnect adapter featuring the latest advances in wire speed and interface technology plus hardware support for a RDMA programming model and experimental acceleration of GPU networking; this design allows us to build a low latency, high bandwidth PC cluster, the APEnet+ network, the new generation of our cost-effective, tens-of-thousands-scalable cluster network architecture. Some test results and characterization of data transmission of a complete testbench, based on a commercial development card mounting an Altera® FPGA, are provided.

ieee international conference on high performance computing data and analytics | 2011

QUonG: A GPU-based HPC System Dedicated to LQCD Computing

Roberto Ammendola; Andrea Biagioni; Ottorino Frezza; Francesca Lo Cicero; Alessandro Lonardo; Pier Stanislao Paolucci; Davide Rossetti; Francesco Simula; Laura Tosoratto; P. Vicini

QUonG is an INFN (Istituto Nazionale di Fisica Nucleare) initiative targeted to develop a high performance computing system dedicated to Lattice QCD computations. QUonG is a massively parallel computing platform that lever-ages on commodity multi-core processors coupled with last generation GPUs. Its network mesh exploits the characteristics of LQCD algorithm for the design of a point-to-point, high performance, low latency 3-d torus network to interconnect the computing nodes. The network is built upon the APE net+ project: it consists of an FPGA-based PCI Express board exposing six full bidirectional off-board links running at 34 Gbps each, and implementing RDMA protocol and an experimental direct network-to-GPU interface, enabling significant access latency reduction for inter-node data transfers. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 60 TFlops/rack of peak performance, at a cost of 5 Ke/TFlops and for an estimated power consumption of 25 KW/rack. A first QUonG system prototype is expected to be delivered at the end of the year 2011.

high performance interconnects | 2015

UCX: An Open Source Framework for HPC Network APIs and Beyond

Pavel Shamis; Manjunath Gorentla Venkata; M. Graham Lopez; Matthew B. Baker; Oscar R. Hernandez; Yossi Itigin; Mike Dubman; Gilad Shainer; Richard L. Graham; Liran Liss; Yiftah Shahar; Sreeram Potluri; Davide Rossetti; Donald Becker; Duncan Poole; Christopher Lamb; Sameer Kumar; Craig B. Stunkel; George Bosilca; Aurelien Bouteiller

This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort of national laboratories, industry, and academia to design and implement a high-performing and highly-scalable network stack for next generation applications and systems. UCX design provides the ability to tailor its APIs and network functionality to suit a wide variety of application domains and hardware. We envision these APIs to satisfy the networking needs of many programming models such as Message Passing Interface (MPI), OpenSHMEM, Partitioned Global Address Space (PGAS) languages, task-based paradigms and I/O bound applications. To evaluate the design we implement the APIs and protocols, and measure the performance of overhead-critical network primitives fundamental for implementing many parallel programming models and system libraries. Our results show that the latency, bandwidth, and message rate achieved by the portable UCX prototype is very close to that of the underlying driver. With UCX, we achieved a message exchange latency of 0.89 us, a bandwidth of 6138.5 MB/s, and a message rate of 14 million messages per second. As far as we know, this is the highest bandwidth and message rate achieved by any network stack (publicly known) on this hardware.

Journal of Parallel and Distributed Computing | 2013

Benchmarking of communication techniques for GPUs

Massimo Bernaschi; Mauro Bisson; Davide Rossetti

We report about the performances obtained, at the application level, by two MPI implementations for Infiniband that allow direct exchange of data stored in the global memory of Graphic Processing Units (GPU) based on the Nvidia CUDA. For the same purpose, we tested also the Application Programming Interface of APEnet, which is a custom, high performance interconnect technology. As a benchmark we consider the time required to update a single spin of the 3D Heisenberg spin glass model by using the over-relaxation algorithm. The results show that CUDA streams are instrumental in achieving the best possible performances.

ieee international conference on high performance computing, data, and analytics | 2014

Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters

Rong Shi; Sreeram Potluri; Khaled Hamidouche; Jonathan L. Perkins; Mingzhe Li; Davide Rossetti; Dhabaleswar K. Panda

Increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement on GPU clusters continues to be the major bottleneck that keeps scientific applications from fully harnessing the potential of GPUs. Earlier, GPU-GPU inter-node communication has to move data from GPU memory to host memory before sending it over the network. MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using host-based pipelining techniques. Besides that, the newly introduced GPU Direct RDMA (GDR) is a promising solution to further solve this data movement bottleneck. However, existing design in MPI libraries applies the rendezvous protocol for all message sizes, which incurs considerable overhead for small message communications due to extra synchronization message exchange. In this paper, we propose new techniques to optimize internode GPU-to-GPU communications for small message sizes. Our designs to support the eager protocol include efficient support at both sender and receiver sides. Furthermore, we propose a new data path to provide fast copies between host and GPU memories. To the best of our knowledge, this is the first study to propose efficient designs for GPU communication for small message sizes, using eager protocol. Our experimental results demonstrate up to 59% and 63% reduction in latency for GPU-to-GPU and CPU-to-GPU point-to-point communications, respectively. These designs boost the uni-directional bandwidth by 7.3x and 1.7x, respectively. We also evaluate our proposed design with two end-applications: GPULBM and HOOMD-blue. Performance numbers on Kepler GPUs shows that, compared to the best existing GDR design, our proposed designs achieve up to 23.4% latency reduction for GPULBM and 58% increase in average TPS for HOOMD-blue, respectively.

Computer Physics Communications | 1998

The teraflop supercomputer APEmille: architecture, software and project status report*

F. Aglietti; A. Bartoloni; C. Battista; S. Cabasino; M. Cosimi; A. Michelotti; A. Monello; Emanuele Panizzi; P.S. Paolucci; W. Rinaldi; Davide Rossetti; Hubert Simma; M. Torelli; P. Vicini; N. Cabibbo; W. Errico; S. Giovannetti; F. Laico; G. Magazzú; R. Tripiccione

Abstract APEmille is a SPMD parallel processor under development at INFN, Italy, in cooperation with DESY, Germany. APEmille is suited for grand challenges computational problems such as QCD simulations, climate modelling, neural networks, computational chemistry, numerical wind tunnels, seismic and combustion simulations. Its 1 Teraflop/s peak performance and its architecture, together with its language features, allow such applications to execute effectively. APEmille is based on an array of custom arithmetic processors arranged on a tridimensional torus. The processor is optimized for complex computations and has a peak performance of 528 Mflop at 66 MHz. Each processing element has 8 Mbytes of locally addressable RAM. On the software side particular emphasis is devoted to the programming languages that will be available (TAO and C++) and their object oriented, dynamic characteristics: with TAO it is possible to develop language extensions similar to the usual HEP notation; with C++ the portability from and towards different platforms is made possible.

Journal of Systems Architecture | 2016

Dynamic many-process applications on many-tile embedded systems and HPC clusters

Pier Stanislao Paolucci; Andrea Biagioni; Luis Gabriel Murillo; Frédéric Rousseau; Lars Schor; Laura Tosoratto; Iuliana Bacivarov; Robert Lajos Buecs; Clément Deschamps; Ashraf El-Antably; Roberto Ammendola; Nicolas Fournel; Ottorino Frezza; Rainer Leupers; Francesca Lo Cicero; Alessandro Lonardo; Michele Martinelli; Elena Pastorelli; Devendra Rai; Davide Rossetti; Francesco Simula; Lothar Thiele; P. Vicini; Jan Henrik Weinstock

In the next decade, a growing number of scientific and industrial applications will require power-efficient systems providing unprecedented computation, memory, and communication resources. A promising paradigm foresees the use of heterogeneous many-tile architectures. The resulting computing systems are complex: they must be protected against several sources of faults and critical events, and application programmers must be provided with programming paradigms, software environments and debugging tools adequate to manage such complexity. The EURETILE (European Reference Tiled Architecture Experiment) consortium conceived, designed, and implemented: 1- an innovative many-tile, many-process dynamic fault-tolerant programming paradigm and software environment, grounded onto a lightweight operating system generated by an automated software synthesis mechanism that takes into account the architecture and application specificities; 2- a many-tile heterogeneous hardware system, equipped with a high-bandwidth, low-latency, point-to-point 3D-toroidal interconnect. The inter-tile interconnect processor is equipped with an experimental mechanism for systemic fault-awareness; 3- a full-system simulation environment, supported by innovative parallel technologies and equipped with debugging facilities. We also designed and coded a set of application benchmarks representative of requirements of future HPC and Embedded Systems, including: 4- a set of dynamic multimedia applications and 5- a large scale simulator of neural activity and synaptic plasticity. The application benchmarks, compiled through the EURETILE software tool-chain, have been efficiently executed on both the many-tile hardware platform and on the software simulator, up to a complexity of a few hundreds of software processes and hardware cores.

Journal of Instrumentation | 2016

NaNet-10: A 10GbE network interface card for the GPU-based low-level trigger of the NA62 RICH detector

Roberto Ammendola; Andrea Biagioni; M. Fiorini; Ottorino Frezza; A. Lonardo; G. Lamanna; F. Lo Cicero; Michele Martinelli; Ilaria Neri; P.S. Paolucci; Elena Pastorelli; R. Piandani; L. Pontisso; Davide Rossetti; Francesco Simula; M. Sozzi; Laura Tosoratto; P. Vicini

A GPU-based low level (L0) trigger is currently integrated in the experimental setup of the RICH detector of the NA62 experiment to assess the feasibility of building more refined physics-related trigger primitives and thus improve the trigger discriminating power. To ensure the real-time operation of the system, a dedicated data transport mechanism has been implemented: an FPGA-based Network Interface Card (NaNet-10) receives data from detectors and forwards them with low, predictable latency to the memory of the GPU performing the trigger algorithms. Results of the ring-shaped hit patterns reconstruction will be reported and discussed.

arXiv: High Energy Physics - Lattice | 2002

Status of APEmille

A. Bartoloni; Ph. Boucaud; N. Cabibbo; F. Calvayrac; M. Della Morte; R. De Pietri; P. De Riso; F. Di Carlo; F. Di Renzo; W. Errico; Roberto Frezzotti; T. Giorgino; Jochen Heitger; Alessandro Lonardo; M. Loukianov; G. Magazzú; J. Micheli; V. Morenas; N. Paschedag; O. Pène; R. Petronzio; Dirk Pleiter; F. Rapuano; Juri Rolf; Davide Rossetti; L. Sartori; H. Simma; F. Schifano; M. Torelli; R. Tripiccione

Abstract This paper presents the status of the APEmille project, which is essentially completed, as far as machine development and construction is concerned. Several large installations of APEmille are in use for physics production runs leading to many new results presented at this conference. This paper briefly summarizes the APEmille architecture, reviews the status of the installations and presents some performance figures for physics codes.

Explore More