Is this you? Create Your Porfile

Roberto Ammendola

Istituto Nazionale di Fisica Nucleare

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Roberto Ammendola is active.

Explore More

Publication

Featured researches published by Roberto Ammendola.

Journal of Physics: Conference Series | 2012

APEnet+: a 3D Torus network optimized for GPU-based HPC Systems

Roberto Ammendola; Andrea Biagioni; Ottorino Frezza; F Lo Cicero; A. Lonardo; P.S. Paolucci; D Rossetti; Francesco Simula; Laura Tosoratto; P. Vicini

In the supercomputing arena, the strong rise of GPU-accelerated clusters is a matter of fact. Within INFN, we proposed an initiative — the QUonG project — whose aim is to deploy a high performance computing system dedicated to scientific computations leveraging on commodity multi-core processors coupled with latest generation GPUs. The inter-node interconnection system is based on a point-to-point, high performance, low latency 3D torus network which is built in the framework of the APEnet+ project. It takes the form of an FPGA-based PCIe network card exposing six full bidirectional links running at 34 Gbps each that implements the RDMA protocol. In order to enable significant access latency reduction for inter-node data transfer, a direct network-to-GPU interface was built. The specialized hardware blocks, integrated in the APEnet+ board, provide support for GPU-initiated communications using the so called PCIe peer-to-peer (P2P) transactions. This development is made in close collaboration with the GPU vendor NVIDIA. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 80 TFLOPS/rack of peak performance, at a cost of 5 k€/T F LOPS and for an estimated power consumption of 25 kW/rack. In this paper we report on the status of final rack deployment and on the R&D activities for 2012 that will focus on performance enhancement of the APEnet+ hardware through the adoption of new generation 28 nm FPGAs allowing the implementation of PCIe Gen3 host interface and the addition of new fault tolerance-oriented capabilities.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

GPU Peer-to-Peer Techniques Applied to a Cluster Interconnect

Roberto Ammendola; Massimo Bernaschi; Andrea Biagioni; Mauro Bisson; Massimiliano Fatica; Ottorino Frezza; Francesca Lo Cicero; Alessandro Lonardo; Enrico Mastrostefano; Pier Stanislao Paolucci; Davide Rossetti; Francesco Simula; Laura Tosoratto; P. Vicini

Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific hardware features which are not available on current generation network adapters. In this paper we describe the architectural modifications required to implement peer-to-peer access to NVIDIA Fermi- and Kepler-class GPUs on an FPGA-based cluster interconnect. Besides, the current software implementation, which integrates this feature by minimally extending the RDMA programming model, is discussed, as well as some issues raised while employing it in a higher level API like MPI. Finally, the current limits of the technique are studied by analyzing the performance improvements on low-level benchmarks and on two GPU-accelerated applications, showing when and how they seem to benefit from the GPU peer-to-peer method.

arXiv: Computational Physics | 2011

APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters

Roberto Ammendola; Andrea Biagioni; Ottorino Frezza; Francesca Lo Cicero; Alessandro Lonardo; Pier Stanislao Paolucci; Davide Rossetti; A. Salamon; G. Salina; Francesco Simula; Laura Tosoratto; P. Vicini

We describe herein the APElink+ board, a PCIe interconnect adapter featuring the latest advances in wire speed and interface technology plus hardware support for a RDMA programming model and experimental acceleration of GPU networking; this design allows us to build a low latency, high bandwidth PC cluster, the APEnet+ network, the new generation of our cost-effective, tens-of-thousands-scalable cluster network architecture. Some test results and characterization of data transmission of a complete testbench, based on a commercial development card mounting an Altera® FPGA, are provided.

ieee international conference on high performance computing data and analytics | 2011

QUonG: A GPU-based HPC System Dedicated to LQCD Computing

Roberto Ammendola; Andrea Biagioni; Ottorino Frezza; Francesca Lo Cicero; Alessandro Lonardo; Pier Stanislao Paolucci; Davide Rossetti; Francesco Simula; Laura Tosoratto; P. Vicini

QUonG is an INFN (Istituto Nazionale di Fisica Nucleare) initiative targeted to develop a high performance computing system dedicated to Lattice QCD computations. QUonG is a massively parallel computing platform that lever-ages on commodity multi-core processors coupled with last generation GPUs. Its network mesh exploits the characteristics of LQCD algorithm for the design of a point-to-point, high performance, low latency 3-d torus network to interconnect the computing nodes. The network is built upon the APE net+ project: it consists of an FPGA-based PCI Express board exposing six full bidirectional off-board links running at 34 Gbps each, and implementing RDMA protocol and an experimental direct network-to-GPU interface, enabling significant access latency reduction for inter-node data transfers. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 60 TFlops/rack of peak performance, at a cost of 5 Ke/TFlops and for an estimated power consumption of 25 KW/rack. A first QUonG system prototype is expected to be delivered at the end of the year 2011.

Journal of Instrumentation | 2013

APEnet+ 34 Gbps data transmission system and custom transmission logic

Roberto Ammendola; Andrea Biagioni; Ottorino Frezza; A. Lonardo; F Lo Cicero; Pier Stanislao Paolucci; D Rossetti; Francesco Simula; Laura Tosoratto; P. Vicini

APEnet+ is a point-to-point, low-latency, 3D-torus network controller integrated in a PCIe Gen2 board based on the Altera Stratix IV FPGA. We characterize the transmission system (embedded transceivers driving external QSFP+ modules) analyzing signal integrity, throughput, latency, BER and jitter at different data rates up to 34 Gbps. We estimate the efficiency of a custom logic able to sustain 2.6 GB/s per link with an FPGA on-chip memory footprint of 40 KB, providing deadlock-free routing and systemic awareness of faults. Finally, we show the preliminary results obtained with the embedded transceivers of a next-generation FPGA and outline some ideas to increase the performance with the same FPGA memory footprint.

arXiv: High Energy Physics - Lattice | 2005

APENet: LQCD clusters à la APE☆

Roberto Ammendola; M. Guagnelli; G. Mazza; F. Palombi; R. Petronzio; Davide Rossetti; A. Salamon; P. Vicini

Developed by the APE group, APENet is a new high speed, low latency, 3-dimensional interconnect architecture optimized for PC clusters running LQCD-like numerical applications. The hardware implementation is based on a single PCI-X 133MHz network interface card hosting six independent bi-directional channels with a peak bandwidth of 676 MB/s each direction. We discuss preliminary benchmark results showing exciting performances similar or better than those found in high-end commercial network systems.

Journal of Instrumentation | 2014

NaNet: a flexible and configurable low-latency NIC for real-time trigger systems based on GPUs

Roberto Ammendola; Andrea Biagioni; Ottorino Frezza; G. Lamanna; A. Lonardo; F Lo Cicero; Pier Stanislao Paolucci; F. Pantaleo; D Rossetti; Francesco Simula; Marco S. Sozzi; Laura Tosoratto; P. Vicini

NaNet is an FPGA-based PCIe X8 Gen2 NIC supporting 1/10 GbE links and the custom 34 Gbps APElink channel. The design has GPUDirect RDMA capabilities and features a network stack protocol offloading module, making it suitable for building low-latency, real-time GPU-based computing systems. We provide a detailed description of the NaNet hardware modular architecture. Benchmarks for latency and bandwidth for GbE and APElink channels are presented, followed by a performance analysis on the case study of the GPU-based low level trigger for the RICH detector in the NA62 CERN experiment, using either the NaNet GbE and APElink channels. Finally, we give an outline of project future activities.

field-programmable technology | 2013

Virtual-to-Physical address translation for an FPGA-based interconnect with host and GPU remote DMA capabilities

Roberto Ammendola; Andrea Biagioni; Ottorino Frezza; Francesca Lo Cicero; A. Lonardo; Pier Stanislao Paolucci; D Rossetti; Francesco Simula; Laura Tosoratto; P. Vicini

We developed a custom FPGA-based Network Interface Controller named APEnet+ aimed at GPU accelerated clusters for High Performance Computing. The card exploits peer-to-peer capabilities (GPU-Direct RDMA) for latest NVIDIA GPGPU devices and the RDMA paradigm to perform fast direct communication between computing nodes, offloading the host CPU from network tasks execution. In this work we focus on the implementation of a Virtual to Physical address translation mechanism, using the FPGA embedded soft-processor. Address management is the most demanding task - we estimated up to 70% of the μC load - for the NIC receiving side, resulting being the main culprit for data bottleneck. To improve the performance of this task and hence improve data transfer over the network, we added a specialized hardware logic block acting as a Translation Lookaside Buffer. This block makes use of a peculiar Content Address Memory implementation designed for scalability and speed. We present detailed measurements to demonstrate the benefits coming from the introduction of such custom logic: a substantial address translation latency reduction (from a measured value of 1.9 μs to 124 ns) and a performance enhancement of both host-bound and GPU-bound data transfers (up to ~ 60% of bandwidth increase) in given message size ranges.

Journal of Systems Architecture | 2016

Dynamic many-process applications on many-tile embedded systems and HPC clusters

Pier Stanislao Paolucci; Andrea Biagioni; Luis Gabriel Murillo; Frédéric Rousseau; Lars Schor; Laura Tosoratto; Iuliana Bacivarov; Robert Lajos Buecs; Clément Deschamps; Ashraf El-Antably; Roberto Ammendola; Nicolas Fournel; Ottorino Frezza; Rainer Leupers; Francesca Lo Cicero; Alessandro Lonardo; Michele Martinelli; Elena Pastorelli; Devendra Rai; Davide Rossetti; Francesco Simula; Lothar Thiele; P. Vicini; Jan Henrik Weinstock

In the next decade, a growing number of scientific and industrial applications will require power-efficient systems providing unprecedented computation, memory, and communication resources. A promising paradigm foresees the use of heterogeneous many-tile architectures. The resulting computing systems are complex: they must be protected against several sources of faults and critical events, and application programmers must be provided with programming paradigms, software environments and debugging tools adequate to manage such complexity. The EURETILE (European Reference Tiled Architecture Experiment) consortium conceived, designed, and implemented: 1- an innovative many-tile, many-process dynamic fault-tolerant programming paradigm and software environment, grounded onto a lightweight operating system generated by an automated software synthesis mechanism that takes into account the architecture and application specificities; 2- a many-tile heterogeneous hardware system, equipped with a high-bandwidth, low-latency, point-to-point 3D-toroidal interconnect. The inter-tile interconnect processor is equipped with an experimental mechanism for systemic fault-awareness; 3- a full-system simulation environment, supported by innovative parallel technologies and equipped with debugging facilities. We also designed and coded a set of application benchmarks representative of requirements of future HPC and Embedded Systems, including: 4- a set of dynamic multimedia applications and 5- a large scale simulator of neural activity and synaptic plasticity. The application benchmarks, compiled through the EURETILE software tool-chain, have been efficiently executed on both the many-tile hardware platform and on the software simulator, up to a complexity of a few hundreds of software processes and hardware cores.

Journal of Instrumentation | 2016

NaNet-10: A 10GbE network interface card for the GPU-based low-level trigger of the NA62 RICH detector

Roberto Ammendola; Andrea Biagioni; M. Fiorini; Ottorino Frezza; A. Lonardo; G. Lamanna; F. Lo Cicero; Michele Martinelli; Ilaria Neri; P.S. Paolucci; Elena Pastorelli; R. Piandani; L. Pontisso; Davide Rossetti; Francesco Simula; M. Sozzi; Laura Tosoratto; P. Vicini

A GPU-based low level (L0) trigger is currently integrated in the experimental setup of the RICH detector of the NA62 experiment to assess the feasibility of building more refined physics-related trigger primitives and thus improve the trigger discriminating power. To ensure the real-time operation of the system, a dedicated data transport mechanism has been implemented: an FPGA-based Network Interface Card (NaNet-10) receives data from detectors and forwards them with low, predictable latency to the memory of the GPU performing the trigger algorithms. Results of the ring-shaped hit patterns reconstruction will be reported and discussed.

Explore More