Pier Stanislao Paolucci

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pier Stanislao Paolucci is active.

Explore More

Publication

Featured researches published by Pier Stanislao Paolucci.

Computer Physics Communications | 1987

The APE computer: An array processor optimized for lattice gauge theory simulations

M. Albanese; P. Bacilieri; S. Cabasino; N. Cabibbo; F. Costantini; G. Fiorentini; F. Flore; A. Fonti; A. Fucci; M.P. Lombardo; S. Galeotti; P. Giacomelli; P. A. Marchesini; Enzo Marinari; F. Marzano; A. Miotto; Pier Stanislao Paolucci; Giorgio Parisi; D. Pascoli; D. Passuello; S. Petrarca; F. Rapuano; E. Remiddi; R. W. Rusack; G. Salina; R. Tripiccione

Abstract The APE computer is a high performance processor designed to provide massive computational power for intrinsically parallel and homogeneous applications. APE is a linear array of processing elements and memory boards that execute in parallel in SIMD mode under the control of a CERN/SLAC 3081/E. Processing elements and memory boards are connected by a ‘circular’ switchnet. The hardware and software architecture of APE, as well as its implementation are discussed in this paper. Some physics results obtained in the simulation of lattice gauge theories are also presented.

International Journal of High Speed Computing | 1993

THE APE-100 COMPUTER: (I) THE ARCHITECTURE

C. Battista; S. Cabasino; F. Marzano; Pier Stanislao Paolucci; J. Pech; Federico Rapuano; R. Sarno; Gian Marco Todesco; Mario Torelli; W. Tross; P. Vicini; N. Cabibbo; Enzo Marinari; Giorgio Parisi; G. Salina; Filippo del Prete; Adriano Lai; Maria Paola Lombardo; R. Tripiccione; Adolfo Fucci

We describe APE-100, a SIMD, modular parallel processor architecture for large scale scientific computation. The largest configuration that will be implemented in the present design will deliver a peak speed of 100 Gflops. This performance is, for instance, required for high precision computations in Quantum Chromo Dynamics, for which APE-100 is very well suited.

ACM Transactions in Embedded Computing Systems | 2008

Platform-based software design flow for heterogeneous MPSoC

Katalin Popovici; Xavier Guerin; Frédéric Rousseau; Pier Stanislao Paolucci; Ahmed Amine Jerraya

Current multimedia applications demand complex heterogeneous multiprocessor architectures with specific communication infrastructure in order to achieve the required performances. Programming these architectures usually results in writing separate low-level code for the different processors (DSP, microcontroller), implying late global validation of the overall application with the hardware platform. We propose a platform-based software design flow able to efficiently use the resources of the architecture and allowing easy experimentation of several mappings of the application onto the platform resources. We use a high-level environment to capture both application and architecture initial representations. An executable software stack is generated automatically for each processor from the initial model. The software generation and validation is performed gradually corresponding to different software abstraction levels. Specific software development platforms (abstract models of the architecture) are generated and used to allow debugging of the different software components with explicit hardware-software interaction. We applied this approach on a multimedia platform, involving a high performance DSP and a RISC processor, to explore communication architecture and generate an efficient executable code for a multimedia application. Based on automatic tools, the proposed flow increases productivity and preserves design quality.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

GPU Peer-to-Peer Techniques Applied to a Cluster Interconnect

Roberto Ammendola; Massimo Bernaschi; Andrea Biagioni; Mauro Bisson; Massimiliano Fatica; Ottorino Frezza; Francesca Lo Cicero; Alessandro Lonardo; Enrico Mastrostefano; Pier Stanislao Paolucci; Davide Rossetti; Francesco Simula; Laura Tosoratto; P. Vicini

Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific hardware features which are not available on current generation network adapters. In this paper we describe the architectural modifications required to implement peer-to-peer access to NVIDIA Fermi- and Kepler-class GPUs on an FPGA-based cluster interconnect. Besides, the current software implementation, which integrates this feature by minimally extending the RDMA programming model, is discussed, as well as some issues raised while employing it in a higher level API like MPI. Finally, the current limits of the technique are studied by analyzing the performance improvements on low-level benchmarks and on two GPU-accelerated applications, showing when and how they seem to benefit from the GPU peer-to-peer method.

arXiv: Computational Physics | 2011

APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters

Roberto Ammendola; Andrea Biagioni; Ottorino Frezza; Francesca Lo Cicero; Alessandro Lonardo; Pier Stanislao Paolucci; Davide Rossetti; A. Salamon; G. Salina; Francesco Simula; Laura Tosoratto; P. Vicini

We describe herein the APElink+ board, a PCIe interconnect adapter featuring the latest advances in wire speed and interface technology plus hardware support for a RDMA programming model and experimental acceleration of GPU networking; this design allows us to build a low latency, high bandwidth PC cluster, the APEnet+ network, the new generation of our cost-effective, tens-of-thousands-scalable cluster network architecture. Some test results and characterization of data transmission of a complete testbench, based on a commercial development card mounting an Altera® FPGA, are provided.

ieee international conference on high performance computing data and analytics | 2011

QUonG: A GPU-based HPC System Dedicated to LQCD Computing

Roberto Ammendola; Andrea Biagioni; Ottorino Frezza; Francesca Lo Cicero; Alessandro Lonardo; Pier Stanislao Paolucci; Davide Rossetti; Francesco Simula; Laura Tosoratto; P. Vicini

QUonG is an INFN (Istituto Nazionale di Fisica Nucleare) initiative targeted to develop a high performance computing system dedicated to Lattice QCD computations. QUonG is a massively parallel computing platform that lever-ages on commodity multi-core processors coupled with last generation GPUs. Its network mesh exploits the characteristics of LQCD algorithm for the design of a point-to-point, high performance, low latency 3-d torus network to interconnect the computing nodes. The network is built upon the APE net+ project: it consists of an FPGA-based PCI Express board exposing six full bidirectional off-board links running at 34 Gbps each, and implementing RDMA protocol and an experimental direct network-to-GPU interface, enabling significant access latency reduction for inter-node data transfers. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 60 TFlops/rack of peak performance, at a cost of 5 Ke/TFlops and for an estimated power consumption of 25 KW/rack. A first QUonG system prototype is expected to be delivered at the end of the year 2011.

rapid system prototyping | 2007

Efficient Software Development Platforms for Multimedia Applications at Different Abstraction Levels

Katalin Popovici; Xavier Guerin; Frédéric Rousseau; Pier Stanislao Paolucci; Ahmed Amine Jerraya

Multimedia applications require heterogeneous multiprocessor architectures with specific I/O components in order to achieve computation and communication performances. The different processors run different software stacks made of the application code and the hardware dependent software layer. Developing this software usually makes use of a high level programming environment that does not handle specific architecture capabilities. We propose abstract software development platforms allowing to debug incrementally the different software layers and able to accurately estimate the use of the resources of the architecture. The software development platform is an abstract model of the architecture allowing to execute the software with detailed hardware-software interaction, performance measurement and software debug. Different software development platforms are generated automatically from an initial Simulink model and are used to debug the different software components and to easily experiment with several mappings of the application onto the platform resources. In this paper we apply the proposed approach on a multimedia platform, involving a high performance DSP and a RISC processor, to validate the executable code for a MJPEG decoder application.

Journal of Instrumentation | 2013

APEnet+ 34 Gbps data transmission system and custom transmission logic

Roberto Ammendola; Andrea Biagioni; Ottorino Frezza; A. Lonardo; F Lo Cicero; Pier Stanislao Paolucci; D Rossetti; Francesco Simula; Laura Tosoratto; P. Vicini

APEnet+ is a point-to-point, low-latency, 3D-torus network controller integrated in a PCIe Gen2 board based on the Altera Stratix IV FPGA. We characterize the transmission system (embedded transceivers driving external QSFP+ modules) analyzing signal integrity, throughput, latency, BER and jitter at different data rates up to 34 Gbps. We estimate the efficiency of a custom logic able to sustain 2.6 GB/s per link with an FPGA on-chip memory footprint of 40 KB, providing deadlock-free routing and systemic awareness of faults. Finally, we show the preliminary results obtained with the embedded transceivers of a next-generation FPGA and outline some ideas to increase the performance with the same FPGA memory footprint.

Journal of Instrumentation | 2014

NaNet: a flexible and configurable low-latency NIC for real-time trigger systems based on GPUs

Roberto Ammendola; Andrea Biagioni; Ottorino Frezza; G. Lamanna; A. Lonardo; F Lo Cicero; Pier Stanislao Paolucci; F. Pantaleo; D Rossetti; Francesco Simula; Marco S. Sozzi; Laura Tosoratto; P. Vicini

NaNet is an FPGA-based PCIe X8 Gen2 NIC supporting 1/10 GbE links and the custom 34 Gbps APElink channel. The design has GPUDirect RDMA capabilities and features a network stack protocol offloading module, making it suitable for building low-latency, real-time GPU-based computing systems. We provide a detailed description of the NaNet hardware modular architecture. Benchmarks for latency and bandwidth for GbE and APElink channels are presented, followed by a performance analysis on the case study of the GPU-based low level trigger for the RICH detector in the NA62 CERN experiment, using either the NaNet GbE and APElink channels. Finally, we give an outline of project future activities.

field-programmable technology | 2013

Virtual-to-Physical address translation for an FPGA-based interconnect with host and GPU remote DMA capabilities

Roberto Ammendola; Andrea Biagioni; Ottorino Frezza; Francesca Lo Cicero; A. Lonardo; Pier Stanislao Paolucci; D Rossetti; Francesco Simula; Laura Tosoratto; P. Vicini

We developed a custom FPGA-based Network Interface Controller named APEnet+ aimed at GPU accelerated clusters for High Performance Computing. The card exploits peer-to-peer capabilities (GPU-Direct RDMA) for latest NVIDIA GPGPU devices and the RDMA paradigm to perform fast direct communication between computing nodes, offloading the host CPU from network tasks execution. In this work we focus on the implementation of a Virtual to Physical address translation mechanism, using the FPGA embedded soft-processor. Address management is the most demanding task - we estimated up to 70% of the μC load - for the NIC receiving side, resulting being the main culprit for data bottleneck. To improve the performance of this task and hence improve data transfer over the network, we added a specialized hardware logic block acting as a Translation Lookaside Buffer. This block makes use of a peculiar Content Address Memory implementation designed for scalability and speed. We present detailed measurements to demonstrate the benefits coming from the introduction of such custom logic: a substantial address translation latency reduction (from a measured value of 1.9 μs to 124 ns) and a performance enhancement of both host-bound and GPU-bound data transfers (up to ~ 60% of bandwidth increase) in given message size ranges.

Explore More