Is this you? Create Your Porfile

Stefano Dal Pra

Istituto Nazionale di Fisica Nucleare

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Stefano Dal Pra is active.

Explore More

Publication

Featured researches published by Stefano Dal Pra.

IEEE Transactions on Nuclear Science | 2010

Performance of 10 Gigabit Ethernet Using Commodity Hardware

Marco Bencivenni; Daniela Bortolotti; A. Carbone; Alessandro Cavalli; Andrea Chierici; Stefano Dal Pra; Donato De Girolamo; Luca dell'Agnello; Massimo Donatelli; Armando Fella; Domenico Galli; Antonia Ghiselli; Daniele Gregori; Alessandro Italiano; Rajeev Kumar; U. Marconi; B. Martelli; Mirco Mazzucato; Michele Onofri; Gianluca Peco; S. Perazzini; Andrea Prosperini; Pier Paolo Ricci; Elisabetta Ronchieri; F Rosso; Davide Salomoni; Vladimir Sapunenko; Vincenzo Vagnoni; Riccardo Veraldi; Maria Cristina Vistoli

In the prospect of employing 10 Gigabit Ethernet as networking technology for online systems and offline data analysis centers of High Energy Physics experiments, we performed a series of measurements on the performance of 10 Gigabit Ethernet, using the network interface cards mounted on the PCI-Express bus of commodity PCs both as transmitters and receivers. In real operating conditions, the achievable maximum transfer rate through a network link is not only limited by the capacity of the link itself, but also by that of the memory and peripheral buses and by the ability of the CPUs and of the Operating System to handle packet processing and interrupts raised by the network interface cards in due time. Besides the TCP and UDP maximum data transfer throughputs, we also measured the CPU loads of the sender/receiver processes and of the interrupt and soft-interrupt handlers as a function of the packet size, either using standard or ¿jumbo¿ Ethernet frames. In addition, we also performed the same measurements by simultaneously reading data from Fibre Channel links and forwarding them through a 10 Gigabit Ethernet link, hence emulating the behavior of a disk server in a Storage Area Network exporting data to client machines via 10 Gigabit Ethernet.

Journal of Physics: Conference Series | 2017

Improved Cloud resource allocation: how INDIGO-DataCloud is overcoming the current limitations in Cloud schedulers

Álvaro López García; Lisa Zangrando; Massimo Sgaravatto; Vincent Llorens; Sara Vallero; Valentina Zaccolo; S. Bagnasco; Sonia Taneja; Stefano Dal Pra; Davide Salomoni; Giacinto Donvito

Trabajo presentado a: 22nd International Conference on Computing in High Energy and Nuclear Physics (CHEP2016) 10–14 October 2016, San Francisco.

Journal of Physics: Conference Series | 2015

A self-configuring control system for storage and computing departments at INFN-CNAF Tierl

Daniele Gregori; Stefano Dal Pra; Pier Paolo Ricci; Michele Pezzi; Andrea Prosperini; Vladimir Sapunenko

The storage and farming departments at the INFN-CNAF Tier1[1] manage approximately thousands of computing nodes and several hundreds of servers that provides access to the disk and tape storage. In particular, the storage server machines should provide the following services: an efficient access to about 15 petabytes of disk space with different cluster of GPFS file system, the data transfers between LHC Tiers sites (Tier0, Tier1 and Tier2) via GridFTP cluster and Xrootd protocol and finally the writing and reading data operations on magnetic tape backend. One of the most important and essential point in order to get a reliable service is a control system that can warn if problems arise and which is able to perform automatic recovery operations in case of service interruptions or major failures. Moreover, during daily operations the configurations can change, i.e. if the GPFS cluster nodes roles can be modified and therefore the obsolete nodes must be removed from the control system production, and the new servers should be added to the ones that are already present. The manual management of all these changes is an operation that can be somewhat difficult in case of several changes, it can also take a long time and is easily subject to human error or misconfiguration. For these reasons we have developed a control system with the feature of self-configure itself if any change occurs. Currently, this system has been in production for about a year at the INFN-CNAF Tier1 with good results and hardly any major drawback. There are three major key points in this system. The first is a software configurator service (e.g. Quattor or Puppet) for the servers machines that we want to monitor with the control system; this service must ensure the presence of appropriate sensors and custom scripts on the nodes to check and should be able to install and update software packages on them. The second key element is a database containing information, according to a suitable format, on all the machines in production and able to provide for each of them the principal information such as the type of hardware, the network switch to which the machine is connected, if the machine is real (physical) or virtual, the possible hypervisor to which it belongs and so on. The last key point is a control system software (in our implementation we choose the Nagios software), capable of assessing the status of the servers and services, and that can attempt to restore the working state, restart or inhibit software services and send suitable alarm messages to the site administrators. The integration of these three elements was made by appropriate scripts and custom implementation that allow the self-configuration of the system according to a decisional logic and the whole combination of all the above-mentioned components will be deeply discussed in this paper.

Journal of Physics: Conference Series | 2010

A lightweight high availability strategy for Atlas LCG File Catalogs

B. Martelli; Alessandro De Salvo; Daniela Anzellotti; Lorenzo Rinaldi; Alessandro Cavalli; Stefano Dal Pra; Luca dell'Agnello; Daniele Gregori; Andrea Prosperini; Pier Paolo Ricci; Vladimir Sapunenko

The LCG File Catalog is a key component of the LHC Computing Grid middleware [1], as it contains the mapping between Logical File Names and Physical File Names on the Grid. The Atlas computing model foresees multiple local LFC housed in each Tier-1 and Tier-0, containing all information about files stored in the regional cloud. As the local LFC contents are presently not replicated anywhere, this turns out in a dangerous single point of failure for all of the Atlas regional clouds. In order to solve this problem we propose a novel solution for high availability (HA) of Oracle based Grid services, obtained by composing an Oracle Data Guard deployment and a series of application level scripts. This approach has the advantage of being very easy to deploy and maintain, and represents a good candidate solution for all Tier-2s which are usually little centres with little manpower dedicated to service operations. We also present the results of a wide range of functionality and performance tests run on a test-bed having characteristics similar to the ones required for production. The test-bed consists of a failover deployment between the Italian LHC Tier-1 (INFN – CNAF) and an Atlas Tier-2 located at INFN – Roma1. Moreover, we explain how the proposed strategy can be deployed on the present Grid infrastructure, without requiring any change to the middleware and in a way that is totally transparent to end users and applications.

Proceedings of International Symposium on Grids and Clouds (ISGC) 2016 — PoS(ISGC 2016) | 2017

Elastic CNAF DataCenter extension via opportunistic resources

Tommaso Boccali; Stefano Dal Pra; Vincenzo Ciaschini; Luca dell'Agnello; Andrea Chierici; Donato Di Girolamo; Vladimir Sapunenko; Alessandro Italiano

The Computing facility CNAF, in Bologna (Italy), is the biggest WLCG Computing Center in Italy, and serves all WLCG Experiments plus more than 20 non-WLCG Virtual Organizations and currently deploys more than 200 kHS06 of Computing Power and more than 20 PB of Disk and 40 PB of tape via a GPFS SAN. The Center has started a program to evaluate the possibility to extend its resources on external entities, either commercial or opportunistic or simply remote, in order to be prepared for future upgrades or temporary burst in the activity from experiments. The approach followed is meant to be completely transparent to users, with additional external resources directly added to the CNAF LSF batch system; several variants are possible, like the use of VPN tunnels in order to establish LSF communications between hosts, a multi-master LSF approach, or in the longer term the use of HTCondor. Concerning the storage, the simplest approach is to use Xrootd fallback to CNAF storage, unfortunately viable only for some experiments; a more transparent approach involves the use of GPFS/AFM module in order to cache files directly on the remote facilities. In this paper we focus on the technical aspects of the integration, and assess the difficulties using different remote virtualisation technologies, as made available at different sites. A set of benchmarks is provided in order to allow for an evaluation of the solution for CPU and Data intensive workflows. The evaluation of Aruba as a resource provider for CNAF is under test, with limited available resources; a ramp up to a larger scale is being discussed. On a parallel path, this paper shows a similar attempt of extension using proprietary resources, at ReCaS-Bari; the chosen solution is simpler in the setup, but shares many commonalities.

Proceedings of International Symposium on Grids and Clouds (ISGC) 2016 — PoS(ISGC 2016) | 2017

Elastic Computing from Grid sites to External Clouds

Giuseppe Codispoti; Riccardo Di Maria; Cristina Aiftimiei; D. Bonacorsi; Patrizia Calligola; Vincenzo Ciaschini; Alessandro Costantini; Stefano Dal Pra; Claudio Grandi; Diego Michelotto; Matteo Panella; Gianluca Peco; Vladimir Sapunenko; Massimo Sgaravatto; Sonia Taneja; Giovanni Zizzi; Donato De Girolamo

LHC experiments are now in Run-II data taking and approaching new challenges in the operation of the computing facilities in future Runs. Despite having demonstrated to be able to sustain operations at scale during Run-I, it has become evident that the computing infrastructure for RunII already is dimensioned to cope at most with the average amount of data recorded, and not for peak usage. The latter are frequent and may create large backlogs and have a direct impact on data reconstruction completion times, hence to data availability for physics analysis. Among others, the CMS experiment is exploring (since the first Long Shutdown period after Run-I) the access and utilisation of Cloud resources provided by external partners or commercial providers. In this work we present proof of concepts of the elastic extension of a CMS Tier-3 site in Bologna (Italy), on an external OpenStack infrastructure. We start from presenting the experience on a first work on the “Cloud Bursting” of a CMS Grid site using a novel LSF configuration to dynamically register new worker nodes. Then, we move to an even more recent work on a “Cloud Site as-aService” prototype, based on a more direct access/integration of OpenStack resources into the CMS workload management system. Results with real CMS workflows and future plans are also presented and discussed.

Journal of Physics: Conference Series | 2015

Dynamic partitioning as a way to exploit new computing paradigms: the cloud use case.

Vincenzo Ciaschini; Stefano Dal Pra; Luca dell'Agnello

The WLCG community and many groups in the HEP community have based their computing strategy on the Grid paradigm, which proved successful and still ensures its goals. However, Grid technology has not spread much over other communities; in the commercial world, the cloud paradigm is the emerging way to provide computing services. WLCG experiments aim to achieve integration of their existing current computing model with cloud deployments and take advantage of the so-called opportunistic resources (including HPC facilities) which are usually not Grid compliant. One missing feature in the most common cloud frameworks, is the concept of job scheduler, which plays a key role in a traditional computing centre, by enabling a fairshare based access at the resources to the experiments in a scenario where demand greatly outstrips availability. At CNAF we are investigating the possibility to access the Tier-1 computing resources as an OpenStack based cloud service. The system, exploiting the dynamic partitioning mechanism already being used to enable Multicore computing, allowed us to avoid a static splitting of the computing resources in the Tier-1 farm, while permitting a share friendly approach. The hosts in a dynamically partitioned farm may be moved to or from the partition, according to suitable policies for request and release of computing resources. Nodes being requested in the partition switch their role and become available to play a different one. In the cloud use case hosts may switch from acting as Worker Node in the Batch system farm to cloud compute node member, made available to tenants. In this paper we describe the dynamic partitioning concept, its implementation and integration with our current batch system, LSF.

Journal of Physics: Conference Series | 2014

Changing the batch system in a Tier 1 computing center: why and how

Andrea Chierici; Stefano Dal Pra

At the Italian Tierl Center at CNAF we are evaluating the possibility to change the current production batch system. This activity is motivated mainly because we are looking for a more flexible licensing model as well as to avoid vendor lock-in. We performed a technology tracking exercise and among many possible solutions we chose to evaluate Grid Engine as an alternative because its adoption is increasing in the HEPiX community and because its supported by the EMI middleware that we currently use on our computing farm. Another INFN site evaluated Slurm and we will compare our results in order to understand pros and cons of the two solutions. We will present the results of our evaluation of Grid Engine, in order to understand if it can fit the requirements of a Tier 1 center, compared to the solution we adopted long ago. We performed a survey and a critical re-evaluation of our farming infrastructure: many production softwares (accounting and monitoring on top of all) rely on our current solution and changing it required us to write new wrappers and adapt the infrastructure to the new system. We believe the results of this investigation can be very useful to other Tier-ls and Tier-2s centers in a similar situation, where the effort of switching may appear too hard to stand. We will provide guidelines in order to understand how difficult this operation can be and how long the change may take.

Journal of Physics: Conference Series | 2012

INFN Tier-1 Testbed Facility

Daniele Gregori; Alessandro Cavalli; Luca dell'Agnello; Stefano Dal Pra; Andrea Prosperini; Pierpaolo Ricci; Elisabetta Ronchieri; Vladimir Sapunenko

INFN-CNAF, located in Bologna, is the Information Technology Center of National Institute of Nuclear Physics (INFN). In the framework of the Worldwide LHC Computing Grid, INFN-CNAF is one of the eleven worldwide Tier-1 centers to store and reprocessing Large Hadron Collider (LHC) data. The Italian Tier-1 provides the resources of storage (i.e., disk space for short term needs and tapes for long term needs) and computing power that are needed for data processing and analysis to the LHC scientific community. Furthermore, INFN Tier-1 houses computing resources for other particle physics experiments, like CDF at Fermilab, SuperB at Frascati, as well as for astro particle and spatial physics experiments. The computing center is a very complex infrastructure, the hardaware layer include the network, storage and farming area, while the software layer includes open source and proprietary software. Software updating and new hardware adding can unexpectedly deteriorate the production activity of the center: therefore a testbed facility has been set up in order to reproduce and certify the various layers of the Tier-1. In this article we describe the testbed and the checks performed.

Journal of Physics: Conference Series | 2011

The StoRM Certification Process

Elisabetta Ronchieri; Michele Dibenedetto; Riccardo Zappi; Stefano Dal Pra; Cristina Aiftimiei; Sergio Traldi

StoRM is an implementation of the SRM interface version 2.2 used by all Large Hadron Collider (LHC) experiments and non-LHC experiments as SRM endpoint at different Tiers of Worldwide LHC Computing Grid. The complexity of its services and the demand of experiments and users are increasing day by day. The growing needs in terms of service level by the StoRM users communities make it necessary to design and implement a more effective testing procedure to quickly and reliably validate new StoRM candidate releases both in code side (for example via test units, and schema valuator) and in final product software (for example via functionality tests, and stress tests). Testing software service is a very critical quality activity performed in a very ad-hoc informal manner by developers, testers and users of StoRM up to now. In this paper, we describe the certification mechanism used by StoRM team to increase the robustness and reliability of the StoRM services. Various typologies of tests, such as quality, installation, configuration, functionality, stress and performance, defined on the base of a set of use cases gathered as consequence of the collaboration among the StoRM team, experiments and users, are illustrated. Each typology of test is either increased or decreased easily from time to time. The proposed mechanism is based on a new configurable testsuite. This is executed by the certification team, who is responsible for validating the release candidate package as well as bug fix (or patch) package, given a certain testbed that considers all possible use cases. In correspondence of each failure, the package is given back to developers waiting for validating a new package.

Explore More