L. Magnoni
University of California, Irvine
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by L. Magnoni.
Journal of Physics: Conference Series | 2008
Flavia Donno; Lana Abadie; Paolo Badino; Jean Philippe Baud; Ezio Corso; Shaun De Witt; Patrick Fuhrmann; Junmin Gu; B. Koblitz; Sophie Lemaitre; Maarten Litmaath; Dimitry Litvintsev; Giuseppe Lo Presti; L. Magnoni; Gavin McCance; Tigran Mkrtchan; Rémi Mollon; Vijaya Natarajan; Timur Perelmutov; D. Petravick; Arie Shoshani; Alex Sim; David Smith; Paolo Tedesco; Riccardo Zappi
Storage Services are crucial components of the Worldwide LHC Computing Grid Infrastructure spanning more than 200 sites and serving computing and storage resources to the High Energy Physics LHC communities. Up to tens of Petabytes of data are collected every year by the four LHC experiments at CERN. To process these large data volumes it is important to establish a protocol and a very efficient interface to the various storage solutions adopted by the WLCG sites. In this work we report on the experience acquired during the definition of the Storage Resource Manager v2.2 protocol. In particular, we focus on the study performed to enhance the interface and make it suitable for use by the WLCG communities. At the moment 5 different storage solutions implement the SRM v2.2 interface: BeStMan (LBNL), CASTOR (CERN and RAL), dCache (DESY and FNAL), DPM (CERN), and StoRM (INFN and ICTP). After a detailed inside review of the protocol, various test suites have been written identifying the most effective set of tests: the S2 test suite from CERN and the SRM- Tester test suite from LBNL. Such test suites have helped verifying the consistency and coherence of the proposed protocol and validating existing implementations. We conclude our work describing the results achieved.
ieee conference on mass storage systems and technologies | 2007
Lana Abadie; Paolo Badino; J.-P. Baud; Ezio Corso; M. Crawford; S. De Witt; Flavia Donno; A. Forti; Ákos Frohner; Patrick Fuhrmann; G. Grosdidier; Junmin Gu; Jens Jensen; B. Koblitz; Sophie Lemaitre; Maarten Litmaath; D. Litvinsev; G. Lo Presti; L. Magnoni; T. Mkrtchan; Alexander Moibenko; Rémi Mollon; Vijaya Natarajan; Gene Oleynik; Timur Perelmutov; D. Petravick; Arie Shoshani; Alex Sim; David Smith; M. Sponza
Storage management is one of the most important enabling technologies for large-scale scientific investigations. Having to deal with multiple heterogeneous storage and file systems is one of the major bottlenecks in managing, replicating, and accessing files in distributed environments. Storage resource managers (SRMs), named after their Web services control protocol, provide the technology needed to manage the rapidly growing distributed data volumes, as a result of faster and larger computational facilities. SRMs are grid storage services providing interfaces to storage resources, as well as advanced functionality such as dynamic space allocation and file management on shared storage systems. They call on transport services to bring files into their space transparently and provide effective sharing of files. SRMs are based on a common specification that emerged over time and evolved into an international collaboration. This approach of an open specification that can be used by various institutions to adapt to their own storage systems has proven to be a remarkable success - the challenge has been to provide a consistent homogeneous interface to the grid, while allowing sites to have diverse infrastructures. In particular, supporting optional features while preserving interoperability is one of the main challenges we describe in this paper. We also describe using SRM in a large international high energy physics collaboration, called WLCG, to prepare to handle the large volume of data expected when the Large Hadron Collider (LHC) goes online at CERN. This intense collaboration led to refinements and additional functionality in the SRM specification, and the development of multiple interoperating implementations of SRM for various complex multi- component storage systems.
Journal of Physics: Conference Series | 2012
A. Kazarov; G. Lehmann Miotto; L. Magnoni
The Trigger and Data Acquisition (TDAQ) system of the ATLAS experiment at CERN is the infrastructure responsible for collecting and transferring ATLAS experimental data from detectors to the mass storage system. It relies on a large, distributed computing environment, including thousands of computing nodes with thousands of application running concurrently. In such a complex environment, information analysis is fundamental for controlling applications behavior, error reporting and operational monitoring. During data taking runs, streams of messages sent by applications via the message reporting system together with data published from applications via information services are the main sources of knowledge about correctness of running operations. The flow of data produced (with an average rate of O(1-10KHz)) is constantly monitored by experts to detect problem or misbehavior. This requires strong competence and experience in understanding and discovering problems and root causes, and often the meaningful information is not in the single message or update, but in the aggregated behavior in a certain time-line. The AAL project is meant at reducing the man power needs and at assuring a constant high quality of problem detection by automating most of the monitoring tasks and providing real-time correlation of data-taking and system metrics. This project combines technologies coming from different disciplines, in particular it leverages on an Event Driven Architecture to unify the flow of data from the ATLAS infrastructure, on a Complex Event Processing (CEP) engine for correlation of events and on a message oriented architecture for components integration. The project is composed of 2 main components: a core processing engine, responsible for correlation of events through expert-defined queries and a web based front-end to present real-time information and interact with the system. All components works in a loose-coupled event based architecture, with a message broker to centralize all communication between modules. The result is an intelligent system able to extract and compute relevant information from the flow of operational data to provide real-time feedback to human experts who can promptly react when needed. The paper presents the design and implementation of the AAL project, together with the results of its usage as automated monitoring assistant for the ATLAS data taking infrastructure.
ieee-npss real-time conference | 2012
A. Kazarov; Alina Corso Radu; L. Magnoni; Giovanna Lehmann Miotto
Trigger and Data Acquisition (TDAQ) System of the ATLAS experiment on LHC at CERN is a very complex distributed computing system, composed of O(10000) applications running on a farm of commodity CPUs. The system is being designed and developed by dozens of software engineers and physicists since end of 1990s and it will be maintained in operational mode during the lifetime of the experiment. The TDAQ system is controlled by the Control framework, which includes a set of software components and tools used for system configuration, distributed processes handling, synchronization of Run Control state transitions etc. The huge flow of operational monitoring data produced is constantly monitored by operators and experts in order to detect problems or misbehavior. Given the scale of the system and the rates of data to be analyzed, the automation of the Control framework functionality in the areas of operational monitoring, system verification, error detection and recovery is a strong requirement. The paper describes requirements, technologies choice, high-level design and some implementation aspects of advanced Control tools based on knowledge-base technologies. The main aim of these tools is to store and to reuse developers expertise and operational knowledge in order to help TDAQ operators to control the system with maximum efficiency during life time of the experiment.
Journal of Physics: Conference Series | 2015
L. Magnoni; U. Suthakar; C Cordeiro; M Georgiou; Julia Andreeva; A. Khan; David R. Smith
Monitoring the WLCG infrastructure requires the gathering and analysis of a high volume of heterogeneous data (e.g. data transfers, job monitoring, site tests) coming from different services and experiment-specific frameworks to provide a uniform and flexible interface for scientists and sites. The current architecture, where relational database systems are used to store, to process and to serve monitoring data, has limitations in coping with the foreseen increase in the volume (e.g. higher LHC luminosity) and the variety (e.g. new data-transfer protocols and new resource-types, as cloud-computing) of WLCG monitoring events. This paper presents a new scalable data store and analytics platform designed by the Support for Distributed Computing (SDC) group, at the CERN IT department, which uses a variety of technologies each one targeting specific aspects of big-scale distributed data-processing (commonly referred as lambda-architecture approach). Results of data processing on Hadoop for WLCG data activities monitoring are presented, showing how the new architecture can easily analyze hundreds of millions of transfer logs in a few minutes. Moreover, a comparison of data partitioning, compression and file format (e.g. CSV, Avro) is presented, with particular attention given to how the file structure impacts the overall MapReduce performance. In conclusion, the evolution of the current implementation, which focuses on data storage and batch processing, towards a complete lambda-architecture is discussed, with consideration of candidate technology for the serving layer (e.g. Elasticsearch) and a description of a proof of concept implementation, based on Apache Spark and Esper, for the real-time part which compensates for batch-processing latency and automates problem detection and failures.
Journal of Physics: Conference Series | 2012
Alexandru Sicoe; Giovanna Lehmann Miotto; L. Magnoni; S. Kolos; Igor Soloviev
This paper describes P-BEAST, a highly scalable, highly available and durable system for archiving monitoring information of the trigger and data acquisition (TDAQ) system of the ATLAS experiment at CERN. Currently this consists of 20,000 applications running on 2,400 interconnected computers but it is foreseen to grow further in the near future. P-BEAST stores considerable amounts of monitoring information which would otherwise be lost. Making this data accessible, facilitates long term analysis and faster debugging. The novelty of this research consists of using a modern key-value storage technology (Cassandra) to satisfy the massive time series data rates, flexibility and scalability requirements entailed by the project. The loose schema allows the stored data to evolve seamlessly with the information flowing within the Information Service. An architectural overview of P-BEAST is presented alongside a discussion about the technologies considered as candidates for storing the data. The arguments which ultimately lead to choosing Cassandra are explained. Measurements taken during operation in production environment illustrate the data volume absorbed by the system and techniques for reducing the required Cassandra storage space overhead.
Journal of Physics: Conference Series | 2011
Giovanna Lehmann Miotto; L. Magnoni; J. Sloper
The ATLAS Trigger and Data Acquisition (TDAQ) infrastructure is responsible for filtering and transferring ATLAS experimental data from detectors to mass storage systems. It relies on a large, distributed computing system composed of thousands of software applications running concurrently. In such a complex environment, information sharing is fundamental for controlling applications behavior, error reporting and operational monitoring. During data taking, the streams of messages sent by applications and data published via information services are constantly monitored by experts to verify the correctness of running operations and to understand problematic situations. To simplify and improve system analysis and errors detection tasks, we developed the TDAQ Analytics Dashboard, a web application that aims to collect, correlate and visualize effectively this real time flow of information. The TDAQ Analytics Dashboard is composed of two main entities that reflect the twofold scope of the application. The first is the engine, a Java service that performs aggregation, processing and filtering of real time data stream and computes statistical correlation on sliding windows of time. The results are made available to clients via a simple web interface supporting SQL-like query syntax. The second is the visualization, provided by an Ajax-based web application that runs on clients browser. The dashboard approach allows to present information in a clear and customizable structure. Several types of interactive graphs are proposed as widgets that can be dynamically added and removed from visualization panels. Each widget acts as a client for the engine, querying the web interface to retrieve data with desired criteria. In this paper we present the design, development and evolution of the TDAQ Analytics Dashboard. We also present the statistical analysis computed by the application in this first period of high energy data taking operations for the ATLAS experiment.
Journal of Physics: Conference Series | 2012
Aaron Harwood; G. Lehmann Miotto; L. Magnoni; W. Vandelli; D Savu
This paper describes a new approach to the visualization of information about the operation of the ATLAS Trigger and Data Acquisition system. ATLAS is one of the two general purpose detectors positioned along the Large Hadron Collider at CERN. Its data acquisition system consists of several thousand computers interconnected via multiple gigabit Ethernet networks, that are constantly monitored via different tools. Operational parameters ranging from the temperature of the computers to the network utilization are stored in several databases for later analysis. Although the ability to view these data-sets individually is already in place, currently there is no way to view this data together, in a uniform format, from one location. The ADAM project has been launched in order to overcome this limitation. It defines a uniform web interface to collect data from multiple providers that have different structures. It is capable of aggregating and correlating the data according to user defined criteria. Finally, it visualizes the collected data using a flexible and interactive front-end web system. Structurally, the project comprises of 3 main levels of the data collection cycle: The Level 0 represents the information sources within ATLAS. These providers do not store information in a uniform fashion. The first step of the project was to define a common interface with which to expose stored data. The interface designed for the project originates from the Google Data Protocol API. The idea is to allow read-only access to data providers, through HTTP requests similar in format to the SQL query structure. This provides a standardized way to access this different information sources within ATLAS. The Level 1 can be considered the engine of the system. The primary task of the Level 1 is to gather data from multiple data sources via the common interface, to correlate this data together, or over a defined time series, and expose the combined data as a whole to the Level 2 web interface. The Level 2 is designed to present the data in a similar style and aesthetic, despite the different data sources. Pages can be constructed, edited and personalized by users to suit the specific data being shown. Pages can show a collection of graphs displaying data potentially coming from multiple sources. The project as a whole has a great amount of scope thanks to the uniform approach chosen for exposing data, and the flexibility of the Level 2 in presenting results. The paper will describe in detail the design and implementation of this new tool. In particular we will go through the project architecture, the implementation choices and the examples of usage of the system in place within the ATLAS TDAQ infrastructure.
Journal of Physics: Conference Series | 2012
A. Corso Radu; G. Lehmann Miotto; L. Magnoni
A large experiment like ATLAS at LHC (CERN), with over three thousand members and a shift crew of 15 people running the experiment 24/7, needs an easy and reliable tool to gather all the information concerning the experiment development, installation, deployment and exploitation over its lifetime. With the increasing number of users and the accumulation of stored information since the experiment start-up, the electronic logbook actually in use, ATLOG, started to show its limitations in terms of speed and usability. Its monolithic architecture makes the maintenance and implementation of new functionality a hard-to-almost-impossible process. A new tool ELisA has been developed to replace the existing ATLOG. It is based on modern web technologies: the Spring framework using a Model-View-Controller architecture was chosen, thus helping building flexible and easy to maintain applications. The new tool implements all features of the old electronic logbook with increased performance and better graphics: it uses the same database back-end for portability reasons. In addition, several new requirements have been accommodated which could not be implemented in ATLOG. This paper describes the architecture, implementation and performance of ELisA, with particular emphasis on the choices that allowed having a scalable and very fast system and on the aspects that could be re-used in different contexts to build a similar application.
Journal of Physics: Conference Series | 2012
G. Avolio; A. Corso Radu; A. Kazarov; G. Lehmann Miotto; L. Magnoni
The Trigger and Data Acquisition (TDAQ) system of the ATLAS experiment is a very complex distributed computing system, composed of more than 20000 applications running on more than 2000 computers. The TDAQ Controls system has to guarantee the smooth and synchronous operations of all the TDAQ components and has to provide the means to minimize the downtime of the system caused by runtime failures. During data taking runs, streams of information messages sent or published by running applications are the main sources of knowledge about correctness of running operations. The huge flow of operational monitoring data produced is constantly monitored by experts in order to detect problems or misbehaviours. Given the scale of the system and the rates of data to be analyzed, the automation of the system functionality in the areas of operational monitoring, system verification, error detection and recovery is a strong requirement. To accomplish its objective, the Controls system includes some high-level components which are based on advanced software technologies, namely the rule-based Expert System and the Complex Event Processing engines. The chosen techniques allow to formalize, store and reuse the knowledge of experts and thus to assist the shifters in the ATLAS control room during the data-taking activities.