Catello Di Martino
University of Illinois at Urbana–Champaign
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Catello Di Martino.
dependable systems and networks | 2014
Catello Di Martino; Zbigniew Kalbarczyk; Ravishankar K. Iyer; Fabio Baccanico; Joseph Fullop; William Kramer
This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis is based on both manual failure reports and automatically generated event logs collected over 261 days. Results include i) a characterization of the root causes of single-node failures, ii) a direct assessment of the effectiveness of system-level fail over as well as memory, processor, network, GPU accelerator, and file system error resiliency, and iii) an analysis of system-wide outages. The major findings of this study are as follows. Hardware is not the main cause of system downtime. This is notwithstanding the fact that hardware-related failures are 42% of all failures. Failures caused by hardware were responsible for only 23% of the total repair time. These results are partially due to the fact that processor and memory protection mechanisms (x8 and x4 Chip kill, ECC, and parity) are able to handle a sustained rate of errors as high as 250 errors/h while providing a coverage of 99.997% out of a set of more than 1.5 million of analyzed errors. Only 28 multiple-bit errors bypassed the employed protection mechanisms. Software, on the other hand, was the largest contributor to the node repair hours (53%), despite being the cause of only 20% of the total number of failures. A total of 29 out of 39 system-wide outages involved the Lustre file system with 42% of them caused by the inadequacy of the automated fail over procedures.
cyber security and information intelligence research workshop | 2013
Hui Lin; Adam J. Slagell; Catello Di Martino; Zbigniew Kalbarczyk; Ravishankar K. Iyer
When SCADA systems are exposed to public networks, attackers can more easily penetrate the control systems that operate electrical power grids, water plants, and other critical infrastructures. To detect such attacks, SCADA systems require an intrusion detection technique that can understand the information carried by their usually proprietary network protocols. To achieve that goal, we propose to attach to SCADA systems a specification-based intrusion detection framework based on Bro [7][8], a runtime network traffic analyzer. We have built a parser in Bro to support DNP3, a network protocol widely used in SCADA systems that operate electrical power grids. This built-in parser provides a clear view of all network events related to SCADA systems. Consequently, security policies to analyze SCADA-specific semantics related to the network events can be accurately defined. As a proof of concept, we specify a protocol validation policy to verify that the semantics of the data extracted from network packets conform to protocol definitions. We performed an experimental evaluation to study the processing capabilities of the proposed intrusion detection framework.
dependable systems and networks | 2012
Catello Di Martino; Marcello Cinque; Domenico Cotroneo
This paper presents a novel approach to assess time coalescence techniques. These techniques are widely used to reconstruct the failure process of a system and to estimate dependability measurements from its event logs. The approach is based on the use of automatically generated logs, accompanied by the exact knowledge of the ground truth on the failure process. The assessment is conducted by comparing the presumed failure process, reconstructed via coalescence, with the ground truth. We focus on supercomputer logs, due to increasing importance of automatic event log analysis for these systems. Experimental results show how the approach allows to compare different time coalescence techniques and to identify their weaknesses with respect to given system settings. In addition, results revealed an interesting correlation between errors caused by the coalescence and errors in the estimation of dependability measurements.
dependable systems and networks | 2015
Catello Di Martino; William Kramer; Zbigniew Kalbarczyk; Ravishankar K. Iyer
This paper presents an in-depth characterization of the resiliency of more than 5 million HPC application runs completed during the first 518 production days of Blue Waters, a 13.1 petaflop Cray hybrid supercomputer. Unlike past work, we measure the impact of system errors and failures on user applications, i.e., the compiled programs launched by user jobs that can execute across one or more XE (CPU) or XK (CPU+GPU) nodes. The characterization is performed by means of a joint analysis of several data sources, which include workload and error/failure logs. In order to relate system errors and failures to the executed applications, we developed LogDiver, a tool to automate the data pre-processing and metric computation. Some of the lessons learned in this study include: i) while about 1.53% of applications fail due to system problems, the failed applications contribute to about 9% of the production node hours executed in the measured period, i.e., the system consumes computing resources, and system-related issues represent a potentially significant energy cost for the work lost, ii) there is a dramatic increase in the application failure probability when executing full-scale applications: 20x (from 0.008 to 0.162) when scaling XE applications from 10,000 to 22,000 nodes, and 6x (from 0.02 to 0.129) when scaling GPU/hybrid applications from 2000 to 4224 nodes, and iii) the resiliency of hybrid applications is impaired by the lack of adequate error detection capabilities in hybrid nodes.
international parallel and distributed processing symposium | 2009
Marcello Cinque; Domenico Cotroneo; Catello Di Martino; Stefano Russo; Alessandro Testa
As the incidence of faults in real Wireless Sensor Networks (WSNs) increases, fault injection is starting to be adopted to verify and validate their design choices. Following this recent trend, this paper presents a tool, named AVR-INJECT, designed to automate the fault injection, and analysis of results, on WSN nodes. The tool emulates the injection of hardware faults, such as bit flips, acting via software at the assembly level. This allows to attain simplicity, while preserving the low level of abstraction needed to inject such faults. The potential of the tool is shown by using it to perform a large number of fault injection experiments, which allow to study the reaction to faults of real WSN software.
Computer Networks | 2012
Marcello Cinque; Catello Di Martino; Christian Esposito
Middleware plays a key role for the achievement of the mission of future large scale complex critical infrastructures, envisioned as federations of several heterogeneous systems over Internet. However, available approaches for data dissemination result still inadequate, since they are unable to scale and to jointly assure given QoS properties. In addition, the best-effort delivery strategy of Internet and the occurrence of node failures further exacerbate the correct and timely delivery of data, if the middleware is not equipped with means for tolerating such failures. This paper presents a peer-to-peer approach for resilient and scalable data dissemination over large-scale complex critical infrastructures. The approach is based on the adoption of epidemic dissemination algorithms between peer groups, combined with the semi-active replication of group leaders to tolerate failures and assure the resilient delivery of data, despite the increasing scale and heterogeneity of the federated system. The effectiveness of the approach is shown by means of extensive simulation experiments, based on Stochastic Activity Networks.
international supercomputing conference | 2013
Catello Di Martino
This paper proposes a heuristic to improve the analysis of supercomputers error logs. The heuristic is able to estimate the error on the measurement induced by the clustering process of error events and consequently drive the analysis. The goal is to reduce errors induced by the clustering and be able to estimate how much they affect the measurements. The heuristic is validated against 40 synthetic datasets, for different systems ranging from 16k to 256k nodes under different failure assumptions. We show that i) to accurately analyze the complex failure behavior of large computing systems, multiple time windows need to be adopted at the granularity of node subsystems, e.g. memory and I/O, and ii) for large systems, the classical single time window analysis can overestimate the MTBF by more than 150%, while the proposed heuristic can decrease the measurement error of one order of magnitude.
International Journal of Adaptive, Resilient and Autonomic Systems | 2010
Gabriele D'Avino; Alessandro Testa; Catello Di Martino
Wireless sensor networks WSN are spreading as a common mean for mitigating installation and management cost in environment monitoring applications. The increasing complexity ad heterogeneity of such WSN and the differentiated users needs calls for innovative solutions for managing and accessing produced data. This paper presents an architecture, named iCAAS, designed to collect, store, manage and make available to users data received from heterogeneous WSNs. The objective of the architecture is to adaptively deliver data to users depending on their specific interests and transparently to adopted terminals and of networks details. The contribution of the paper is twofold. First the authors detail the requirements that these types of architectures should meet to fill the gap between sensors, details, and users needs. Second, the authors describe the structural organization of the proposed architecture, designed by taking into account the defined requirements. Implementation details and case studies are also provided, showing the effectiveness of the architecture when used in real world application scenarios. Finally, iCAAS architectures scalability is tested against the number of concurrent client querying the same time.
international symposium on computers and communications | 2010
Marcello Cinque; Domenico Cotroneo; Catello Di Martino; Alessandro Testa
This paper presents an effective approach for injecting faults/errors in WSN nodes operating systems. The approach is based on the injection of faults at the assembly level. Results show that depending on the concurrency model and on the memory management, the operating systems react to injected errors differently, indicating that fault containment strategies and hang-checking assertions should be implemented to avoid spreading and activations of errors.
mobility management and wireless access | 2013
Marcello Cinque; Antonio Coronato; Alessandro Testa; Catello Di Martino
Resiliency is becoming one of the most important non functional properties of Wireless Sensor Networks (WSNs), as their adoption is more and more hypothesized in critical application scenarios. Hence, assessing the resiliency of WSN applications, protocols, and platforms is starting to be an import concern for WSN designers and developers. This paper presents the techniques that are currently used in the field of WSN resiliency assessment, categorized as experimental, simulative, analytical and formal techniques. After reporting the state-of-art according to the four categories, the paper discusses research trends, strengths and weaknesses of existing approaches, and concludes with open issues and ongoing challenges.