[PDF] A Machine Learning Approach to Online Fault Classification in HPC Systems

Abstract

As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which is the deliberate triggering of faults in a system so as to observe their behavior in a controlled environment. In this paper, we propose a fault classification method for HPC systems based on machine learning. The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner, thus opening the possibility to devise and enact control actions on the target system in real-time. We introduce a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments. In order to train and evaluate our machine learning classifiers, we inject faults to an in-house experimental HPC system using FINJ, and generate a fault dataset which we describe extensively. Both FINJ and the dataset are publicly available to facilitate resiliency research in the HPC systems field. Experimental results demonstrate that our approach allows almost perfect classification accuracy to be reached for different fault types with low computational overhead and minimal delay.

Full PDF

AA Machine Learning Approach to Online FaultClassiﬁcation in HPC Systems

Alessio Netti , Zeynep Kiziltan , Ozalp Babaoglu , Alina Sˆırbu ,Andrea Bartolini , and Andrea Borghesi Leibniz Supercomputing Centre, Garching bei M¨unchen, Germany Department of Computer Science and Engineering, University ofBologna, Italy Department of Electrical, Electronic and InformationEngineering, University of Bologna, Italy Department of Computer Science, University of Pisa, Italy [email protected],[email protected] { zeynep.kiziltan,ozalp.babaoglu,a.bartolini,andrea.borghesi3 } @unibo.it Abstract

As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels willincrease signiﬁcantly. Thus, detecting and classifying faults in HPC sys-tems as they occur and initiating corrective actions before they can trans-form into failures becomes essential for continued operation. Central tothis objective is fault injection, which is the deliberate triggering of faultsin a system so as to observe their behavior in a controlled environment.In this paper, we propose a fault classiﬁcation method for HPC systemsbased on machine learning. The novelty of our approach rests with thefact that it can be operated on streamed data in an online manner, thusopening the possibility to devise and enact control actions on the targetsystem in real-time. We introduce a high-level, easy-to-use fault injectiontool called FINJ, with a focus on the management of complex experiments.In order to train and evaluate our machine learning classiﬁers, we injectfaults to an in-house experimental HPC system using FINJ, and generatea fault dataset which we describe extensively. Both FINJ and the datasetare publicly available to facilitate resiliency research in the HPC systemsﬁeld. Experimental results demonstrate that our approach allows almostperfect classiﬁcation accuracy to be reached for diﬀerent fault types withlow computational overhead and minimal delay. a r X i v : . [ c s . D C ] J u l Introduction

Motivation

Modern scientiﬁc discovery is increasingly being driven by com-putation [36]. In a growing number of areas where experimentation is eitherimpossible, dangerous or costly, computing is often the only viable alternativetowards conﬁrming existing theories or devising new ones. As such, High-Performance Computing (HPC) systems have become fundamental “instru-ments” for driving scientiﬁc discovery and industrial competitiveness. Exas-cale (10 operations per second) is the moonshot for HPC systems. Reachingthis goal is bound to produce signiﬁcant advances in science and technologythrough higher-ﬁdelity simulations, better predictive models and analysis ofgreater quantities of data, leading to vastly-improved manufacturing processesand breakthroughs in fundamental sciences ranging from particle physics tocosmology. Future HPC systems will achieve exascale performance through acombination of faster processors and massive parallelism [2].With Moore’s Law having reached its limit, the only viable path towardshigher performance has to consider switching from increased transistor den-sity towards increased core count, thus increased sockets count. This, however,presents major obstacles. With everything else being equal, the fault rate ofa system is directly proportional to the number of sockets used in its con-struction [9]. But everything else is not equal: exascale HPC systems willalso use advanced low-voltage technologies that are much more prone to ag-ing eﬀects [5] together with system-level performance and power modulationtechniques, such as dynamic voltage frequency scaling, all of which tend to in-crease fault rates [16]. Economic forces that push for building HPC systemsout of commodity components aimed at mass markets only add to the like-lihood of more frequent unmasked hardware faults. Finally, complex systemsoftware, often built using open-source components, to deal with more complexand heterogeneous hardware, fault masking and energy management, coupledwith legacy applications will signiﬁcantly increase the potential for faults [23].It is estimated that large parallel jobs will encounter a wide range of failures asfrequently as once every 30 minutes on exascale platforms [33]. At these rates,failures will prevent applications from making progress. Consequently, exascaleperformance, even if achieved nominally, cannot be sustained for the durationof most applications that often run for long periods.To be usable in production environments with acceptable quality of service levels, exascale systems need to improve their resiliency by several orders ofmagnitude. Therefore, future exascale HPC systems must include automatedmechanisms for masking faults, or recovering from them, so that computationscan continue with minimal disruptions. In our terminology, a fault is deﬁnedas an anomalous behavior at the software or hardware level that can lead toillegal system states ( errors ) and, in the worst case, to service interruptions( failures ) [18]. In this paper, we limit our attention to improving the resiliencyof HPC systems through the use of mechanisms for detecting and classifyingfaults as soon as possible since they are the root causes of errors and failures. Animportant technique for this objective is fault injection : the deliberate triggering2f faults in a system so as to observe their behavior in a controlled environment,enable development of new prediction and response techniques as well as testingof existing ones [22]. For fault injection to be eﬀective, dedicated tools arenecessary, allowing users to trigger complex and realistic fault scenarios in areproducible manner. Contributions

The contributions of our work are several fold. First, we pro-pose and evaluate a fault classiﬁcation method based on supervised MachineLearning (ML) suitable for online deployment in HPC systems as part of an in-frastructure for building mechanisms to increase their resiliency. Our approachrelies on a collection of performance metrics that are readily available in mostHPC systems. A novel aspect of our proposal is its ability to work online withlive streamed data as opposed to traditional oﬄine techniques that work witharchived data. Our experimental results show that the method we propose canclassify almost perfectly several types of faults, ranging from hardware malfunc-tions to software issues and bugs. In our method, classiﬁcation can be achievedwith little computational overhead and with minimal delay, making it suitablefor online use. We characterize the performance of our proposed solution in arealistic online use scenario where live streamed data is fed to fault classiﬁersboth for training and for detection, dealing with issues such as class imbalanceand ambiguous system states. Most existing studies, on the contrary, consideroﬄine scenarios and rely on extensive manipulation of archived data, which isnot feasible in online scenarios. Furthermore, switching from an oﬄine to anonline approach based on streamed data opens up the possibility for devisingand enacting control actions in real-time.Second, we introduce an easy-to-use open-source Python fault injection toolcalled FINJ. A relevant feature of FINJ is the possibility of seamless integrationwith other injection tools targeted at speciﬁc fault types, thus enabling usersto coordinate faults from diﬀerent sources and diﬀerent system levels. By us-ing FINJ’s workload feature, users can also specify lists of applications to beexecuted and faults to be triggered on multiple nodes at speciﬁc times, withspeciﬁc durations. FINJ thus represents a high-level, ﬂexible tool, enablingusers to perform complex and reproducible experiments, aimed at revealing thecomplex relations that may exist between faults, application behavior and thesystem itself. FINJ is also extremely easy to use: it can be set up and executedin a matter of minutes, and does not require the writing of additional code inmost of its usage scenarios. To the best of our knowledge, FINJ is the ﬁrstportable, open-source tool that allows users to perform and control complex in-jection experiments, that can be integrated with heterogeneous fault types andthat includes workload support, while retaining ease of use and a quick setuptime.As a third and ﬁnal contribution, our evaluation is based on a dataset con-sisting of monitoring data that we acquired from an experimental HPC system(called Antarex) by injecting faults using FINJ. We make the Antarex datasetpublicly available and describe it extensively for use in the community. This3s an important contribution to the HPC ﬁeld, since commercial operators arevery reluctant to share trace data containing information about faults in theirHPC systems [25].

Organization

The rest of the paper is organized as follows. In the nextsection we put our work in context with respect to related work. In Section 3,we introduce the FINJ tool and present a simple use case in Section 4 to showhow it can be deployed. In Section 5, we describe the Antarex dataset that willbe used for evaluating our supervised ML classiﬁcation models. In Section 6, wediscuss the features extracted from the Antarex dataset to train the classiﬁersand in Section 7, we present our experimental results. We conclude in Section 8.

Fault injection for prediction and detection purposes is a recent topic of intenseactivity, and several studies have proposed tools with varying levels of abstrac-tion. Calhoun et al. [8] devised a compiler-level fault injection tool focused onmemory bit-ﬂip errors, targeting HPC applications. De Bardeleben et al. [13]proposed a logic error-oriented fault injection tool. This tool is designed to injectfaults in virtual machines, by exploiting emulated machine instructions throughthe open-source virtual machine and processor emulator (QEMU). Both worksfocus on low-level fault-speciﬁc tools and do not provide functionality for theinjection of complex workloads, and for the collection of produced data, if any.Stott et al. [34] proposed NFTAPE, a high-level and generic tool for faultinjection. This tool is designed to be integrated with other fault injection toolsand triggers at various levels, allowing for the automation of long and complexexperiments. The tool, however, has aged considerably, and is not publiclyavailable. A similar fault injection tool was proposed by Naughton et al. [31],which has never evolved beyond the prototype stage and is also not publiclyavailable, to the best of our knowledge. Moreover, both tools require users towrite a fair amount of wrapper and conﬁguration code, resulting in a complexsetup process. The Gremlins Python package also supplies a high-level fault in-jector. However, it does not support workload or data collection functionalities,and experiments on multiple nodes cannot be performed.Joshi et al. [24] introduced the PREFAIL tool, which allows for the injectionof failures at any code entry point in the underlying operating system. Thistool, like NFTAPE, employs a coordinator process for the execution of com-plex experiments. It is targeted at a speciﬁc fault type (code-level errors) anddoes not permit performing experiments focused on performance degradationand interference. Similarly, the tool proposed by Gunawi et al. [21], namedFATE, allows the execution of long experiments; furthermore, it is focused onreproducing speciﬁc fault sequences, simulating real scenarios. Like PREFAIL,it is limited to a speciﬁc fault type, namely I/O errors, thus greatly limiting itsscope. https://github.com/toddlipcon/gremlins In this section, we ﬁrst discuss how fault injection is achieved in FINJ. We thenpresent the architecture of FINJ and discuss its implementation. CustomizingFINJ for diﬀerent purposes is easy, thanks to its portable and modular nature.

Fault injection is achieved through tasks that are executed on target nodes. Eachtask corresponds to either an HPC application or a fault-triggering program,and has a speciﬁcation for its execution. As demonstrated by Stott et al. [34],this approach allows for the integration in FINJ of any third-party fault injection5ramework that can be triggered externally. In any case, many fault-triggeringprograms are supplied with FINJ (see Section 5.2), allowing users to experimentwith anomalies out-of-the-box.A sequence of tasks deﬁnes a workload , which is a succession of scheduledapplication and fault executions at speciﬁc times, reproducing a realistic workingenvironment for the fault injection process. A task is speciﬁed by the followingattributes: • args : the full shell command required to run the selected task. The com-mand must refer to an executable ﬁle that can be accessed from the targethosts; • timestamp : the time in seconds at which the task must be started, ex-pressed as a relative oﬀset; • duration : the task’s maximum allowed duration, expressed in seconds,after which it will be abruptly terminated. This duration can serve as an exact duration as well, with FINJ restarting the task if it ﬁnishes earlier,and terminating it if it lasts more. This behavior depends on the FINJconﬁguration (see Section 3.3); • isFault : deﬁnes whether the task corresponds to a fault-triggering pro-gram, or to an application; • seqNum : a sequence number used to uniquely identify the task inside aworkload; • cores : an optional attribute which is the list of CPU cores that the task isallowed to use on target nodes, enforced through a NUMA Control policy[27].A workload is stored in a workload ﬁle , which contains the speciﬁcationsof all the tasks of a workload in CSV format. The starting time of each taskis expressed as a relative oﬀset, in seconds, with respect to the ﬁrst task inthe workload. A particular execution of a given workload then constitutes an injection session .In our approach, the responsibility of each entity involved in the fault injec-tion process is isolated, as depicted in Figure 1. The fault injection engine ofFINJ manages the execution of tasks on target nodes and the collection of theiroutput. Tasks contain all the necessary logic to run applications or to triggerany other low-level fault injection framework (e.g., by means of a writable ﬁleor system call). At the last level, the activated low-level fault injector handlesthe actual triggering of faults, which can be at the software, kernel, or evenhardware level. 6 ault Injection Engine

WorkloadcoordinationTask scheduling Processmanagement Output collection

Task

Application-levelanomalies Kernel-levelfault injection Hardware-levelfault injectionHPC applications Fault triggering Restore healthysystem state

Figure 1: A diagram representing the responsibility of each entity involved inthe fault injection process.

FINJ consists of a fault injection controller and a fault injection engine , whichare designed to run on separate machines. The high-level structure of the FINJarchitecture is illustrated in Figure 2.The controller orchestrates the injection process, and should be run on anexternal node that is not aﬀected by the faults. The controller maintains con-nections to all nodes involved in the injection session, which run fault injectionengine instances and whose addresses are speciﬁed by users. Therefore, injectionsessions can be performed on multiple nodes at the same time. The controllerreads the task entries from the selected workload ﬁle. For each task the con-troller sends a command to all target hosts, instructing them to start the newtask at the speciﬁed time. Finally, the controller collects and stores all statusmessages (e.g., task start and termination, status changes) produced by thetarget hosts.The engine is a daemon running on nodes where faults are to be injected.The engine waits for task commands to be received from remote controller in-stances. Engines can be connected to multiple controllers at the same time,however task commands are accepted from one controller at a time, the master of the injection session. The engine manages received task commands by as-signing them to a dedicated thread from a pool . The thread manages all aspectsrelated to the execution of the task, such as spawning the necessary subpro-cesses and sending status messages to controllers. Whenever a fault causes atarget node to crash and reboot, controllers are able to re-establish and recoverthe previous injection session, given that the engine is set up to be executed atboot time on the target node.

FINJ is based on a highly modular architecture, and therefore it is very easy tocustomize its single components in order to add or tune features.7 arget Node[s]Controller Node

Workload File

Applications and Fault Programs

ApplicationProcesses FaultProcessesChildren Processes << i n j ec ti on c o mm a nd s >> << s t a t u s m e ss a g e s >> Fault Injection ControllerFault Injection EngineInjectionControllerInjectionEngine Network Server

Configuration

Input

Configuration

Thread PoolExecution RecordOutputNetwork ClientWorkloadGenerator

Figure 2: Architecture of the FINJ tool showing the division between a controllernode (top) and a target node (bottom).

Network

Engine and controller instances communicate through a networklayer, and communication is achieved through a simple message-based protocol.Speciﬁcally, we implemented client and server software components for theexchange of messages. Fault injection controllers use a client instance in orderto connect to fault injection engines, which in turn run a server instance whichlistens for incoming connections. A message can be either a command sent by acontroller, related to a single task, or a status message, sent by an engine, relatedto status changes in its machine. All messages are in the form of dictionaries .This component also handles resiliency features such as automatic re-connectionfrom clients to servers, since temporary connection losses are to be expected ina fault injection context.

Thread Pool

Task commands in the engines are assigned to a thread in apool as they are received. Each thread manages all aspects of a task assignedto it. Speciﬁcally, the thread sleeps until the scheduled starting time of thetask (according to its time-stamp); then, it spawns a subprocess running thespeciﬁed task, and sends a message to all connected controllers to inform them ofthe event. At this point, the thread waits for the task’s termination, dependingon its duration and on the current conﬁguration. Finally, the thread sends a newstatus message to all connected hosts informing them of the task’s termination,and returns to sleep. The amount of threads in the pool, which is a conﬁgurableparameter, determines the maximum number of tasks that can be executed8 n j ec ti on E ng i n e Worker Thread T a s k Q u e u e pu s h () pop() ! " ! ! $ ! % ! & ! ' Wait for ! ( start timeStart task ! ( Wait fortermination Collectoutput

Thread Pool

Figure 3: A representation of the life cycle of a task, as managed by the threadpool of the fault injection engine.concurrently. Since threads in the pool are started only once during the engine’sinitialization, and wake up for minimal amounts of time when a task needsto be started or terminated, we expect their impact on the performance to benegligible. The life cycle of a task, as managed by a worker thread, is representedin Figure 3.

Input and Output

The input and output of all data related to injectionsessions are performed by controller instances, and are handled by reader and writer entities. By default, the input/output format is CSV, which was chosendue to its extreme simplicity and generality.

Input is constituted by workload ﬁles, that include one entry for each injection task, as described in Section 3.1.

Output , instead, is made up of two parts: I) the execution log , which containsentries corresponding to status changes in the target nodes (e.g., start andtermination, encountered errors and connection loss or recovery events); II)output produced by single tasks.

Conﬁguration

FINJ’s runtime behavior can be customized by means of aconﬁguration ﬁle. This ﬁle includes several options that alter the behavior ofeither the controller or engine instances. Among the basic options, it is possibleto specify the listening TCP port for engine instances, and the list of addressesof target hosts, to which controller instances should connect at launch time. Thelatter is useful when injection sessions must be performed on large sets of nodes,whose addresses can be conveniently stored in a ﬁle. More complex optionsare also available. For instance, it is possible to deﬁne a series of commandscorresponding to tasks that must be launched together with FINJ, and mustbe terminated with it. This option proves especially useful when users wishto set up monitoring frameworks, such as the

Lightweight Distributed MetricService (LDMS) [1], to be launched together with FINJ in order to collectsystem performance metrics during injection sessions.9 orkload Generation

While writing workload ﬁles manually is possible,this is time-consuming and not desirable for long injection sessions. Therefore,we implemented in FINJ a workload generation tool, which can be used to au-tomatically generate workload ﬁles with certain statistical features, while tryingto combine ﬂexibility and ease of use. The workload generation process is con-trolled by three parameters: a maximum time span for the total duration ofthe workload expressed in seconds, a statistical distribution for the duration oftasks, and another one for their inter-arrival times. We deﬁne the inter-arrivaltime as the interval between the start of two consecutive tasks. These distribu-tions are separated in two sets, for fault and application tasks, thus amountingto a total of four. They can be either speciﬁed analytically by the user or canbe ﬁtted from real data, thus reproducing realistic behavior.A workload is composed as a series of fault and application tasks that areselected from a list of possible shell commands. To control the compositionof workloads, users can optionally associate to each command a probabilityfor its selection during the generation process, and a list of CPU cores forits execution, as explained in Section 3.1. By default, commands are pickeduniformly. Having deﬁned its parameters, the workload generation process isthen fairly simple. Tasks are randomly generated in order to achieve statisticalfeatures close to those speciﬁed as input, and are written to an output CSV ﬁle,until the maximum imposed time span is reached. Alongside the full workload, a probe ﬁle is also produced, which contains one entry for each task type, all witha short ﬁxed duration, representing a lightweight workload version. This ﬁle canbe used during the setup phase to test the correct conﬁguration of the system,making sure that all tasks are correctly found and executed on the target hosts,without having to run the entire heavy workload.

FINJ is implemented in Python, an object-oriented, high-level interpreted pro-gramming language , and can be used on all major operating systems. Alldependencies are included in the Python distribution, and the only optionalexternal dependency is the scipy package, which is needed for the workload gen-eration functionality. The source code is publicly available on GitHub underthe MIT license, together with its documentation, usage examples and severalfault-triggering programs. FINJ works on Python versions 3.4 and above.In Figure 4, we illustrate the class diagram of FINJ. The fault injectionengine and the fault injection controller are represented by the InjectorEngine and

InjectorController classes. Users can instantiate these classes and startinjection sessions directly, by using the listen injection method to put the enginein listening mode, and the inject method of the controller, which allows to startthe injection session itself. Scripts are supplied with FINJ to create controllerand engine instances from a command-line interface, simplifying the process. https://github.com/AlessioNetti/fault_injector sgEntity msgQueueregisteredHostssend_msg()broadcast_msg()pop_msg_queue()peek_msg_queue()listen()MsgClientadd_servers()listen() MsgServerlisten()InjectorControllermsgClientworkloadReaderlogWritersbuild()inject()stop() InjectorEnginemsgServerthreadPoolbuild()listen_injection()InjectionThreadPooltask_queueworker_threadsstart()stop()submit_task() Reader fileread_entry()close()

Writer filewrite_entry()close() CSVReaderread_entry()

ExecutionLogReader read_entry()ExecutionLogWriterwrite_entry()

CSVWriter write_entry()WorkloadGenerator faultTimeGeneratorfaultDurGeneratorbenchTimeGeneratorbenchDurGenerator generate()ElementGeneratordistributiondataset_distribution()fit_data()show_fit()pick() TaskargstimestampdurationisFaultseqNumcorestask_to_dict()dict_to_task()

Figure 4: The class diagram of FINJ.The

InjectionThreadPool class, instead, supplies the thread pool implementationused to execute and manage the tasks.The network layer of the tool is represented by the

MsgClient and

MsgServer classes, which implement the message and queue-based client and server usedfor communication. Both classes are implementations of the

MsgEntity abstractclass, which provides the interface for sending and receiving messages, and im-plements the basic mechanisms that regulate the access to the underlying queue.Input and output are instead handled by the

Reader and

Writer abstractclasses and their implementations:

CSVReader and

CSVWriter handle the read-ing and writing of workload ﬁles, while

ExecutionLogReader and

ExecutionLog-Writer handle the execution logs generated by injection sessions. Since theseclasses are all implementations of abstract interfaces, it is easy for users to cus-tomize them for diﬀerent formats. Tasks are modeled by the

Task class thatcontains all attributes speciﬁed in Section 3.1.Lastly, access to the workload generator is provided through the

Workload-Generator class, which is the interface used to set up and start the generationprocess. This class is backed by the

ElementGenerator class, which oﬀers basicfunctionality for ﬁtting data and generating random values. This class acts asa wrapper on scipy’s rv continuous class, which generates random variables.

In this section we demonstrate the ﬂow of execution of FINJ through a concreteexample carried out on a real HPC node, and provide insight on its overhead.11imestamp ; d u r a t i o n ; seqNum ; i s F a u l t ; c o r e s ; a r g s0 ; 1 7 2 3 ; 1 ; F a l s e ; 0 − Here we consider a sample fault injection session. The employed CSV workloadﬁle is illustrated in Figure 5. The experimental setup for this test, both for faultinjection and monitoring, is the same as that for the acquisition of the Antarexdataset, which we present in Section 5. Python scripts are supplied with FINJto start and conﬁgure engine and controller instances: their usage is explainedon the GitHub repository of the tool, together with all conﬁguration options.In this workload, the ﬁrst task corresponds to an HPC application and is theIntel Distribution for the well-known

High-Performance Linpack (HPL) bench-mark, optimized for Intel Xeon CPUs. This task starts at time 0 in the work-load, and has a maximum allowed duration of roughly 30 minutes. The followingtwo tasks are fault-triggering programs: cpufreq which dynamically reduces themaximum allowed CPU frequency, emulating performance degradation, and leak which creates a memory leak in the system, eventually using all available RAM.The cpufreq program requires appropriate permissions, so that users can accessthe ﬁles controlling Linux CPU governors. Both fault programs are discussedin detail in Section 5. The HPL benchmark is run with 8 threads, pinned onthe ﬁrst 8 cores of the machine, while the cpufreq and leak tasks are forced torun on cores 6 and 4, respectively. Note also that the tasks must be availableat the speciﬁed path on the systems running the engine, which in this case isrelative to the location of the FINJ launching script.Having deﬁned the workload, the injection engine and controller must bestarted. For this experiment, we run both on the same machine. The controllerinstance will then connect to the engine and start executing the workload, stor-ing all output in a CSV ﬁle which is unique for each target host. Each entryin this ﬁle represents a status change event, which in this case is the start ortermination of tasks, and is ﬂagged with its absolute time-stamp on the targethost. In addition, any errors that were encountered are also reported. When theworkload is ﬁnished, the controller terminates. It can be clearly seen from thisexample how easily a FINJ experiment can be conﬁgured and run on multipleCPU cores.At this point, the data generated by FINJ can be easily compared withother data, for example performance metrics collected through a monitoringframework, in order to better understand the system’s behavior under faults.In Figure 6 we show the total RAM usage and the CPU frequency of core 0, asmonitored by the LDMS framework. The benchmark’s proﬁle is simple, showinga constant CPU frequency while RAM usage slowly increases as the application12

250 500 750 1000 1250 1500 1750Time [s]0.010.020.030.040.050.060.0 R a m U s a g e [ G B ] CPU Frequency RAM Usage Fault 0.00.51.01.52.02.53.0 C P U C o r e F r e q u e n c y [ G h z ] Figure 6: CPU Frequency and RAM Usage, as monitored on the target systemduring a sample FINJ injection session.performs tests on increasing matrix sizes. The eﬀect of our fault programs,marked in gray, can be clearly observed in the system. The cpufreq fault causesa sudden drop in CPU frequency, resulting in reduced performance and longercomputation times, while the leak fault causes a steady, linear increase in RAMusage. Even though saturation of the available RAM is not reached, this peculiarbehavior can be used for prediction purposes.

We performed tests in order to evaluate the overhead that FINJ may introduce.To do so, we employed the same machine used in Section 5 together with theHPL benchmark, this time conﬁgured to use all 16 cores of the machine. Weexecuted the HPL benchmark 20 times directly on the machine by using a shellscript, and then repeated the same process by embedding the benchmark in 20tasks of a FINJ workload ﬁle. FINJ was once again instantiated locally. In bothconditions the HPL benchmark scored an average running time of roughly 320seconds, therefore leading us to conclude that the impact of FINJ on runningapplications is negligible. This was expected, since FINJ is designed to performonly the bare minimum amount of operations in order to start and managetasks, without aﬀecting their execution.

The dataset contains trace data that we collected from an HPC system (calledAntarex) located at ETH Zurich while we injected faults using FINJ. Thedataset is publicly available for use by the community and all the details re-garding the test environment, as well as the employed applications and faultsare extensively documented. In this section, we give a comprehensive overviewof the dataset. https://zenodo.org/record/2553224 BlockI BlockIII BlockII BlockIVType

CPU-Mem HDD

Parallel

No Yes No Yes

Duration

12 days 4 days

Applications

DGEMM [14], HPCC [29],STREAM [30], HPL [15] IOZone [10],Bonnie++ [12]

Faults leak, memeater, ddot,dial, cpufreq, pagefail ioerr, copy

To acquire data, we executed some HPC applications and at the same timeinjected faults using FINJ in a single compute node of Antarex. We acquiredthe data in four steps by using four diﬀerent FINJ workload ﬁles. Each dataacquisition step consists of application and fault program runs related to eitherthe CPU and memory components, or the hard drive, either in single-core ormulti-core mode. This resulted in 4 blocks of nearly 20GB and 32 days of datain total, whose structure is summarized in Table 1. We acquired the data bycontinuous streaming, thus any study based on it will easily be reproducible ona real HPC system, in an online way.We acquired data from a single HPC compute node for several reasons.First, we focus on fault detection models that operate on a per-node basis andthat do not require knowledge about the behavior of other nodes. Second, weassume that the behavior of diﬀerent compute nodes under the same faults willbe comparable, thus a model generated for one node will be usable for othernodes as well. Third, our fault injection experiments required alterations to theLinux kernel of the compute node or special permissions. These are possible ina test environment, but not in a production one, rendering the acquisition of alarge-scale dataset on multiple nodes impractical.The Antarex compute node is equipped with two Intel Xeon E5-2630 v3CPUs, 128GB of RAM, a Seagate ST1000NM0055-1V4 1TB hard drive and runsthe CentOS 7.3 operating system. The node has a default Tier-1 computingsystem conﬁguration. We used FINJ in a Python 3.4 environment. We alsoused the LDMS monitoring framework to collect performance metrics, so as tocreate features for fault classiﬁcation purposes, as described in Section 6. Weconﬁgured LDMS to sample several metrics at each second, which come fromthe following seven diﬀerent plug-ins:1. meminfo collects general information on RAM usage;2. perfevent collects CPU performance counters;3. procinterrupts collects information on hardware and software interrupts;14

00 200 300 400 500 600Time [s].000.002.005.007.010.012.015 (a) Histogram of fault durations.

500 600 700 800 900 1000 1100Time [s].000.002.005.007.010.012.015.017.020 (b) Histogram of fault inter-arrival times.

Figure 7: Histograms for fault durations (a) and fault inter-arrival times (b) inthe Antarex dataset compared to the PDFs of the Grid5000 data, as ﬁtted ona Johnson SU and Exponentiated Weibull distribution respectively.4. procdiskstats collects statistics on hard drive usage;5. procsensors collects sensor data related to CPU temperature and fre-quency;6. procstat collects general metrics about CPU usage;7. vmstat collects information on virtual memory usage.This conﬁguration resulted in a total of 2094 metrics to be collected eachsecond. Some of the metrics are node-level, and describe the status of thecompute node as a whole, others instead are core-speciﬁc and describe the statusof one of the 16 available CPU cores. In order to minimize noise and bias in thesampled data, we chose to analyze, execute applications and inject faults intoonly 8 of the 16 cores available in the system, and therefore used only one CPU.On the other CPU of the system, instead, we executed the FINJ and LDMStools, which rendered their CPU overhead negligible.

FINJ orchestrates the execution of applications and the injection of faults bymeans of a workload ﬁle, as explained in Section 3.1. For this purpose, we used15everal FINJ-generated workload ﬁles, one for each block of the dataset.

Workload Files

We used two diﬀerent statistical distributions in the FINJworkload generator to create the durations and inter-arrival times of the taskscorresponding to the applications and fault-triggering programs. The applica-tion tasks are characterized by duration and inter-arrival times following normaldistributions, and they occupy the 75% of the dataset’s entire duration. Thisresulted in tasks having an average duration of 30 minutes, and average inter-arrival times of nearly 40 minutes, both exhibiting relatively low variance andspread.Fault-triggering tasks, on the other hand, are modeled using distributionsﬁtted on the data from the failure trace associated to the Grid5000 cluster [7],available on the Failure Trace Archive . We extracted from this trace the inter-arrival times of the host failures. Such data was then scaled and shifted to obtainan average of 10 minutes, while retaining the shape of the distribution. We thenﬁtted the data using an exponentiated Weibull distribution, which is commonlyused to characterize failure inter-arrival times [18]. To model durations, weextracted for all hosts the time intervals between successive absent and alive records, which respectively describe connectivity loss and recovery events fromthe HPC system’s resource manager to the host. We then ﬁtted a Johnson SUdistribution over a cluster of the data present at the 5 minutes point, whichrequired no alteration in the original data. This particular distribution waschosen because of the quality of the ﬁtting. In Figure 7, we show the histogramsfor the durations (a) and inter-arrival times (b) of the fault tasks in the workloadﬁles, together with the original distributions ﬁtted from the Grid5000 data.Unless conﬁgured to use speciﬁc probabilities, FINJ generates each task inthe workload by randomly picking the respective application or fault programto be executed, from those that are available, with uniform probability. Thisimplies that, statistically, all of the applications we selected will be subject toall of the available fault-triggering programs, given a suﬃciently-long workload,with diﬀerent execution overlaps depending on the starting times and durationsof the speciﬁc tasks. Such a task distribution greatly mitigates overﬁtting issues.Finally, we do not allow fault-triggering program executions to overlap. HPC Applications

We used a series of well-known benchmark applications,each of which stresses a diﬀerent part of the node and measures the correspond-ing performance.1. DGEMM [14]: performs CPU-heavy matrix-to-matrix multiplication;2. HPC Challenge (HPCC) [29]: a collection of applications that stress boththe CPU and memory bandwidth of an HPC system;3. Intel distribution for High-Performance Linpack (HPL) [15]: solves a sys-tem of linear equations; http://fta.scem.uws.edu.au/

16. STREAM [30]: stresses the system’s memory and measures its bandwidth;5. Bonnie++ [12]: performs HDD read-write operations;6. IOZone [10]: performs HDD read-write operations.The diﬀerent applications provide a diverse environment for fault injection.Since we limit our analysis to a single machine, we use versions of the applica-tions that rely on shared-memory parallelism, for example through the OpenMPlibrary.

Fault Programs

All the fault programs used to reproduce anomalous con-ditions on Antarex are available at the FINJ Github repository. As describedby Tuncer et al. [35], each program can also operate in a low-intensity mode,thus doubling the number of possible faults. While we do not physically damagehardware, we closely reproduce several reversible hardware issues, such as I/Oand memory allocation errors. Some of the fault programs ( ddot and dial ) onlyaﬀect the performance of the CPU core they run on, whereas the others aﬀectthe entire compute node. The programs and the generated faults are as follows.1. leak periodically allocates 16MB arrays that are never released [35] cre-ating a memory leak , causing memory fragmentation and severe systemslowdown;2. memeater allocates, writes into and expands a 36MB array [35], decreasingperformance through a memory interference fault and saturating band-width;3. ddot repeatedly calculates the dot product between two equal-size matri-ces. The sizes of the matrices change periodically between 0.9, 5 and 10times the CPU cache’s size [35]. This produces a

CPU and cache inter-ference fault, resulting in degraded performance of the aﬀected CPU;4. dial repeatedly performs ﬂoating-point operations over random numbers [35],producing an

ALU interference fault, resulting in degraded performancefor applications running on the same core as the program;5. cpufreq decreases the maximum allowed CPU frequency by 50% throughthe Linux Intel P-State driver. This simulates a system misconﬁguration or failing CPU fault, resulting in degraded performance;6. pagefail makes any page allocation request fail with 50% probability. This simulates a system misconﬁguration or failing memory fault, causingperformance degradation and stalling of running applications; https://kernel.org/doc/Documentation/cpu-freq https://kernel.org/doc/Documentation/fault-injection ioerr fails one out of 500 hard-drive I/O operations with 20% probability,simulating a failing hard drive fault, and causing degraded performancefor I/O-bound applications, as well as potential errors;8. copy repeatedly writes and then reads back a 400MB ﬁle from a hard drive.After such a cycle, the program sleeps for 2 seconds [20]. This simulates an I/O interference or failing hard drive fault by saturating I/O bandwidth,and results in degraded performance for I/O-bound applications.The faults triggered by these programs can be grouped in three categoriesaccording to their nature. The interference faults (i.e., leak, memeater, ddot,dial and copy) usually occur when orphan processes are left running in thesystem, saturating resources and slowing down the other processes. Misconﬁg-uration faults occur when a component’s behavior is outside of its speciﬁcation,due to a misconﬁguration by the users or administrators (i.e., cpufreq). Finally,the hardware faults are related to physical components in the system that areabout to fail, and trigger various kinds of errors (i.e., pagefail or ioerr). We notethat some faults may belong to multiple categories, as they can be triggered bydiﬀerent factors in the system.

In this section, we describe the process of parsing the metrics collected by LDMSto obtain a set of features capable of describing the state of the system, inparticular for classiﬁcation tasks.

Post-Processing of Data

Firstly, we removed all constant metrics (e.g., theamount of total memory in the node), which were redundant, and replaced theraw monotonic counters captured by the perfevent and procinterrupts plug-inswith their ﬁrst-order derivatives. Moreover, we created an allocated metric,both at the CPU core and node level, and integrated it in the original set.This metric can assume only binary values, and it encapsulates the informationabout the hardware component occupation, namely deﬁning whether there is anapplication allocated on the node or not. This information would be availablealso in a real setting, since resource managers in HPC systems always keeptrack of the running jobs and their allocated computational resources. Lastly,for each above-mentioned metric and for each time point, we computed its ﬁrst-order derivative and added it to the dataset, as proposed by Guan et al. [19].After having post-processed the LDMS metrics, we created the feature setsvia aggregation. Each feature set corresponds to a 60-second aggregation win-dow and is related to a speciﬁc CPU core. The time step between consecutivefeature sets is 10 seconds; this ﬁne granularity allows for quick response timesin fault detection. For each metric, we computed several statistical measuresover the distribution of the values measured within the aggregation window [35].These measures are the average , standard deviation , median , minimum , maxi-mum , skewness , kurtosis , and the and . Overall,18 DMS Data

Core 0Classifier DataProcessorCore 1Classifier Core 2Classifier Core 3ClassifierResource NClassifier[…]Resource 1ClassifierResource 0Classifier R e s ou r ce D a t a R e s ou r ce D a t a [ … ] R e s ou r ce ND a t a Node-levelData

Figure 8: Architecture of the proposed machine learning-based fault detectionsystem.there are 22 statistical features for each metric in the dataset (including alsothe ﬁrst-order derivatives). Hence, starting from the initial set of 2094 LDMSmetrics including node-level data as well as from all CPU cores, the ﬁnal featuresets contain 3168 elements for each separate CPU core. We note that we didnot include the metrics collected by the procinterrupts plugin, as a preliminaryanalysis revealed them to be irrelevant for fault classiﬁcation. All the scriptsused to process the data are available on the FINJ Github repository.

Labeling

To train classiﬁers to distinguish between faulty and normal states,we labeled the feature sets either according to the fault program (i.e., one ofthe 8 programs presented in Section 5.2) running within the corresponding ag-gregation window, or “healthy” if no fault was running. The logs produced byFINJ (included in the Antarex dataset) provide the details about the fault pro-grams running at each time-stamp. In a realistic deployment scenario, the faultdetection model can also be trained using data from spontaneous, real faults.In that case, administrators should provide explicit labels instead of relying onfault injection experiments.A single aggregation window may capture multiple system states, making la-beling not trivial. For example, a feature set may contain “healthy” time pointsbefore and after the start of a fault, or even include two diﬀerent fault types.We deﬁne these feature sets as ambiguous . One of the reasons behind the use ofa short aggregation window (60 seconds) is to minimize the impact of such am-biguous system states on fault detection. However, since these situations cannotbe avoided, we propose two labelling methods. The ﬁrst method is mode , whereall the labels appearing in the time window are considered. Their distributionis examined and the label with the majority of occurrences is assigned to thefeature set. This leads to robust feature sets, whose labels are always represen-tative of the aggregated data. The second method is recent , where the label isobtained by observing the state of the system at the most recent time point inthe time window (the last time point). This could correspond to a certain faulttype or could be “healthy”. This approach could be less robust, for instancewhen a feature set that is mostly “healthy” is labelled as faulty, but has theadvantage of leading to a more responsive fault detection, as faults are detectedas soon as they are encountered, without having to look at the state over thelast 60 seconds. 19 etection System Architecture

The fault detection system we proposein this paper is based on an architecture containing an array of classiﬁers, asshown in Figure 8. A classiﬁer handles a speciﬁc computing resource type inthe node, such as a CPU core, GPU or MIC. Each classiﬁer is trained withthe feature sets of all the resource units of the corresponding type, and is ableto perform fault diagnoses for all of them, thus detecting faults both at nodelevel and resource level (e.g., dial and ddot). This is possible as the featuresets of each classiﬁer contain all node-level metrics for the system as well asthe resource-speciﬁc metrics for the resource unit. Since a feature set containsdata from only one resource type, the proposed approach allows us to limittheir size, which in turn improves performance and detection accuracy. The keyassumption of this approach is that the resource units of the same type behavesimilarly and that the respective feature sets can be combined in a coherent set.It is anyway possible to use a separate classiﬁer for each resource unit of thesame type without altering the feature sets themselves, should this assumptionprove to be too strong. In our case, the computing nodes have only CPU cores.Therefore, we train one classiﬁer with feature sets that contain both node-leveland core-level data, for one core at a time.The classiﬁers’ training can be performed oﬄine, using labeled data resultingfrom normal system operation or from fault injection, as we do in this paper.The trained classiﬁers can then be deployed to detect faults on new, streamingdata. By design, our architecture can detect only one fault at any time. Iftwo faults happen simultaneously, the classiﬁer would detect the fault whoseeﬀects on the system are deemed more disruptive for the normal, “healthy”state. In this preliminary study, this design choice is reasonable, as our purposeis to automatically distinguish between diﬀerent fault conditions. Althoughunlikely, multiple faults could aﬀect the same compute node of an HPC systemat the same time. Our approach could be extended to deal with this situation bydevising a classiﬁer that does not produce a single output but rather a compositeoutput, for instance a distribution or a set of probabilities, one for each possiblefault type.

In this section we present the results of our experiments, whose purpose was tocorrectly detect which of the 8 faults (as described in Section 5.2) were injectedin the Antarex HPC node at any point in time in the Antarex dataset. Alongthe way, we compared a variety of classiﬁcation models and the two labelingmethods discussed in Section 6, assessed the impact of ambiguous feature sets,estimated the most important metrics for fault detection, and ﬁnally evaluatedthe overhead of our detection method. For each classiﬁer, both the training andtest sets are built from the feature set of one randomly-selected core for eachtime point. We evaluated the classiﬁers using 5-fold cross-validation, with theaverage F-score as the performance metric. The software environment we used The F-score is the harmonic mean between precision and recall. oldI FoldII FoldIII FoldIV FoldV (a) Time-stamp order. FoldI FoldII FoldIII FoldIV FoldV (b) Shuﬄed order.

Figure 9: The eﬀect of using time-stamp (a) or shuﬄed (b) ordering to createthe data folds for cross-validation. Blocks with similar color represent featuresets from the same time frame.is Python 3.4 with the Scikit-learn package [32].Although data shuﬄing is a standard technique with proven advantagesin machine learning, it is not well suited to the fault detection method wepropose in this paper. This is because our design is tailored for online systems,where classiﬁers are trained using only continuous, streamed, and potentiallyunbalanced data as it is acquired, while ensuring robustness in training so asto detect faults in the near future. Hence, it is very important to assess thedetection accuracy without data shuﬄing. We reproduced this realistic, onlinescenario by performing cross-validation on the Antarex dataset using feature setsin time-stamp order. Time-stamp ordering produces cross-validation folds thatcontain data from speciﬁc time intervals. We depict this eﬀect in Figure 9. Inany case, we used shuﬄing in a small subset of the experiments for comparativepurposes.

We ﬁrst present results associated with diﬀerent classiﬁers using feature setsin time-stamp order and the mode labeling method. As classiﬁcation models,we opted for a Random Forest (RF), Decision Tree (DT), Linear Support Vec-tor Classiﬁer (SVC) and Neural Network (NN) with two hidden layers, eachhaving 1000 neurons. A preliminary empirical evaluation revealed that thisneural network architecture, as well as the other models, are well-suited for ourpurposes and thus provide good results. Even though we also considered us-ing more complex models, such as

Long Short Term Memory (LSTM) neuralnetworks, we ultimately decided to employ comparable, general-purpose mod-els. This allows us to broaden the scope of our study, evaluating the impact offactors such as data normalization, shuﬄing and ambiguous system states onfault classiﬁcation. On the other hand, LSTM-like models have more restrictivetraining requirements and are more tailored for regression tasks. Finally, sincethe Antarex dataset was acquired by injecting faults lasting a few minutes each,such a model would not beneﬁt from the long-term correlations in system states,where models like RF are capable of near-optimal performance.The results of each classiﬁer and each fault type are shown in Figure 10,with the overall F-score highlighted. We notice that all classiﬁers have verygood performance, with the overall F-scores well above 0.9. RF is the best21 v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (a) Random Forest. o v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (b) Decision Tree. o v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (c) Neural Network. o v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (d) Support Vector Classiﬁer. Figure 10: Classiﬁcation results on the Antarex dataset, using all feature sets intime-stamp order, the mode labeling method, and diﬀerent classiﬁcation models.classiﬁer, with an overall F-score of 0.98, followed by NN and SVC scoring 0.93.The most diﬃcult fault types to detect for all classiﬁers are pagefail and ioerr faults, which have substantially worse scores than the others.We infer from the results that an RF would be the ideal classiﬁer for anonline fault detection system, due to its detection accuracy which is at least 5%higher than the other classiﬁers, in terms of the overall F-score. Additionally,random forests are computationally eﬃcient [26], and therefore would be suit-able for use in online environments with strict overhead requirements. As anadditional advantage, unlike the NN and SVC classiﬁers, RF and DT do notrequire data normalization. Normalization in an online environment is hard toachieve, as many metrics do not have well-deﬁned upper bounds. Although arolling window-based dynamic normalization approach can be used [20] to solvethe problem, this method is unfeasible for ML-based classiﬁcation, as it canlead to quickly-degrading detection accuracy and to the necessity of frequenttraining. For all these reasons, we will show only the results of an RF classiﬁerin the following. 22 v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (a) Mode labeling. o v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (b) Recent labeling. o v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (c) Mode labeling with shuﬄing. o v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (d) Recent labeling with shuﬄing. Figure 11: RF classiﬁcation results, using all feature sets in time-stamp orshuﬄed order, with the mode and recent labeling methods.

Next we present the results of the two labeling methods described in Section 5.2.Figures 11a and 11b report the classiﬁcation results without data shuﬄing for,respectively, the mode and the recent labeling. The overall F-scores are 0.98and 0.96, close to the optimal values. Once again, in both cases the ioerr and pagefail faults are substantially more diﬃcult to detect than the others. Thisis probably due to the intermittent nature of both of these faults, whose eﬀectsdepend on the hard drive I/O (ioerr) and memory allocation (pagefail) patternsof the underlying applications.We observe an interesting behavior with the copy fault program, which givesa worse F-score when using the recent method in 11b. As shown in Section 7.4,a metric related to the read rate of the hard drive used in our experiments( time read rate der perc95 ) is deﬁned as important by the DT model for dis-tinguishing faults, and we assume it is useful for detecting hard drive faults inparticular, since it has no use for CPU-related faults. However, this is a com-paratively slowly-changing metric. For this reason, a feature set may be labeled23 v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (a) Non-ambiguous dataset. o v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (b) Non-ambiguous dataset with shuﬄing. Figure 12: RF classiﬁcation results on the Antarex dataset, using only non-ambiguous feature sets in time-stamp (a) and shuﬄed (b) order.as copy as soon as the program is started, before the metric has been updatedto reﬂect the new system status. This in turn makes classiﬁcation more diﬃcultand leads to degraded accuracy. This leads us to conclude that recent may notbe well suited for the faults whose eﬀects cannot be detected immediately.Figures 11c and 11d report the detection accuracy for the mode and recent methods, this time obtained after having shuﬄed the data for the classiﬁertraining phase. As expected, data shuﬄing markedly increases the detectionaccuracy for both labeling methods, reaching an almost optimal F-score with allfault types – and overall F-score of 0.99. A similar performance boost with datashuﬄing was obtained also with the remaining classiﬁcation models introducedin Section 7.1 (the results are not reported here since they would not add anyinsight to the experimental analysis). We notice that the recent labeling has aslightly higher detection rate, especially for some fault types. The reason forthis improvement most likely lies in the highly reactive nature of this labelingmethod, as it can capture status changes faster than mode . Another interestingobservation is that adding data shuﬄing grants a larger performance boost tothe recent labeling compared to the mode labeling. This happens because the recent method is more sensible to temporal correlations in the data, which inturn may lead to classiﬁcation errors. Data shuﬄing destroys the temporalcorrelations in the training set and thus improves detection accuracy.

Here we give insights on the impact of ambiguous feature sets in the dataset onthe classiﬁcation process by excluding them from the training and test sets. Asshown in Figure 12, the overall F-scores are above 0.99 both without (Figure 12a)and with shuﬄing (Figure 12b), leading to a slightly better classiﬁcation perfor-mance compared to having the ambiguous feature sets in the dataset. Around20% of the feature sets of the Antarex dataset is ambiguous. With respect to24 a) ddot. (b) cpufreq. (c) memeater.

Figure 13: The scatter plots of two important metrics, as quantiﬁed by a DTclassiﬁer, for three fault types. The “healthy” points are marked in blue, whilefault-aﬀected points in orange, and the points related to ambiguous feature setsin green.this relatively large proportion, the performance gap described above is small,which proves the robustness of our detection method. In general, the proportionof ambiguous feature sets in a dataset depends primarily on the length of theaggregation window, and on the frequency of state changes in the HPC system.More feature sets will be ambiguous as the length of the aggregation windowincreases, leading to more pronounced adverse eﬀects on the classiﬁcation ac-curacy. Thus, as a practical guideline, we advise to use a short aggregationwindow, such as the 60-second window we employed here, to perform onlinefault detection.A more concrete example of the behavior of ambiguous feature sets can beseen in Figure 13, where we show the scatter plots of two important metrics (aswe will discuss in Section 7.4) for the feature sets related to the ddot , cpufreq and memeater fault programs, respectively. The “healthy” points, marked inblue, and the fault-aﬀected points, marked in orange, are distinctly clusteredin all cases. On the other hand, the points representing the ambiguous featuresets, marked in green, are sparse, and often fall right between the “healthy” andfaulty clusters. This is particularly evident with the cpufreq fault program inFigure 13b. A crucial aspect in fault detection is understanding the most important metricsfor the detection accuracy. Identifying them can help reducing the amount ofcollected data, thus reducing the number of hardware measuring components orsoftware tools which could create additional overhead. Tuncer et al. [35] showedthat using methods such as principal component analysis (and other methodsthat rely exclusively on the variance in the data) may discard certain importantmetrics. On the other hand, a RF classiﬁer tends to report as relevant the samemetric many times, with diﬀerent statistical indicators. This is caused by itsensemble nature (a random forest is comprised of a collection of decision trees)25able 2: The most important metrics for fault detection, obtained via a DTclassiﬁer.

Source Name procsensors cpu freq perc50 meminfo

2. active der perc5 perfevent hw cache misses perc50 vmstat

4. thp split std vmstat

5. nr active ﬁle der perc25 perfevent hw branch instructions perc95 meminfo

7. mapped der avg meminfo

8. nr anon pages der perc95 procstat

9. sys der min vmstat

10. nr dirtied der std meminfo

11. kernelstack perc50 vmstat

12. numa hit perc5 procstat

13. processes der std procstat

14. context switches der perc25 procstat

15. procs running perc5 finj allocated perc50 vmstat

17. nr free pages der min diskstats

18. time read rate der perc95 vmstat

19. nr kernel stack der max perfevent hw instructions perc5 and the subtle diﬀerences in the estimators that compose it. Instead, a DTclassiﬁer naturally provides this information, as the most relevant metrics willbe those in the top layer of the decision tree. Thus, we trained a DT classiﬁeron the Antarex dataset.The results are listed in Table 2, along with their source LDMS plug-ins.While the metrics marked in bold are per-core, the others are system-wide. Wenotice that metrics from most of the available plug-ins are used, and some ofthese metrics can be directly associated to the behaviour of the faults. Forinstance, the metric related to context switches ( context switches der perc25 ) istied to the dial and ddot programs, as CPU interference generates an anomalousnumber of context switches. In general, ﬁrst-order derivatives (marked with the“der” suﬃx) are widely used by the classiﬁer, which suggests that computingthem is actually useful for fault detection. On the contrary, more complexstatistical indicators such as the skewness and kurtosis do not appear amongthe most relevant. This suggests that simple features are suﬃcient for machinelearning-driven fault detection on HPC nodes.

Finally, a critical consideration for understanding the feasibility of a fault de-tection approach is its overhead, especially if the ﬁnal target is its deployment Decision trees are built by splitting the data in subsets. The splitting choice is based onthe value of the metrics, or attribute in the DT terminology. The attributes providing thehighest information gain (i.e., the most relevant one) are selected ﬁrst by the standard DTtraining algorithms.

26n a real HPC system. LDMS is proven to have a low overhead at high samplingrates [1]. We also assume that the generation of feature sets and the classiﬁca-tion are performed locally in each node ( on-edge computation), and that onlythe resulting fault diagnoses are sent externally, which greatly decreases the net-work communication requirements and overhead. Following these assumptions,the hundreds of performance metrics used to train the classiﬁcation models donot need to be sampled and streamed at a ﬁne granularity. Generating a setof feature sets, one for each core in the test node, at a given time point for anaggregation window of 60 seconds requires, on average, 340 ms by employing asingle thread. This time includes the I/O overhead due to reading and parsingthe LDMS CSV ﬁles, and writing the output feature sets. RF classiﬁers arevery eﬃcient: classifying a single feature set as faulty or not requires 2ms, onaverage. This means that in total 342 ms are needed to generate and classify afeature set, using a single thread and a 60-seconds aggregation window. This ismore than acceptable for online use and practically negligible. Furthermore, theoverhead should be much lower in a real HPC system, with direct in-memory ac-cess to streamed data, as a signiﬁcant fraction of the overhead in our simulationis due to ﬁle system I/O operations to access the CSV ﬁles with the data. Ad-ditionally, the single statistical features are independent from each other. Thismeans that the data can be processed in parallel fashion, using multiple threadsto further reduce latency and ensure load balancing across CPU cores, which isa critical aspect to prevent application slowdown induced by fault detection.

We studied a ML approach to online fault classiﬁcation in HPC systems. Ourstudy provided three contributions to the state-of-the-art in resiliency researchin the HPC systems ﬁeld. The ﬁrst is FINJ, a fault injection tool, which allowsfor the automation of complex experiments, and for reproducing anomalousbehaviors in a deterministic, simple way. FINJ is implemented in Python andhas no dependencies for its core operation. This, together with the simplicityof its command-line interface, makes the deployment of FINJ on large-scalesystems trivial. Since FINJ is based on the use of tasks, which are externalexecutable programs, users can integrate the tool with any existing lower-levelfault injection framework that can be triggered in such way, and ranging from theapplication level to the kernel, or even hardware level. The use of workloads inFINJ also allows to reproduce complex, speciﬁc fault conditions, and to reliablyperform experiments involving multiple nodes at the same time.The second contribution is the Antarex fault dataset, which we generatedusing FINJ, to enable training and evaluation of our supervised ML classiﬁca-tion models, and which we extensively described. Both FINJ and the Antarexdataset are publicly available to facilitate resiliency research in the HPC systemsﬁeld. The third contribution is a classiﬁcation method intended for streamed,online data obtained from a monitoring framework, which is then processedand fed to classiﬁers. The experimental results of our study show almost per-27ect classiﬁcation accuracy for all fault types injected by FINJ, with negligiblecomputational overhead for HPC nodes. Moreover, our study reproduces theoperating conditions that could be found in a real online system, in particularthose related to ambiguous system states and data imbalance in the trainingand test sets.As future work, we plan to deploy our fault detection method in a large-scale real HPC system. This will involve facing a number of new challenges.We need to develop tools to aid online training of machine learning models, aswell as integrate our method in a monitoring framework such as Examon [4].We also expect to study our approach in online scenarios and adapt it wherenecessary. For instance, we need to characterize the scalability of FINJ, andextend it to include the ability to build workloads where the order of tasks isdeﬁned by causal relationships rather than time-stamps, which might simplifythe triggering of extremely speciﬁc anomalous states in a given system. Sincetraining is performed before HPC nodes move into production (e.g., in a testenvironment), we also need to characterize how often re-training is needed, anddevise a procedure to perform this. Finally, we plan to make our detectionmethod robust against the occurrence of faults that were unknown during thetraining phase, preventing their misclassiﬁcation, as well as expect to evaluatesome specialized models such as LSTM neural networks, in the light of thegeneral results obtained with this study.

Acknowledgements

A. Netti has been supported by the EU project

Oprecomp-Open Transprecision Computing (grant agreement 732631). A. Sˆırbu has beenpartially funded by the EU project

SoBigData Research Infrastructure — BigData and Social Mining Ecosystem (grant agreement 654024). We thank theIntegrated Systems Laboratory of ETH Zurich for granting us control of theirAntarex HPC node during this study.

References [1] A. Agelastos, B. Allan, J. Brandt, P. Cassella, et al. The lightweight dis-tributed metric service: a scalable infrastructure for continuous monitoringof large scale computing systems and applications. In

Proc. of SC 2014 ,pages 154–165. IEEE, 2014.[2] S. Ashby, P. Beckman, J. Chen, P. Colella, et al. The opportunities andchallenges of exascale computing.

Summary Report of the Advanced Scien-tiﬁc Computing Advisory Committee (ASCAC) Subcommittee , pages 1–77,2010.[3] E. Baseman, S. Blanchard, N. DeBardeleben, A. Bonnie, et al. Interpretableanomaly detection for monitoring of high performance computing systems.In

Proc. of the ACM SIGKDD Workshops 2016 , 2016.284] F. Beneventi, A. Bartolini, C. Cavazzoni, and L. Benini. Continuous learn-ing of HPC infrastructure models using big data analytics and in-memoryprocessing tools. In

Proc. of DATE 2017 , pages 1038–1043. IEEE, 2017.[5] K. Bergman, S. Borkar, D. Campbell, W. Carlson, et al. Exascale comput-ing study: Technology challenges in achieving exascale systems.

DARPAIPTO, Tech. Rep , 15, 2008.[6] P. Bodik, M. Goldszmidt, A. Fox, D. B. Woodard, et al. Fingerprintingthe datacenter: automated classiﬁcation of performance crises. In

Proc. ofEuroSys 2010 , pages 111–124. ACM, 2010.[7] R. Bolze, F. Cappello, E. Caron, M. Dayd´e, et al. Grid’5000: A largescale and highly reconﬁgurable experimental grid testbed.

The Interna-tional Journal of High Performance Computing Applications , 20(4):481–494, 2006.[8] J. Calhoun, L. Olson, and M. Snir. FlipIt: An LLVM based fault injectorfor HPC. In

Proc. of Euro-Par 2014 , pages 547–558. Springer, 2014.[9] F. Cappello, A. Geist, W. Gropp, S. Kale, et al. Toward exascale resilience:2014 update.

Supercomputing frontiers and innovations , 1(1):5–28, 2014.[10] D. Capps and W. Norcott. IOzone ﬁlesystem benchmark, 2008.[11] I. Cohen, J. S. Chase, M. Goldszmidt, T. Kelly, et al. Correlating instru-mentation data to system states: A building block for automated diagnosisand control. In

OSDI , volume 4, pages 16–16, 2004.[12] R. Coker. Bonnie++ ﬁle-system benchmark, 2012.[13] N. DeBardeleben, S. Blanchard, Q. Guan, Z. Zhang, et al. Experimentalframework for injecting logic errors in a virtual machine to proﬁle applica-tions for soft error resilience. In

Proc. of Euro-Par 2011 , pages 282–291.Springer, 2011.[14] J. J. Dongarra, J. D. Cruz, S. Hammarling, and I. S. Duﬀ. Algorithm 679:A set of level 3 basic linear algebra subprograms: model implementationand test programs.

ACM Transactions on Mathematical Software (TOMS) ,16(1):18–28, 1990.[15] J. J. Dongarra, P. Luszczek, and A. Petitet. The linpack benchmark: past,present and future.

Concurrency and Computation: practice and experi-ence , 15(9):803–820, 2003.[16] C. Engelmann and S. Hukerikar. Resilience design patterns: A structuredapproach to resilience at extreme scale.

Supercomputing Frontiers and In-novations , 4(3), 2017. 2917] K. B. Ferreira, P. Bridges, and R. Brightwell. Characterizing applicationsensitivity to os interference using kernel-level noise injection. In

Proc. ofSC 2008 , page 19. IEEE Press, 2008.[18] A. Gainaru and F. Cappello. Errors and faults. In

Fault-Tolerance Tech-niques for High-Performance Computing , pages 89–144. Springer, 2015.[19] Q. Guan, C.-C. Chiu, and S. Fu. Cda: A cloud dependability analysisframework for characterizing system dependability in cloud computing in-frastructures. In

Proc. of PRDC 2012 , pages 11–20. IEEE, 2012.[20] Q. Guan and S. Fu. Adaptive anomaly identiﬁcation by exploring metricsubspace in cloud computing infrastructures. In

Proc. of SRDS 2013 , pages205–214. IEEE, 2013.[21] H. S. Gunawi, T. Do, P. Joshi, P. Alvaro, et al. FATE and DESTINI: Aframework for cloud recovery testing. In

Proc. of NSDI 2011 , page 239,2011.[22] M.-C. Hsueh, T. K. Tsai, and R. K. Iyer. Fault injection techniques andtools.

Computer , 30(4):75–82, 1997.[23] W. M. Jones, J. T. Daly, and N. DeBardeleben. Application monitoringand checkpointing in hpc: looking towards exascale systems. In

Proc. ofACM-SE 2012 , pages 262–267. ACM, 2012.[24] P. Joshi, H. S. Gunawi, and K. Sen. PREFAIL: A programmable tool formultiple-failure injection. In

ACM SIGPLAN Notices , volume 46, pages171–188. ACM, 2011.[25] D. Kondo, B. Javadi, A. Iosup, and D. Epema. The failure trace archive:Enabling comparative analysis of failures in diverse distributed systems. In

Proc. of CCGRID 2010 , pages 398–407. IEEE, 2010.[26] B. Lakshminarayanan, D. M. Roy, and Y. W. Teh. Mondrian forests: Eﬃ-cient online random forests. In

Advances in neural information processingsystems , pages 3140–3148, 2014.[27] C. Lameter. Numa (non-uniform memory access): An overview.

Queue ,11(7):40, 2013.[28] Z. Lan, Z. Zheng, and Y. Li. Toward automated anomaly identiﬁcation inlarge-scale systems.

IEEE Transactions on Parallel and Distributed Sys-tems , 21(2):174–187, 2010.[29] P. R. Luszczek, D. H. Bailey, J. J. Dongarra, J. Kepner, et al. The HPCchallenge (HPCC) benchmark suite. In

Proc. of SC 2006 , volume 213.ACM, 2006.[30] J. McCalpin. STREAM: Sustainable memory bandwidth in high perfor-mance computers, 2006. 3031] T. Naughton, W. Bland, G. Vallee, C. Engelmann, et al. Fault injectionframework for system resilience evaluation: fake faults for ﬁnding futurefailures. In

Proc. of Resilience 2009 , pages 23–28. ACM, 2009.[32] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, et al. Scikit-learn: Machine learning in python.

Journal of machine learning research ,12(Oct):2825–2830, 2011.[33] M. Snir, R. W. Wisniewski, J. A. Abraham, et al. Addressing failuresin exascale computing.

The International Journal of High PerformanceComputing Applications , 28(2):129–173, 2014.[34] D. T. Stott, B. Floering, D. Burke, Z. Kalbarczpk, et al. NFTAPE: a frame-work for assessing dependability in distributed systems with lightweightfault injectors. In

Proc. of IPDS 2000 , pages 91–100. IEEE, 2000.[35] O. Tuncer, E. Ates, Y. Zhang, A. Turk, et al. Online diagnosis of perfor-mance variation in HPC systems using machine learning.

IEEE Transac-tions on Parallel and Distributed Systems , 2018.[36] O. Villa, D. R. Johnson, M. Oconnor, E. Bolotin, et al. Scaling the powerwall: a path to exascale. In

Proc. of SC 2014 , pages 830–841. IEEE, 2014.[37] C. Wang, V. Talwar, K. Schwan, and P. Ranganathan. Online detection ofutility cloud anomalies using metric distributions. In