A Machine Learning Approach to Online Fault Classification in HPC Systems
Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi
AA Machine Learning Approach to Online FaultClassification in HPC Systems
Alessio Netti , Zeynep Kiziltan , Ozalp Babaoglu , Alina Sˆırbu ,Andrea Bartolini , and Andrea Borghesi Leibniz Supercomputing Centre, Garching bei M¨unchen, Germany Department of Computer Science and Engineering, University ofBologna, Italy Department of Electrical, Electronic and InformationEngineering, University of Bologna, Italy Department of Computer Science, University of Pisa, Italy [email protected],[email protected] { zeynep.kiziltan,ozalp.babaoglu,a.bartolini,andrea.borghesi3 } @unibo.it Abstract
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels willincrease significantly. Thus, detecting and classifying faults in HPC sys-tems as they occur and initiating corrective actions before they can trans-form into failures becomes essential for continued operation. Central tothis objective is fault injection, which is the deliberate triggering of faultsin a system so as to observe their behavior in a controlled environment.In this paper, we propose a fault classification method for HPC systemsbased on machine learning. The novelty of our approach rests with thefact that it can be operated on streamed data in an online manner, thusopening the possibility to devise and enact control actions on the targetsystem in real-time. We introduce a high-level, easy-to-use fault injectiontool called FINJ, with a focus on the management of complex experiments.In order to train and evaluate our machine learning classifiers, we injectfaults to an in-house experimental HPC system using FINJ, and generatea fault dataset which we describe extensively. Both FINJ and the datasetare publicly available to facilitate resiliency research in the HPC systemsfield. Experimental results demonstrate that our approach allows almostperfect classification accuracy to be reached for different fault types withlow computational overhead and minimal delay. a r X i v : . [ c s . D C ] J u l Introduction
Motivation
Modern scientific discovery is increasingly being driven by com-putation [36]. In a growing number of areas where experimentation is eitherimpossible, dangerous or costly, computing is often the only viable alternativetowards confirming existing theories or devising new ones. As such, High-Performance Computing (HPC) systems have become fundamental “instru-ments” for driving scientific discovery and industrial competitiveness. Exas-cale (10 operations per second) is the moonshot for HPC systems. Reachingthis goal is bound to produce significant advances in science and technologythrough higher-fidelity simulations, better predictive models and analysis ofgreater quantities of data, leading to vastly-improved manufacturing processesand breakthroughs in fundamental sciences ranging from particle physics tocosmology. Future HPC systems will achieve exascale performance through acombination of faster processors and massive parallelism [2].With Moore’s Law having reached its limit, the only viable path towardshigher performance has to consider switching from increased transistor den-sity towards increased core count, thus increased sockets count. This, however,presents major obstacles. With everything else being equal, the fault rate ofa system is directly proportional to the number of sockets used in its con-struction [9]. But everything else is not equal: exascale HPC systems willalso use advanced low-voltage technologies that are much more prone to ag-ing effects [5] together with system-level performance and power modulationtechniques, such as dynamic voltage frequency scaling, all of which tend to in-crease fault rates [16]. Economic forces that push for building HPC systemsout of commodity components aimed at mass markets only add to the like-lihood of more frequent unmasked hardware faults. Finally, complex systemsoftware, often built using open-source components, to deal with more complexand heterogeneous hardware, fault masking and energy management, coupledwith legacy applications will significantly increase the potential for faults [23].It is estimated that large parallel jobs will encounter a wide range of failures asfrequently as once every 30 minutes on exascale platforms [33]. At these rates,failures will prevent applications from making progress. Consequently, exascaleperformance, even if achieved nominally, cannot be sustained for the durationof most applications that often run for long periods.To be usable in production environments with acceptable quality of service levels, exascale systems need to improve their resiliency by several orders ofmagnitude. Therefore, future exascale HPC systems must include automatedmechanisms for masking faults, or recovering from them, so that computationscan continue with minimal disruptions. In our terminology, a fault is definedas an anomalous behavior at the software or hardware level that can lead toillegal system states ( errors ) and, in the worst case, to service interruptions( failures ) [18]. In this paper, we limit our attention to improving the resiliencyof HPC systems through the use of mechanisms for detecting and classifyingfaults as soon as possible since they are the root causes of errors and failures. Animportant technique for this objective is fault injection : the deliberate triggering2f faults in a system so as to observe their behavior in a controlled environment,enable development of new prediction and response techniques as well as testingof existing ones [22]. For fault injection to be effective, dedicated tools arenecessary, allowing users to trigger complex and realistic fault scenarios in areproducible manner. Contributions
The contributions of our work are several fold. First, we pro-pose and evaluate a fault classification method based on supervised MachineLearning (ML) suitable for online deployment in HPC systems as part of an in-frastructure for building mechanisms to increase their resiliency. Our approachrelies on a collection of performance metrics that are readily available in mostHPC systems. A novel aspect of our proposal is its ability to work online withlive streamed data as opposed to traditional offline techniques that work witharchived data. Our experimental results show that the method we propose canclassify almost perfectly several types of faults, ranging from hardware malfunc-tions to software issues and bugs. In our method, classification can be achievedwith little computational overhead and with minimal delay, making it suitablefor online use. We characterize the performance of our proposed solution in arealistic online use scenario where live streamed data is fed to fault classifiersboth for training and for detection, dealing with issues such as class imbalanceand ambiguous system states. Most existing studies, on the contrary, consideroffline scenarios and rely on extensive manipulation of archived data, which isnot feasible in online scenarios. Furthermore, switching from an offline to anonline approach based on streamed data opens up the possibility for devisingand enacting control actions in real-time.Second, we introduce an easy-to-use open-source Python fault injection toolcalled FINJ. A relevant feature of FINJ is the possibility of seamless integrationwith other injection tools targeted at specific fault types, thus enabling usersto coordinate faults from different sources and different system levels. By us-ing FINJ’s workload feature, users can also specify lists of applications to beexecuted and faults to be triggered on multiple nodes at specific times, withspecific durations. FINJ thus represents a high-level, flexible tool, enablingusers to perform complex and reproducible experiments, aimed at revealing thecomplex relations that may exist between faults, application behavior and thesystem itself. FINJ is also extremely easy to use: it can be set up and executedin a matter of minutes, and does not require the writing of additional code inmost of its usage scenarios. To the best of our knowledge, FINJ is the firstportable, open-source tool that allows users to perform and control complex in-jection experiments, that can be integrated with heterogeneous fault types andthat includes workload support, while retaining ease of use and a quick setuptime.As a third and final contribution, our evaluation is based on a dataset con-sisting of monitoring data that we acquired from an experimental HPC system(called Antarex) by injecting faults using FINJ. We make the Antarex datasetpublicly available and describe it extensively for use in the community. This3s an important contribution to the HPC field, since commercial operators arevery reluctant to share trace data containing information about faults in theirHPC systems [25].
Organization
The rest of the paper is organized as follows. In the nextsection we put our work in context with respect to related work. In Section 3,we introduce the FINJ tool and present a simple use case in Section 4 to showhow it can be deployed. In Section 5, we describe the Antarex dataset that willbe used for evaluating our supervised ML classification models. In Section 6, wediscuss the features extracted from the Antarex dataset to train the classifiersand in Section 7, we present our experimental results. We conclude in Section 8.
Fault injection for prediction and detection purposes is a recent topic of intenseactivity, and several studies have proposed tools with varying levels of abstrac-tion. Calhoun et al. [8] devised a compiler-level fault injection tool focused onmemory bit-flip errors, targeting HPC applications. De Bardeleben et al. [13]proposed a logic error-oriented fault injection tool. This tool is designed to injectfaults in virtual machines, by exploiting emulated machine instructions throughthe open-source virtual machine and processor emulator (QEMU). Both worksfocus on low-level fault-specific tools and do not provide functionality for theinjection of complex workloads, and for the collection of produced data, if any.Stott et al. [34] proposed NFTAPE, a high-level and generic tool for faultinjection. This tool is designed to be integrated with other fault injection toolsand triggers at various levels, allowing for the automation of long and complexexperiments. The tool, however, has aged considerably, and is not publiclyavailable. A similar fault injection tool was proposed by Naughton et al. [31],which has never evolved beyond the prototype stage and is also not publiclyavailable, to the best of our knowledge. Moreover, both tools require users towrite a fair amount of wrapper and configuration code, resulting in a complexsetup process. The Gremlins Python package also supplies a high-level fault in-jector. However, it does not support workload or data collection functionalities,and experiments on multiple nodes cannot be performed.Joshi et al. [24] introduced the PREFAIL tool, which allows for the injectionof failures at any code entry point in the underlying operating system. Thistool, like NFTAPE, employs a coordinator process for the execution of com-plex experiments. It is targeted at a specific fault type (code-level errors) anddoes not permit performing experiments focused on performance degradationand interference. Similarly, the tool proposed by Gunawi et al. [21], namedFATE, allows the execution of long experiments; furthermore, it is focused onreproducing specific fault sequences, simulating real scenarios. Like PREFAIL,it is limited to a specific fault type, namely I/O errors, thus greatly limiting itsscope. https://github.com/toddlipcon/gremlins In this section, we first discuss how fault injection is achieved in FINJ. We thenpresent the architecture of FINJ and discuss its implementation. CustomizingFINJ for different purposes is easy, thanks to its portable and modular nature.
Fault injection is achieved through tasks that are executed on target nodes. Eachtask corresponds to either an HPC application or a fault-triggering program,and has a specification for its execution. As demonstrated by Stott et al. [34],this approach allows for the integration in FINJ of any third-party fault injection5ramework that can be triggered externally. In any case, many fault-triggeringprograms are supplied with FINJ (see Section 5.2), allowing users to experimentwith anomalies out-of-the-box.A sequence of tasks defines a workload , which is a succession of scheduledapplication and fault executions at specific times, reproducing a realistic workingenvironment for the fault injection process. A task is specified by the followingattributes: • args : the full shell command required to run the selected task. The com-mand must refer to an executable file that can be accessed from the targethosts; • timestamp : the time in seconds at which the task must be started, ex-pressed as a relative offset; • duration : the task’s maximum allowed duration, expressed in seconds,after which it will be abruptly terminated. This duration can serve as an exact duration as well, with FINJ restarting the task if it finishes earlier,and terminating it if it lasts more. This behavior depends on the FINJconfiguration (see Section 3.3); • isFault : defines whether the task corresponds to a fault-triggering pro-gram, or to an application; • seqNum : a sequence number used to uniquely identify the task inside aworkload; • cores : an optional attribute which is the list of CPU cores that the task isallowed to use on target nodes, enforced through a NUMA Control policy[27].A workload is stored in a workload file , which contains the specificationsof all the tasks of a workload in CSV format. The starting time of each taskis expressed as a relative offset, in seconds, with respect to the first task inthe workload. A particular execution of a given workload then constitutes an injection session .In our approach, the responsibility of each entity involved in the fault injec-tion process is isolated, as depicted in Figure 1. The fault injection engine ofFINJ manages the execution of tasks on target nodes and the collection of theiroutput. Tasks contain all the necessary logic to run applications or to triggerany other low-level fault injection framework (e.g., by means of a writable fileor system call). At the last level, the activated low-level fault injector handlesthe actual triggering of faults, which can be at the software, kernel, or evenhardware level. 6 ault Injection Engine
WorkloadcoordinationTask scheduling Processmanagement Output collection
Task
Application-levelanomalies Kernel-levelfault injection Hardware-levelfault injectionHPC applications Fault triggering Restore healthysystem state
Figure 1: A diagram representing the responsibility of each entity involved inthe fault injection process.
FINJ consists of a fault injection controller and a fault injection engine , whichare designed to run on separate machines. The high-level structure of the FINJarchitecture is illustrated in Figure 2.The controller orchestrates the injection process, and should be run on anexternal node that is not affected by the faults. The controller maintains con-nections to all nodes involved in the injection session, which run fault injectionengine instances and whose addresses are specified by users. Therefore, injectionsessions can be performed on multiple nodes at the same time. The controllerreads the task entries from the selected workload file. For each task the con-troller sends a command to all target hosts, instructing them to start the newtask at the specified time. Finally, the controller collects and stores all statusmessages (e.g., task start and termination, status changes) produced by thetarget hosts.The engine is a daemon running on nodes where faults are to be injected.The engine waits for task commands to be received from remote controller in-stances. Engines can be connected to multiple controllers at the same time,however task commands are accepted from one controller at a time, the master of the injection session. The engine manages received task commands by as-signing them to a dedicated thread from a pool . The thread manages all aspectsrelated to the execution of the task, such as spawning the necessary subpro-cesses and sending status messages to controllers. Whenever a fault causes atarget node to crash and reboot, controllers are able to re-establish and recoverthe previous injection session, given that the engine is set up to be executed atboot time on the target node.
FINJ is based on a highly modular architecture, and therefore it is very easy tocustomize its single components in order to add or tune features.7 arget Node[s]Controller Node
Workload File
Applications and Fault Programs
ApplicationProcesses FaultProcessesChildren Processes << i n j ec ti on c o mm a nd s >> << s t a t u s m e ss a g e s >> Fault Injection ControllerFault Injection EngineInjectionControllerInjectionEngine Network Server
Configuration
Input
Configuration
Thread PoolExecution RecordOutputNetwork ClientWorkloadGenerator
Figure 2: Architecture of the FINJ tool showing the division between a controllernode (top) and a target node (bottom).
Network
Engine and controller instances communicate through a networklayer, and communication is achieved through a simple message-based protocol.Specifically, we implemented client and server software components for theexchange of messages. Fault injection controllers use a client instance in orderto connect to fault injection engines, which in turn run a server instance whichlistens for incoming connections. A message can be either a command sent by acontroller, related to a single task, or a status message, sent by an engine, relatedto status changes in its machine. All messages are in the form of dictionaries .This component also handles resiliency features such as automatic re-connectionfrom clients to servers, since temporary connection losses are to be expected ina fault injection context.
Thread Pool
Task commands in the engines are assigned to a thread in apool as they are received. Each thread manages all aspects of a task assignedto it. Specifically, the thread sleeps until the scheduled starting time of thetask (according to its time-stamp); then, it spawns a subprocess running thespecified task, and sends a message to all connected controllers to inform them ofthe event. At this point, the thread waits for the task’s termination, dependingon its duration and on the current configuration. Finally, the thread sends a newstatus message to all connected hosts informing them of the task’s termination,and returns to sleep. The amount of threads in the pool, which is a configurableparameter, determines the maximum number of tasks that can be executed8 n j ec ti on E ng i n e Worker Thread T a s k Q u e u e pu s h () pop() ! " ! ! $ ! % ! & ! ' Wait for ! ( start timeStart task ! ( Wait fortermination Collectoutput
Thread Pool
Figure 3: A representation of the life cycle of a task, as managed by the threadpool of the fault injection engine.concurrently. Since threads in the pool are started only once during the engine’sinitialization, and wake up for minimal amounts of time when a task needsto be started or terminated, we expect their impact on the performance to benegligible. The life cycle of a task, as managed by a worker thread, is representedin Figure 3.
Input and Output
The input and output of all data related to injectionsessions are performed by controller instances, and are handled by reader and writer entities. By default, the input/output format is CSV, which was chosendue to its extreme simplicity and generality.
Input is constituted by workload files, that include one entry for each injection task, as described in Section 3.1.
Output , instead, is made up of two parts: I) the execution log , which containsentries corresponding to status changes in the target nodes (e.g., start andtermination, encountered errors and connection loss or recovery events); II)output produced by single tasks.
Configuration
FINJ’s runtime behavior can be customized by means of aconfiguration file. This file includes several options that alter the behavior ofeither the controller or engine instances. Among the basic options, it is possibleto specify the listening TCP port for engine instances, and the list of addressesof target hosts, to which controller instances should connect at launch time. Thelatter is useful when injection sessions must be performed on large sets of nodes,whose addresses can be conveniently stored in a file. More complex optionsare also available. For instance, it is possible to define a series of commandscorresponding to tasks that must be launched together with FINJ, and mustbe terminated with it. This option proves especially useful when users wishto set up monitoring frameworks, such as the
Lightweight Distributed MetricService (LDMS) [1], to be launched together with FINJ in order to collectsystem performance metrics during injection sessions.9 orkload Generation
While writing workload files manually is possible,this is time-consuming and not desirable for long injection sessions. Therefore,we implemented in FINJ a workload generation tool, which can be used to au-tomatically generate workload files with certain statistical features, while tryingto combine flexibility and ease of use. The workload generation process is con-trolled by three parameters: a maximum time span for the total duration ofthe workload expressed in seconds, a statistical distribution for the duration oftasks, and another one for their inter-arrival times. We define the inter-arrivaltime as the interval between the start of two consecutive tasks. These distribu-tions are separated in two sets, for fault and application tasks, thus amountingto a total of four. They can be either specified analytically by the user or canbe fitted from real data, thus reproducing realistic behavior.A workload is composed as a series of fault and application tasks that areselected from a list of possible shell commands. To control the compositionof workloads, users can optionally associate to each command a probabilityfor its selection during the generation process, and a list of CPU cores forits execution, as explained in Section 3.1. By default, commands are pickeduniformly. Having defined its parameters, the workload generation process isthen fairly simple. Tasks are randomly generated in order to achieve statisticalfeatures close to those specified as input, and are written to an output CSV file,until the maximum imposed time span is reached. Alongside the full workload, a probe file is also produced, which contains one entry for each task type, all witha short fixed duration, representing a lightweight workload version. This file canbe used during the setup phase to test the correct configuration of the system,making sure that all tasks are correctly found and executed on the target hosts,without having to run the entire heavy workload.
FINJ is implemented in Python, an object-oriented, high-level interpreted pro-gramming language , and can be used on all major operating systems. Alldependencies are included in the Python distribution, and the only optionalexternal dependency is the scipy package, which is needed for the workload gen-eration functionality. The source code is publicly available on GitHub underthe MIT license, together with its documentation, usage examples and severalfault-triggering programs. FINJ works on Python versions 3.4 and above.In Figure 4, we illustrate the class diagram of FINJ. The fault injectionengine and the fault injection controller are represented by the InjectorEngine and
InjectorController classes. Users can instantiate these classes and startinjection sessions directly, by using the listen injection method to put the enginein listening mode, and the inject method of the controller, which allows to startthe injection session itself. Scripts are supplied with FINJ to create controllerand engine instances from a command-line interface, simplifying the process. https://github.com/AlessioNetti/fault_injector sgEntity msgQueueregisteredHostssend_msg()broadcast_msg()pop_msg_queue()peek_msg_queue()listen()MsgClientadd_servers()listen() MsgServerlisten()InjectorControllermsgClientworkloadReaderlogWritersbuild()inject()stop() InjectorEnginemsgServerthreadPoolbuild()listen_injection()InjectionThreadPooltask_queueworker_threadsstart()stop()submit_task() Reader fileread_entry()close()
Writer filewrite_entry()close() CSVReaderread_entry()
ExecutionLogReader read_entry()ExecutionLogWriterwrite_entry()
CSVWriter write_entry()WorkloadGenerator faultTimeGeneratorfaultDurGeneratorbenchTimeGeneratorbenchDurGenerator generate()ElementGeneratordistributiondataset_distribution()fit_data()show_fit()pick() TaskargstimestampdurationisFaultseqNumcorestask_to_dict()dict_to_task()
Figure 4: The class diagram of FINJ.The
InjectionThreadPool class, instead, supplies the thread pool implementationused to execute and manage the tasks.The network layer of the tool is represented by the
MsgClient and
MsgServer classes, which implement the message and queue-based client and server usedfor communication. Both classes are implementations of the
MsgEntity abstractclass, which provides the interface for sending and receiving messages, and im-plements the basic mechanisms that regulate the access to the underlying queue.Input and output are instead handled by the
Reader and
Writer abstractclasses and their implementations:
CSVReader and
CSVWriter handle the read-ing and writing of workload files, while
ExecutionLogReader and
ExecutionLog-Writer handle the execution logs generated by injection sessions. Since theseclasses are all implementations of abstract interfaces, it is easy for users to cus-tomize them for different formats. Tasks are modeled by the
Task class thatcontains all attributes specified in Section 3.1.Lastly, access to the workload generator is provided through the
Workload-Generator class, which is the interface used to set up and start the generationprocess. This class is backed by the
ElementGenerator class, which offers basicfunctionality for fitting data and generating random values. This class acts asa wrapper on scipy’s rv continuous class, which generates random variables.
In this section we demonstrate the flow of execution of FINJ through a concreteexample carried out on a real HPC node, and provide insight on its overhead.11imestamp ; d u r a t i o n ; seqNum ; i s F a u l t ; c o r e s ; a r g s0 ; 1 7 2 3 ; 1 ; F a l s e ; 0 − Here we consider a sample fault injection session. The employed CSV workloadfile is illustrated in Figure 5. The experimental setup for this test, both for faultinjection and monitoring, is the same as that for the acquisition of the Antarexdataset, which we present in Section 5. Python scripts are supplied with FINJto start and configure engine and controller instances: their usage is explainedon the GitHub repository of the tool, together with all configuration options.In this workload, the first task corresponds to an HPC application and is theIntel Distribution for the well-known
High-Performance Linpack (HPL) bench-mark, optimized for Intel Xeon CPUs. This task starts at time 0 in the work-load, and has a maximum allowed duration of roughly 30 minutes. The followingtwo tasks are fault-triggering programs: cpufreq which dynamically reduces themaximum allowed CPU frequency, emulating performance degradation, and leak which creates a memory leak in the system, eventually using all available RAM.The cpufreq program requires appropriate permissions, so that users can accessthe files controlling Linux CPU governors. Both fault programs are discussedin detail in Section 5. The HPL benchmark is run with 8 threads, pinned onthe first 8 cores of the machine, while the cpufreq and leak tasks are forced torun on cores 6 and 4, respectively. Note also that the tasks must be availableat the specified path on the systems running the engine, which in this case isrelative to the location of the FINJ launching script.Having defined the workload, the injection engine and controller must bestarted. For this experiment, we run both on the same machine. The controllerinstance will then connect to the engine and start executing the workload, stor-ing all output in a CSV file which is unique for each target host. Each entryin this file represents a status change event, which in this case is the start ortermination of tasks, and is flagged with its absolute time-stamp on the targethost. In addition, any errors that were encountered are also reported. When theworkload is finished, the controller terminates. It can be clearly seen from thisexample how easily a FINJ experiment can be configured and run on multipleCPU cores.At this point, the data generated by FINJ can be easily compared withother data, for example performance metrics collected through a monitoringframework, in order to better understand the system’s behavior under faults.In Figure 6 we show the total RAM usage and the CPU frequency of core 0, asmonitored by the LDMS framework. The benchmark’s profile is simple, showinga constant CPU frequency while RAM usage slowly increases as the application12
250 500 750 1000 1250 1500 1750Time [s]0.010.020.030.040.050.060.0 R a m U s a g e [ G B ] CPU Frequency RAM Usage Fault 0.00.51.01.52.02.53.0 C P U C o r e F r e q u e n c y [ G h z ] Figure 6: CPU Frequency and RAM Usage, as monitored on the target systemduring a sample FINJ injection session.performs tests on increasing matrix sizes. The effect of our fault programs,marked in gray, can be clearly observed in the system. The cpufreq fault causesa sudden drop in CPU frequency, resulting in reduced performance and longercomputation times, while the leak fault causes a steady, linear increase in RAMusage. Even though saturation of the available RAM is not reached, this peculiarbehavior can be used for prediction purposes.
We performed tests in order to evaluate the overhead that FINJ may introduce.To do so, we employed the same machine used in Section 5 together with theHPL benchmark, this time configured to use all 16 cores of the machine. Weexecuted the HPL benchmark 20 times directly on the machine by using a shellscript, and then repeated the same process by embedding the benchmark in 20tasks of a FINJ workload file. FINJ was once again instantiated locally. In bothconditions the HPL benchmark scored an average running time of roughly 320seconds, therefore leading us to conclude that the impact of FINJ on runningapplications is negligible. This was expected, since FINJ is designed to performonly the bare minimum amount of operations in order to start and managetasks, without affecting their execution.
The dataset contains trace data that we collected from an HPC system (calledAntarex) located at ETH Zurich while we injected faults using FINJ. Thedataset is publicly available for use by the community and all the details re-garding the test environment, as well as the employed applications and faultsare extensively documented. In this section, we give a comprehensive overviewof the dataset. https://zenodo.org/record/2553224 BlockI BlockIII BlockII BlockIVType
CPU-Mem HDD
Parallel
No Yes No Yes
Duration
12 days 4 days
Applications
DGEMM [14], HPCC [29],STREAM [30], HPL [15] IOZone [10],Bonnie++ [12]
Faults leak, memeater, ddot,dial, cpufreq, pagefail ioerr, copy
To acquire data, we executed some HPC applications and at the same timeinjected faults using FINJ in a single compute node of Antarex. We acquiredthe data in four steps by using four different FINJ workload files. Each dataacquisition step consists of application and fault program runs related to eitherthe CPU and memory components, or the hard drive, either in single-core ormulti-core mode. This resulted in 4 blocks of nearly 20GB and 32 days of datain total, whose structure is summarized in Table 1. We acquired the data bycontinuous streaming, thus any study based on it will easily be reproducible ona real HPC system, in an online way.We acquired data from a single HPC compute node for several reasons.First, we focus on fault detection models that operate on a per-node basis andthat do not require knowledge about the behavior of other nodes. Second, weassume that the behavior of different compute nodes under the same faults willbe comparable, thus a model generated for one node will be usable for othernodes as well. Third, our fault injection experiments required alterations to theLinux kernel of the compute node or special permissions. These are possible ina test environment, but not in a production one, rendering the acquisition of alarge-scale dataset on multiple nodes impractical.The Antarex compute node is equipped with two Intel Xeon E5-2630 v3CPUs, 128GB of RAM, a Seagate ST1000NM0055-1V4 1TB hard drive and runsthe CentOS 7.3 operating system. The node has a default Tier-1 computingsystem configuration. We used FINJ in a Python 3.4 environment. We alsoused the LDMS monitoring framework to collect performance metrics, so as tocreate features for fault classification purposes, as described in Section 6. Weconfigured LDMS to sample several metrics at each second, which come fromthe following seven different plug-ins:1. meminfo collects general information on RAM usage;2. perfevent collects CPU performance counters;3. procinterrupts collects information on hardware and software interrupts;14
00 200 300 400 500 600Time [s].000.002.005.007.010.012.015 (a) Histogram of fault durations.
500 600 700 800 900 1000 1100Time [s].000.002.005.007.010.012.015.017.020 (b) Histogram of fault inter-arrival times.
Figure 7: Histograms for fault durations (a) and fault inter-arrival times (b) inthe Antarex dataset compared to the PDFs of the Grid5000 data, as fitted ona Johnson SU and Exponentiated Weibull distribution respectively.4. procdiskstats collects statistics on hard drive usage;5. procsensors collects sensor data related to CPU temperature and fre-quency;6. procstat collects general metrics about CPU usage;7. vmstat collects information on virtual memory usage.This configuration resulted in a total of 2094 metrics to be collected eachsecond. Some of the metrics are node-level, and describe the status of thecompute node as a whole, others instead are core-specific and describe the statusof one of the 16 available CPU cores. In order to minimize noise and bias in thesampled data, we chose to analyze, execute applications and inject faults intoonly 8 of the 16 cores available in the system, and therefore used only one CPU.On the other CPU of the system, instead, we executed the FINJ and LDMStools, which rendered their CPU overhead negligible.
FINJ orchestrates the execution of applications and the injection of faults bymeans of a workload file, as explained in Section 3.1. For this purpose, we used15everal FINJ-generated workload files, one for each block of the dataset.
Workload Files
We used two different statistical distributions in the FINJworkload generator to create the durations and inter-arrival times of the taskscorresponding to the applications and fault-triggering programs. The applica-tion tasks are characterized by duration and inter-arrival times following normaldistributions, and they occupy the 75% of the dataset’s entire duration. Thisresulted in tasks having an average duration of 30 minutes, and average inter-arrival times of nearly 40 minutes, both exhibiting relatively low variance andspread.Fault-triggering tasks, on the other hand, are modeled using distributionsfitted on the data from the failure trace associated to the Grid5000 cluster [7],available on the Failure Trace Archive . We extracted from this trace the inter-arrival times of the host failures. Such data was then scaled and shifted to obtainan average of 10 minutes, while retaining the shape of the distribution. We thenfitted the data using an exponentiated Weibull distribution, which is commonlyused to characterize failure inter-arrival times [18]. To model durations, weextracted for all hosts the time intervals between successive absent and alive records, which respectively describe connectivity loss and recovery events fromthe HPC system’s resource manager to the host. We then fitted a Johnson SUdistribution over a cluster of the data present at the 5 minutes point, whichrequired no alteration in the original data. This particular distribution waschosen because of the quality of the fitting. In Figure 7, we show the histogramsfor the durations (a) and inter-arrival times (b) of the fault tasks in the workloadfiles, together with the original distributions fitted from the Grid5000 data.Unless configured to use specific probabilities, FINJ generates each task inthe workload by randomly picking the respective application or fault programto be executed, from those that are available, with uniform probability. Thisimplies that, statistically, all of the applications we selected will be subject toall of the available fault-triggering programs, given a sufficiently-long workload,with different execution overlaps depending on the starting times and durationsof the specific tasks. Such a task distribution greatly mitigates overfitting issues.Finally, we do not allow fault-triggering program executions to overlap. HPC Applications
We used a series of well-known benchmark applications,each of which stresses a different part of the node and measures the correspond-ing performance.1. DGEMM [14]: performs CPU-heavy matrix-to-matrix multiplication;2. HPC Challenge (HPCC) [29]: a collection of applications that stress boththe CPU and memory bandwidth of an HPC system;3. Intel distribution for High-Performance Linpack (HPL) [15]: solves a sys-tem of linear equations; http://fta.scem.uws.edu.au/
16. STREAM [30]: stresses the system’s memory and measures its bandwidth;5. Bonnie++ [12]: performs HDD read-write operations;6. IOZone [10]: performs HDD read-write operations.The different applications provide a diverse environment for fault injection.Since we limit our analysis to a single machine, we use versions of the applica-tions that rely on shared-memory parallelism, for example through the OpenMPlibrary.
Fault Programs
All the fault programs used to reproduce anomalous con-ditions on Antarex are available at the FINJ Github repository. As describedby Tuncer et al. [35], each program can also operate in a low-intensity mode,thus doubling the number of possible faults. While we do not physically damagehardware, we closely reproduce several reversible hardware issues, such as I/Oand memory allocation errors. Some of the fault programs ( ddot and dial ) onlyaffect the performance of the CPU core they run on, whereas the others affectthe entire compute node. The programs and the generated faults are as follows.1. leak periodically allocates 16MB arrays that are never released [35] cre-ating a memory leak , causing memory fragmentation and severe systemslowdown;2. memeater allocates, writes into and expands a 36MB array [35], decreasingperformance through a memory interference fault and saturating band-width;3. ddot repeatedly calculates the dot product between two equal-size matri-ces. The sizes of the matrices change periodically between 0.9, 5 and 10times the CPU cache’s size [35]. This produces a
CPU and cache inter-ference fault, resulting in degraded performance of the affected CPU;4. dial repeatedly performs floating-point operations over random numbers [35],producing an
ALU interference fault, resulting in degraded performancefor applications running on the same core as the program;5. cpufreq decreases the maximum allowed CPU frequency by 50% throughthe Linux Intel P-State driver. This simulates a system misconfiguration or failing CPU fault, resulting in degraded performance;6. pagefail makes any page allocation request fail with 50% probability. This simulates a system misconfiguration or failing memory fault, causingperformance degradation and stalling of running applications; https://kernel.org/doc/Documentation/cpu-freq https://kernel.org/doc/Documentation/fault-injection ioerr fails one out of 500 hard-drive I/O operations with 20% probability,simulating a failing hard drive fault, and causing degraded performancefor I/O-bound applications, as well as potential errors;8. copy repeatedly writes and then reads back a 400MB file from a hard drive.After such a cycle, the program sleeps for 2 seconds [20]. This simulates an I/O interference or failing hard drive fault by saturating I/O bandwidth,and results in degraded performance for I/O-bound applications.The faults triggered by these programs can be grouped in three categoriesaccording to their nature. The interference faults (i.e., leak, memeater, ddot,dial and copy) usually occur when orphan processes are left running in thesystem, saturating resources and slowing down the other processes. Misconfig-uration faults occur when a component’s behavior is outside of its specification,due to a misconfiguration by the users or administrators (i.e., cpufreq). Finally,the hardware faults are related to physical components in the system that areabout to fail, and trigger various kinds of errors (i.e., pagefail or ioerr). We notethat some faults may belong to multiple categories, as they can be triggered bydifferent factors in the system.
In this section, we describe the process of parsing the metrics collected by LDMSto obtain a set of features capable of describing the state of the system, inparticular for classification tasks.
Post-Processing of Data
Firstly, we removed all constant metrics (e.g., theamount of total memory in the node), which were redundant, and replaced theraw monotonic counters captured by the perfevent and procinterrupts plug-inswith their first-order derivatives. Moreover, we created an allocated metric,both at the CPU core and node level, and integrated it in the original set.This metric can assume only binary values, and it encapsulates the informationabout the hardware component occupation, namely defining whether there is anapplication allocated on the node or not. This information would be availablealso in a real setting, since resource managers in HPC systems always keeptrack of the running jobs and their allocated computational resources. Lastly,for each above-mentioned metric and for each time point, we computed its first-order derivative and added it to the dataset, as proposed by Guan et al. [19].After having post-processed the LDMS metrics, we created the feature setsvia aggregation. Each feature set corresponds to a 60-second aggregation win-dow and is related to a specific CPU core. The time step between consecutivefeature sets is 10 seconds; this fine granularity allows for quick response timesin fault detection. For each metric, we computed several statistical measuresover the distribution of the values measured within the aggregation window [35].These measures are the average , standard deviation , median , minimum , maxi-mum , skewness , kurtosis , and the and . Overall,18 DMS Data
Core 0Classifier DataProcessorCore 1Classifier Core 2Classifier Core 3ClassifierResource NClassifier[…]Resource 1ClassifierResource 0Classifier R e s ou r ce D a t a R e s ou r ce D a t a [ … ] R e s ou r ce ND a t a Node-levelData
Figure 8: Architecture of the proposed machine learning-based fault detectionsystem.there are 22 statistical features for each metric in the dataset (including alsothe first-order derivatives). Hence, starting from the initial set of 2094 LDMSmetrics including node-level data as well as from all CPU cores, the final featuresets contain 3168 elements for each separate CPU core. We note that we didnot include the metrics collected by the procinterrupts plugin, as a preliminaryanalysis revealed them to be irrelevant for fault classification. All the scriptsused to process the data are available on the FINJ Github repository.
Labeling
To train classifiers to distinguish between faulty and normal states,we labeled the feature sets either according to the fault program (i.e., one ofthe 8 programs presented in Section 5.2) running within the corresponding ag-gregation window, or “healthy” if no fault was running. The logs produced byFINJ (included in the Antarex dataset) provide the details about the fault pro-grams running at each time-stamp. In a realistic deployment scenario, the faultdetection model can also be trained using data from spontaneous, real faults.In that case, administrators should provide explicit labels instead of relying onfault injection experiments.A single aggregation window may capture multiple system states, making la-beling not trivial. For example, a feature set may contain “healthy” time pointsbefore and after the start of a fault, or even include two different fault types.We define these feature sets as ambiguous . One of the reasons behind the use ofa short aggregation window (60 seconds) is to minimize the impact of such am-biguous system states on fault detection. However, since these situations cannotbe avoided, we propose two labelling methods. The first method is mode , whereall the labels appearing in the time window are considered. Their distributionis examined and the label with the majority of occurrences is assigned to thefeature set. This leads to robust feature sets, whose labels are always represen-tative of the aggregated data. The second method is recent , where the label isobtained by observing the state of the system at the most recent time point inthe time window (the last time point). This could correspond to a certain faulttype or could be “healthy”. This approach could be less robust, for instancewhen a feature set that is mostly “healthy” is labelled as faulty, but has theadvantage of leading to a more responsive fault detection, as faults are detectedas soon as they are encountered, without having to look at the state over thelast 60 seconds. 19 etection System Architecture
The fault detection system we proposein this paper is based on an architecture containing an array of classifiers, asshown in Figure 8. A classifier handles a specific computing resource type inthe node, such as a CPU core, GPU or MIC. Each classifier is trained withthe feature sets of all the resource units of the corresponding type, and is ableto perform fault diagnoses for all of them, thus detecting faults both at nodelevel and resource level (e.g., dial and ddot). This is possible as the featuresets of each classifier contain all node-level metrics for the system as well asthe resource-specific metrics for the resource unit. Since a feature set containsdata from only one resource type, the proposed approach allows us to limittheir size, which in turn improves performance and detection accuracy. The keyassumption of this approach is that the resource units of the same type behavesimilarly and that the respective feature sets can be combined in a coherent set.It is anyway possible to use a separate classifier for each resource unit of thesame type without altering the feature sets themselves, should this assumptionprove to be too strong. In our case, the computing nodes have only CPU cores.Therefore, we train one classifier with feature sets that contain both node-leveland core-level data, for one core at a time.The classifiers’ training can be performed offline, using labeled data resultingfrom normal system operation or from fault injection, as we do in this paper.The trained classifiers can then be deployed to detect faults on new, streamingdata. By design, our architecture can detect only one fault at any time. Iftwo faults happen simultaneously, the classifier would detect the fault whoseeffects on the system are deemed more disruptive for the normal, “healthy”state. In this preliminary study, this design choice is reasonable, as our purposeis to automatically distinguish between different fault conditions. Althoughunlikely, multiple faults could affect the same compute node of an HPC systemat the same time. Our approach could be extended to deal with this situation bydevising a classifier that does not produce a single output but rather a compositeoutput, for instance a distribution or a set of probabilities, one for each possiblefault type.
In this section we present the results of our experiments, whose purpose was tocorrectly detect which of the 8 faults (as described in Section 5.2) were injectedin the Antarex HPC node at any point in time in the Antarex dataset. Alongthe way, we compared a variety of classification models and the two labelingmethods discussed in Section 6, assessed the impact of ambiguous feature sets,estimated the most important metrics for fault detection, and finally evaluatedthe overhead of our detection method. For each classifier, both the training andtest sets are built from the feature set of one randomly-selected core for eachtime point. We evaluated the classifiers using 5-fold cross-validation, with theaverage F-score as the performance metric. The software environment we used The F-score is the harmonic mean between precision and recall. oldI FoldII FoldIII FoldIV FoldV (a) Time-stamp order. FoldI FoldII FoldIII FoldIV FoldV (b) Shuffled order.
Figure 9: The effect of using time-stamp (a) or shuffled (b) ordering to createthe data folds for cross-validation. Blocks with similar color represent featuresets from the same time frame.is Python 3.4 with the Scikit-learn package [32].Although data shuffling is a standard technique with proven advantagesin machine learning, it is not well suited to the fault detection method wepropose in this paper. This is because our design is tailored for online systems,where classifiers are trained using only continuous, streamed, and potentiallyunbalanced data as it is acquired, while ensuring robustness in training so asto detect faults in the near future. Hence, it is very important to assess thedetection accuracy without data shuffling. We reproduced this realistic, onlinescenario by performing cross-validation on the Antarex dataset using feature setsin time-stamp order. Time-stamp ordering produces cross-validation folds thatcontain data from specific time intervals. We depict this effect in Figure 9. Inany case, we used shuffling in a small subset of the experiments for comparativepurposes.
We first present results associated with different classifiers using feature setsin time-stamp order and the mode labeling method. As classification models,we opted for a Random Forest (RF), Decision Tree (DT), Linear Support Vec-tor Classifier (SVC) and Neural Network (NN) with two hidden layers, eachhaving 1000 neurons. A preliminary empirical evaluation revealed that thisneural network architecture, as well as the other models, are well-suited for ourpurposes and thus provide good results. Even though we also considered us-ing more complex models, such as
Long Short Term Memory (LSTM) neuralnetworks, we ultimately decided to employ comparable, general-purpose mod-els. This allows us to broaden the scope of our study, evaluating the impact offactors such as data normalization, shuffling and ambiguous system states onfault classification. On the other hand, LSTM-like models have more restrictivetraining requirements and are more tailored for regression tasks. Finally, sincethe Antarex dataset was acquired by injecting faults lasting a few minutes each,such a model would not benefit from the long-term correlations in system states,where models like RF are capable of near-optimal performance.The results of each classifier and each fault type are shown in Figure 10,with the overall F-score highlighted. We notice that all classifiers have verygood performance, with the overall F-scores well above 0.9. RF is the best21 v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (a) Random Forest. o v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (b) Decision Tree. o v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (c) Neural Network. o v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (d) Support Vector Classifier. Figure 10: Classification results on the Antarex dataset, using all feature sets intime-stamp order, the mode labeling method, and different classification models.classifier, with an overall F-score of 0.98, followed by NN and SVC scoring 0.93.The most difficult fault types to detect for all classifiers are pagefail and ioerr faults, which have substantially worse scores than the others.We infer from the results that an RF would be the ideal classifier for anonline fault detection system, due to its detection accuracy which is at least 5%higher than the other classifiers, in terms of the overall F-score. Additionally,random forests are computationally efficient [26], and therefore would be suit-able for use in online environments with strict overhead requirements. As anadditional advantage, unlike the NN and SVC classifiers, RF and DT do notrequire data normalization. Normalization in an online environment is hard toachieve, as many metrics do not have well-defined upper bounds. Although arolling window-based dynamic normalization approach can be used [20] to solvethe problem, this method is unfeasible for ML-based classification, as it canlead to quickly-degrading detection accuracy and to the necessity of frequenttraining. For all these reasons, we will show only the results of an RF classifierin the following. 22 v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (a) Mode labeling. o v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (b) Recent labeling. o v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (c) Mode labeling with shuffling. o v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (d) Recent labeling with shuffling. Figure 11: RF classification results, using all feature sets in time-stamp orshuffled order, with the mode and recent labeling methods.
Next we present the results of the two labeling methods described in Section 5.2.Figures 11a and 11b report the classification results without data shuffling for,respectively, the mode and the recent labeling. The overall F-scores are 0.98and 0.96, close to the optimal values. Once again, in both cases the ioerr and pagefail faults are substantially more difficult to detect than the others. Thisis probably due to the intermittent nature of both of these faults, whose effectsdepend on the hard drive I/O (ioerr) and memory allocation (pagefail) patternsof the underlying applications.We observe an interesting behavior with the copy fault program, which givesa worse F-score when using the recent method in 11b. As shown in Section 7.4,a metric related to the read rate of the hard drive used in our experiments( time read rate der perc95 ) is defined as important by the DT model for dis-tinguishing faults, and we assume it is useful for detecting hard drive faults inparticular, since it has no use for CPU-related faults. However, this is a com-paratively slowly-changing metric. For this reason, a feature set may be labeled23 v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (a) Non-ambiguous dataset. o v e r a ll h e a l t h y l e a k m e m e a t d i a l dd o t c p u f r e qpg f a il c o p y i o e rr F - S c o r e (b) Non-ambiguous dataset with shuffling. Figure 12: RF classification results on the Antarex dataset, using only non-ambiguous feature sets in time-stamp (a) and shuffled (b) order.as copy as soon as the program is started, before the metric has been updatedto reflect the new system status. This in turn makes classification more difficultand leads to degraded accuracy. This leads us to conclude that recent may notbe well suited for the faults whose effects cannot be detected immediately.Figures 11c and 11d report the detection accuracy for the mode and recent methods, this time obtained after having shuffled the data for the classifiertraining phase. As expected, data shuffling markedly increases the detectionaccuracy for both labeling methods, reaching an almost optimal F-score with allfault types – and overall F-score of 0.99. A similar performance boost with datashuffling was obtained also with the remaining classification models introducedin Section 7.1 (the results are not reported here since they would not add anyinsight to the experimental analysis). We notice that the recent labeling has aslightly higher detection rate, especially for some fault types. The reason forthis improvement most likely lies in the highly reactive nature of this labelingmethod, as it can capture status changes faster than mode . Another interestingobservation is that adding data shuffling grants a larger performance boost tothe recent labeling compared to the mode labeling. This happens because the recent method is more sensible to temporal correlations in the data, which inturn may lead to classification errors. Data shuffling destroys the temporalcorrelations in the training set and thus improves detection accuracy.
Here we give insights on the impact of ambiguous feature sets in the dataset onthe classification process by excluding them from the training and test sets. Asshown in Figure 12, the overall F-scores are above 0.99 both without (Figure 12a)and with shuffling (Figure 12b), leading to a slightly better classification perfor-mance compared to having the ambiguous feature sets in the dataset. Around20% of the feature sets of the Antarex dataset is ambiguous. With respect to24 a) ddot. (b) cpufreq. (c) memeater.
Figure 13: The scatter plots of two important metrics, as quantified by a DTclassifier, for three fault types. The “healthy” points are marked in blue, whilefault-affected points in orange, and the points related to ambiguous feature setsin green.this relatively large proportion, the performance gap described above is small,which proves the robustness of our detection method. In general, the proportionof ambiguous feature sets in a dataset depends primarily on the length of theaggregation window, and on the frequency of state changes in the HPC system.More feature sets will be ambiguous as the length of the aggregation windowincreases, leading to more pronounced adverse effects on the classification ac-curacy. Thus, as a practical guideline, we advise to use a short aggregationwindow, such as the 60-second window we employed here, to perform onlinefault detection.A more concrete example of the behavior of ambiguous feature sets can beseen in Figure 13, where we show the scatter plots of two important metrics (aswe will discuss in Section 7.4) for the feature sets related to the ddot , cpufreq and memeater fault programs, respectively. The “healthy” points, marked inblue, and the fault-affected points, marked in orange, are distinctly clusteredin all cases. On the other hand, the points representing the ambiguous featuresets, marked in green, are sparse, and often fall right between the “healthy” andfaulty clusters. This is particularly evident with the cpufreq fault program inFigure 13b. A crucial aspect in fault detection is understanding the most important metricsfor the detection accuracy. Identifying them can help reducing the amount ofcollected data, thus reducing the number of hardware measuring components orsoftware tools which could create additional overhead. Tuncer et al. [35] showedthat using methods such as principal component analysis (and other methodsthat rely exclusively on the variance in the data) may discard certain importantmetrics. On the other hand, a RF classifier tends to report as relevant the samemetric many times, with different statistical indicators. This is caused by itsensemble nature (a random forest is comprised of a collection of decision trees)25able 2: The most important metrics for fault detection, obtained via a DTclassifier.
Source Name procsensors cpu freq perc50 meminfo
2. active der perc5 perfevent hw cache misses perc50 vmstat
4. thp split std vmstat
5. nr active file der perc25 perfevent hw branch instructions perc95 meminfo
7. mapped der avg meminfo
8. nr anon pages der perc95 procstat
9. sys der min vmstat
10. nr dirtied der std meminfo
11. kernelstack perc50 vmstat
12. numa hit perc5 procstat
13. processes der std procstat
14. context switches der perc25 procstat
15. procs running perc5 finj allocated perc50 vmstat
17. nr free pages der min diskstats
18. time read rate der perc95 vmstat
19. nr kernel stack der max perfevent hw instructions perc5 and the subtle differences in the estimators that compose it. Instead, a DTclassifier naturally provides this information, as the most relevant metrics willbe those in the top layer of the decision tree. Thus, we trained a DT classifieron the Antarex dataset.The results are listed in Table 2, along with their source LDMS plug-ins.While the metrics marked in bold are per-core, the others are system-wide. Wenotice that metrics from most of the available plug-ins are used, and some ofthese metrics can be directly associated to the behaviour of the faults. Forinstance, the metric related to context switches ( context switches der perc25 ) istied to the dial and ddot programs, as CPU interference generates an anomalousnumber of context switches. In general, first-order derivatives (marked with the“der” suffix) are widely used by the classifier, which suggests that computingthem is actually useful for fault detection. On the contrary, more complexstatistical indicators such as the skewness and kurtosis do not appear amongthe most relevant. This suggests that simple features are sufficient for machinelearning-driven fault detection on HPC nodes.
Finally, a critical consideration for understanding the feasibility of a fault de-tection approach is its overhead, especially if the final target is its deployment Decision trees are built by splitting the data in subsets. The splitting choice is based onthe value of the metrics, or attribute in the DT terminology. The attributes providing thehighest information gain (i.e., the most relevant one) are selected first by the standard DTtraining algorithms.
26n a real HPC system. LDMS is proven to have a low overhead at high samplingrates [1]. We also assume that the generation of feature sets and the classifica-tion are performed locally in each node ( on-edge computation), and that onlythe resulting fault diagnoses are sent externally, which greatly decreases the net-work communication requirements and overhead. Following these assumptions,the hundreds of performance metrics used to train the classification models donot need to be sampled and streamed at a fine granularity. Generating a setof feature sets, one for each core in the test node, at a given time point for anaggregation window of 60 seconds requires, on average, 340 ms by employing asingle thread. This time includes the I/O overhead due to reading and parsingthe LDMS CSV files, and writing the output feature sets. RF classifiers arevery efficient: classifying a single feature set as faulty or not requires 2ms, onaverage. This means that in total 342 ms are needed to generate and classify afeature set, using a single thread and a 60-seconds aggregation window. This ismore than acceptable for online use and practically negligible. Furthermore, theoverhead should be much lower in a real HPC system, with direct in-memory ac-cess to streamed data, as a significant fraction of the overhead in our simulationis due to file system I/O operations to access the CSV files with the data. Ad-ditionally, the single statistical features are independent from each other. Thismeans that the data can be processed in parallel fashion, using multiple threadsto further reduce latency and ensure load balancing across CPU cores, which isa critical aspect to prevent application slowdown induced by fault detection.
We studied a ML approach to online fault classification in HPC systems. Ourstudy provided three contributions to the state-of-the-art in resiliency researchin the HPC systems field. The first is FINJ, a fault injection tool, which allowsfor the automation of complex experiments, and for reproducing anomalousbehaviors in a deterministic, simple way. FINJ is implemented in Python andhas no dependencies for its core operation. This, together with the simplicityof its command-line interface, makes the deployment of FINJ on large-scalesystems trivial. Since FINJ is based on the use of tasks, which are externalexecutable programs, users can integrate the tool with any existing lower-levelfault injection framework that can be triggered in such way, and ranging from theapplication level to the kernel, or even hardware level. The use of workloads inFINJ also allows to reproduce complex, specific fault conditions, and to reliablyperform experiments involving multiple nodes at the same time.The second contribution is the Antarex fault dataset, which we generatedusing FINJ, to enable training and evaluation of our supervised ML classifica-tion models, and which we extensively described. Both FINJ and the Antarexdataset are publicly available to facilitate resiliency research in the HPC systemsfield. The third contribution is a classification method intended for streamed,online data obtained from a monitoring framework, which is then processedand fed to classifiers. The experimental results of our study show almost per-27ect classification accuracy for all fault types injected by FINJ, with negligiblecomputational overhead for HPC nodes. Moreover, our study reproduces theoperating conditions that could be found in a real online system, in particularthose related to ambiguous system states and data imbalance in the trainingand test sets.As future work, we plan to deploy our fault detection method in a large-scale real HPC system. This will involve facing a number of new challenges.We need to develop tools to aid online training of machine learning models, aswell as integrate our method in a monitoring framework such as Examon [4].We also expect to study our approach in online scenarios and adapt it wherenecessary. For instance, we need to characterize the scalability of FINJ, andextend it to include the ability to build workloads where the order of tasks isdefined by causal relationships rather than time-stamps, which might simplifythe triggering of extremely specific anomalous states in a given system. Sincetraining is performed before HPC nodes move into production (e.g., in a testenvironment), we also need to characterize how often re-training is needed, anddevise a procedure to perform this. Finally, we plan to make our detectionmethod robust against the occurrence of faults that were unknown during thetraining phase, preventing their misclassification, as well as expect to evaluatesome specialized models such as LSTM neural networks, in the light of thegeneral results obtained with this study.
Acknowledgements
A. Netti has been supported by the EU project
Oprecomp-Open Transprecision Computing (grant agreement 732631). A. Sˆırbu has beenpartially funded by the EU project
SoBigData Research Infrastructure — BigData and Social Mining Ecosystem (grant agreement 654024). We thank theIntegrated Systems Laboratory of ETH Zurich for granting us control of theirAntarex HPC node during this study.
References [1] A. Agelastos, B. Allan, J. Brandt, P. Cassella, et al. The lightweight dis-tributed metric service: a scalable infrastructure for continuous monitoringof large scale computing systems and applications. In
Proc. of SC 2014 ,pages 154–165. IEEE, 2014.[2] S. Ashby, P. Beckman, J. Chen, P. Colella, et al. The opportunities andchallenges of exascale computing.
Summary Report of the Advanced Scien-tific Computing Advisory Committee (ASCAC) Subcommittee , pages 1–77,2010.[3] E. Baseman, S. Blanchard, N. DeBardeleben, A. Bonnie, et al. Interpretableanomaly detection for monitoring of high performance computing systems.In
Proc. of the ACM SIGKDD Workshops 2016 , 2016.284] F. Beneventi, A. Bartolini, C. Cavazzoni, and L. Benini. Continuous learn-ing of HPC infrastructure models using big data analytics and in-memoryprocessing tools. In
Proc. of DATE 2017 , pages 1038–1043. IEEE, 2017.[5] K. Bergman, S. Borkar, D. Campbell, W. Carlson, et al. Exascale comput-ing study: Technology challenges in achieving exascale systems.
DARPAIPTO, Tech. Rep , 15, 2008.[6] P. Bodik, M. Goldszmidt, A. Fox, D. B. Woodard, et al. Fingerprintingthe datacenter: automated classification of performance crises. In
Proc. ofEuroSys 2010 , pages 111–124. ACM, 2010.[7] R. Bolze, F. Cappello, E. Caron, M. Dayd´e, et al. Grid’5000: A largescale and highly reconfigurable experimental grid testbed.
The Interna-tional Journal of High Performance Computing Applications , 20(4):481–494, 2006.[8] J. Calhoun, L. Olson, and M. Snir. FlipIt: An LLVM based fault injectorfor HPC. In
Proc. of Euro-Par 2014 , pages 547–558. Springer, 2014.[9] F. Cappello, A. Geist, W. Gropp, S. Kale, et al. Toward exascale resilience:2014 update.
Supercomputing frontiers and innovations , 1(1):5–28, 2014.[10] D. Capps and W. Norcott. IOzone filesystem benchmark, 2008.[11] I. Cohen, J. S. Chase, M. Goldszmidt, T. Kelly, et al. Correlating instru-mentation data to system states: A building block for automated diagnosisand control. In
OSDI , volume 4, pages 16–16, 2004.[12] R. Coker. Bonnie++ file-system benchmark, 2012.[13] N. DeBardeleben, S. Blanchard, Q. Guan, Z. Zhang, et al. Experimentalframework for injecting logic errors in a virtual machine to profile applica-tions for soft error resilience. In
Proc. of Euro-Par 2011 , pages 282–291.Springer, 2011.[14] J. J. Dongarra, J. D. Cruz, S. Hammarling, and I. S. Duff. Algorithm 679:A set of level 3 basic linear algebra subprograms: model implementationand test programs.
ACM Transactions on Mathematical Software (TOMS) ,16(1):18–28, 1990.[15] J. J. Dongarra, P. Luszczek, and A. Petitet. The linpack benchmark: past,present and future.
Concurrency and Computation: practice and experi-ence , 15(9):803–820, 2003.[16] C. Engelmann and S. Hukerikar. Resilience design patterns: A structuredapproach to resilience at extreme scale.
Supercomputing Frontiers and In-novations , 4(3), 2017. 2917] K. B. Ferreira, P. Bridges, and R. Brightwell. Characterizing applicationsensitivity to os interference using kernel-level noise injection. In
Proc. ofSC 2008 , page 19. IEEE Press, 2008.[18] A. Gainaru and F. Cappello. Errors and faults. In
Fault-Tolerance Tech-niques for High-Performance Computing , pages 89–144. Springer, 2015.[19] Q. Guan, C.-C. Chiu, and S. Fu. Cda: A cloud dependability analysisframework for characterizing system dependability in cloud computing in-frastructures. In
Proc. of PRDC 2012 , pages 11–20. IEEE, 2012.[20] Q. Guan and S. Fu. Adaptive anomaly identification by exploring metricsubspace in cloud computing infrastructures. In
Proc. of SRDS 2013 , pages205–214. IEEE, 2013.[21] H. S. Gunawi, T. Do, P. Joshi, P. Alvaro, et al. FATE and DESTINI: Aframework for cloud recovery testing. In
Proc. of NSDI 2011 , page 239,2011.[22] M.-C. Hsueh, T. K. Tsai, and R. K. Iyer. Fault injection techniques andtools.
Computer , 30(4):75–82, 1997.[23] W. M. Jones, J. T. Daly, and N. DeBardeleben. Application monitoringand checkpointing in hpc: looking towards exascale systems. In
Proc. ofACM-SE 2012 , pages 262–267. ACM, 2012.[24] P. Joshi, H. S. Gunawi, and K. Sen. PREFAIL: A programmable tool formultiple-failure injection. In
ACM SIGPLAN Notices , volume 46, pages171–188. ACM, 2011.[25] D. Kondo, B. Javadi, A. Iosup, and D. Epema. The failure trace archive:Enabling comparative analysis of failures in diverse distributed systems. In
Proc. of CCGRID 2010 , pages 398–407. IEEE, 2010.[26] B. Lakshminarayanan, D. M. Roy, and Y. W. Teh. Mondrian forests: Effi-cient online random forests. In
Advances in neural information processingsystems , pages 3140–3148, 2014.[27] C. Lameter. Numa (non-uniform memory access): An overview.
Queue ,11(7):40, 2013.[28] Z. Lan, Z. Zheng, and Y. Li. Toward automated anomaly identification inlarge-scale systems.
IEEE Transactions on Parallel and Distributed Sys-tems , 21(2):174–187, 2010.[29] P. R. Luszczek, D. H. Bailey, J. J. Dongarra, J. Kepner, et al. The HPCchallenge (HPCC) benchmark suite. In
Proc. of SC 2006 , volume 213.ACM, 2006.[30] J. McCalpin. STREAM: Sustainable memory bandwidth in high perfor-mance computers, 2006. 3031] T. Naughton, W. Bland, G. Vallee, C. Engelmann, et al. Fault injectionframework for system resilience evaluation: fake faults for finding futurefailures. In
Proc. of Resilience 2009 , pages 23–28. ACM, 2009.[32] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, et al. Scikit-learn: Machine learning in python.
Journal of machine learning research ,12(Oct):2825–2830, 2011.[33] M. Snir, R. W. Wisniewski, J. A. Abraham, et al. Addressing failuresin exascale computing.
The International Journal of High PerformanceComputing Applications , 28(2):129–173, 2014.[34] D. T. Stott, B. Floering, D. Burke, Z. Kalbarczpk, et al. NFTAPE: a frame-work for assessing dependability in distributed systems with lightweightfault injectors. In
Proc. of IPDS 2000 , pages 91–100. IEEE, 2000.[35] O. Tuncer, E. Ates, Y. Zhang, A. Turk, et al. Online diagnosis of perfor-mance variation in HPC systems using machine learning.
IEEE Transac-tions on Parallel and Distributed Systems , 2018.[36] O. Villa, D. R. Johnson, M. Oconnor, E. Bolotin, et al. Scaling the powerwall: a path to exascale. In
Proc. of SC 2014 , pages 830–841. IEEE, 2014.[37] C. Wang, V. Talwar, K. Schwan, and P. Ranganathan. Online detection ofutility cloud anomalies using metric distributions. In