[PDF] MOARD: Modeling Application Resilience to Transient Faults on Data Objects

Abstract

Understanding application resilience (or error tolerance) in the presence of hardware transient faults on data objects is critical to ensure computing integrity and enable efficient application-level fault tolerance mechanisms. However, we lack a method and a tool to quantify application resilience to transient faults on data objects. The traditional method, random fault injection, cannot help, because of losing data semantics and insufficient information on how and where errors are tolerated. In this paper, we introduce a method and a tool (called MOARD) to model and quantify application resilience to transient faults on data objects. Our method is based on systematically quantifying error masking events caused by application-inherent semantics and program constructs. We use MOARD to study how and why errors in data objects can be tolerated by the application. We demonstrate tangible benefits of using MOARD to direct a fault tolerance mechanism to protect data objects.

Full PDF

MMOARD: Modeling Application Resilience toTransient Faults on Data Objects

Luanzheng Guo

EECS, University of California, [email protected]

Dong Li

EECS, University of California, [email protected]

Abstract —Understanding application resilience (or error toler-ance) in the presence of hardware transient faults on data objectsis critical to ensure computing integrity and enable efﬁcientapplication-level fault tolerance mechanisms. However, we lack amethod and a tool to quantify application resilience to transientfaults on data objects. The traditional method, random faultinjection, cannot help, because of losing data semantics andinsufﬁcient information on how and where errors are tolerated. Inthis paper, we introduce a method and a tool (called “MOARD”)to model and quantify application resilience to transient faults ondata objects. Our method is based on systematically quantifyingerror masking events caused by application-inherent semanticsand program constructs. We use MOARD to study how and whyerrors in data objects can be tolerated by the application. Wedemonstrate tangible beneﬁts of using MOARD to direct a faulttolerance mechanism to protect data objects.

I. I

NTRODUCTION

Transient faults due to high energy particle strikes, wear-out, etc. are expected to become a critical contributor to in-ﬁeld system failures of high performance computing (HPC).If those faults manifest in architecturally visible states (e.g.,registers and the memory) and those states hold values of adata object, then we have transient faults on the data object.Transient faults on a data object impact application outcomecorrectness. Understanding application resilience to transientfaults on data objects is critical to ensure computing integrityin future large scale systems.Furthermore, many common application-level fault toler-ance mechanisms focus on data objects. Understanding ap-plication resilience to transient faults on data objects can behelpful to direct those mechanisms. Application-level check-point is an example of such an application level fault tolerancemechanism. By periodically saving correct values of some dataobjects into persistent storage, application-level checkpointmakes application resumable when a failure happens. Somealgorithm-based fault tolerance methods [1], [2] are otherexamples. They can detect and locate errors in speciﬁc dataobjects. However, those application-level fault tolerance mech-anisms can be expensive (e.g., 35% performance overheadin [3]). If data corruptions of a data object are easily tolerableby the application, then we do not need to apply thosemechanisms to protect the data object, which will improveperformance and energy efﬁciency. Hence, understanding ap-plication resilience to transient faults on data objects is usefulto direct those application level fault tolerance mechanisms. However, we do not have a method or a tool to quantifyapplication resilience to transient faults on data objects. Thecurrent common practice to understand application resilienceto transient faults in HPC is application-level random faultinjection (RFI) [4], [5], [6]. Although RFI is useful, it cannotstudy application resilience to transient faults on data objectsbecause of the following two reasons.First, RFI loses application semantics (data semantics). RFIrandomly selects instructions and triggers random bit ﬂip ininput or output operands of the instructions. Typically RFIperforms a large amount of random fault injection tests, andthen calculates that among all fault injection tests, how manyof them succeed (i.e., having correct application outcomes).However, we do not know the data value corrupted by RFIbelongs to which data object. Second, RFI gives us littleknowledge on how and where errors are tolerated [7]. Under-standing “how” and “where” is necessary to identify why theapplication is vulnerable to the value corruption of some dataobjects, and provides feedback on how to apply application-level fault tolerance mechanisms effectively and efﬁciently.In this paper, we introduce a method to model and quantifyapplication resilience to transient faults on data objects. Ourmethod is based on an observation that, application resilienceto transient faults on data objects is mainly because ofapplication-inherent semantics and program constructs. For ex-ample, a corrupted bit in a data structure could be overwrittenby an assignment operation, hence does not cause an outcomecorruption; a corrupted bit of a molecular representation in aMonte Carlo method-based simulation may not matter to theapplication outcome because of the statistical nature of thesimulation. Based on the above observation, the quantiﬁcationof application resilience to transient faults on data objectsis equivalent to quantifying error masking events caused byapplication-inherent semantics and program constructs, and as-sociating those events with data objects. By analyzing applica-tion execution information (e.g., the architecture-independent,LLVM [8] IR trace), we can accurately capture those errormasking events, and provide insightful analysis on how andwhere an error tolerance happens. Furthermore, analyzing ap-plication execution information, we can use memory addressesof data objects and track register allocation to associate datavalues in registers and memory with data objects. Such amethod introduces data semantics into the analysis.Quantifying application resilience to transient faults on a r X i v : . [ c s . D C ] F e b ata objects must address a couple of research problems.First, we have little knowledge of the characteristics of errormasking events. This creates a major obstacle to recognizethose events and achieve analytical quantiﬁcation. Second,we do not have a good metric to make the quantiﬁcation.Simply counting the number of error masking events cannotprovide a meaningful quantiﬁcation, because the number canbe accumulated throughout application execution. The factthat a data object has many error masking events does notnecessarily mean that the application is resilient to the valuecorruption of the data object because those events may beonly a small portion of the total operations on data objects.Third, determining the impact of an error occurrence on thecorrectness of application outcome is challenging. The errorcan propagate to many data objects. Tracking all of thoseerrors for analysis is prohibitive. In addition, an error maynot impact the correctness of application outcome becauseof algorithm semantics in the application. However, recogniz-ing algorithm semantics requires detailed application domainknowledge, which is prohibitive for common users.Based on the method of quantifying error masking events,we systematically model and quantify application resilienceto transient faults on data objects, and address the aboveproblems. We ﬁrst characterize error masking events andclassify them into three classes: operation-level error masking,error masking when error propagation, and algorithm-levelerror masking. We further introduce a metric. The metricquantiﬁes how often error masking happens. Based on themetric, the comparison of application resilience to transientfaults between different data objects is more meaningfulthan based on simply counting error masking events. Ourclassiﬁcation of error masking events and the proposed metricare fundamental, because they lay a foundation not only formodeling application resilience to transient faults on dataobjects, but also for other research, such as the placementof error detectors [9] and application checkpoint [10].Based on our classiﬁcation and metric, we introduce amodel. Given a data object, our model examines operationsin the dynamic instruction trace. For each operation thatconsumes elements of the data object, the model makes thefollowing inference: if an element consumed by the operationhas an error, will the application outcome remain correct?The inference procedure of the model includes three practicaltechniques to recognize the three classes of error maskingevents: (1) detecting operation-level error masking based onoperation semantics, (2) tracking error propagation by limitingpropagation length for analysis, and (3) detecting algorithm-level error masking based on deterministic fault injection.For (2), limiting propagation length is a technique based onthe characterization of error propagation. This technique doesnot impact our conclusion on error masking while avoidingexpensive analysis; for (3), the deterministic fault injectiontreats the application as a black box without requiring detailedapplication domain knowledge.In summary, this paper makes the following contributions: • A systematic method and a metric to analytically model application resilience to transient faults on data objects,which is unprecedented; • A comprehensive classiﬁcation of error masking events, andmethods to recognize them; • An open-sourced system tool, MOARD [11], to modelapplication resilience to transient faults on data objects. • An evaluation of representative, computational algorithmsand two scientiﬁc applications to reveal how application-level error masking typically happens on data objects; • A case study to demonstrate the beneﬁt of using a model-driven approach to direct error tolerance designs.II. B

ACKGROUND

In this section, we introduce the fault model and give anintroductory description of application-level error masking.

A. Fault Model

We focus on transient faults that change the values of dataobjects. Those faults are not corrected by hardware (e.g.,ECC), propagate through a high level of the system, andbecome observable to the application [12].In terms of application resilience in the existence of cor-rupted data values, we focus on application outcome correct-ness . The application outcome is deemed correct as long as itis acceptable . Depending on the notion of the acceptance, theoutcome correctness can refer to precise numerical integrity(e.g., the outcome of a multiplication operation must benumerically precise) or refer to satisfying a minimum ﬁdelitythreshold (e.g., the outcome of an iterative solver must meetcertain convergence thresholds).

B. Error Masking

Error masking can happen at the application level andhardware level. The application-level error masking happensbecause of application inherent semantics and program con-structs. The hardware-level error masking happens because afault does not corrupt the precise semantics of hardware [13].The key of our error tolerance modeling is the application-level error masking. We particularly study error masking thathappens to individual data objects . We consider that whenan error happens in a data object ( other data objects remaincorrect before the error happens ) how the error impacts theapplication outcome correctness. A data object can be anarray or other data structures with many data elements. Otherthan data objects, we do not consider the corruption of otherapplication components (e.g., computing logic).

Hence, wedo not aim to model the error tolerance of all applicationcomponents but focus on data objects . In addition, we focuson errors happening in data objects and directly consumed bythe application. Latent errors in data objects (i.e., the errorsnot consumed by the application) are not considered becausethey do not matter to the application outcome correctness.III. E

RROR T OLERANCE M ODELING

We start with a classiﬁcation of application-level errormasking and then introduce a modeling metric.2 void func ( double *par A , double * par b , double * par x ) { double c = 0; / / Pre− processing par A par A [ 0 ] = s q r t ( i n i t I n f o ) ; c = par A [ 2 ] * 2 ; i f ( c > THR) par A [ 4 ] = ( i n t ) c >> b i t s ; / / b i t s h i f t i n g / / Using the a l g e b r a i c multi −grid s o l v e r AMG Solver ( par A , par b , par x ) ; } Fig. 1:

The example code to show error masking that happens to adata object, par A . A. General Description

Error masking that happens to data objects has various rep-resentations. Listing 1 gives a synthetic example to illustratethose representations. In this example, we focus on a dataobject, par A , which is an array. We study error masking thathappens to this data object . We examine every statement in theexample code. For each statement, we examine if any elementof the data object is involved. If yes, we examine if thereis a data corruption in the element, how the data corruptionimpacts the result correctness of the statement, and how thedata corruption propagates to the successor statements whichin turn impact the application outcome correctness. par A is involved in 4 statements (Lines 7, 8, 10 and13). The statement at Line 7 has an error masking event: ifan error happens at par A , (in particular, the data element par A [0] , which is consumed by the statement), the error canbe overwritten by an assignment operation, no matter which bitis ﬂipped in par A [0] . The statement at Line 8 has no expliciterror masking happen. If an error at par A [2] occurs, the errorpropagates to c by multiplication and assignment operations.If the error propagates to Line 10 (bit shifting), depending onwhich bit is corrupted at Line 8 and how many bits are shiftedat Line 10, the corrupted bit can be thrown away or remain. Ifthe corrupted bit is thrown away, then the error in par A [2] propagating from Line 8 to Line 10 is indirectly masked atLine 10 (not directly masked at Line 8).Line 13 is an invocation of an algebraic multi-grid solver(AMG) taking par A as input. AMG treats par A as a multi-dimensional grid and can tolerate certain data corruptions inthe grid, because of the algorithm semantics of AMG (particu-larly, AMG’s iterative structure that mitigates error magnitudeand tolerates incorrectness of numerical results [14]).This example reveals many interesting facts. In essence, aprogram can be regarded as a combination of data objectsand operations performed on the data objects. An operation(deﬁned at LLVM instruction level) refers to arithmetic com-putation, assignment, logical and comparison instructions oran invocation of an algorithm implementation. An operationmay inherently come with error masking effects, exempliﬁed atLine 7 (error overwriting); an operation may propagate errors,exempliﬁed at Line 8. Different operations have differenterror masking effects, and hence impact the application out- come differently. Based on the above discussion, we classifyapplication-level error masking into three classes.(1) Operation-level error masking.

An error that happensto the target data object is masked because of the semanticsof the operation. Line 7 in Listing 1 is an example.(2)

Error masking when error propagation.

Some errormasking events are implicit and have to be identiﬁed beyonda single operation. In particular, a corrupted bit in a dataobject is not masked in the current operation (e.g., Line 8in Listing 1) but the error propagates to another data object(e.g., the variable c ) and masked in another operation (e.g.,Line 10). Note that simply relying on isolated operation-levelanalysis without the error propagation analysis is not sufﬁcientto recognize these error masking events.(3) Algorithm-level error masking.

Identiﬁcation of someerror masking events must include algorithm-level informa-tion. The identiﬁcation of these events is beyond the ﬁrsttwo classes. Examples of such events include the multigridsolver [14] and certain sorting algorithm [15]. The algorithm-level error masking can tolerate errors that happen to manyvariables. For example, the multigrid solver can tolerate cer-tain errors in hundreds of variables [14]. The essence ofalgorithm-level error masking is typically due to algorithmspeciﬁc deﬁnition on execution ﬁdelity and speciﬁc programconstructs that mitigate error magnitude during applicationexecution [16]. Limited analysis at individual operations orerror propagation is not sufﬁcient to build up a big picture tocapture the algorithm-level fault tolerance.Our modeling is analytical and relies on the quantiﬁcationof the above error masking events on data objects. We createa metric to quantify those events.

B. aDVF: A New Metric

To quantify application resilience to transient faults on adata object, the key is to quantify how often error maskinghappens to the data object. We introduce a new metric, aDVF (i.e., the application-level Data Vulnerability Factor),to quantify application resilience to transient faults on dataobjects. aDVF is deﬁned as follows.For an operation with the participation of the target dataobject (maybe multiple data elements of the target data object),we reason that if an error happens to a participating dataelement of the target data object, the application outcomecould or could not remain correct in terms of the outcomevalue and algorithm semantics. If the error does not cause anincorrect application outcome, then an error masking eventhappens to the target data object. A single operation canoperate on multiple data elements of the target data object.For example, an ADD operation can use two elements of thetarget data object as operands. For a speciﬁc operation, aDVFof the target data object is deﬁned as the total number of errormasking events divided by the number of data elements of thetarget data object involved in the operation.For example, an assignment operation a [1] = w happens toa data object, the array a . This operation involves one dataelement ( a [1] ) of the target data object a . We calculate aDVF3or a in this operation as follows. If an error happens to a [1] ,we reason that the erroneous a [1] does not impact correctnessof the application outcome and the error in a [1] is alwaysmasked (no matter which bit of a [1] is ﬂipped). Hence, thenumber of error masking events for the target data object a in this operation is 1. Also, the total number of data elementsinvolved in the operation is 1. Hence, the aDVF value for thetarget data object in this assignment operation is / .Based on the above discussion, the deﬁnition of aDVF fora data object X in an operation ( aDV F Xop ) is formulated inEquation 1, where x i is a data element of the target data object X involved in the operation and m is the number of dataelements involved in the operation; f is a function to counterror masking events that can happen to a data element. aDV F Xop = m − (cid:88) i =0 f ( x i ) /m (1) To calculate aDVF for a data object in a code segment, weexamine operations in the code segment one by one; For eachoperation that involves any element of the target data object,we consider that if a transient fault happens to the element,how many error masking events can happen. In general, thedeﬁnition of aDVF for a data object in a code segment issimilar to the above for an operation, except that m is thenumber of data elements of X involved in all operations ofthe code segment. According to the above deﬁnition, a higheraDVF value for a data object indicates that the application ismore resilient to transient faults on the data object; Also, anaDVF value should be in [0 , .To further explain it, we use a code segment from LU bench-mark in SNU NPB benchmark suite 1.0.3 (a C-based imple-mentation of the Fortran-based NAS benchmark suite [17]),shown in Listing 2. An example from LU.

We calculate aDVF for the array sum [] . Statement A has an assignment operation involvingone data element ( sum [ m ] ) and one error masking event (i.e.,if an error happens to sum [ m ] , the error is overwritten by theassignment). Considering that there are ﬁve iterations in theﬁrst loop ( iter num = 5 ), there are ﬁve error masking eventshappening to ﬁve data elements of sum [] .Statement B has two operations related to sum [] (i.e.,an assignment and an addition). The assignment operationinvolves one data element ( sum [ m ] ) and has no error maskingbecause the new value is added to sum [ m ] (not overwriting it);The addition operation involves one data element ( sum [ m ] )and may have one error masking (i.e., certain corruptionsin sum [ m ] can be ignored, if ( v [ k ][ j ][ i ][ m ] ∗ v [ k ][ j ][ i ][ m ] )is signiﬁcantly larger than sum [ m ] ). This error masking iscounted as r (cid:48) ( ≤ r (cid:48) ≤ ), depending on the corruptedbit position in sum [ m ] and the error propagation result (seeSections III-C and IV for further discussion). In the loopstructure where Statement B is, there are ( r (cid:48) ∗ iter num ) errormasking events that happen to ( ∗ iter num ) elements of If a data element is referenced multiple times in the code segment, thisdata element is counted multiple times in m . void l2norm ( i n t ldx , i n t ldy , i n t ldz , i n t nx0 , \ i n t ny0 , i n t nz0 , i n t i s t , i n t iend , i n t j s t , \ i n t jend , double v [ ] [ ldy /2 * 2 +1 ][ ldx / 2 * 2 + 1 ] [ 5 ] , \ double sum [ 5 ] ) { i n t i , j , k , m; for (m=0;m < / / The f i r s t loop sum [m] = 0 . 0 ; / / Statement A for ( k =1; k < nz0 −1; k++) / / The second loop for ( j = j s t ; j < jend ; j ++) for ( i = i s t ; i < iend ; i ++) for (m=0;m < sum [m]=sum [m]+ v [ k ] [ j ] [ i ] [m] \ *v [ k ] [ j ] [ i ] [m] ; / / Statement B for (m=0;m < { / / The third loop sum [m]= s q r t ( sum [m] / ( ( nx0 −2) * \ ( ny0 −2) *( nz0 −2) ) ) ; / / Statement C } } Fig. 2:

A code segment from LU. sum [] , where “ r (cid:48) ” comes from the addition operation , and iter num is the number of iterations in the second loop.Statement C has two operations related to sum [] (i.e., anassignment and a division) but only the assignment operationhas error masking (overwriting). In the loop structure whereStatement C is, there are ﬁve iterations ( iter num = 5 ).Hence, there are ﬁve error masking events that happen on ﬁvedata elements of the target data object. In summary, the aDVFcalculation for sum [] is aDV F sumop = 1 ∗ iter num + r (cid:48) ∗ iter num + 1 ∗ iter num ∗ iter num + (1 + 1) ∗ iter num + (1 + 1) ∗ iter num , (2) where each term in the numerator is the number of er-ror masking events in the ﬁrst, second, and third loop,respectively; each term in the denominator is the numberof target data elements involved in each loop; iter num =5 , iter num = 5 and iter num = ( nz − ∗ ( jend − jst ) ∗ ( iend − ist ) ∗ .To calculate aDVF for a data object, we must rely oneffective identiﬁcation and counting of error masking events(i.e., the function f ). In Sections III-C, III-D and III-E,we introduce a series of counting methods based on theclassiﬁcation of error masking events. C. Operation-Level Analysis

To identify error masking events at the operation level,we analyze all possible operations. In particular, we analyzearchitecture-independent, LLVM instructions and characterizetheir error tolerance based on operation semantics. We classifythe operation-level error masking as follows.(1)

Value overwriting . An operation writes a new valueinto a data element of the target data object and the errorin the data element (no matter where the corrupted bit is inthe data element) is masked. For example, the store operationoverwrites the error in the store destination. We also include The addition operation with the corrupted sum [ m ] can propagate the errorto the assignment. This error propagation effect is included in r (cid:48) . runc and bit-shifting operations into this category because theerror could be truncated or shifted away in those operations.(2) Logical and comparison operations . If an error in thetarget data object does not change the correctness of logicaland comparison operations, the error is masked. Examplesof such operations include logical

AND and the predicateexpression in a switch statement.(3)

Value overshadowing . If the corrupted data value in anoperand of an addition or subtraction operation is overshad-owed by the other correct operand involved in the operation,then the corrupted data can have an ignorable impact onthe correctness of application outcome. For example, thedata value “10” in an addition operation (“10e+6 + 10”) iscorrupted and the addition operation becomes “10e+6 + 11”.But such data corruption may not matter to the applicationoutcome because the operand “10e+6” is much larger thanthe magnitude of the data corruption. We further discuss howthe overshadowing effect is determined in Section IV.The above three operation-level error masking impacts theapplication outcome differently. Error masking based on valueoverwriting and logic and comparison operations can makethe application outcome numerically the same as the error-free case. Error masking based on value overshadowing canmake the application outcome numerically different from orthe same as the error-free case.For value overshadowing, if the application outcome isnumerically different, the application outcome can still beacceptable because of algorithm semantics; if the applicationoutcome is numerically the same, operations after the valueovershadowing must help tolerate corrupted bits. For the abovetwo cases, we do not attribute error masking to the algorithmlevel or error propagation level. Instead, we attribute it tooperation-level value overshadowing because value overshad-owing initiates error masking. Without value overshadowing,algorithm or error propagation may not mask errors.The effectiveness of the above error masking heavily relieson the error pattern.

The error pattern is deﬁned by howerroneous bits are distributed within a corrupted data element (e.g., single-bit vs. spatial multiple-bit, least signiﬁcant bit vs.most signiﬁcant bit). Depending on where the erroneous bit is,the error in the data object could or could not be masked. Takeas an example the bit shift operation (Line 10) in Listing 1.Depending on the error pattern, the shift operation can removeor keep the corrupted bit.To determine the existence of the above (2) and (3) errormasking, we must consider error patterns (i.e., the spatialaspect of errors [18]). In the practice of our resilience mod-eling, given an operation to analyze, we enumerate possibleerror patterns for the target data object. Then, we derive theexistence of error masking for each error pattern withoutapplication execution. Suppose there are n error patterns and m ( ≤ m ≤ n ) of which have error masking. Then thenumber of error masking events is calculated as m / n , whichis a statistical quantiﬁcation of possible error masking. In theexample of the bit shift (Line 10 in Listing 1), assuming that c is 64-bit and we consider single-bit errors, then there are 64 error patterns. For each error pattern, we decide if thecorrupted bit is shifted away. If 10 of the 64 fault patterns havethe corrupted bit shifted, then the number of error maskingevents for the data object c in this shift operation is 10/64. D. Error Propagation Analysis

If we analyze a speciﬁc error pattern in an operation (named“target error pattern” and “target operation” in the rest of thissection) and determine that the error cannot be masked inthe target operation, then we use error propagation analysisto capture error masking (i.e., the temporal aspect of er-rors [18]). Using a dynamic instruction trace as input, the errorpropagation analysis tracks whether the errors (including theoriginal one and the new ones because of error propagation)are masked in the successor operations based on the operation-level analysis without application execution. If all of theerrors are masked and hence the application outcome remains numerically the same as the error-free case, then we claim thatthe original error in the target operation is masked.For the error propagation analysis, a big challenge is to trackall contaminated data which can quickly increase as the errorpropagates. Tracking all the contaminated data signiﬁcantlyincreases analysis time and memory usage. A solution tothis challenge is deterministic fault injection. Different fromrandom fault injection, the deterministic fault injection injectsan error at the target operation using the target error patternand then run the application to completion. If the applicationoutcome is numerically the same as the error-free case, thenthe original and the new errors are masked, and the error mask-ing based on error propagation takes effect. If the applicationoutcome is numerically different but still accepted, then thealgorithm-level error masking takes effect.Because of the deterministic fault injection, we do not needto analyze operations one by one to track data ﬂow and errorcontamination. Hence it is faster. However, the deterministicfault injection can still be time-consuming, if applicationexecution time is long. To improve the efﬁciency of the errorpropagation analysis, we optimize the analysis based on thecharacteristics of error propagation.

Optimization: bounding propagation path.

We observethat tracking a limited number of operations ( k operations)after the target operation is often sufﬁcient to decide the exis-tence of the propagation-based error masking. Our observationis based on 1000 random fault injection tests on 16 data objectsfrom eight benchmarks (see Table I for benchmark details).We observe that 87% of the fault injection tests that cannotmask errors within 10 operations ( k = 10 ) after fault injectionlead to numerically incorrect application outcomes; 100% ofthe fault injection tests that cannot mask errors within 50operations ( k = 50 ) after fault injection lead to numericallyincorrect application outcomes. This fact indicates that errorsthat are not masked within a limited number of operationshave little chance to be masked by further error propagation.The rationale to support the above observation is as follows.An error in a data object typically propagates to a large amountof data (objects) quickly. After a certain number of operations,5t is very unlikely that all errors are able to be masked byfurther error propagation and making a conclusion of no errormasking by error propagation is correct in most cases.Based on the above observation, we only need to track theﬁrst k operations after the target operation to determine theexistence of the propagation-based error masking. In particu-lar, after analyzing k operations ( k = 50 in our evaluation),(1) If not all errors due to error propagation are masked at theoperation level, we conclude that the errors will not be maskedat the operation level by further error propagation. But thoseerrors may be masked by algorithm (if the user wants to doalgorithm-level analysis), pending further investigation; (2) Ifall errors due to error propagation are masked and based onthe operation-level analysis we can derive that the applicationoutcome remains numerically correct, then we claim errormasking due to error propagation happens. E. Algorithm-Level Analysis

Identifying the algorithm-level error masking demands do-main and algorithm knowledge. In our modeling, we want tominimize the usage of that knowledge, such that the modelingmethodology can be general across different domains. Thetraditional random fault injection treats the program as ablack-box. Hence, using the traditional random fault injectioncould be an effective tool to identify the algorithm-level errormasking. However, to avoid the randomness, we use thedeterministic fault injection again.In particular, when we analyze a speciﬁc error pattern in atarget operation and decide that the error cannot be maskedin the target operation and next k operations, we inject anerror using the error pattern in the target operation and runthe application to completion. If the application outcome isnumerically different from the error-free case but acceptablein terms of algorithm semantics, then algorithm-level errormasking takes effect. If the application outcome is numericallythe same, then error masking due to error propagation happens,which should be rare based on the above discussion on“bounding propagation path”. Discussion : Although we employ the deterministic faultinjection, it cannot replace our modeling because of tworeasons. First, the fault injection space without our modelingis typically huge (trillions of fault injection sites [7]), which isprohibitive for implementation. Second, the deterministic faultinjection tells us little about how an error is tolerated.IV. I

MPLEMENTATION

To calculate the aDVF value for a data object, we developa tool, named

MOARD (standing for MO deling A pplication R esilience to transient faults on D ata data objects). Figure 3shows the tool framework and its algorithm. MOARD hasthree components: an application trace generator, a traceanalysis tool, and a deterministic fault injector.The application trace generator is an LLVM instrumenta-tion pass to generate a dynamic LLVM IR trace. LLVM IR isarchitecture independent and each instruction in the dynamicIR trace corresponds to one operation. We extend a trace generator [19] to enable trace generation for MPI applications.During the trace analysis, we consider error propagation byMPI communication, but do not consider those cases whereerrors happen in the communication.The trace analysis tool is the core of MOARD. Using anapplication trace as input, the tool can calculate the aDVFvalue of any data object with known memory address range.In particular, the trace analysis tool conducts the operation-level and error propagation analysis. For those unresolvedanalyses, the trace analysis tool will output a set of faultinjection information for the deterministic fault injection. Suchinformation includes dynamic instruction IDs, IDs of theoperands that reference the values of the target data object,and the bit locations of the operands that correspond to thoseerror patterns with undetermined error masking. After thefault injection results (i.e., the numerical values of applicationoutcome and whether the outcome is acceptable) are availablefrom the deterministic fault injector, we re-run the traceanalysis tool, and use the fault injection results to address theunresolved analyses and update the aDVF calculation.For the error propagation analysis, we associate data seman-tics (the data object name) with the data values in registers,such that we can identify the data of the target data object inregisters. To associate data semantics with the data in registers,MOARD tracks the register allocation when analyzing thetrace, such that we can know at any moment which registershave the data of the target data object.To determine the existence of value overshadowing in anaddition or subtraction operation, we use the deterministic faultinjection. Particularly, given a target operand in an addition orsubtraction operation for value overshadowing analysis, weenumerate all error patterns for deterministic fault injectiontests. If the following two conditions are true, then we derivethat the value overshadowing happens in the operation: • Some error patterns result in small magnitudes of theoperand (smaller than the magnitude of the other operandin the operation); the application outcome is acceptable. • The other error patterns result in larger magnitudes of theoperand (larger than those in the ﬁrst condition) but theapplication outcome is not acceptable.The error masking of the value shadowing is quantiﬁed as x/y ,where x is the number of error patterns in the ﬁrst conditionand y is the number of all error patterns. For example, supposewe have an addition operation ( a + b , a = 1000 and b = 1 )and b is our target data object. We enumerate error patterns in b (assuming 32 single-bit-ﬂip error patterns). If ﬁve patternsresult in the values of b as 0, 3, 5, 9 and 17, which are smallerthan a and the application outcome is acceptable, and the other26 patterns result in larger b (larger than 0, 3, 5, 9, and 17)but the application outcome is not acceptable, then the valueovershadowing happens (the corrupted b is overshadowed by a ), and is quantiﬁed as 5/32.The deterministic fault injector is a tool to resolve thoseerror masking analyses undetermined by the trace analysistool. The input to the deterministic fault injector is a listof fault injection sites generated by the trace analysis tool.6 ig. 3: MOARD, a tool for modeling application resilience to transient faults on data objects

Similar to the application trace generation, the deterministicfault injector is also based on the LLVM instrumentation. Weuse the LLVM instrumentation to count dynamic instructionsand trigger bit ﬂips. The application execution will trigger bitﬂip when a fault injection site is encountered.To accelerate the calculation of aDVF , we leverage theexisting work [7], [20] that explores “error equivalence” basedon the similarity of intermediate execution states to avoidrepeated analysis and fault injections on instructions. Duringour evaluation, MOARD calculates aDVF for 16 data objectsin eight benchmarks within one day on a cluster of 256 cores,which is comparable to the execution time of existing faultinjection work [7], [20].V. E

VALUATION

In this section, we use aDVF as an metric to evaluateapplication resilience to transient faults on data objects with aset of benchmarks. Furthermore, we validate the accuracy ofour aDVF calculation. We also compare aDVF calculation withthe traditional fault injection to show the power and beneﬁtsof aDVF calculation.

A. Evaluating Application Resilience to Transient Faults onData Objects Using aDVF

We study 12 data objects from six benchmarks of the NASparallel benchmark (NPB) suite and four data objects fromtwo scientiﬁc applications. Those data objects are chosen tobe representative: they have various data access patterns andparticipate in different execution phases. Table I gives detailson the benchmarks and applications. The maximum errorpropagation path for aDVF analysis is 50, for which we do notlose analysis accuracy as we discuss in Section III-D. Similarto [7], [20], [23], we only study single-bit errors because theyare the most common errors.Figure 4 shows the aDVF results and breaks them downinto the three levels (i.e., the operation level, error propagationlevel, and algorithm level).Error masking happens commonly in data objects acrossbenchmarks and applications including those scientiﬁc appli-cations (e.g., LULESH and AMG) that are highly sensitiveto data correctness. Several data objects (e.g., r in CG, and exp and plane in FT) have aDVF values close to 1 inFigure 4, which indicates that most operations working onthese data objects have error masking. Those data objects aredouble-precision ﬂoating-point and their error masking mainlycomes from value overshadowing and overwriting (Figure 5).However, a couple of data objects have much less intensive error masking. For example, the aDVF value of colidx in CGis only 0.28 (Figure 4). Further study reveals that colidx is aninteger array to store indexes of sparse matrices and there isfew operation-level or error propagation-level error masking(Figure 5). Its corruption can easily cause segmentation errorcaught by the deterministic fault injection. grid points in SPand BT also have a small aDVF value (0.06 and 0.38 for SPand BT respectively in Figure 4). Further study reveals thatthe array grid points deﬁnes input problems for SP and BT.An error in grid points can easily cause major changes incomputation caught by the error propagation analysis. Evaluation conclusion 1 : The above aDVF-based analysisreveals the variation of application resilience to transient faultson data objects and provides insights on whether the corruptionon a data object impacts application outcomes, which is usefulto direct fault tolerance mechanisms.We further notice that the data objects colidx and r inCG have 2.19e+09 and 4.54e+07 error masking events (notshown in Figure 4), respectively. Although colidx has moreerror masking events, CG is not more resilient to errors on colidx than on r . In particular, 75% bit ﬂips that happen in theelements of colidx involved in the operations of CG causesincorrect application outcome or segmentation faults, whileless than 1% in r . The above observation provides a strongsupport to introduce the metric, aDVF. Evaluation conclusion 2 : Simply counting the number oferror masking events is not sufﬁcient to evaluate applicationresilience to errors on data objects.We further look into the results based on the analysis of thethree levels. Operation-level error masking is very common.Figure 4 shows that there are 12 data objects whose operation-level error masking contributes more than 70% of the aDVFvalues. For exp in FT and rhoi in SP, the contribution of theoperation-level error masking is close to 99%.We further notice that the contribution of error maskingat the error propagation level to the aDVF result is verylimited. For most of the data objects, the contribution is lessthan 10% (Figure 4). For ﬁve data objects ( colidx in CG, grid points and u in BT, and grid points and rhoi in SP),there is no such error masking. Note that our analysis at theerror propagation level is valid even if we increase the errorpropagation length. We discuss the impact of error propagationlength in Section III-D.Different from error masking at the error propagation level,the contribution of the algorithm-level error masking to theaDVF result is relatively large. For example, the algorithm-7 ABLE I:

Benchmarks and applications for the study

Name Benchmark description Code segment for evaluation Target data objects

CG (NPB) Conjugate Gradient, irregular memory access (input class S) The routine conj grad in the main loop The arrays r and colidx MG (NPB) Multi-Grid on a sequence of meshes (input class S) The routine mg3P in the main loop The arrays u and r FT (NPB) Discrete 3D fast Fourier Transform (input class S) The routine fftXYZ in the main loop The arrays plane and exp BT (NPB) Block Tri-diagonal solver (input class S) The routine x solve in the main loop The arrays grid points , u SP (NPB) Scalar Penta-diagonal solver (input class S) The routine x solve in the main loop The arrays rhoi and grid points

LU (NPB) Lower-Upper Gauss-Seidel solver (input class S) The routine ssor The arrays u and rsd LULESH [21] Unstructured Lagrangian explicit shock hydrodynamics (input 5x5x5) The routine CalcMonotonicQRegion-ForElems The arrays m elemBC and m delv zeta

AMG2013 [22] An algebraic multigrid solver for linear systems arising from problems onunstructured grids (we use GMRES(10) with AMG preconditioner). We usea compact version from LLNL with input matrix aniso . The routine hypre GMRESSolve The arrays ipiv and A u rsd a DV F Data object LU

The operation level The error propagation level The algorithm level u r MG zeta elemBC LULESH grid_points rhoi SP grid_points u BT exp1 plane FT r colidx CG ipiv A AMG l a r g er i s b e tt er Fig. 4:

The breakdown of aDVF results based on the three level analysis. The x axis is the data object name. Value overwritting Value overshadowing Logic and comparison operations r colidx CG u rsd a DV F Data object LU u r MG zeta elemBC LULESH grid_points u BT exp1 plane FT grid_points rhoi SP ipiv A AMG L a r g er i s b e tt er Fig. 5:

The breakdown of aDVF results based on value overwriting, value overshadowing, and logic and comparison operation at the levelsof operation and error propagation. The x axis is the data object name. zeta and elemBC in LULESH are m delv zeta and m elemBC . level error masking contributes 19% to the aDVF value for u in MG and 27% for plane in FT (Figure 4). The largecontribution for u in MG is consistent with the existingwork [14]. For FT (particularly 3D FFT), the large contributionof algorithm-level error masking in plane comes from frequenttranspose and 1D FFT computations that average out thedata corruption. CG, as an iterative solver, is known to havethe algorithm-level error masking because of the iterativenature [24]. Interestingly, the algorithm-level error masking inCG contributes most to application resilience to transient faultson colidx which is a vulnerable integer data object (Figure 4). Evaluation conclusion 3 : The aDVF analysis gives us deepinformation on how errors are tolerated. This may be usefulfor refactoring application (e.g., using different algorithms or different data structures and data types) to improve errortolerance of data objects.We further break down the aDVF results based on clas-siﬁcations of the value overwriting, logical and comparisonoperations, and value overshadowing) based on the analysis atthe operation and error propagation levels, shown in Figure 5.We have the following observation.The value overshadowing is very common, especially for(double-precision) ﬂoating point data objects (e.g., u in BT, zeta in LULESH, and rhoi in SP in Figure 5). This ﬁndinghas an important indication for studying application-level errortolerance. We have the following conclusion: the impact ofdata corruption can be correlated with the input problem,because different input problems can have different values8f the data objects, which in turn have different effectsof value overshadowing. Hence, the existing conclusions onapplication-level fault tolerance [4], [5], [6], [15], [25] withsingle input problems must be re-examined with different inputproblems to validate the conclusions of application resilience. B. Model Validation

In this section, we aim to (1) validate the accuracy of ourapproach to calculate aDVF, and (2) demonstrate that aDVFcorrectly quantiﬁes application resilience to transient faults on data objects .We validate our modeling approach by comparing the aDVFresult with the result of exhaustive fault injection (particularly,the success rate of exhaustive fault injection tests). The ex-haustive fault injection is different from the traditional randomfault injection. With an exhaustive fault injection campaign,we inject faults into all valid fault injection sites. A validfault injection site is a bit in an instruction operand or outputthat has a value of the target data object. We use those faultinjection sites, because we quantify application resilience totransient faults on data objects . The exhaustive fault injectionis accurate to quantify application resilience to transient faultson data objects, because of its full coverage of all fault sites.However, the number of valid fault injection sites can bevery large (e.g., trillions of sites in CG (Class A)). Hence,although the exhaustive fault injection is accurate and good formodel validation in this section, this method is not practical,compared with aDVF.Note that the aDVF result cannot be exactly the same asthe exhaustive fault injection result, because the deﬁnitionsof aDVF and exhaustive fault injection are different. Hence,we validate the modeling accuracy by quantifying applicationresilience to transient faults for multiple data objects, andthen ranking them based on the quantiﬁcation. Ideally, therank order of data objects based on the aDVF calculationshould be exactly the same as that based on the exhaustivefault injection. A correct order of data objects in terms ofapplication resilience to transient faults is critical to decidewhich data objects should be protected by fault tolerancemechanisms.We focus on a function ( conj grad () ) from CG anda function ( CalcM onotomicQRegionF orElems () ) fromLULESH. We study major data objects in the two functions(those data objects take most of memory footprint). We usesingle-bit ﬂip in fault injection. The results are shown in Fig-ure 6. We notice that the aDVF and exhaustive fault injectionrank the data objects in the same order. aDVF correctly reﬂectsapplication resilience to transient faults on data objects. C. Comparing aDVF Calculation with the Traditional Ran-dom Fault Injection (RFI)

We compare aDVF calculation with RFI. We aim to revealthe limitation of this traditional approach, and demonstrate thepredictive power of aDVF, compared to RFI. rowstrcolidx a p q m_z m_x m_yCG LULESH Su cce ss r a t e a DV F aDVF success rate (exhaustive fault injection) L a r g er i s b e tt er Fig. 6:

Model validation against exhaustive fault injection. The x axisshows the data object name. . . . m_x m_y m_z The number of fault injection tests Su cce ss r a t e Fig. 7:

The RFI results with the margin of error (the conﬁdence level95%) and aDVF results. The results are for three data objects ( m x , m y , and m z ) from CalcMonotomicQRegionF orElems () ofLULESH. RFI : We use the following method for RFI. We use validfault injection sites, as deﬁned in Section V-B, for RFI. In eachfault injection test, we randomly trigger a single-bit ﬂip in avalid fault injection site. The number of fault injection testsis determined by a statistical approach [26] using conﬁdence-level of 95% to ensure statistical signiﬁcance. We do seven setsof fault injection tests, and the number of fault injection tests inthe seven sets ranges from 500 to 3500 with a stride of 500. Weuse three equal-sized, ﬂoating-point arrays ( m x , m y , and m z ) in the function CalcM onotomicQRegionF orElems () of LULESH for study.Figure 7 shows the results of RFI (the success rate). Theﬁgure also shows the margin of error (shown as small redbars in the ﬁgure). The results reveal that the results ofRFI are sensitive to the number of fault injection tests. Forexample, for m z , the success rates of RFI are 0.28 and 0.19for 1000 and 3000 random fault injection tests, respectively.There is 49% difference between the two results. Furthermore,in terms of application resilience to transient faults on dataobjects, we cannot rank the three target data objects in aconsistent order across the seven test sets. For example, thesuccess rate of RFI for m x is lower than that for m z , whenthe number of fault injection tests is 500, 1000, and 1500.However, the observation is opposite, when the number offault injection tests is 2000 and 3500. In other words, usingRFI, we cannot make any conclusion that LULESH is moreresilient to transient faults on a data object than on another dataobject (even through the margin of error is considered). Thereason is three-fold: randomness of RFI, limited conﬁdencelevel, and inability to capture error masking events.9 ) aDVF : We measure aDVF of the three data objects.Figure 7 shows the results (see the last group of bars). We rankthe three objects in a determined order (i.e., no inconsistence inthe aDVF calculation results, no matter how many times wecalculate aDVF). The order is also veriﬁed by the accurate,exhaustive fault injection (see Section V-B for discussion).Having a determined order is important for guiding errortolerant designs (e.g., deciding which data object should beprotected by a fault tolerance mechanism).

Evaluation conclusion 4:

The calculation of aDVF isdeterministic, meaning that we can deterministically rank dataobjects in terms of application resilience to transient faults onthe data objects. Using the traditional RFI, we cannot do so.RFI can be ineffective for guiding error tolerant designs.VI. C

ASE S TUDY

In this section, we study a case of using aDVF to helpsystem designers decide whether a speciﬁc application-levelfault tolerance mechanism is helpful to improve applicationresilience to transient faults on data objects.Application-level fault tolerance mechanisms, such asalgorithm-based fault tolerance [27], [28], [29], are extensivelystudied as a means to increase application resilience to tran-sient faults on data objects. However, those mechanisms cancome with big performance and energy overheads (e.g., 35%performance loss in [3]). To justify the necessity of usingthose mechanisms, we must quantify the effectiveness of thosemechanisms. With the introduction of aDVF, we can evaluateif application resilience to transient faults on data objects iseffectively improved with fault tolerance mechanisms in place.We focus on a speciﬁc application-level fault tolerancemechanism, the algorithm-based fault tolerance (ABFT) forgeneral matrix multiplication ( C = A × B ) [28]. This ABFTmechanism encodes matrices A , B , and C into a new formwith checksums. If an error happens in an element of C ,leveraging the checksums, we are able to correct and detect theerroneous element. We apply the aDVF analysis on this ABFTand the matrix C is the target data object. We compare theaDVF values of C with and without ABFT. Figure 8 shows theresults. The ﬁgure shows that ABFT effectively improves errortolerance of C : the aDVF value increases from 0.0172 to 0.82(the larger is better). The improvement mostly comes fromthe value overwriting during error propagation. This result isexpected because a corrupted element of C is not correctedby ABFT right away. Instead, it will be corrected in a speciﬁcveriﬁcation phase of ABFT during error propagation.Given the effectiveness of this ABFT, we further explorewhether this ABFT can help us improve resilience to transientfaults on a data object in an application, Particle Filer (PF)from Rodinia [30], without knowing the application resilienceof PF. PF has a critical variable, xe , which is repeatedlyused to store vector multiplication results. Given the fact thata vector can be treated as a special matrix, we can applyABFT to protect xe for those vector multiplications. Using xe as our target data object, we perform the aDVF analysiswith and without ABFT. We want to answer a question: Will

ABFT_[C] [C] a DV F The algorithm levelThe error propagation levelThe operation level

ABFT_[C] [C] a DV F Logic and comparison operationsValue shadowingValue overwritting L a r g e r i s b e tt e r Fig. 8:

Using aDVF analysis to study application resilience totransient faults on C in matrix multiplication (MM). Notation: [C] isMM without applying ABFT on C ; ABFT [C] is MM with ABFTtaking effect. ABFT_[xe] [xe] a DV F The algorithm levelThe error propagation levelThe operation level

ABFT_[xe] [xe] a DV F Logic and comparison operationsValue shadowingValue overwritting L a r g e r i s b e tt e r Fig. 9:

Using aDVF analysis to study the effectiveness of ABFT fora data object xe in PF. [ xe ] has no protection of ABFT; ABFT [ xe ]has ABFT taking effect on xe . using ABFT be an effective fault tolerance mechanism forprotecting xe in PF ?Figure 9 shows the results. The ﬁgure reveals that usingABFT does not improve much application resilience to tran-sient faults on the data object xe : there is only little changeto the aDVF value (0.48 vs. 0.475). We ﬁnd two reasons forit: (1) The operation-level error masking accounts for a largepart of error masking, no matter whether we use ABFT or not;(2) Most errors corrected by ABFT are also correctable by PF.The second reason is demonstrated by the following fact: withABFT, the number of error masking events increases at theerror propagation level but decreases at the algorithm level.But in total, the number of error masking events at the bothlevels with ABFT is almost the same as without ABFT. Thiscase study is a clear demonstration of how powerful the aDVFanalysis can direct error tolerance designs.VII. D ISCUSSIONS

A. Program Optimization by aDVF aDVF has many potential usages. We discuss two cases thatuse aDVF to optimize programs.

Code optimization : Programmers have been working oncode optimization to improve performance and energy ef-ﬁciency. However, the impact of code optimization on ap-plication resilience is often ignored. There are cases whereoptimizing code to improve application resilience is neces-sary (e.g., [31] and [32]). The code optimization (includingcommon compiler optimization on applications) can changememory access patterns and runtime values of data objects,which in turn impacts error propagation and value shadowing.aDVF and its analysis give programmers a feasible tool tostudy and compare application resilience (from the perspective10f data objects) before and after code optimization. TheaDVF analysis is also helpful to pinpoint which part of theapplication code is vulnerable from the perspective of dataobjects, and hence demands further optimization.

Algorithm choice : To solve a speciﬁc computation prob-lem, we can have multiple algorithm choices. For example,to solve the Poisson’s equation on a 2D grid, we coulduse direct method (Cholesky factorization), Multigrid, or red-black successive over relaxation. Different algorithms havedifferent implications on data distribution, parallelism, andblocking [33]. Which algorithm should be employed dependson users’ requirements on performance, energy/power efﬁ-ciency and resilience. aDVF and its analysis can help users(especially those users working on HPC) make the algorithmchoice from the perspective of application resilience. It wouldbe also interesting to integrate the aDVF analysis with pro-gramming language and compiler for algorithm choice, suchas PetaBricks [33].

B. Beyond Single-Bit Errors

MOARD and aDVF calculation are general, meaning thatthey can be used for analyzing single-bit errors and multi-biterrors. In our study and evaluation, we focus on single-biterrors for two reasons: (1) Multi-bit errors rarely occur inHPC systems, and most of the existing studies on applicationresilience focus on single-bit errors; (2) Existing work revealsthat multi-bit errors can have similar effects as single-bit errorson applications [34].To use MOARD and aDVF for analyzing multi-bit errors,we need to make the following extension. (1) Deﬁne multi-bit error patterns. For example, for two-bit errors, the errorpattern could be spatially contiguous; it could also be spatiallyseparated (the spatial separation is four bits, for example).(2) Re-implement the function f (deﬁned in Equation 1)in MOARD. This indicates that we must re-examine errormasking. For the operation-level analysis, the effects of logicaland comparison operations and value overshadowing will bedifferent from that for single-bit errors; the effect of valueoverwriting may be the same as that for single-bit errors. Forthe error propagation analysis, we can use the same methodas for single-bit errors to track error propagation, but theempirical bound of error propagation (i.e., the parameter k in Section III-D) must be reset using fault injection tests. Forthe algorithm-level analysis, we use the same fault injection-based method as for single-bit errors, but the injected errorsmust follow the deﬁned error pattern. C. Impact of Input Problems

The aDVF analysis is input dependent. This means that anapplication with different input problems may have differentaDVF values for a data object. Such input dependence is be-cause of multiple reasons.

First , the effectiveness of operation-level error masking is input dependent. For example, a bitshifting operation for integers, x >> y , can tolerate a singlebit error in the least signiﬁcant bit of x if y = 1 , but cantolerate three single bit errors in the three least signiﬁcant bits of x if y = 3 . Second , different input problems can result indifferent control ﬂows, which in turn results in different errorpropagation.

Third , different input problems can result in theemployment of different algorithms. Different algorithms canresult in different algorithm-level error masking.Because of input dependence nature of the aDVF analysis,we must do the aDVF analysis whenever the applicationchanges its input problem. This is a common limitation formany resilience study, including fault injection, AVF [35],[13], PVF [36], DVF [37] and [18]. However, a static analysis-based method cannot address the limitation because of unre-solved branches and data values. Fortunately, MOARD allowsa user to easily leverage hardware resource to parallelize theanalysis (e.g., deterministic fault injection and trace analysis),making the analysis easy and efﬁcient, even if the user hasto repeatedly do the aDVF analysis. Furthermore, leveragingcommon iterative structures of HPC applications, analyzing asmall trace of the application instead of the whole trace isoften enough. This makes the repeated aDVF analysis evenmore feasible. Nevertheless, studying the sensitivity of aDVFanalysis to input problems is our future work.VIII. R

ELATED W ORK

Application-level random fault injection.

Casa et al. [14]study the resilience of an algebraic multi-grid solver by in-jecting errors into instruction output based on LLVM. Similarwork can be found in [4], [15]. Li et al. [5] build a binaryinstrumentation-based fault injection tool for random faultinjection. Hari et al. [7], [20] aggressively employ static anddynamic program analyses to reduce the number of faultinjection tests. Menon and Mohror [38] apply algorithmicdifferentiation to predict the impact of a SDC on applicationoutput to avoid fault injection. Those research efforts do notsufﬁciently consider application semantics (e.g., algorithm-level fault tolerance), hence provide limited guidance to someapplication-level fault tolerance mechanisms. However, thoseresearch efforts can complement our work by accelerating faultinjection. Vishnu et.al [18] associate data semantics with faultinjection to build a machine learning model to predict applica-tion errors. However, the data semantics is only introduced atmain memory, not registers; also the machine learning modelhas to be trained and has no accuracy guarantee. Our methoddoes not have the above limitation.

Resilience metrics.

Architectural vulnerability factor (AVF)is a hardware-oriented metric to quantify the probability of anerror in a hardware component resulting in incorrect applica-tion outcomes. It was ﬁrst introduced in [35], [13] and thenattracted a series of follow-up work. This includes statisticalmodeling techniques to accelerate AVF estimate [39], onlineAVF estimation [40], Yu et al. [37] introduce a metric, DVF.DVF captures the effects of application and hardware on errortolerance of data objects. In contrast to AVF and DVF, aDVFis a highly application-oriented metric.IX. C

ONCLUSIONS

Understanding application resilience (or error tolerance) inthe presence of hardware transient faults on data objects is11ritical to ensure computing integrity and enable efﬁcientapplication level fault tolerance mechanisms. The traditionalmethods (such as random fault injection) cannot help becauseof losing data semantics and insufﬁcient information on howand where errors are tolerated. This paper introduces a funda-mentally new method to quantify application resilience to tran-sient faults on data objects. In essence, our method measureserror masking events at the application level and associatesthe events with data objects. We perform a comprehensiveclassiﬁcation of error masking events and create a series oftechniques to recognize them. We develop an open source toolto quantify application resilience from the perspective of dataobjects. We hope that our method can make the quantiﬁcation acommon practice. Currently, the deployment of fault tolerancemechanisms is often a problem because of a lack of a methodto quantify its effectiveness on protecting data objects. Ourwork provides a tangible solution to address the problem.

Acknowledgements.

This work is partially supported byU.S. National Science Foundation (CNS-1617967, CCF-1553645 and CCF-1718194) and LLNL subcontract B629135.We thank anonymous reviewers for their valuable feedback.R

EFERENCES[1] Z. Chen, “Online-ABFT: An Online ABFT Scheme for Soft ErrorDetection in Iterative Methods,”

PPoPP , 2013.[2] T. Davies and Z. Chen, “Correcting Soft Errors Online in LU Factoriza-tion,” in

International ACM Symposium on High-Performance Paralleland Distributed Computing (HPDC) , 2013.[3] P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra, “Algorithm-based Fault Tolerance for Dense Matrix Factorizations,” in

PPoPP , 2012.[4] J. Calhoun, L. Olson, and M. Snir, “FlipIt: An LLVM Based FaultInjector for HPC,” in

Workshops in Euro-Par , 2014.[5] D. Li, J. S. Vetter, and W. Yu, “Classifying Soft Error Vulnerabilities inExtreme-Scale Scientiﬁc Applications Using a Binary InstrumentationTool,” in

International Conference for High Performance Computing,Networking, Storage and Analysis (SC) , 2012.[6] G. Li, K. Pattabiraman, C.-Y. Cher, and P. Bose, “UnderstandingError Propagation in GPGPU,” in

International Conference for HighPerformance Computing, Networking, Storage and Analysis (SC) , 2016.[7] S. K. S. Hari, S. V. Adve, H. Naeimi, and P. Ramachandran, “Re-lyzer: Exploiting Application-level Fault Equivalence to Analyze App.Resiliency to Transient Faults,” in

ASPLOS , 2012.[8] LLVM, “LLVM Language Reference Manual,” http://llvm.org .[9] K. Pattabiraman, Z. Kalbarczyk, and R. K. Iyer, “Application-BasedMetrics for Strategic Placement of Detectors,” in

Paciﬁc Rim Interna-tional Symposium on Dependable Computing , 2005.[10] A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, “Design,Modeling, and Evaluation of a Scalable Multi-level CheckpointingSystem,” in

Conference on High Performance Computing Networking,Storage and Analysis (SC) , 2010.[11] Anonymous, “MOARD: Modeling Application Re-silience to Transient Faults on Data Objects,” https://github.com/PASAUCMerced/MOARD .[12] M. Li, P. Ramachandran, S. K. Sahoo, S. V. Adve, V. S. Adve, andY. Zhou, “Understanding the Propagation of Hard Errors to Softwareand Implications for Resilient System Design,” in

ASPLOS , 2008.[13] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin,“A Systematic Methodology to Compute the Architectural Vulnerabil-ity Factors for a High-Performance Microprocessor,” in

InternationalSymposium on Microarchitecture , 2003.[14] M. Casas, B. R. de Supinski, G. Bronevetsky, and M. Schulz, “FaultResilience of the Multi-grid Solver,” in

ICS , 2012.[15] V. C. Sharma, A. Haran, Z. Rakamaric, and G. Gopalakrishnan,“Towards Formal Approaches to System Resilience,” in

Paciﬁc RimInternational Symp. on Dependable Computing , 2013. [16] M. Rinard, H. Hoffmann, S. Misailovic, and S. Sidiroglou, “Patterns andStatistical Analysis for Understanding Reduced Resource Computing,”in

OOPSLA , 2010.[17] D. H. Bailey, L. Dagum, E. Barszcz, and H. D. Simon, “NAS ParallelBenchmark Results,” in

International Conference for High PerformanceComputing, Networking, Storage and Analysis (SC) , 1992.[18] A. Vishnu, H. v. Dam, N. R. Tallent, D. J. Kerbyson, and A. Hoisie,“Fault Modeling of Extreme Scale Applications Using Machine Learn-ing,” in

IEEE International Parallel and Distributed Processing Sympo-sium (IPDPS) , 2016.[19] Y. S. Shao and D. Brooks, “ISA-Independent Workload Characterizationand its Implications for Specialized Architectures,” in

IEEE Interna-tional Symposium on Performance Analysis of Systems and Software(ISPASS) , 2013.[20] S. K. Sastry Hari, R. Venkatagiri, S. V. Adve, and H. Naeimi, “GangES:Gang Error Simulation for Hardware Resiliency Evaluation,” in

Inter-national Symposium on Computer Arch. , 2014.[21] I. Karlin, A. Bhatele, and etc., “Exploring Traditional and EmergingParallel Programming Models using a Proxy Application,” in

IEEEInternational Parallel and Distributed Processing Symposium , 2013.[22] V. Henson and U. Yang, “BoomerAMG: A Parallel Multigrid Solver andPreconditioner,”

Appl. Num. Math , vol. 41, 2002.[23] R. Venkatagiri, A. Mahmoud, S. K. S. Hari, and S. V. Adve, “Approx-ilyzer: Towards a Sys. Framework for Instruction-level ApproximateComputing and Its Application to HW Resiliency,” in

MICRO , 2016.[24] M. Shantharam, S. Srinivasmurthy, and P. Raghavan, “Characterizing theImpact of Soft Errors on Iterative Methods in Scientiﬁc Computing,” in

International Conference on Supercomputing (ICS) , 2011.[25] X. Li and D. Yeung, “Application-level Correctness and Its Impact onFault Tolerance,” in

International Symposium on Computer Arch. , 2007.[26] R. Leveugle, A. Calvez, P. Maistri, and P. Vanhauwaert, “Statistical FaultInjection: Quantiﬁed Error and Conﬁdence,” in

Conference on Design,Automation and Test in Europe (DATE) , 2009.[27] Z. Chen, “Algorithm-based Recovery for Iterative Methods withoutCheckpointing,” in

HPDC , 2011.[28] P. Wu, C. Ding, and etc., “On-line Soft Error Correction in MatrixMultiplication,”

J. of Computational Sci. , vol. 4, no. 6, 2013.[29] K.-H. Huang and J. A. Abraham, “Algorithm-Based Fault Tolerance forMatrix Operations,”

IEEE Transactions on Computers , vol. C-33, no. 6,pp. 518–528, 1984.[30] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, andK. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,”in

IISWC , 2009.[31] Debra Werner, “HPE Supercomputer in Orbit is Ready for Researchers,”https://spacenews.com/hpe-supercomputer.[32] K. Maeng, A. Colin, and B. Lucia, “Alpaca: Intermittent executionwithout checkpoints,”

Proceedings of ACM Programming Language ,2017.[33] J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, andS. Amarasinghe, “Petabricks: A language and compiler for algorithmicchoice,” in

PLDI , 2009.[34] B. Sangchoolie, K. Pattabiraman, and J. Karlsson, “One Bit is (Not)Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors,” in

International Conference on Dependable Systems andNetworks , 2017.[35] A. Biswas, P. Racunas, R. Cheveresan, J. Emer, S. S. Mukherjee,and R. Rangan, “Computing Arch. Vulnerability Factors for Address-Based Structures,” in

International Symposium of Computer Architecture(ISCA) , 2005.[36] V. Sridharan and D. R. Kaeli, “Eliminating Microarchitectural Depen-dency from Architectural Vulnerability,” in

IEEE International Sympo-sium on High-Performance Computer Architecture (HPCA) , 2009.[37] L. Yu, D. Li, S. Mittal, and J. S. Vetter, “Quantitatively Modeling App.Resiliency with Data Vulnerability Factor,” in SC , 2014.[38] H. Menon and K. Mohror, “DisCVar: Discovering Critical VariablesUsing Algorithmic Differentiation for Transient Faults,” in PPOPP ,2018.[39] L. Duan, B. Li, and L. Peng, “Versatile Prediction and Fast Estimation ofArchitectural Vulnerability Factor from Processor Performance Metrics,”in

HPCA , 2009.[40] X. Li, S. V. Adve, P. Bose, and J. Rivers, “Online Estimation of ArchVulnerability Factor for Soft Errors,” in

ISCA , 2008., 2008.