[PDF] NSan: A Floating-Point Numerical Sanitizer

Abstract

Sanitizers are a relatively recent trend in software engineering. They aim at automatically finding bugs in programs, and they are now commonly available to programmers as part of compiler toolchains. For example, the LLVM project includes out-of-the-box sanitizers to detect thread safety (tsan), memory (asan,msan,lsan), or undefined behaviour (ubsan) bugs. In this article, we present nsan, a new sanitizer for locating and debugging floating-point numerical issues, implemented inside the LLVM sanitizer framework. nsan puts emphasis on practicality. It aims at providing precise, and actionable feedback, in a timely manner. nsan uses compile-time instrumentation to augment each floating-point computation in the program with a higher-precision shadow which is checked for consistency during program execution. This makes nsan between 1 and 4 orders of magnitude faster than existing approaches, which allows running it routinely as part of unit tests, or detecting issues in large production applications.

Full PDF

NNSan: A Floating-Point Numerical Sanitizer

Clement Courbet

Google ResearchFrance [email protected]

Abstract

Sanitizers are a relatively recent trend in software engineer-ing. They aim at automatically finding bugs in programs,and they are now commonly available to programmers aspart of compiler toolchains. For example, the LLVM projectincludes out-of-the-box sanitizers to detect thread safety ( tsan ), memory ( asan,msan,lsan ), or undefined behaviour ( ubsan ) bugs.In this article, we present nsan , a new sanitizer for locatingand debugging floating-point numerical issues, implementedinside the LLVM sanitizer framework. nsan puts emphasison practicality . It aims at providing precise , and actionable feedback, in a timely manner. nsan uses compile-time instrumentation to augment eachfloating-point computation in the program with a higher-precision shadow which is checked for consistency duringprogram execution. This makes nsan between 1 and 4 ordersof magnitude faster than existing approaches, which allowsrunning it routinely as part of unit tests, or detecting issuesin large production applications. CCS Concepts: • Software and its engineering → Dy-namic analysis ; Software verification . Keywords:

Floating Point Arithmetic, Numerical Stability,LLVM, nsan

ACM Reference Format:

Clement Courbet. 2021. NSan: A Floating-Point Numerical Sanitizer.In

Proceedings of the 30th ACM SIGPLAN International Conference onCompiler Construction (CC ’21), March 2–3, 2021, Virtual, Republicof Korea.

ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3446804.3446848

Most programs use IEEE 754[9] for numerical computation.Because speed and efficiency are of major importance, there

Permission to make digital or hard copies of part or all of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contactthe owner/author(s).

CC ’21, March 2–3, 2021, Virtual, Republic of Korea © 2021 Copyright held by the owner/author(s).ACM ISBN 978-1-4503-8325-7/21/03. https://doi.org/10.1145/3446804.3446848 is a constant tension between using larger types for more pre-cision and smaller types for improved performance. Nowa-days, the vast majority of architectures offer hardware sup-port for at least 32-bit ( float ) and 64-bit ( double ) precision.Specialized architectures also support even smaller typesfor improved efficiency, such as bfloat16 [7]. SIMD instruc-tions, whose width is a predetermined byte size, can typicallyprocess twice as many floats as doubles per cycle. There-fore, performance-sensitive applications are very likely tofavor lower-precision alternatives when implementing theiralgorithms.Numerical analysis can be used to provide theoretical guar-antees on the precision of a conforming implementation withrespect to the type chosen for the implementation. However,it is time-consuming and therefore typically applied only tothe critical parts of an application. To automatically detectpotential numerical errors in programs, several approacheshave been proposed. The majority of numerical verification tools use probabilisticmethods to check the accuracy of floating-point computa-tions. They perturbate floating-point computations in theprogram to effectively change its output. Statistical analysiscan then be applied to estimate the number of significantdigits in the result. They come in two flavors: Discrete Sto-chastic Arithmetic (DSA)[13] runs each floating-point oper-ation 𝑁 times with a randomization of the rounding mode.Monte Carlo Arithmetic (MCA)[12] directly perturbates theinput and output values of the floating-point operations. Early approaches to numerical checking, such as CADNA [13],required modifying the source code of the application andmanually inserting the DSA or MCA instrumentation. Whilethis works on very small examples, it is not doable in prac-tice for real-life numerical applications. This has hinderedthe widespread adoption of these methods. To alleviate thisproblem, more recent approaches automatically insert MCAor DSA instrumentation automatically.

Verificarlo [8] is an LLVM pass that intercepts floating-pointinstructions at the IR level ( fadd , fsub , fmul , fdiv , fcmp ) a r X i v : . [ c s . S E ] F e b C ’21, March 2–3, 2021, Virtual, Republic of Korea Clement Courbet and replaces them with calls to a runtime library called back-end . The original paper describes a backend that replacesthe floating-point operations by calls to an MCA library.Since the original publication, the Verificarlo has gainedseveral backends , including an improved MCA backend: mca is about 9 times faster than than the original mca_mpfr backend . VERROU [4] and

CraftHPC [10] are alternative tools thatwork directly from the original application binary. VERROUis based on the

Valgrind framework [11], while CraftHPCis based on

DyninstAPI [3]. In both cases, the applicationbinary is decompiled by the framework into IR, and instru-mentation is performed on the resulting IR. This has theadvantage that the tool does not require re-compilation ofthe program. However, this makes running the analysis rela-tively slow. In terms of instrumentation, VERROU performsthe same MCA perturbation as the mca backend of Verifi-carlo, while CraftHPC detects cancellation issues (similar toVerificarlo’s cancellation backend). A major downside ofworking directly from the binary is that some semantics thatare available at compile time are lost in the binary. For exam-ple, the compiler knows about the semantics of math libraryfunctions such as 𝑐𝑜𝑠 , and knows that it has been designedfor a specific rounding mode. On the other hand, dynamictools like VERROU only see a succession of floating-pointoperations, and blindly apply MCA, which will result in falsepositives.

The main drawback of approaches based on probabilisticmethods, such as Verificarlo and VERROU, is that they mod-ify the state of the application. Just stating that a programhas numerical instabilities is not very useful, so both relyon delta-debugging [14] for locating instabilities. Delta de-bugging is a general framework for locating issues in pro-grams based on a hypothesis-trial-result loop. Because of itsgenerality, it is not immediately well adapted to numericaldebugging. This puts a significant burden on the user whohas to write a configuration for debugging .FpDebug [2] takes a different approach. Like VERROU,FpDebug is a dynamic instrumentation method based onValgrind. However, instead of using MCA for the analysis,it maintains a separate shadow value for each floating-pointvalue in the original application. The shadow value is theresult of performing operations in higher-precision floating-point arithmetic (120 bits of precision by default). By com-paring the original and shadow value, FpDebug is able topinpoint the precise location of the instruction that intro-duces the numerical error. https://github.com/verificarlo/verificarlo

132 and 1167 ms/sample respectively for the example of section 4.1 https://github.com/verificarlo/verificarlo Based on the analysis in section 2, we design nsan aroundthe concept of shadow values : • Every floating-point value 𝑣 at any given time in theprogram has a corresponding shadow value , noted 𝑆 ( 𝑣 ) ,which is kept alongside the original value. The shadowvalue 𝑆 ( 𝑣 ) is typically a higher precision counterpart of 𝑣 . A shadow value is created for every program input,and any computation on original values is applied inparallel in the shadow domain. For example, addingtwo values: 𝑣 = 𝑎𝑑𝑑 ( 𝑣 , 𝑣 ) will create a shadow value 𝑆 ( 𝑣 ) = 𝑎𝑑𝑑 𝑠ℎ𝑎𝑑𝑜𝑤 ( 𝑆 ( 𝑣 ) , 𝑆 ( 𝑣 )) , where 𝑎𝑑𝑑 𝑠ℎ𝑎𝑑𝑜𝑤 isthe addition in the shadow domain. • At any point in the program, 𝑣 and 𝑆 ( 𝑣 ) can be com-pared for consistency. When they differ significantly,we emit a warning (see section 3.3).In our implementation, 𝑆 ( 𝑣 ) is simply a floating pointvalue with a precision that is twice that of 𝑣 : float valueshave double shadow values, double values have quad (a.k.a. fp128 ) shadow values. In the special case of X86’s 80-bit long double , we chose to use an fp128 shadow. Note thatthis does not offer any guarantees that the shadow computa-tions will themselves be stable. However, the stability of theapplication computations implies that of the shadow compu-tations, so any discrepancy between 𝑣 and 𝑆 ( 𝑣 ) means thatthe application is unstable. This allows us to catch unstablecases, even though we might be missing some of them. Inother words, in comparison to approaches based on MCA,we trade some coverage for speed and memory efficiency,while keeping a low rate of false positives. In our experi-ments, doubling the precision was enough to catch mostissues while keeping the shadow value memory reasonablysmall.Conceptually, our design combines the shadow compu-tation technique of FpDebug with the compile-time instru-mentation of Verificarlo. Where our approach diverges sig-nificantly from that of FpDebug is that we implement theshadow computations in LLVM IR, alongside the originalcomputations. This has several advantages: • Speed:

Most computations do not emit runtime librarycalls, the code remains local, and the runtime is ex-tremely simple. The shadow computations are opti-mized by the compiler. This improves the speed byorders of magnitude (see section 4.1), and allows ana-lyzing programs that are beyond the reach of FpDebugin practice (see section 4.2.1). • Scaling:

FpDebug runs on Valgrind, which forces allthreads in the application to run serially . Using com-pile time instrumentation means that nsan scales as well as the original applications. This is a major ad-vantage in modern hardware with tens of cores. • Semantics:

Contrary to dynamic approaches based onValgrind, most of the semantics of the original programare still known at the LLVM IR stage. For example, animplementation that does not know the semantics ofthe program would compute the shadow of a floatcosine as 𝑆 ( 𝑐𝑜𝑠 𝑓 ( 𝑣 )) = 𝑆 ( 𝑐𝑜𝑠 𝑓 ( 𝑆 ( 𝑣 ))) . This would in-troduce numerical errors as cosf ’s implementation iswritten for single-precision. Instead, nsan is able toreplace the cosf by its double-precision counterpart cos : 𝑆 ( 𝑐𝑜𝑠 𝑓 ( 𝑣 )) = 𝑐𝑜𝑠 ( 𝑆 ( 𝑣 )) , . • Simplicity:

From the software engineering perspective,this reduces the maintenance burden by relying on thecompiler for the shadow computation logic. WhereFpDebug requires modified versions of the GNU Mul-tiple Precision Arithmetic Library and GNU MultiplePrecision Floating-Point Reliably in addition to theFpDebug Valgrind tool itself, in our case, LLVM han-dles the lowering (and potential vectorization) of theshadow code.The following sections detail how we construct, track, andcheck shadow values in our implementation. A floating-point value is any LLVM value of type float , double , x86_fp80 , or a vector thereof (e.g. <4 x float> ).We classify floating-point values into several categories: • Temporary values inside a function: These are typi-cally named variables or artifacts of the programminglanguage. They have an IR representation (and we alsocall them

IR values ). During execution, these valuestypically reside within registers. • Parameter (resp. argument ) values: These are the val-ues that are passed (resp. received) through a func-tion call. Because numerical instabilities can span sev-eral functions, it is important that shadow values arepassed to functions alongside their original counter-parts. • Return values: are similar in spirit to parameter values,as the shadow must be returned alongside the originalvalue. • Memory values are values that do not have an IR rep-resentation outside of their materialization through a load instruction.

Temporary values are the sim-plest case: every IR instruction that produces a floating-pointvalue gets a shadow IR instruction of the same opcode, butthe type of the instruction is different and parameters are re-placed by their shadow counterparts. We give a few examplesin Table 1. https://llvm.org/docs/LangRef.html Parameter valuesare maintained in a shadow stack. During a function call,for each floating-point parameter 𝑣 , the caller places 𝑆 ( 𝑣 ) on the shadow stack before entering the call. On entry, thecallee loads 𝑆 ( 𝑣 ) from the shadow stack. The only complex-ity comes from the fact that a non-instrumented functioncan call an instrumented function. Blindly reading from theshadow stack in the callee would result in garbage shadowvalues. To avoid this, the shadow stack is tagged by the ad-dress of the callee. Before calling a function f , the caller tagsthe shadow stack with f . When reading shadow stack val-ues, the callee checks that the shadow stack tag matches itsaddress. If it does, the shadow values are loaded from theshadow stack. Else, the parameters are extended to createnew shadows. In practice, the introduced branch does nothurt performance as it’s typically perfectly predicted.Return values are handled in similar manner. The frame-work has a return slot with a tag and a buffer. Instrumentedfunctions that return a floating-point value set the tag totheir address and put the return value in the shadow returnslot. Instrumented callers check whether the tag matches thecallee and either read from the shadow return slot or extendthe original return value (see Table 1). Note that because theprogram can be multithreaded, the shadow stack and returnslot are thread-local. These are a bit special becausethey do not have a well-defined lifetime and can persist forthe lifetime of the program.

Shadow Memory : Like most LLVM sanitizers, we main-tain a shadow memory alongside the main application mem-ory. The nsan runtime intercepts memory functions (e.g. malloc , realloc , free ). Whenever the application allocatesa memory buffer, a corresponding shadow memory buffer isallocated. The shadow buffer is released when the applica-tion buffer is released. The shadow memory is in a differentaddress space than that of the application, which ensures thatshadow memory cannot be tampered with from the applica-tion. Shadow memory is conceptually very simple: for everyfloating point value 𝑣 in application memory at address 𝐴 ( 𝑣 ) ,we maintain its shadow 𝑆 ( 𝑣 ) at address 𝑀 𝑠 ( 𝐴 ( 𝑣 )) . A loadfrom 𝐴 ( 𝑣 ) to create a value 𝑣 is instrumented as a shadowload from 𝑀 𝑠 ( 𝐴 ( 𝑣 )) to create 𝑆 ( 𝑣 ) ; a store to 𝐴 ( 𝑣 ) creates ashadow store of 𝑆 ( 𝑣 ) to 𝑀 𝑠 ( 𝐴 ( 𝑣 )) . Shadow Types : We have to handle an extra complex-ity: memory is untyped, so there is no guarantee that theapplication does not modify the value at 𝐴 ( 𝑣 ) through non-floating-type stores or partial overwrites by another float.Consider the code of Fig. 1, which modifies the byte represen-tation of a floating-point value in memory. It’s unclear howthis should translate in the shadow space. In that case, wechoose to resume computations by re-extending the originalvalue: 𝑆 (∗ 𝑓 ) = ∗ 𝑓 . C ’21, March 2–3, 2021, Virtual, Republic of Korea Clement Courbet

Table 1.

Example nsan instrumentation.

Operation Example Added Instrumentation binary/unary operation %c = fadd float %a, %b %s_c = fadd double %s_a, %s_b cast %b = fpext <2 x float> %a to <2 x double> %s_b = fpext <2 x double> %s_a to <2 x fp128> select %d = select i1 %c, double %a, double %b %s_d = select i1 %c, fp128 %s_a, fp128 %s_b vector operation %c = shufflevector <2 x float> %a, %s_b = shufflevector <2 x double> %s_a,<2 x float> %b, <2 x i32> <2 x double> %s_b, <2 x i32> known function call %b = call float @fabsf(float %a) %s_b = call double @llvm.fabs.f64(double %s_a) fcmp %r = fcmp oeq double %a, 1.0 %s_r = fcmp oeq fp128 %s_a, 1.0%c = icmp eq i1 %r, %s_rbr i1 %c, label 2, label 11:call void @__fcmp_fail_double(...)br label 22: return ret float %a store i64 i64 %fn_addr, i64* @__ret_tag, align 8%rp = bitcast ([64 x i8]* @__ret_ptr to double*)store double %s_a, double* %rp, align 8 function call %a = call float @returns_float() %tag = load i64, i64* @__ret_tag, align 8%fn_addr = ptrtoint (float ()* @returns_float to i64)%m = icmp eq i64 %tag, i64 %fn_addr%rp = bitcast ([64 x i8]* @__ret_ptr to double*)%l = load double, double* %rp), align 8%e = fpext float %a to double%s_a = select i1 %m, double %l, %e float UntypedMemory ( float * f) { *f = 1.0; *(( char *)f + 2) = 2; return *f; } Figure 1.

A function that modifies the binary representationof a floating point value in memory. How to extend thisoperation to the shadow domain is unclear.To handle this case correctly, we track the type of each bytein application memory. We maintain shadow types memory.For a floating point value 𝑣 in application memory at ad-dress 𝐴 ( 𝑣 ) , each byte in the shadow types memory at address 𝑀 𝑡 ( 𝐴 ( 𝑣 )) + 𝑘 contains the type of the floating point value( unknown , float , double , x86_fp80 ), as well as the position 𝑘 of the byte within the value (see Fig. 2). A shadow value inmemory is valid only if the shadow type memory contains acomplete position sequence [0,...,sizeof(type)-1] , ofthe right type.When storing a floating point value, the shadow instru-mentation retrieves the shadow pointer via a call to a func-tion __shadow_ptr__load , which sets the shadowmemory type to and returns the shadow value ad-dress. When loading a floating-point value, the shadow in-strumentation calls a function __shadow_ptr__load which returns the shadow pointer if the shadow value is valid,and null otherwise. If the shadow is valid, it is loaded fromthe shadow address; else, the instrumentation creates a newshadow by extending the original load. Copying bytes from Figure 2.

Shadow type memory example: The left column isthe address in application memory. For each byte in shadowtype memory, the first character denotes the type: float(f) , double (d) , long double (l) , unknown (_) ; andthe second character is the position of the byte inside thecorresponding floating point value. In this example, theshadow memory contains valid shadows for aligned float sat addresses , , , and ; a double at address ; and a longdouble at address . Note that the double at ad-dress is not valid as it has been overwritten bythe float at address .one memory location to another (either through memcpy() or an untyped load/store pair) copies both the shadow typesand shadow values. Untyped stores and functions with thesemantics of an untyped store (e.g. memset ) set the shadowmemory type to unknown .In practice, subtle binary representation manipulationssuch as that of figure 1 are very uncommon, and most un-typed memory accesses fall in two categories: San: A Floating-Point Numerical Sanitizer CC ’21, March 2–3, 2021, Virtual, Republic of Korea • Setting a memory region to a constant value (typicallyzero), e.g. memset(p, 0, n * sizeof(float)) . Inthat case, the nsan framework sets the shadow typesto unknown , and any subsequent load from this mem-ory region will see a correct shadow value of , re-extended from the original value . • Copying a memory region (typically, an array of floatsor a struct containing a float member), e.g. structS { int32_t i; float f; }; void CopyS(S&s2, const S& s1) { s2 = s1; } . In this case,LLVM might choose to do the structure copy with asingle untyped 8-byte load/store pair. nsan copies theshadow types from 𝑀 𝑡 ( 𝐴 ( 𝑠 )) to 𝑀 𝑡 ( 𝐴 ( 𝑠 )) (8 bytes)and the shadow values from 𝑀 𝑠 ( 𝐴 ( 𝑠 )) to 𝑀 𝑠 ( 𝐴 ( 𝑠 )) (16 bytes). Therefore, assuming that 𝑀 𝑡 ( 𝐴 ( 𝑠 .𝑓 )) con-tains valid types, any subsequent load from s2.f willsee the correct shadow types in 𝑀 𝑡 ( 𝐴 ( 𝑠 .𝑓 )) and loadthe shadow value from 𝑀 𝑠 ( 𝐴 ( 𝑠 .𝑓 )) In the SPECfp2006 benchmark suite, all the floating-pointloads that are done from a location with invalid or unknowntypes have a corresponding application value of + . , whichis a strong indication that shadow types are either correctlytracked or come from an untyped store (or memset ) of thevalue . However, shadow type tracking is necessary forcorrectness and we have found it to be necessary in severalplaces in Google’s large industrial codebase. Memory Usage : All allocations/deallocations are mir-rored, and each original byte uses one byte in the shadowtypes block and two bytes in the shadow values block: quad(resp. double) is twice as big as double (resp. float). So aninstrumented application uses 4 times as much memory asthe original one.

We check for several types of shadow value consistency: • Observable value consistency:

By default, we check con-sistency between 𝑣 and 𝑆 ( 𝑣 ) every time a value canescape from a function, that is: function calls, return,and stores to memory. These values are the only onethat are observable by the environment (the user, orother code inside the application). This is differentfrom the approach of FpDebug, and we’ll see later thatthis decision has an influence on the terseness of theoutput and reduces false positives. • Branch consistency:

For every comparison betweenfloating-point values, we check that the comparison ofthe shadow values yields the same result. This catchesthe case when, even though the values are very close,they can drastically affect the output of the programby taking a different execution path. This approach isalso implemented in Verificarlo and VERROU. • Load consistency:

When loading a floating-point valuefrom memory, we check that its loaded shadow is con-sistent. If not, this means that some uninstrumentedcode modified memory without nsan being aware.This can happen, for example, when the user usedhand-written assembly code which could not be instru-mented. By default, this check does not emit a warningsince this is typically not an issue of the code undertest. It simply resumes computation with 𝑆 ( 𝑣 ) = 𝑣 . Inpractice, we found that this happened extremely rarely,and we provide a flag to disable load tracking whenthe user knows that it cannot happen.In each case, we print a warning with a detailed diagnosticto help the user figure out where the issue appeared. Thediagnostic includes the value and its shadow, how they differ,and a full stack trace of the execution complete with symbolssource code location. An example diagnostic is given in Fig. 3. Runtime Flags.

Determining whether two floating-pointvalues are similar is a surprisingly ill-defined problem [6].nsan implements the epsilon and relative epsilon strategiesfrom [6], and allows the user to customize their tolerances.

Sanitizer Interface.

We provide a set of functions thatcan be used to interact explicitly with the sanitizer. This isuseful when debugging instabilities: • __nsan_check_float(v) emits a consistency checkof 𝑣 . Note that this is a normal function call: the instru-mentation automatically forwards the shadow valueto the runtime in the shadow stack. • __nsan_dump_shadow_mem(addr, size) prints a rep-resentation of shadow memory at address [addr,addr+size] . See Fig. 2 for an example. • __nsan_resume_float(v) Resumes the computationfrom the original value from that point onwards: 𝑆 ( 𝑣 ) = 𝑣 . Suppressions.

The framework might produce false pos-itives. This can happen, for example, when an applicationperforms a computation that might be unstable, but hasways to check for and correct numerical stability afterwards(see section 4). We provide a way to disable these warningsthrough suppressions . Suppressions are specified in an ex-ternal file as a function name or a source filename. If anyfunction or filename within the stack of the warning matchesa suppression, the warning is not emitted. Suppressions canoptionally specify whether to resume computation from theshadow or the original value after a match.

Most applications will at one point or other make use ofcode that is not instrumented. This might be because theyare calling a closed-source library, because they are calling

C ’21, March 2–3, 2021, Virtual, Republic of Korea Clement Courbet

WARNING : NumericalSanitizer : inconsistent shadow results while checking store to address 0 xffda3808double precision ( native ): dec: 0.00000000000002309503 hex: 0x1. a00b086c4888f0000000p -46__float128 precision ( shadow ): dec: 0.00000000000005877381 hex: 0x8 .458 cb4531bef87a00000p -47shadow truncated to double : dec: 0.00000000000005877381 hex: 0x1 .08 b1968a637df0000000p -44Relative error : 60.70% (2^51 epsilons ) (6344632558530384 ULPs == 15.8 digits == 52.5 bits )

Figure 3.

An example nsan warning in a real application. Note that the warning pinpoints the exact location of the issue inthe original source code. The full stack trace was collapsed for clarity.a hand-coded assembly routine, or because they are callinginto the

C runtime library (e.g. memcpy() , or for math func-tions. nsan interacts seamlessly with these libraries thanksto the shadow tagging system described in section 3.2.

In this section, we start by taking a common example ofnumerical instability and compare how Verificarlo, FpDebugand nsan perform in terms of diagnostics and performance.Then, we show how nsan compares in practice on real-lifeapplications, using the SPECfp2006 suite. In particular, wediscuss how the improved speed allows us to analyze binariesthat are not approachable with existing tools, while reducingthe number of false positives (and therefore the burden onthe user).

Summation is probably the best known example of an al-gorithm which is intrinsically unstable when implementednaively. Kahan’s compensated summation [9] works aroundthe unstabilityof the naive summation by introducing a com-pensation term. Example code for both algorithms can befound on Fig. 4.

For each tool, we ran the two summa-tion algorithms of Fig. 4, on the same randomly generatedvector of 10M elements. A perfect tool would warn of aninstability on line in the naive case. Whether it shouldproduce no warnings in the stable case is up for debate: Onthe one hand, the operations on line and result in lossof precision. On the other hand, the only thing that reallymatters in the end is the observable output of the function.All three tools were able to detect the numerical issuewhen compiled with compiler optimizations. The tools differquite a lot in the amount of diagnostic that they produce: • Verificarlo produces an estimate of the number of cor-rect significant digits in both modes. The number of float NaiveSum ( const vector & values ) { float sum = 0.0f; for ( float v : values ) { sum += v; } return sum; } float KahanSum ( const vector & values ) { float sum = 0.0f; float c = 0.0f; for ( float v : values ) { float y = v - c; float t = sum + y; c = (t - sum) - y; sum = t; } return sum; } Figure 4.

Naive summation and Kahan compensated sum-mation.significant digits is lower for the naive case ( . vs . ),which shows the issue. By default, no source code in-formation is provided, though the user can optionallyprovide a debugging script to locate the issue . • FpDebug evaluates the error introduced by each in-structions, and sorts them by magnitude. In the naivecase, FpDebug reports , , discrepancies, thelargest of which (line ) has a relative error of . × − , which is the error introduced by the summation.In the stable case, it reports , , discrepanciesbetween application and shadow value, the largest 2being on line and , with errors of about and https://github.com/verificarlo/verificarlo − respectively. This makes sense because the com-pensation term c is somehow random. The relativeerror for sum is reported to be . × − . • nsan produces a single warning ( lines of output)in naive mode, reporting a relative error of . × − on line ( return sum ). In stable mode, it produces nooutput. On the one hand, nsan avoids producing falsepositives in stable mode, as the temporary variables c , y , t , and sum are only checked when producing ob-servable value (see section 3.3). On the other hand, thediagnostic is made on the location where the observ-able is produced (l. ) instead of the specific locationwhere the error occurs (l. ). We believe that while thisproduces less precise diagnostics, the gain in terseness(in particular, the reduction in what we argue are falsepositives) benefits the user experience. To detect the is-sue, Verificarlo needs to run the program 𝑁 times, where 𝑁 is a large number, and run analysis on the output. In theoriginal article, the authors use 𝑁 = ; it’s unclear howone should pick the right value of 𝑁 . In contrast, FpDebugand nsan are able to detect the issue with a single run of theprogram, and they can pinpoint the exact location where theissue happens.Table 2 compares the performance of running the programwithout instrumentation, with Verificarlo, VERROU, FpDe-bug, and nsan respectively. Simply enabling instrumentationin Verificarlo makes the program run about 6 times slower.This is because all instrumentation is done as function calls.Before every call, registers have to be spilled to respect thecalling convention. The function call additionally preventsmany optimizations because the compiler does not knowwhat happens inside the runtime. Performing the random-ization on top with the MCA backend makes each samplerun about 40 times slower in total. The dynamic approachof FpDebug is also quite slow as it does not benefit fromcompiler optimizations.In contrast, nsan slows down the program by a factor of . when shadowing float computations as double : shadow double computations are done in hardware, and are as fast asthe original ones, and the framework adds a small overhead.When shadowing double computations as quad , the slow-down is around : this is because shadow computationsare done in software, and are therefore much slower (somearchitectures supported by LLVM, such as POWER9[5], havehardware support for quad-precision floats; nsan would bemuch faster on these). Note that all these times are given persample. A typical debugging session in Verificarlo requiresrunning the mca backend for a large number of samples (theVerificarlo authors use samples). Therefore, analyzingeven this trivial program slows it down by a factor . libinterflop_ieee.so libinterflop_mca.so –mode=mca Table 2.

Performance of various approaches on the KahanSum. The second column shows the time (in milliseconds)to run one sample of the compensated sum algorithm fromFig. 4, with 1M elements. The third and fourth columns re-spectively show the slowdown compared to the originalprogram for a single sample, and for the whole analysis (us-ing samples for probabilistic methods). The experimentwas performed on a 6-core Xeon [email protected] with 16MBL3 cache.

Version ms/sample Slowdown Slowdown (1 sample) (full)original program 3.3 1.0x 1.0xVerificarlo, ieee 18.4 5.6x 5600xVerificarlo, mca 132.3 40.0x 40000xVerrou, nearest 96.5 29.2x 29200xVerrou, random 117.0 35.4x 35400xFpDebug, precision=64 1573.3 476.6x 476.6xnsan (double shadow) 7.7 2.3x 2.3xnsan (quad shadow) 56.7 17.2x 17.2x S p ee dup Figure 5.

Parallel Scalability: Speedup of running one sam-ple of the compensated sum algorithm from Fig. 4 (100Melements) vs. number of threads.

If ordering is notimportant, the compensated sum of Fig. 4 can be triviallyparallelized: Each thread is given a portion of the array, and alast pass sums the results for each thread. Figure 5 shows howeach approach scales with the number of threads. BecauseValgrind serializes all threads, both Verrou and FpDebug can-not take advantage of additional parallelism. Methods basedon compile-time instrumentation (Verificarlo and nsan ) scalewith the application. An exception is Verificarlo with theMCA backend, which is actively hurt by multithreading.

Table 3 shows the time it takes to an-alyze each of the C/C++ benchmarks of SPECfp2006 (testset) with FpDebug and nsan . As shown on the simple ex-ample above, Verificarlo and Verrou take too much time to

C ’21, March 2–3, 2021, Virtual, Republic of Korea Clement Courbet analyze large application, so we only provide compare withFpDebug. All experiments were performed on a 6-core [email protected] with 16MB L3 cache. In practice, debugginga floating-point application is likely to involve running theanalysis with the application compiled in debug mode (with-out compiler optimizations), so we include results when theapplication is compiled with compiler optimizations ( opt rows) or without them ( dbg rows).Note that all programs in SPECfp2006 are single-threaded,so this is the best case for FpDebug.

Table 3.

Performance of analyzing SPECfp2006 applications(test set) with fpdebug and nsan, with compiler optimizationsturned on and off. For each experiment, we show the runtimein seconds for each and the speedup factor of nsan vs fpdebug.Note that as noted in [2], the dealII benchmark cannot rununder FpDebug due to limitations in Valgrind.

Benchmark Original FpDebug nsan Speedup milc (opt) 3.73 3118.2 505.4 6.2xnamd (opt) 8.33 5679.8 519.8 10.9xdealII (opt) 7.60 - 356.4 -soplex (opt) 0.01 1.9 0.1 19.0xpovray (opt) 0.31 171.8 12.7 13.5xlbm (opt) 1.47 1343.0 105.4 12.7xsphinx3 (opt) 0.88 304.0 26.7 11.4xmilc (dbg) 13.80 4721.1 502.2 9.4xnamd (dbg) 20.20 11445.2 529.0 21.6xdealII (dbg) 85.40 - 621.6 -soplex (dbg) 0.33 41.0 0.8 52.5xpovray (dbg) 0.85 286.6 18.0 15.9xlbm (dbg) 2.00 1785.0 105.5 16.9xsphinx3 (dbg) 1.79 649.0 27.3 23.8x

To investigate what made nsan much faster, we profiledFpDebug and nsan runs using the Linux perf tool [1]. Table 4shows where the analyzed program spends most of its time.For nsan , we base the breakdown on calls into the compilerruntime (for quad computation) and nsan runtime (shadowvalue load/stores and checking). This underestimates whathappens in reality as the breakdown does not include addi-tional time spent in the original application such as shadowvalue creation, shadow double computations for float values,or register spilling when calling framework functions.For nsan , most time is spent on shadow computations,shadow value tracking is secondary, and checking is neg-ligible. For FpDebug, shadow value computation (calls to mpfr_* ) is a much smaller part of the total. Shadow memorytracking is somehow significant, in particular the memoryinterceptions (calls to vgPlain_* ). Most time is spent exe-cuting Valgrind.Because nsan only adds a constant of work per opera-tion, it scales linearly with respect to problem size. To assessthis experimentally, we used the milc benchmark, which isinteresting because it can scale independently in terms of

Table 4.

Approximate breakdown of where time is spent inan instrumented application (with compiler optimizations).

Benchmark Shadow Memory ValueComputation Tracking Checkingnsan milc 75.5% 4.7% 0.2%namd 83.2% 3.0% 0.7%dealII 73.7% 5.7% 1.2%soplex 39.6% 11.4% 0.4%povray 71.8% 8.3% 0.2%lbm 79.6% 2.3% 1.6%sphinx3 71.2% 7.5% 0.2%

FpDebug milc 49.3% 14.2% 0.0%namd 51.6% 9.0% 0.01%soplex 14.6% 4.0% 0.1%povray 34.5% 7.6% 0.01%lbm 49.1% 10.6% 0.7%sphinx3 34.2% 9.7% 0.1% base runtime (seconds) n s a n r un ti m e ( s ec ond s ) Figure 6.

Scaling of the milc benchmark with respect toproblem size. We run the benchmark uninstrumented (base)and instrumented (nsan) and measure the runtime whilevarying the input problem size in each dimension. Eachpoint represents a benchmark run with a particular valueof steps_per_trajectory ( s ) and the grid resolution inthe time domain ( nt ). Trend lines are represented for eachdimension.memory (grid size, parameter nt ) and number of steps (pa-rameter steps_per_trajectory ). Figure 6 shows that nsan scales linearly with the problem size in both dimensions. Table 5 shows, for each tool, the num-ber of instructions reported as introducing a relative errorlarger than − (a.k.a positives ). This threshold is arbitrary,and corresponds to the default for nsan . For this experiment,compiler optimizations are enabled as this is likely to be theconfiguration of choice when debugging a whole application. San: A Floating-Point Numerical Sanitizer CC ’21, March 2–3, 2021, Virtual, Republic of Korea

Table 5.

Number of instructions introducing a relative errorlarger than − . The first two columns show the number ofwarnings for FpDebug with and without counting the falsepositives from libm . Note that for the value marked with the number of warnings is a lower bound as FpDebug reportsunsupported vector operations Max64Fx2 and

Min64Fx2 . Benchmark FpDebug FpDebug ¬ libm nsan milc 140 0 0namd 100 72 415dealII - - 21soplex 53 50 2povray 772 An important source of false positives for FpDebug (upto of the positives can be false positives) are mathe-matical functions such as sine or cosine. For example, for the milc benchmark, all warnings happen inside the libm . Thisis because the implementation of (e.g.) sin(double) usesspecific constants tailored to the double type. Reproducingthe same operations in quad precision is unlikely to producea correct result. As mentioned in 3.1, LLVM is aware of thesemantics of the functions of the libc and libm , which al-lows nsan to process the shadow value using the extendedprecision version of these functions (e.g. sin(double) for sin(float) ), avoiding the false positives.If we ignore the false positives from libm , nsan tendsto reports fewer issues than FpDebug . Unfortunately, asseen in section 4.1.1, whether a warning is a true or falsepositive is subject to interpretation. We inspected a sampleof positives from FpDebug and nsan . They can roughly beclassified in three buckets: • False positives due to temporary values. This is simi-lar to the false positives in the Kahan sum from 4.1.1.These are mostly from FpDebug, though nsan can alsoproduce them when memory is used as temporaryvalue: Writing a temporary to memory makes it an observable value. Fig. 7 gives examples of such a falsepositives. • False positives due to incorrect shadow value trackingin FpDebug. FpDebug has issues dealing with integerstores that alias floating-point values in memory (a.k.a type punning ). Because nsan tracks shadow memorytypes (see 3.2)), it does not suffer from this problem.Fig. 8 gives an example of this issue. • Computations that are inherently unstable, and theinstability is visible on a partial computation. However,the input is such that the observable output value doesnot differ significantly from its shadow counterpart. The large number of warnings for the namd benchmark is due to theexistence of multiple warnings inside a macro: FpDebug reports one issuefor the macro, while nsan reports an issue for each line inside the macro. 1 // (1). void equal ( double x, double y) { double d = x - y; if ( d > 0.00001 || d < -0.00001 ) { printf (" error : numeric test failed ! ( error= %g)\n",d); exit ( -10); } } // (2). Real delta = 0.1 + 1.0 / thesolver -> basis ().iteration (); ... x = coPenalty_ptr [j] += rhoVec [j]*( beta_q *rhoVec [j] -2* rhov_1 * workVec_ptr [j]); if (x < delta ) coPenalty_ptr [j] = delta ; Figure 7.

Example false positives from the soplex and namd benchmark. For (1), note how a large relative error can becreated by cancellation on line . However, all that matters isthe absolute value compared to . . FpDebug incorrectlywarns on that case, while nsan is silent. (2) is similar inspirit, though more complex. The cancellation potentiallyintroduced on line is handled on line − , but bothFpDebug and nsan report an issue on line − .Fig. 10 illustrates this. Because FpDebug checks partialcomputations, it warns about this case. nsan does not,as it only checks observables. The best tradeoff hereis debatable: On one hand, the computation mightbecome unstable with a different input. On the otherhand, the code might be making assumptions aboutthe data that the instrumentation does not know about.Until the instrumentation sees data that changes theobservable behaviour of the function, it can assumethat the implementation is correct. We have mentioned earlier that nsan only checks observable values within a function, and we have seen previous sectionsthat this approach helps prevent false positives. However,this also makes nsan susceptible to compiler optimizationssuch as inlining (resp. outlining). Because these optimiza-tions change the boundaries of a function, they change its observable values. For example, given the code of Fig. 11,a compiler might decide to inline

NaiveSum into its caller

Print .In that case, the sum value will not be checked by nsan online , because sum is not an observable value of NaiveSum .This is not an issue for detecting numerical stability, asthe sum variable is still tracked within

Print . However, itchanges the source location where nsan reports the error.

C ’21, March 2–3, 2021, Virtual, Republic of Korea Clement Courbet void __attribute__ (( noinline )) Neg( double * v) { *(( unsigned char *)v + 7) ^= 0x80; } double Example ( double v) { double d = v / 0.2 - 3.0; Neg (&d); return d; } Figure 8.

Example false positive with type punning. FpDe-bug can be made to report an arbitrarily large error, as ituses a non-negated shadow value for 𝑆 ( 𝑑 ) after the call to Neg . In

Example , the computation is unstable around v=0.6 ,and FpDebug returns an error of 260% instead of the correctvalue of 60%. nsan is able to detect that the last two bytesof the shadow value have been invalidated by the untypedstore thanks to shadow type tracking.Note: the code was adapted from more complex applicationcode, noinline added to prevent some compiler optimiza-tions. // Unstable loop . for (i = 2; i <= Octaves ; i++) { ... result [Y] += o * value [Y]; result [Z] += o * value [Z]; if (i < Octaves ) { ... o *= Omega ; } } // Division by small value . if (D[Z] > EPSILON ) { ... t = ( Corner1 [Z] - P[Z]) / D[Z]; } Figure 9.

Example true positives from the povray bench-mark. The loop accumulates values of widely different mag-nitudes, which is known to produce large numerical errors.The first one is caught only by nsan , likely because it’s vec-torized by the compiler, and FpDebug does not handle somevector constructs. Both tools catch the second.While the user can easily circumvent the issue by using __nsan_check_float() function to debug where the errorhappens exactly, this degrades the user experience as it re-quires manual intervention.However, LLVM internally tracks function inlining in itsdebug information. In the future we plan to to correct the is-sue above by emitting checks for observable values of inlinedfunctions within their callers. Real SSVector :: length2 () const { Real x = 0.0; for(int i = 0; i < num; ++i) x += val[idx[i]] * val[idx[i]]; return x; } Figure 10.

Example code from the soplex benchmark.While the elements of the sum and the partial sum divergefrom their shadow counterpart, the eventual result does not.FpDebug reports an issue on line , but not on line . nsan does not report an issue. float NaiveSum ( const vector & values ) { float sum = 0.0f; for ( float v : values ) sum += v; return sum; } void Print ( const vector & values ) { float v = NaiveSum ( values ) + 1.0; printf ("%f", v); } Figure 11.

Example code where inlining might change theoutput of nsan. The only observable value of

NaiveSum isits return value. The only observable value of

Print is thesecond argument to the printf call. Depending on whether

NaiveSum is inlined, the warning is emitted on line column , or line column . Even though nsan offers less guarantees than numericalanalysis tools based on probabilistic methods, it was able totackle real-life applications that are not approachable withthese tools in practice due to prohibitive runtimes.We’ve shown that nsan was able to detect a lot of numeri-cal issues in real-life applications, while drastically reducingthe number of false positives compared to FpDebug. Our san-itizer provides precise and actionable diagnostics, offering agood debugging experience to the end user.Because nsan works directly in LLVM IR, shadow com-putations benefit from compiler optimizations, and can belowered to native code, which reduces the analysis cost by atleast an order of magnitude compared to other approaches.We believe that user experience, and in particular execu-tion speed and scalability, was a major factor for the adoptionof toochain-based sanitizers over Valgrind-based tools, andwe aim to emulate this success with nsan . We think that thisnew sanitizer is a step towards wider adoption of numericalanalysis tools.We intend to propose nsan for inclusion within the LLVMproject, complementing the existing sanitizer suite.

San: A Floating-Point Numerical Sanitizer CC ’21, March 2–3, 2021, Virtual, Republic of Korea

References [1] [n. d.]. Linux Perf. https://perf.wiki.kernel.org/ .[2] Florian Benz, Andreas Hildebrandt, and Sebastian Hack. 2012. ADynamic Program Analysis to Find Floating-Point Accuracy Prob-lems. In

Proceedings of the 33rd ACM SIGPLAN Conference on Pro-gramming Language Design and Implementation (PLDI ’12) . Asso-ciation for Computing Machinery, New York, NY, USA, 453–462. https://doi.org/10.1145/2254064.2254118 [3] Bryan Buck and Jeffrey K. Hollingsworth. 2000. An API for RuntimeCode Patching.

Int. J. High Perform. Comput. Appl.

14, 4 (Nov. 2000),317–329. https://doi.org/10.1177/109434200001400404 [4] François Févotte and Bruno Lathuilière. 2016. VERROU: AssessingFloating-Point Accuracy Without Recompiling.[5] IBM Corporation. [n. d.]. Power ISA, Version 3.0 B. https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0 .[6] Bruce Dawson. [n. d.]. Comparing Floating Point Numbers, 2012Edition. https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/ .[7] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin,Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, KeYang, Quoc V. Le, and Andrew Y. Ng. 2012. Large Scale DistributedDeep Networks. In

Advances in Neural Information Processing Systems25 , F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.).Curran Associates, Inc., 1223–1231. http://papers.nips.cc/paper/4687- large-scale-distributed-deep-networks.pdf [8] Christophe Denis, Pablo de Oliveira Castro, and Eric Petit. 2016. Veri-ficarlo: Checking Floating Point Accuracy through Monte Carlo Arith-metic. In . 55–62. https://doi.org/10.1109/ARITH.2016.31 [9] Nicholas J. Higham. 2002.

Accuracy and Stability of Numerical Al-gorithms (2nd ed.). Society for Industrial and Applied Mathematics,USA.[10] Michael O. Lam, Jeffrey K. Hollingsworth, and G. W. Stewart. 2013.Dynamic Floating-Point Cancellation Detection.

Parallel Comput. https://doi.org/10.1016/j.parco.2012.08.002 [11] Nicholas Nethercote and Julian Seward. 2007. Valgrind: A Frameworkfor Heavyweight Dynamic Binary Instrumentation.

SIGPLAN Not. https://doi.org/10.1145/1273442.1250746 [12] Douglas Stott Parker and David Langley. 1997. Monte Carlo Arithmetic:exploiting randomness in floating-point arithmetic.[13] Jean Vignes. 2004. Discrete Stochastic Arithmetic for Validating Resultsof Numerical Software.

Numerical Algorithms

37, 1-4 (Dec. 2004), 377–390. https://doi.org/10.1023/B:NUMA.0000049483.75679.ce [14] Andreas Zeller. 2002. Isolating Cause-Effect Chains from ComputerPrograms. In

Proceedings of the 10th ACM SIGSOFT Symposium onFoundations of Software Engineering (SIGSOFT ’02/FSE-10) . Associationfor Computing Machinery, New York, NY, USA, 1–10.. Associationfor Computing Machinery, New York, NY, USA, 1–10.