[PDF] A Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations

Abstract

Given the importance of floating-point~(FP) performance in numerous domains, several new variants of FP and its alternatives have been proposed (e.g., Bfloat16, TensorFloat32, and Posits). These representations do not have correctly rounded math libraries. Further, the use of existing FP libraries for these new representations can produce incorrect results. This paper proposes a novel approach for generating polynomial approximations that can be used to implement correctly rounded math libraries. Existing methods generate polynomials that approximate the real value of an elementary function f(x) and produce wrong results due to approximation errors and rounding errors in the implementation. In contrast, our approach generates polynomials that approximate the correctly rounded value of f(x) (i.e., the value of f(x) rounded to the target representation). It provides more margin to identify efficient polynomials that produce correctly rounded results for all inputs. We frame the problem of generating efficient polynomials that produce correctly rounded results as a linear programming problem. Our approach guarantees that we produce the correct result even with range reduction techniques. Using our approach, we have developed correctly rounded, yet faster, implementations of elementary functions for multiple target representations.

Full PDF

7753A Novel Approach to Generate Correctly Rounded MathLibraries for New Floating Point Representations

JAY P. LIM,

Rutgers University

MRIDUL AANJANEYA,

Rutgers University

JOHN GUSTAFSON,

National University of Singapore

SANTOSH NAGARAKATTE,

Rutgers UniversityGiven the importance of floating-point (FP) performance in numerous domains, several new variants of FP andits alternatives have been proposed ( e.g. , Bfloat16 , TensorFloat32 , and

Posits ). These representations do nothave correctly rounded math libraries. Further, the use of existing FP libraries for these new representations canproduce incorrect results. This paper proposes a novel methodology for generating polynomial approximationsthat can be used to implement correctly rounded math libraries. Existing methods produce polynomials thatapproximate the real value of an elementary function f ( x ) and experience wrong results due to errors inthe approximation and due to rounding errors in the implementation. In contrast, our approach generatespolynomials that approximate the correctly rounded value of f ( x ) ( i.e. , the value of f ( x ) rounded to thetarget representation). This methodology provides more margin to identify efficient polynomials that producecorrectly rounded results for all inputs. We frame the problem of generating efficient polynomials that producecorrectly rounded results as a linear programming problem. Our approach guarantees that we produce thecorrect result even with range reduction techniques. Using our approach, we have developed correctly rounded,yet faster, implementations of elementary functions for multiple target representations. Our Bfloat16 libraryis 2.3 × faster than the corresponding state-of-the-art while producing correct results for all inputs. Approximating real numbers.

Every programming language has primitive data types to repre-sent numbers. The floating point (FP) representation, which was standardized with the IEEE-754standard [Cowlishaw 2008], is widely used in mainstream languages to approximate real numbers.For example, every number in JavaScript is a FP number! There is an ever-increasing need forimproved FP performance in domains such as machine learning and high performance comput-ing (HPC). Hence, several new variants and alternatives to FP have been proposed recently suchas

Bfloat16 [Tagliavini et al. 2018], Posit [Gustafson 2017; Gustafson and Yonemoto 2017], and

TensorFloat32 [NVIDIA 2020].

Bfloat16 [Tagliavini et al. 2018] is a 16-bit FP representation with 8-bits of exponent and 7-bitsfor the fraction. It is already available in Intel FPGAs [Intel 2019] and Google TPUs [Wang andKanwar 2019].

Bfloat16 ’s dynamic range is similar to a 32-bit float but has lower memory trafficand footprint, which makes it appealing for neural networks [Kalamkar et al. 2019]. Nvidia’s

TensorFloat32 [NVIDIA 2020] is a 19-bit FP representation with 8-bits of exponent and 10-bitsfor the fraction, which is available with Nvidia’s Ampere architecture.

TensorFloat32 providesthe dynamic range of a 32-bit float and the precision of half data type ( i.e. , 16-bit float), whichis intended for machine learning and HPC applications. In contrast to FP, posit [Gustafson 2017;Gustafson and Yonemoto 2017] provides tapered precision with a fixed number of bits. Dependingon the value being represented, the number of precision bits varies. Inspired by posits, a tapered

Authors’ addresses: Jay P. Lim, Rutgers University, [email protected]; Mridul Aanjaneya, Rutgers University, [email protected]; John Gustafson, National University of Singapore, [email protected]; Santosh Nagarakatte,Rutgers University, [email protected]. /2020/7-ART753 $15.00https://doi.org/Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020. a r X i v : . [ c s . M S ] J u l precision log number system has been shown to be effective with neural networks [Bernstein et al.2020; Johnson 2018]. Correctly rounded math libraries.

Any number system that approximates real numbers needsa math library that provides implementations for elementary functions [Muller 2005] ( i.e. , loд ( x ) , exp ( x ) , sqrt ( x ) , sin ( x ) ). The recent IEEE-754 standard recommends (although it does not require)that the programming language standards define a list of math library functions and implementthem to produce the correctly rounded result [Cowlishaw 2008]. Any application using an erroneousmath library will produce erroneous results.A correctly rounded result of an elementary function f for an input x is defined as the valueproduced by computing the value of f ( x ) with real numbers and then rounding the result accordingto the rounding rule of the target representation. Developing a correct math library is a challengingtask. Hence, there is a large body of work on accurately approximating elementary functions [Brise-barre et al. 2006; Brunie et al. 2015; Bui and Tahar 1999; Chevillard et al. 2011, 2010; Chevillard andLauter 2007; Gustafson 2020; Jeannerod et al. 2011; Kupriianova and Lauter 2014; LefÃĺvre et al.1998; Lim et al. 2020], verifying the correctness of math libraries [Boldo et al. 2009; Daumas et al.2005; de Dinechin et al. 2011; de Dinechin et al. 2006; Harrison 1997a,b; Lee et al. 2017; Sawada2002], and repairing math libraries to increase the accuracy [Yi et al. 2019]. There are a few correctlyrounded math libraries (for float and double types in the IEEE-754 standard) such as the IBMLibUltim (also known as MathLib) [IBM 2008; Ziv 1991], Sun Microsystem’s LibMCR [Microsystems2008], CR-LIBM [Daramy et al. 2003], and MPFR math library [Fousse et al. 2007]. Widely usedmath library ( i.e. , libm in glibc ) does not produce correctly rounded results for all inputs. New representations lack math libraries.

The new FP representations currently do not havemath libraries specifically designed for them. One stop-gap alternative is to promote values fromnew representations to a float / double value and use existing FP libraries for them. For example,we can convert a Bfloat16 value to a 32-bit float and use the FP math library. However, thisapproach can produce wrong results for the

Bfloat16 value even when we use the correctlyrounded float library (see Section 2.6 for a detailed example). This approach also has suboptimalperformance as the math library for float / double types probably uses a polynomial of a largedegree with many terms than necessary to approximate these functions. Prior approaches for creating math libraries.

Most prior approaches use minimax approx-imation methods ( i.e. , Remez algorithm [Remes 1934] or Chebyshev approximations [Trefethen2012]) to generate polynomials that have the smallest error compared to the value of an elementaryfunction when evaluated with real numbers. The resultant polynomial can be evaluated withaddition, subtraction, and multiplication operations. Typically, range reduction techniques are usedto reduce the input domain such that the polynomial only needs to approximate the elementaryfunction for a small input domain. Subsequently, the result of the polynomial evaluation on thesmall input domain is adjusted to produce the result for the entire input domain, which is knownas output compensation. Polynomial evaluation, range reduction, and output compensation areimplemented in some finite representation that has higher precision than the target representation.The approximated result is finally rounded to the target representation.When the result of an elementary function f ( x ) with reals is extremely close to the rounding-boundary ( i.e. , f ( x ) rounds to a value v but f ( x ) + ϵ rounds to a different value v for very smallvalue ϵ ), then the error of the polynomial must be smaller than ϵ to ensure that the result ofthe polynomial produces the correctly rounded value [LefÃĺvre and Muller 2001]. This probablynecessitates a polynomial of a large degree with many terms. Another drawback of prior methods isthat there can be round-off errors in the polynomial evaluation with a finite precision representation.Hence, the result produced may not be the correctly rounded result. Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:3

RangeReductionReverseOutputCompensationConvertto HIdentify values in H thatround to y (1) (2) (3) GenerateP(x’)using LP(4)(3)Input xin T Input xin Hy = Correctlyroundedf(x) in T Roundinginterval[l, h] Reducedinterval[l’, h’]

An Approach to Generate Correctly Rounded Math Libraries for New Floating Point Variants 1:17 Function

Main( f , T , H , X , RR , OC , OC , d ) : L CalcRndIntervals( f , T , H , X ) if L = ; then return false L CalcRedIntervals( f , L , T , H , RR , OC , OC ) if L = ; then return false CombineRedIntervals( L ) if = ; then return false S , P H SynthesizePoly( , d ) if S = true then return P else return false Fig. 7. Overall algorithm that creates the polynomial approximation P ( x ) that will produce the cor-rectly rounded result. Each function, CalcIntervals , CalcRedIntervals , CombineRedIntervals , and

SynthesizePoly is explained later in this section. Function

CalcRndIntervals( f , T , H , X ) : L ; foreach x X do RN ( f ( x ) , T ) I GetRndInterval( , T , H ) if I = ; then return ; L L [ {( x , I )} end return L Function

GetRndInterval( , T , H ) : t l GetPrecVal( , T ) l min { H | t l , ] and RN ( , T ) = } t u GetSuccVal( , T ) h max { H | , t u ] and RN ( , T ) = } return [ l , h ] Fig. 8. For each input x X , CalcRndIntervals ( f , T , H , X ) identifies the interval I = [ l , h ] where all values in I rounds to the correctly rounded result f ( x ) for a given transcendental function f ( x ) . The GetRndInterval ( , T , H ) function returns the interval I H where all values in I rounds to . GetPrecValue ( , T ) returns thepreceding value of in the T representation and GetSuccValue ( , T ) returns the succeeding value of in T . (2) CalcRedIntervals:

For each pair ( x , I x ) 2 L , we compute the reduced input x . We alsocompute the reduced interval I x = [ l , h ] that denes the range of inputs for the outputcompensation such that any value in I x is output compensated to a value in I x . The pair ( x , I x ) species what the output of P ( x ) needs to be such that A ( x ) rounds to . CalcRedIntervals returns a list L containing all such pair of constraints for all input x .(3) CombineRedIntervals:

Because all inputs are reduced to the reduced input x , there may bemultiple reduced intervals for each reduced input in L . P ( x ) must produce a value within allthe reduced interval for A ( x ) to produce the correct value when rounded. Thus, we combineall reduced interval for each reduced input x and produce the pair ( x , ) where representsthe combined interval. CalcRedIntervals returns a list containing the constraint pair ( x , ) for each reduced input x .(4) SynthesizePoly:

Each pair ( x , ) 2 species the constraint on the output of P ( x ) . Weframe synthesizing P ( x ) that satises all constraints in as an LP problem and generate acorrect P ( x ) . The rst step in our approach is to identify the values that A ( x ) must produces such that the roundedvalue of A ( x ) is equal to the correctly rounded result of = f ( x ) , i.e. RN ( A ( x ) , T ) = RN ( , T ) , for Proc. ACM Program. Lang., Vol. 1, No. POPL, Article 1. Publication date: January 2021.

An Approach to Generate Correctly Rounded Math Libraries for New Floating Point Variants 1:19 Function

Calculate L ( f , L , T , H , RR , OC , OC , d ) : L ; foreach ( x i , [ l i , h i ]) 2 L do t OC H ( l i , x i ) t OC H ( h i , x i ) if OC H is an increasing function then t ; t else // OC H is a decreasing function t ; t end while OC H ( , x i ) < [ l i , h i ] do AdjHigher( , H ) if > then return ; end while OC H ( , x i ) < [ l i , h i ] do AdjLower( , H ) if > then return ; end L L [ {( RR H ( x i ) , [ , ])} end return L Function

Calculate ( L ) : X { x i | ( x i , I x i ) 2 L } ; foreach unique i X do i ; foreach ( x j , I x j ) 2 L do if i = x j then i i [ { I x j } end end i — I xj i I x j if i = ; then return ; [ {( i , i )} end return Fig. 9. The function

CalculateL’ transforms each constraint ( x i , I x i ) 2 L that constrains A f , H ( x i ) into a newconstraint, ( x i , I x i ) , that constraints P H ( x i ) such that A f , H satisfies A f , H ( x i ) 2 I x i even in the presence ofrange reduction as long as P H ( x i ) 2 I x i . The function Calculate combines multiple constraints with thesame reduced input, i.e. ( x i , I x i ) , ( x j , I x j ) 2 L where x i = x j , into a single constraint and creates a finallist of constraints for P H . of I x i , i.e. [ , ] I x i , ; . In particular, values near the boundary of the interval, i.e. or maybe a value such that OC H ( , x i ) < I x i or OC H ( , x i ) < I x i . Therefore, we repeatedly check whetherthe boundary values, and , is correctly output compensated to a value in I x i while reducing theboundary of [ , ] if they do not (lines 10-17). Finally, we store the nal interval I x i = [ , ] where OC H ( , x i ) 2 I x i and OC H ( , x i ) 2 I x i and the corresponding reduced input, x i = RR H ( x i ) in L (line 18). Once the list of constraints L is identied, we merge the constraints ( x , I x i ) , . . . ( x i , I x i ) 2 L where x = · · · = x i . Each of these constraints bound the output of P H ( x ) such that A f , H ( x ) produces the correct value for each input x i , that reduces to the same value x i . The function Calculate in Figure 15 shows how we merge the constraints. First, for all constraints ( x i , I x i ) 2 L , we identify a list of unique reduced inputs, X (line 1). For each unique reduced input X ,we identify all constraints ( x i , I x i ) 2 L where = x i and group the intervals I x i into (line4-10). can be considered as the list of constraint that bounds the output of P H ( ) . Therefore, wecreate a unied constraint by taking intersection of all intervals in (line 11). If the intersectedinterval, is ; , then it means that there is no output P H ( ) that satises all constraints in and ouralgorithm terminates by outputting it (line 12). Finally, Calculate returns the list containingthe merged constraints ( i , i ) for each unique i X . The polynomial P H ( x ) should be generatedsuch that it satises the constraints . Proc. ACM Program. Lang., Vol. 1, No. POPL, Article 1. Publication date: January 2021.

An Approach to Generate Correctly Rounded Math Libraries for New Floating Point Variants 1:17 Function

SynthesizePoly is explained later in this section. Function

CalcRndIntervals( f , T , H , X ) : L ; foreach x X do RN ( f ( x ) , T ) I GetRndInterval( , T , H ) if I = ; then return ; L L [ {( x , I )} end return L Function

An Approach to Generate Correctly Rounded Math Libraries for New Floating Point Variants 1:19 Function

Reducedinputx’Entire input domain

An Approach to Generate Correctly Rounded Math Libraries for New Floating Point Variants 1:19 Function

Fig. 1. Our approach to generate correctly rounded elementary functions for a target representation ( T ).The math library is implemented in representation H . The goal is to synthesize a polynomial P ( x ′ ) usinglinear programming such that the final result after range reduction and output compensation is the correctlyrounded result of f ( x ) in T . (1) For each input x in T , we compute the correctly rounded value of f ( x ) (denotedas y ) using an oracle. (2) Based on y , we identify an interval ( [ l , h ] ) where all values in the interval rounds to y . (3) Then, we compute the reduced input x ′ using range reduction and the reduced interval ( [ l ′ , h ′ ] ) suchthat when the output of the polynomial on the reduced input x ′ is adjusted ( i.e. , output compensation), itproduces the result for the original input and it is in [ l , h ] . (4) Finally, we synthesize P ( x ′ ) that produces avalue in the reduced interval [ l ′ , h ′ ] for each reduced input x ′ . Our approach.

This paper proposes a novel methodology to generate correctly rounded imple-mentations of elementary functions by framing it as a linear programming problem. In contrast toprior approaches that generate polynomials by minimizing the error compared to the real value ofan elementary function f ( x ) , we propose to generate polynomials that directly approximate thecorrectly rounded value of f ( x ) inspired by the Minefield approach [Gustafson 2020]. Specifically,we identify an interval of values for each input that will result in a correctly rounded output anduse that interval to generate the polynomial approximation. For each input x i , we use an oracle togenerate an interval [ l i , h i ] such that all real values in this interval rounds to the correctly roundedvalue of f ( x i ) . Using these intervals, we can subsequently generate a set of constraints, whichis given to a linear programming solver, to generate a polynomial that computes the correctlyrounded result for all inputs. The interval [ l i , h i ] for correctly rounding the output of input x i islarger than [ f ( x i ) − ϵ , f ( x i ) + ϵ ] where ϵ is the maximum error of the polynomial generated usingprior methods. Hence, our approach has larger freedom to generated polynomials that producecorrectly rounded results and also provide better performance. Handling range reduction.

Typically, generating polynomials for a small input domain iseasier than a large input domain. Hence, the input is reduced to a smaller domain with rangereduction. Subsequently, polynomial approximation is used for the reduced input. The resultingvalue is adjusted with output compensation to produce the final output. For example, the inputdomain for loд ( x ) is ( , ∞) . Approximating this function with a polynomial is much easier over thedomain [ , ) when compared to the entire input domain ( , ∞) . Hence, we range reduce the input x into z using x = z ∗ e , where z ∈ [ , ) and e is an integer. We compute y ′ = loд ( z ) using ourpolynomial for the domain [ , ) . We compute the final output y using the range reduced output y ′ and the output compensation function, which is y = y ′ + e . Polynomial evaluation, range reduction,and output compensation are performed with finite precision representation ( e.g. , double ) andcan experience numerical errors. Our approach for generating correctly rounded outputs has toconsider the numerical error with output compensation. To account for rounding errors with rangereduction and output compensation, we constrain the output intervals that we generated for each Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020. s E E E E sign exponent F F F F mantissa… … (a) Float s E E E E sign exponent F F F F mantissa… … (b) Bfloat16 s E E sign exponent F F mantissa (c) 5-bit floating point (FP5) s R R R |R| sign regime E exponent… F F F |F| fraction…Rguard E … E es (d) PositFig. 2. (a) The bit-string for a 32-bit FP format ( float ). (b) The bit-string for the Bfloat16 representation. (c)a 5-bit FP format used for illustration in the paper. It has 2 bits for the exponent and 2 bits for the fraction. (d)The bit pattern for a posit representation. input x in the entire input domain (see Section 4). When our approach generates a polynomial, it isguaranteed that the polynomial evaluation along with the range reduction and output compensationcan be implemented with finite precision to produce a correctly rounded result for all inputs of anelementary function f ( x ) . Figure 1 pictorially provides an overview of our methodology. OurLibm.

We have developed a collection of correctly rounded math library functions, which wecall OurLibm, for

Bfloat16 , Posits, and floating point using our proposed methodology. Concretely,OurLibm contains ten elementary functions for

Bfloat16 , six elementary functions for 16-bitposits, and loд ( x ) function in the domain [ , ) for a 32-bit float type. We have verified that ourimplementation produces the correctly rounded result for all inputs. In contrast, glibc ’s loд ( x ) function produces wrong results for more than a million inputs in the domain [ , ) . Our libraryfunctions for Bfloat16 are on average 2 . × faster than the glibc ’s double library and 1 . × fasterthan the glibc ’s float library. We also observed that using glibc ’s float library for Bfloat16 produces a wrong result.

Contributions.

This paper makes the following contributions. • Proposes a novel methodology that generates polynomials based on the correctly roundedvalue of an elementary function rather than minimizing the error between the real value andthe approximation. • Demonstrates that the task of generating polynomials with correctly rounded results can beframed as a linear programming problem while accounting for range reduction. • Demonstrates OurLibm, a library of elementary functions that produce correctly roundedresults for all inputs for various new alternatives to floating point such as

Bfloat16 andPosits. Our functions are on average 2 . × faster than state-of-the-art libraries. We provide background on the FP representation and its variants ( i.e. , Bfloat16 ), the posit repre-sentation, the state-of-the-art for developing math libraries, and a motivating example illustratinghow the use of existing libraries for new representations can result in wrong results.

The FP representation F n , | E | , which is specified in the IEEE-754 standard [Cowlishaw 2008], isparameterized by the total number of bits n and the number of bits for the exponent | E | . There arethree components in a FP bit-string: a sign bit s , | E | -bits to represent the exponent, and | F | -bits torepresent the mantissa F where | F | = n − − | E | . Figure 2(a) shows the FP format. If s =

0, then thevalue is positive. If s =

1, then the value is negative. The value represented by the FP bit-string is a

Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:5 normal value if the bit-string E , when interpreted as an unsigned integer, satisfies 0 < E < | E | − ( + F | F | ) × E − bias , where bias is 2 | E |− −

1. If E =

0, then the FP value is a denormal value. The value of the denormal value is ( F | F | ) × − bias .When E = | E | −

1, the FP bit-strings represent special values. If F =

0, then the bit-string represents ±∞ depending on the value of s and in all other cases, it represents not-a-number (NaN).IEEE-754 specifies a number of default FP types: 16-bit ( F , or half ), 32-bit ( F , or float ), and64-bit ( F , or double ). Beyond the types specified in the IEEE-754 standard, recent extensionshave increased the dynamic range and/or precision. Bfloat16 [Tagliavini et al. 2018], F , , providesincreased dynamic range compared to FP’s half type. Figure 2(b) illustrates the Bfloat16 format.Recently proposed

TensorFloat32 [NVIDIA 2020], F , , increased both the dynamic range andprecision compared to the half type. Posit [Gustafson 2017; Gustafson and Yonemoto 2017] is a new representation that provides taperedprecision with a fixed number of bits. A posit representation, P n , es , is defined by the total numberof bits n and the maximum number of bits for the exponents es . A posit bit-string consists of fivecomponents (see Figure 2(d)): a sign bit s , a number of regime bits R , a regime guard bit R , upto es -bits of the exponent E , and fraction bits F . When the regime bits are not used, they can bere-purposed to represent the fraction, which provides tapered precision. Value of a posit bit-string.

The first bit is a sign bit. If s =

0, then the value is positive. If s = R , R , and E together are used to representthe exponent of the final value. After the sign bit, the next 1 ≤ | R | ≤ n − R . Regime bits consist of consecutive 1’s (or 0’s) and are only terminated if | R | = n − R ). The regime bits represent thesuper exponent. Regime bits contribute useed r to the value of the number where useed = es and r = | R | − R consists of 1’s and r = −| R | if R consists of 0’s.If 2 + | R | < n , then the next min { es , n − − | R |} bits represent the exponent bits. If | E | < es , then E is padded with 0’s to the right until | E | = es . These | es | -bits contribute 2 E to the value of thenumber. Together, the regime and the exponent bits of the posit bit-string contribute useed r × E tothe value of the number. If there are any remaining bits after the es -exponent bits, they representthe fraction bits F . The fraction bits are interpreted like a normal FP value, except the length of F can vary depending on the number of regime bits. They contribute 1 + F | F | . Finally, the value v represented by a posit bit-string is, v = (− ) s × ( + F | F | ) × useed r × E = (− ) s × ( + F | F | ) × es × r + E There are two special cases. A bit-string of all 0’s represents 0. A bit-string of 1 followed by all0 s ’s represents Not-a-Real (NaR).

Example.

Consider the bit-string in the P , configuration. Here, useed = = . Also s = , R = , R = , E = , and F = . Hence, r = −| R | = −

4. The finalexponent resulting from the regime and the exponent bits is ( ) − × = − . The fraction value is1 . . × − . When a real number x cannot be represented in a target representation T , it has to be rounded to avalue v ∈ T . The FP standard defines a number of rounding modes but the default rounding mode Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020. (0b00100) (0b00011) (0b00101) (0b00110) rounds to rounds to rounds to

Fig. 3. Illustration of Round to Nearest with ties to Even (RNE) rounding mode with our 5-bit FP representation(

FP5 ). There are two

FP5 values ( . and . ) adjacent to the real number . , but both . and . areequidistant from . . In this case, RNE mode specifies that . should round to . because the bitrepresentation of . ( ) is an even number when interpreted as an integer. Similarly, the real number . rounds to . and . rounds to . . is the round-to-nearest-tie-goes-to-even (RNE) mode. The posit standard also specifies RNE roundingmode with a minor difference that any non-zero value does not underflow to or overflow to NaR .We describe our approach with RNE mode but it is applicable to other rounding modes.In the RNE mode, the rounding function v = RN T ( x ) , rounds x ∈ R (Reals) to v ∈ T , such that x is rounded to the nearest representable value in T , i.e. ∀ v ′ ∈ T | x − v | ≤ | x − v ′ | . In the case of atie, where ∃ v , v ∈ T , v (cid:44) v such that | x − v | = | x − v | and ∀ v ′ ∈ T | x − v | ≤ | x − v ′ | , then x isrounded to v if the bit-string encoding the value v is an even number when interpreted as aninteger and to v otherwise. Figure 3 illustrates the RNE mode with a 5-bit FP representation fromFigure 2(c).The result of primitive operations in FP or any other representation experiences rounding errorwhen it cannot be exactly represented. Modern hardware and libraries produce correctly roundedresults for primitive operations. However, this rounding error can get amplified with a series ofprimitive operations because the intermediate result of each primitive operation must be rounded.As math libraries are also implemented with finite precision, numerical errors in the implementationshould also be carefully addressed. The state-of-the-art methods to approximate an elementary function f ( x ) for a target representation( T ) involves two steps. First, approximation theory ( e.g. , minimax methods) is used to develop afunction A R ( x ) that closely approximates f ( x ) using real numbers. Second, A R ( x ) is implementedin a finite precision representation that has higher precision than T . Generating A R ( x ) . Mathematically deriving A R ( x ) can be further split into three steps. First,identify inputs that exhibit special behavior ( e.g. , ±∞ ). Second, reduce the input domain to a smallerinterval, [ a ′ , b ′ ] , with range reduction techniques and perform any other function transformations.Third, generate a polynomial P ( x ) that approximates f ( x ) in the domain [ a ′ , b ′ ] .There are two types of special cases. The first type includes inputs that produce undefined valuesor ±∞ when mathematically evaluating f ( x ) . For example, in the case of f ( x ) = x , f ( x ) = ∞ if x = ∞ . The second type consists of interesting inputs for evaluating RN T ( f ( x )) . These cases includea range of inputs that produce interesting outputs such as RN T ( f ( x )) ∈ {±∞ , } . For example, whileapproximating f ( x ) = x for Bfloat16 ( B ), all values x ∈ (−∞ , − . ] produce RN B ( x ) = x ∈ [− . · · · × − , . · · · × − ] produce RN B ( x ) =

1, and x ∈ [ . , ∞) produces RN B ( x ) = ∞ . These properties are specific to each f ( x ) and T . Range reduction.

It is mathematically simpler to approximate f ( x ) for a small domain of inputs.Hence, most math libraries use range reduction to reduce the entire input domain into a smallerdomain before generating the polynomial. Given an input x ∈ [ a , b ] where [ a , b ] ⊆ T , the goal ofrange reduction is to reduce the input x to x ′ ∈ [ a ′ , b ′ ] , where [ a ′ , b ′ ] ⊂ [ a , b ] . We represent this Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:7 process of range reduction with x ′ = RR ( x ) . Then, the polynomial P approximates the output y ′ forthe range reduced input ( i.e. , y ′ = P ( x ′ ) ). The output ( y ′ ) of the range reduced input ( x ′ ) has to becompensated to produce the output for the original input ( x ). The output compensation function, OC ( y ′ , x ) , produces the final result by compensating the range reduced output y ′ based on therange reduction performed for input x .For example, consider the function f ( x ) = loд ( x ) where the input domain is defined over ( , ∞) .One way to range reduce the original input is to use the mathematical property loд ( a × b ) = loд ( a ) + b . We decompose the input x as x = x fi × e where x fi ∈ [ , ) and e is an integer.Approximating loд ( x ) is equivalent to approximating loд ( x fi × e ) = loд ( x fi ) + e . Thus, we canrange reduce the original input x ∈ ( , ∞) into x fi ∈ [ , ) . Then, we approximate loд ( x fi ) using P ( x fi ) , which needs to only approximate loд ( x ) for the input domain [ , ) . To produce the outputof loд ( x ) , we compensate the output of the reduced input by computing P ( x ′ ) + e , where e isdependent on the range reduction of x . Polynomial approximation P ( x ) . A common method to approximate an elementary function f ( x ) is with a polynomial function, P ( x ) , which can be implemented with addition, subtraction,and multiplication operations. Typically, P ( x ) for math libraries is generated using the minimaxapproximation technique, which aims to minimize the maximum error, or L ∞ -norm, || P ( x ) − f ( x )|| ∞ = sup x ∈[ a , b ] | P ( x ) − f ( x )| where sup represents the supremum of a set. The minimax approach is attractive because theresulting P ( x ) has a bound on the error ( i.e. , | P ( x − I ) − f ( x i )| ). The most well-known minimaxapproximation method is the Remez algorithm [Remes 1934]. Both CR-LIBM [Daramy et al. 2003]and Metalibm [Kupriianova and Lauter 2014] use a modified Remez algorithm to produce polynomialapproximations [Brisebarre and Chevillard 2007]. Implementation of A R ( x ) with finite precision. Finally, mathematical approximation A R ( x ) is implemented in finite precision to approximate f ( x ) . This implementation typically uses ahigher precision than the intended target representation. We use A H ( x ) to represent that A R ( x ) isimplemented in a representation with higher precision ( H ) where T ⊂ H . Finally, the result of theimplementation A H ( x ) is rounded to the target representation T . An approximation of an elementary function f ( x ) is defined to be a correctly rounded approximationif for all inputs x i ∈ T , it produces RN T ( f ( x i )) . There are two major challenges in creating a correctlyrounded approximation. First, A H ( x ) incurs error because P ( x ) is an approximation of f ( x ) . Second,the evaluation of A H ( x ) has numerical error because it is implemented in a representation withfinite precision ( i.e. , H ). Hence, the rounding of RN T ( A H ( x )) can result in a value different from RN T ( f ( x )) , even if A H ( x ) is arbitrarily close to f ( x ) for some x ∈ T .As A R ( x ) uses a polynomial approximation of f ( x ) , there is an inherent error of | f ( x )− A R ( x )| > A H ( x ) experiences an error of | A H ( x ) − A R ( x )| >

0. It is not possible toreduce both errors to 0. The error in approximating the polynomial can be reduced by using apolynomial of a higher degree or a piece-wise polynomial. The numerical error in the evaluation of A H ( x ) can be reduced by increasing the precision of H . Typically, library developers make trade-offsbetween error and the performance of the implementation.Unfortunately, there is no known general method to analyze and predict the bound on the errorfor A H ( x ) that guarantees RN T ( A H ( x )) = RN T ( f ( x )) for all x because the error may need to bearbitrarily small. This problem is widely known as table-maker’s dilemma [Kahan 2004]. It states Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020. b2 = 0.9609375b1 = 0.95703125 ) Fig. 4. Using a correctly rounded 32-bit FP math library to approximate x for Bfloat16 results in wrongresults. Horizontal axis represents a real number line. Given an input x = − . that is exactlyrepresentable in Bfloat16 , b and b represent the two closest Bfloat16 values to the real value of x . Thecorrectly rounded Bfloat16 value is b (black star). When we use the 32-bit FP library to compute x , itproduces the value shown with red diamond, which then rounds to b producing an incorrect result. that there is no general method to predict the amount of precision in H such that the result iscorrectly rounded for T . An alternative to developing math libraries for new representations is to use existing libraries. Wecan convert the input x ∈ T to x ′ = RN T ′ ( x ) , , where T is the representation of interest and T ′ isthe representation that has a math library available ( e.g. , double ). Subsequently, we can use a mathlibrary for T ′ and round the result back to T . This strategy is appealing if a correctly rounded mathlibrary for T ′ exists and T ′ has significantly more precision bits than T .However, using a correctly rounded math library designed for T ′ to approximate f ( x ) for T canproduce incorrect results for values in T . We illustrate this behavior by generating an approximationfor the function f ( x ) = x in the Bfloat16 ( B ) representation (Figure 4). Let’s consider the input x = − . ∈ B . The real value of f ( x ) ≈ . . . . (black circle in Figure 4).This oracle result cannot be exactly represented in Bfloat16 and must be rounded. There are two

Bfloat16 values adjacent to f ( x ) , b = . b = . b is closer to f ( x ) ,the correctly rounded result is RN B ( x ) = b , which is represented by a black star in Figure 4.If we use the correctly rounded float math library to approximate 10 x , we get the value, y ′ = . float , y ′ is a correctly rounded result, i.e. y ′ = RN F , ( x ) = . y ′ (cid:60) B , we round y ′ to Bfloat16 based on the rounding rule, RN B ( y ′ ) = b . Therefore, the float math library roundsthe result to b but the correctly rounded result is RN B ( x ) = b . Summary.

Approximating an elementary function for representation T using a math librarydesigned for a higher precision representation T ′ does not guarantee a correctly rounded result.Further, the math library for T ′ probably requires higher accuracy than the one for T . Hence, ituses a higher degree polynomial, which causes it to be slower than the math library tailored for T . We provide a high-level overview of our methodology to generate correctly rounded math libraries.We will illustrate this methodology with an end-to-end example that creates correctly roundedresults for ln ( x ) with FP5 ( i.e. , a 5-bit FP type shown in Figure 2(c)). Given an elementary function f ( x ) and a target representation T , our goal is to synthesize apolynomial that when used with range reduction ( RR ) and output compensation ( OC ) functionproduces the correctly rounded result for all inputs in T . The evaluation of the polynomial, range Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:9 reduction, and output compensation are implemented in representation H , which has higherprecision than T .Our methodology for generating correctly rounded elementary functions is shown in Figure 1.Our methodology consists of four steps. First, we use an oracle ( i.e. , MPFR [Fousse et al. 2007] witha large number of precision bits) to compute the correctly rounded result of the function f ( x ) foreach input x ∈ T . In this step, a small sample of the entire input space can be used rather thanusing all inputs for a type with a large input domain.Second, we identify an interval [ l , h ] around the correctly rounded result such that any valuein [ l , h ] rounds to the correctly rounded result in T . We call this interval the rounding interval .Since the eventual polynomial evaluation happens in H , the rounding intervals are also in the H representation. The internal computations of the math library evaluated in H should produce avalue in the rounding interval for each input x .Third, we employ range reduction to transform input x to x ′ . The generated polynomial willapproximate the result for x ′ . Subsequently, we have to use an appropriate output compensationcode to produce the final correctly rounded output for x . Both range reduction and output com-pensation happen in the H representation and can experience numerical errors. These numericalerrors should not affect the generation of correctly rounded results. Hence, we infer intervals forthe reduced domain so that the polynomial evaluation over the reduced input domain produces thecorrect results for the entire domain. Given x and its rounding interval [ l , h ] , we can compute thereduced input x ′ with range reduction. The next task before polynomial generation is identifyingthe reduced rounding interval for P ( x ′ ) such that when used with output compensation it producesthe correctly rounded result. We use the inverse of the output compensation function to identifythe reduced interval [ l ′ , h ′ ] . Any value in [ l ′ , h ′ ] when used with the implementation of outputcompensation in H produces the correctly rounded results for the entire domain.Fourth, we synthesize a polynomial of a degree d using an arbitrary precision linear programming(LP) solver that satisfies the constraints ( i.e. , l ′ ≤ P ( x ′ ) ≤ h ′ ) when given a set of inputs x ′ . Sincethe LP solver produces coefficients for the polynomial in arbitrary precision, it is possible that someof the constraints will not be satisfied when evaluated in H . In such cases, we refine the reducedintervals for those inputs whose constraints are violated and repeat the above step. If the LP solveris not able to produce a solution, then the developer of the library has to either increase the degreeof the polynomial or reduce the input domain.If the inputs were sampled in the first step, we check whether the generated polynomial producesthe correctly rounded result for all inputs. If it does not, then the input is added to the sampleand the entire process is repeated. At the end of this process, the polynomial along with rangereduction and output compensation when evaluated in H produces the correctly rounded outputsfor all inputs in T . ln ( x ) for FP5 We provide an end-to-end example of our approach by creating a correctly rounded result of ln ( x ) for the FP5 representation shown in Figure 2(c) with the RNE rounding mode. The ln ( x ) function isdefined over the input domain ( , ∞) . There are 11 values ranging from 0 .

25 to 3 . FP5 within ( , ∞) . We show the generation of the polynomial with FP5 for pedagogical reasons. With

FP5 , it isbeneficial to create a pre-computed table of correctly rounded results for the 11 values.Our strategy is to approximate ln ( x ) by using loд ( x ) . Hence, we perform range reduction andoutput compensation using the properties of logarithm: ln ( x ) = loд ( x ) loд ( e ) and loд ( x × y z ) = loд ( x ) + zloд ( y ) . We decompose the input x as x = x ′ × e where x ′ is the fractional value represented bythe mantissa, i.e. x ′ ∈ [ , ) , and e is the exponent of the value. We use loд ( x ) = loд ( x ′ ) + eloд ( e ) for our Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020. ln ( x )5-bit FP resultUpper boundLower bound (a) (b) − . · · · ≤ P ( . ) ≤ . . . . . · · · ≤ P ( . ) ≤ . . . . . · · · ≤ P ( . ) ≤ . . . . . · · · ≤ P ( . ) ≤ . . . . (c)  − . . . . . . . . . . . . . . . .  ≤  . . . . . . . .  (cid:20) c c (cid:21) ≤  . . . . . . . . . . . . . . . .  (d) P ( x ) = c + c xc = − . . . . c = . . . . (e) P ( x ) = c + c x Output range for P(1.0)Output range for P(1.25)Output range for P(1.5)Output range for P(1.75) (f)Fig. 5. Our approach for ln ( x ) with FP5 . (a) For each input x in FP5 , we accurately compute the correctlyrounded result (black circle) and identify intervals around the result so that all values round to it. (b) For eachinput and corresponding interval computed in (a), we perform range reduction to obtain the reduced input.The number below a value on the x-axis represents the reduced input. The reduced interval to account forrounding errors in output compensation is also shown. Multiple distinct inputs can map to the same reducedinput after range reduction (intervals with the same color). In such scenarios, we combine the reduce intervalsby computing the common region in the intervals (highlighted in bold for each color with dotted lines). (c) Theset of constraints that must be satisfied by the polynomial for the reduced input. (d) LP formulation for thegeneration of a polynomial of degree one. (e) The coefficients generated by the LP solver for the polynomial.(f) Generated polynomial satisfies the combined intervals. range reduction. We construct the range reduction function RR ( x ) and the output compensationfunction OC ( y ′ , x ) as follows, RR ( x ) = f r ( x ) , OC ( y ′ , x ) = y ′ + exp ( x ) loд ( e ) Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:11 ( )[ ] (0b00100) (0b00011) (0b00101) (0b00110)

Fig. 6. This figure shows the real number line and a number of adjacent

FP5 values, 0.75, 1.0, 1.25, and 1.5.Any real value in the blue interval [ . , . ] , rounds to 1.0 in FP5 with RNE rounding mode. Similarly, anyvalue in the green interval ( . , . ) rounds to 1.25 in FP5 . where f r ( x ) returns the fractional part of x ( i.e. , x ′ ∈ [ , ) ) and exp ( x ) returns the exponent of x ( i.e. , e ). Then, our polynomial approximation P ( x ′ ) should approximate the function loд ( x ) for thereduced input domain x ′ ∈ [ , ) . The various steps of our approach are illustrated in Figure 5. Step 1: Identifying the correctly rounded result.

There are a total of 11

FP5 values in theinput domain of loд ( x ) , ( , ∞) . These values are shown on the x-axis in Figure 5(a). Other valuesare special cases. They are captured by the precondition for this function ( i.e. , x = x = ∞ ). Ourgoal is to generate the correctly rounded results for these 11 FP5 values. For each of these 11 inputs x , we use an oracle ( i.e. , MPFR math library) to compute y , which is the correctly rounded value of ln ( x ) . Figure 5(a) shows the correctly rounded result for each input as a black dot. Step 2: Identifying the rounding interval [ l , h ] . The range reduction, output compensation,and polynomial evaluation are performed with the double type. The double result of the evaluationis rounded to

FP5 to produce the final result. The next step is to find a rounding interval [ l , h ] inthe double type for each output. Figure 5(a) shows the rounding interval for each FP5 output usingthe blue (upper bound) and orange (lower bound) bracket.Let us suppose that we want to compute the rounding interval for y = .

0, which is the correctlyrounded result of ln ( . ) . To identify the lower bound l of the rounding interval for y = .

0, wefirst identify the preceding

FP5 value, which is 0 .

75. Then we find a value v between 0 .

75 and 1 . v rounds to 1 .

0. In our case, v = . h , we identify the FP5 value succeeding 1 .

0, which is1 .

25. We find a value v such that any value less than or equal to v rounds to 1 .

0. In our case, theupper bound is h = . y = . [ . , . ] . Figure 6 showsthe intervals for a small subset of FP5 . Step 3-a: Computing the reduced input x ′ and the reduced interval [ l ′ , h ′ ] . We performrange reduction and generate a polynomial that computes loд ( x ) for all reduced inputs in [ , ) .The next step is to identify the reduced input and the rounding interval for the reduced input suchthat it accounts for any numerical error in output compensation. Figure 5(b) shows the reducedinput (number below the value on the x-axis) and the reduced interval for each input.To identify the reduced rounding interval, we use the inverse of the output compensationfunction, which exists if OC is continuous and bijective over real numbers. For example, for theinput x = . = . × , the output compensation function is, OC ( y ′ , . ) = y ′ + loд ( e ) The inverse is OC − ( y , . ) = yloд ( e ) − [ l ′ , h ′ ] by computing l ′ = OC − ( l , x ) and h ′ = OC − ( h , x ) . Then, we verify that the outputcompensation result of l ′ ( i.e. , OC ( l ′ , x ) ) and h ′ ( i.e. , OC ( h ′ , x ) ), when evaluated in double lies in [ l , h ] . If it does not, then we iteratively refine the reduced interval by restricting [ l ′ , h ′ ] to a smaller Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020. interval until both OC ( l ′ , x ) and OC ( h ′ , x ) evaluated in double results lie in [ l , h ′ ] . The vertical barsin Figure 5(b) show the reduced input for each x and its corresponding reduced rounding interval. Step 3-b: Combining the reduced intervals.

Multiple inputs from the original input domaincan map to the same reduced input after range reduction. In our example, both x = .

25 and x = . x ′ = .

25. However, the reduced intervals that we compute for x and x are [ l ′ , h ′ ] and [ l ′ , h ′ ] , respectively. They are not exactly the same. In Figure 5(b), the reduced intervalscorresponding to the original inputs that map to the same reduced input are colored with the samecolor. The reduced interval for x = .

25 and x = . x indicates that P ( . ) must produce a value in [ l ′ , h ′ ] such that thefinal result, after evaluating the output compensation function in double , is the correctly roundedvalue of ln ( . ) . The reduced interval for x indicates that P ( . ) must produce a value in [ l ′ , h ′ ] such that the final result is the correct value of ln ( . ) . To produce the correctly rounded resultfor both inputs x and x , P ( . ) must produce a value that is in both [ l ′ , h ′ ] and [ l ′ , h ′ ] . Thus,we combine all reduced intervals that correspond to the same reduced input by computing thecommon interval. Figure 5(b) shows the common interval for a given reduced input using a darkershade. At the end of this step, we are left with one combined interval for each reduced input. Step 4: Generating the Polynomial for the reduced input.

The combined intervals specifythe constraints on the output of the polynomial for each reduced input, which when used with outputcompensation in double results in a correctly rounded result for the entire domain. Figure 5(c)shows the constraints for P ( x ′ ) for each reduced input.To synthesize a polynomial P ( x ′ ) of a particular degree (the degree is 1 in this example), weencode the problem as a linear programming (LP) problem that solves for the coefficients of P ( x ′ ) .We look for a polynomial that satisfies constraints for each reduced input (Figure 5(d)). We use an LPsolver to solve for the coefficients and find P ( x ′ ) with the coefficients in Figure 5(e). The generatedpolynomial P ( x ′ ) satisfies all the linear constraints as shown in Figure 5(f). Finally, we also verifythat the generated polynomial when used with range reduction and output compensation producesthe correctly rounded results for all inputs in the original domain. Our goal is to create approximations for an elementary function f ( x ) that produces correctlyrounded results for all inputs in the target representation ( T ). Definition 4.1.

A function that approximates an elementary function f ( x ) is a correctly roundedfunction for the target representation T if it produces y = RN T ( f ( x )) for all x ∈ T .Intuitively, the result produced by the approximation should be same as the result obtained when f ( x ) is evaluated with infinite precision and then rounded to the target representation. It may bebeneficial to develop precomputed tables with correctly rounded results of elementary functionsfor small data types ( e.g. , FP5 ). However, it is infeasible (due to memory overheads) to store suchtables for every elementary function even with modestly sized data types.We propose a methodology that produces polynomial approximation and stores a few coefficientsfor evaluating the polynomial. There are three main challenges in generating a correctly roundedresult with polynomial approximations. First, we have to generate polynomial approximations thatproduce the correct result and are efficient to evaluate. Second, the polynomial approximation shouldconsider rounding errors with range reduction and output compensation that are implemented insome finite precision representation. Third, the polynomial evaluation also is implemented withfinite precision and can experience numerical errors.We will use A H ( x ) to represent the approximation of the elementary function f ( x ) producedwith our methodology while using a representation H to perform polynomial evaluation, range Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:13 Function

CorrectlyRoundedPoly( f , T , H , X , RR H , OC H , d ) : L ← CalcRndIntervals( f , T , H , X ) if L = ∅ then return (false, DNE) L ′ ← CalcRedIntervals( L , H , RR H , OC H ) if L ′ = ∅ then return (false, DNE) Λ ← CombineRedIntervals( L ′ ) if Λ = ∅ then return (false, DNE) S , P H ← GeneratePoly( Λ , d ) if S = true then return (true, P H ) else return (false, DNE) Input Description: f : The oracle that computes the result of f ( x ) in arbitrary precision. T : Target representation of math library. H : Higher precision representation. X : Input domain of A H ( x ) . RR H : The range reduction function. OC H : The output compensation function. d : The degree of polynomial to generate. Fig. 7. Our approach to generate a polynomial approximation P H ( x ) that produces the correctly roundedresult for all inputs. On successfully finding a polynomial, it returns (true, P H ). Otherwise, it returns (false,DNE) where DNE means that the polynomial Does-Not-Exist. Functions, CalcIntervals , CalcRedIntervals , CombineRedIntervals , and

GeneratePoly are shown in Figure 8, Figure 9, and Figure 10, respectively. reduction, and output compensation. The result of A H ( x ) is rounded to T to produce the final result.Hence, A H ( x ) is composed of three functions: A H ( x ) = OC H ( P H ( RR H ( x )) , x ) where y ′ = P H ( x ′ ) is thepolynomial approximation function, x ′ = RR H ( x ) is the range reduction function, and OC H ( y ′ , x ) isthe output compensation function. All three functions, RR H ( x ) , P H ( x ′ ) , and OC H ( y ′ , x ) are evaluatedin H . Given RR H ( x ) and OC H ( y ′ , x ) for a particular elementary function f ( x ) , the task of creating anapproximation that produces correctly rounded results involves synthesizing a polynomial P H ( x ) such that final result generated by A H ( x ) is a correctly rounded result for all inputs x .Our methodology for identifying A H ( x ) that produces correctly rounded outputs is pictoriallyshown in Figure 1. In our approach, we assume the existence of an oracle, which generates the correctreal result, to generate the polynomial approximation for a target representation T . We can useexisting MPFR libraries with large precision as an oracle. Typically, the polynomial approximation isclosely tied to techniques used for range reduction and the resulting output compensation. We alsorequire that the output compensation function ( OC ) is invertible ( i.e. , continuous and bijective). Thedegree of the polynomial is an input provided by the developer of the math library. The top-levelalgorithm shown in Figure 7 identifies a polynomial approximation of degree d . If it is unable tofind one, the developer of the math library should explore one with a higher degree.Our approach has four main steps. First, we compute y ∈ T , the correctly rounded result of f ( x ) , i.e. y = RN T ( f ( x )) for each input x (or a sample of the inputs for a large data type) using our oracle.Then, we identify the rounding interval I = [ l , h ] ⊆ H where all values in the interval rounds to y . The pair ( x , I ) specifies that A H ( x ) must produce a value in I such that A H ( x ) rounds to y . Thefunction CalcRndIntervals in Figure 7 returns a list L that contains a pair ( x , I ) for all inputs x .Second, we compute the reduced input x ′ using range reduction and a reduced interval I ′ = [ l ′ , h ′ ] for each pair ( x , I ) ∈ L . The reduced interval I ′ = [ l ′ , h ′ ] ensures that any value in I ′ when usedwith output compensation code results in a value in I . This pair ( x ′ , I ′ ) specifies the constraints forthe output of the polynomial approximation P H ( x ′ ) so A H ( x ) rounds to the correctly rounded result.The function CalcRedIntervals returns a list L ′ with such reduced constraints for all inputs x .Third, multiple inputs from the original input domain will map to the same input in the reduceddomain after range reduction. Hence, there will be multiple reduced constraints for each reducedinput x ′ . The polynomial approximation, P H ( x ′ ) , must produce a value that satisfies all the reducedconstraints to ensure that A H ( x ) produces the correct value for all inputs when rounded. Thus,we combine all reduced intervals for each unique reduced input x ′ and produce the pair ( x ′ , Ψ ) Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020. Function

CalcRndIntervals( f , T , H , X ) : L ← ∅ foreach x ∈ X do y ← RN T ( f ( x )) I ← GetRndInterval( y , T , H ) if I = ∅ then return ∅ L ← L ∪ {( x , I )} end return L Function

GetRndInterval( y , T , H ) : t l ← GetPrecVal( y , T ) l ← min { v ∈ H | v ∈ [ t l , y ] and RN T ( v ) = y } t u ← GetSuccVal( y , T ) h ← max { v ∈ H | v ∈ [ y , t u ] and RN T ( v ) = y } return [ l , h ] Fig. 8. For each input x ∈ X , CalcRndIntervals identifies the interval I = [ l , h ] where all values in I roundsto the correctly rounded result. The GetRndInterval function takes the correctly rounded result y and returnsthe interval I ⊆ H where all values in I round to y . GetPrecValue ( y , T ) returns the value preceeding y in T . GetSuccValue ( y , T ) returns the value succeeding y in T . where Ψ represents the combined interval. Function CalcRedIntervals in Figure 7 returns a list Λ containing the constraint pair ( x ′ , Ψ ) for each unique reduced input x ′ . Finally, we generate apolynomial of degree d using linear programming so that all constraints ( x ′ , Ψ ) ∈ Λ are satisfied.Next, we describe these steps in detail. The first step in our approach is to identify the values that A H ( x ) must produce so that the roundedvalue of A H ( x ) is equal to the correctly rounded result of y = f ( x ) , i.e. RN T ( A H ( x )) = RN T ( y ) , foreach input x ∈ X . Our key insight is that it is not necessary to produce the exact value of y to producea correctly rounded result . It is sufficient to produce any value in H that rounds to the correct result.For a given rounding mode and an input, we are looking for an interval I = [ l , h ] around the oracleresult that produces the correctly rounded result. We call this the rounding interval.Given an elementary function f ( x ) and an input x ∈ X , define a interval I that is representablein H such that RN T ( v ) = RN T ( f ( x )) for all v ∈ I . If A H ( x ) ∈ I , then rounding the result of A H ( x ) to T produces the correctly rounded result ( i.e. , RN T ( A ( x )) = RN T ( f ( x )) ). For each input x , if A H ( x ) can produce a value that lies within its corresponding rounding interval, then it will produce acorrectly rounded result. Thus, the pair ( x , I ) for each input x defines constraints on the output of A H ( x ) such that RN T ( A H ( x )) is a correctly rounded result.Figure 8 presents our algorithm to compute constraints ( x , I ) . For each input x in our inputdomain X , we compute the correctly rounded result of f ( x ) using an oracle and produce y . Next,we compute the rounding interval of y where all values in the interval rounds to y . The roundinginterval can be computed as follows. First, we identify t l , the preceding value of y in T (line 11 inFigure 8). Then we find the minimum value l ∈ H between t l and y where l rounds to y (line 12in Figure 8). Similarly for the upper bound, we identify t u , the succeeding value of y in T (line 13in Figure 8), and find the maximum value h ∈ H between y and t u where h rounds to y (line 14in Figure 8). Then, [ l , h ] is the rounding interval of y and all values in [ l , h ] round to y . Thus, thepair ( x , [ l , h ]) specifies a constraint on the output of A H ( x ) to produce the correctly rounded resultfor input x . We generate such constraints for each input in the entire domain (or for a sample ofinputs) and produce a list of such constraints (line 7-9 in Figure 8). After the previous step, we have a list of constraints, ( x , I ) , that need to be satisfied by our approxi-mation A H ( x ) to produce correctly rounded outputs. If we do not perform any range reduction, Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:15 Function

CalcRedIntervals( L , H , RR H , OC H ) : L ′ ← ∅ foreach ( x , [ l , h ]) ∈ L do x ′ ← RR H ( x ) if OC H is an increasing function then [ α , β ] ← [ OC − H ( l , x ) , OC − H ( h , x )] else [ α , β ] ← [ OC − H ( h , x ) , OC − H ( l , x )] while OC H ( α , x ) (cid:60) [ l , h ] do α ← GetSuccVal( α , H ) if α > β then return ∅ end while OC H ( β , x ) (cid:60) [ l , h ] do β ← GetPrecVal( β , H ) if α > β then return ∅ end L ′ ← L ′ ∪ {( x ′ , [ α , β ])} end return L ′ Function

CombineRedIntervals( L ′ ) : ˆ X ← { x ′ | ( x ′ , I ′ ) ∈ L ′ } Λ ← ∅ foreach ˆ x ∈ ˆ X do Ω ← { I ′ | ( ˆ x , I ′ ) ∈ L ′ Ψ ← (cid:209) I ′ ∈ Ω I ′ if Ψ = ∅ then return ∅ Λ ← Λ ∪ {( ˆ x , Ψ )} end return Λ Fig. 9.

CalRedIntervals computes the reduced input x ′ and the reduced interval I ′ for each constraintpair ( x , I ) in L . The reduced constraint pair ( x ′ , I ′ ) specifies the bound on the output of P H ( x ′ ) such that itproduces the correct value for the input x . CombineRedIntervals combines any reduced constraints withthe same reduced input, i.e. ( x ′ , I ′ ) and ( x ′ , I ′ ) where x ′ = x ′ into a single combined constraint ( x , Ψ ) bycomputing the common interval range in I ′ and I ′ . then we can generate a polynomial that satisfies these constraints. However, it is necessary toperform range reduction ( RR ) in practice to reduce the complexity of the polynomial and to improveperformance. Range reduction is accompanied by output compensation ( OC ) to produce the finaloutput. Hence, A H ( x ) = OC H ( P H ( RR H ( x )) , x ) . Our goal is to synthesize a polynomial P H ( x ′ ) thatoperates on the range reduced input x ′ and A ( x ) = OC H ( P H ( RR H ( x )) , x ) produces a value in I foreach input x , which rounds to the correct output.To synthesize this polynomial, we have to identify the reduced input and the reduced intervalfor an input x such that A H ( x ) produces a value in the rounding interval I corresponding to x . Thereduced input is available by applying range reduction x ′ = RR ( x ) . Next, we need to compute thereduced interval corresponding to x ′ . The output of the polynomial on the reduced input will be fedto the output compensation function to compute the output for the original input. For the reducedinput x ′ corresponding to the original input x , y ′ = P H ( x ′ ) , A H ( x ) = OC H ( y ′ , x ) , and A H ( x ) mustbe within the interval I for input x to produce a correct output. Hence, our high-level strategy is touse the inverse of the output compensation function to compute the reduced interval, which isfeasible when the output compensation function is continuous and bijective. In our experience, allcommonly used output compensation functions are continuous and bijective.However, the output compensation function is evaluated in H , which necessitates us to take anynumerical error in output compensation with H into account. Figure 9 describes our algorithm tocompute reduced constraint ( x ′ , I ′ ) for each ( x , I ) ∈ L when the output compensation is performedin H .To compute the reduced interval I ′ for each constraint pair ( x , [ l , h ]) ∈ L , we evaluate the values v = OC − H ( l , x ) and v = OC − H ( h , x ) and create an interval [ α , β ] = [ v , v ] if OC R ( y ′ , x ) is anincreasing function (lines 5-6 in Figure 9) or [ v , v ] if OC R ( y ′ , x ) is a decreasing function (line 7 in Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Figure 9). The interval [ α , β ] is a candidate for I ′ . Then, we verify that the output compensatedvalue of α is in [ l , h ] ( i.e. , I ). If it is not, we replace α with the succeeding value in H and repeatthe process until OC H ( α , x ) is in I (lines 8-11 in Figure 9). Similarly, we verify that the outputcompensated value of β is in [ l , h ] and repeatedly replace β with the preceding value in H if it is not(lines 12-15 in Figure 9). If α > β at any point during this process, then it indicates that there is nopolynomial P ( x ′ ) that can produce the correct result for all inputs. As there are only finitely manyvalues between [ α , β ] in H , this process terminates. In the case when our algorithm is not able tofind a polynomial, the user can provide either a different range reduction/output compensationfunction or increase the precision to be higher than H .If the resulting interval [ α , β ] (cid:44) ∅ , then I ′ = [ α , β ] is our reduced interval. The reduced constraintpair, ( x ′ , [ α , β ]) created for each ( x , I ) ∈ L specifies the constraint on the output of P H ( x ′ ) such that A H ( x ) ∈ I . Finally, we create a list L ′ containing such reduced constraints. Each reduced constraint ( x ′ i , I ′ i ) ∈ L ′ corresponds to a constraint ( x i , I i ) ∈ L . It specifies the boundon the output of P H ( x ′ i ) ( i.e. , P H ( x ′ i ) ∈ I ′ i should be satisfied), which ensures A H ( x i ) produces avalue in I i . Range reduction reduces the original input x i in the entire input domain of f ( x ) to areduced input x ′ i in the reduced domain. Hence, multiple inputs in the entire input domain can berange reduced to the same reduced input. More specifically, there can exist multiple constraints ( x , I ) , ( x , I ) , · · · ∈ L such that RR H ( x ) = RR H ( x ) = ˆ x . Consequently, L ′ can contains reducedconstraints ( ˆ x , I ′ ), ( ˆ x , I ′ ) · · · ∈ L ′ . The polynomial P H ( ˆ x ) must produce a value in I ′ to guaranteethat A H ( x ) ∈ I . It must also be within I ′ to guarantee A H ( x ) ∈ I . Hence, for each unique reducedinput ˆ x , P H ( ˆ x ) must satisfy all reduced constraints corresponding to ˆ x , i.e. P H ( ˆ x ) ∈ I ′ ∩ I ′ .The function CombineRedIntervals in Figure 9 combines all reduced constraints with the samereduced input by identifying the common interval ( Ψ in line 24 in Figure 9). If such a commoninterval does not exist, then it is infeasible to find a single polynomial P H ( x ′ ) that produces correctoutputs for all inputs before range reduction. Otherwise, we create a pair ( ˆ x , Ψ ) for each uniquereduced interval ˆ x and produce a list of constraints Λ (line 26 in Figure 9). Each reduced constraint ( x ′ , [ l ′ , h ′ ]) ∈ Λ requires that P H ( x ′ ) satisfy the following condition: l ′ ≤ P H ( x ′ ) ≤ h ′ . This constraint ensures that when P H ( x ′ ) is combined with range reduction andoutput compensation, it produces the correctly rounded result for all inputs. When we are tryingto generate a polynomial of degree d , we can express each of the above constraints in the form: l ′ ≤ c + c x ′ + c ( x ′ ) + ... + c d ( x ′ ) d ≤ h ′ The goal is to find coefficients for the polynomial evaluated in H . Here, x ′ , l ′ and h ′ are constantsfrom perspective of finding the coefficients. We can express all constraints ( x ′ i , [ l ′ i , h ′ i ]) ∈ Λ in asingle system of linear inequalities as shown below, which can be solved using a linear programming(LP) solver.  l ′ l ′ ... l ′| Λ |  ≤  x ′ . . . ( x ′ ) d x ′ . . . ( x ′ ) d ... ... . . . ... x ′| Λ | . . . ( x ′| Λ | ) d   c c ... c d  ≤  h ′ h ′ ... h ′| Λ |  Given a system of inequalities, the LP solver finds a solution for the coefficients with real numbers.The polynomial when evaluated in real ( i.e. P R ( x ′ ) ) satisfies all constraints in Λ . However, numerical Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:17 Function

GeneratePoly( Λ , H d ) : ϒ ← Λ while true do C ← LPSolve( ϒ , d ) if C = ∅ then return (false, DNE) P H ← CreateP( C , d , H ) ϒ ← Verify( P H , Λ , ϒ , H ) if ϒ = ∅ then return (true, P H ) end Function

Verify( P H , Λ , ϒ , H ) : Z ← {( x ′ , Ψ , ψ ) | ( x ′ , Ψ ) ∈ Λ , ( x ′ , ψ ) ∈ ϒ } foreach ( x ′ , [ l ′ , h ′ ] , [ σ , µ ]) ∈ Z do if P H ( x ′ ) < l ′ then ϒ ← ϒ − {( x ′ , [ σ , µ ])} σ ′ ← GetSuccVal( σ , H ) return ϒ ∪ {( x ′ , [ σ ′ , µ ])} else if P H ( x ′ ) > h ′ then ϒ ← ϒ − {( x ′ , [ σ , µ ])} µ ′ ← GetPrecVal( µ , H ) return ϒ ∪ {( x ′ , [ σ , µ ′ ])} end end return ∅ Fig. 10. The function

GeneratePoly generates a polynomial P H ( x ′ ) of degree d that satisfies all constraintsin Λ when evaluated in H . If it cannot generate such a polynomial, then it returns false . The function LPSolve solves for the real number coefficients of a polynomial P R ( x ) using an LP solver where P R ( x ) satisfies allconstraints in Λ when evaluated in real number. CreateP creates P H ( x ) that evaluates the polynomial P R ( x ) in H . The Verify function checks whether the generated polynomial P H ( x ) satisfies all constraints in Λ whenevaluated in H and refines the constraints to a smaller interval for each constraint that P H ( x ) does not satisfy. errors in polynomial evaluation in H can cause the result to not satisfy Λ . We propose a search-and-refine approach to address this problem. We use the LP solver to solve for the coefficients of P R ( x ′ ) that satisfies Λ and then check if P H ( x ′ ) that evaluates P R ( x ′ ) in H satisfies the constraints in Λ . If P H ( x ′ ) does not satisfy a constraint ( x ′ , [ l ′ , h ′ ]) ∈ Λ , then we refine the reduced interval [ l ′ , h ′ ] to asmaller interval. Subsequently, we use the LP solver to generate the coefficients of P R ( x ′ ) for therefined constraints. This process is repeated until either P H ( x ′ ) satisfies all reduced constraints in Λ or the LP solver determines that there is no polynomial that satisfies all the constraints.Figure 10 provides the algorithm used for generating the coefficients of the polynomial using theLP solver. ϒ tracks the refined constraints for P H ( x ′ ) during our search-and-refine process. Initially, ϒ is set to Λ (line 2 in Figure 10). Here, ϒ is used to generate the polynomial and Λ is used to toverify that the generated polynomial satisfies all constraints. If the generated polynomial does notsatisfy Λ , we restrict the intervals in ϒ .We use an LP solver to solve for the coefficients of the P R ( x ′ ) that satisfies all constraints in ϒ (line 4 in Figure 10). If the LP solver cannot find the coefficients, our algorithm concludes that itis not possible to generate a polynomial and terminates (line 5 in Figure 10). Otherwise, we create P H ( x ′ ) that evaluates P R ( x ′ ) in H by rounding all coefficients to H and perform all operations in H (line 6 in Figure 10). The resulting P H ( x ′ ) is a candidate for the correct polynomial for A H ( x ) .Next, we verify that P H ( x ′ ) satisfies all constraints in Λ (line 7 in Figure 10). If P H ( x ′ ) satisfies allconstraints in Λ , then our algorithm returns the polynomial. If there is a constraint ( x ′ , [ l ′ , h ′ ]) ∈ Λ that is not satisfied by P H ( x ′ ) , then we further restrict the interval ( x ′ , [ σ , µ ]) in ϒ corresponding tothe reduced input x ′ . If P H ( x ′ ) is smaller than the lower bound of the interval constraint in Λ ( i.e. l ′ ), then we restrict the lower bound of the interval constraint σ in ϒ to the value succeeding σ in H (lines 13-16 in Figure 10). This forces the next coefficients for P R ( x ′ ) that we generate usingthe LP solver to produce a value larger than l ′ . Likewise, if P H ( x ′ ) produces a value larger than theupper bound of the interval constraint in Λ ( i.e. h ′ ), then we restrict the upper bound of the intervalconstraint µ in ϒ to the value preceding µ in H (lines 17-20 in Figure 10). Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

We repeat this process of generating a new candidate polynomial with the refined constraints ϒ until it satisfies all constraints in Λ or the LP solver determines that it is infeasible. If a constraint ( x ′ , [ σ , µ ]) ∈ ϒ is restricted to the point where σ > µ (or [ σ , µ ] = ∅ ), then the LP solver willdetermine that it is infeasible to generate the polynomial. When we are successful in generating apolynomial, then P H ( x ) used in tandem with range reduction and the output compensation in H ischecked to ascertain that it produces the correctly rounded results for all inputs. This section describes our prototype for generating correctly rounded elementary functions andOurLibm, a math library that we developed using our approach for

Bfloat16 , posit, and float data types. We present case studies for approximating elementary functions 10 x , ln ( x ) , loд ( x ) , and loд ( x ) with our approach for various types. We also evaluate the performance of our correctlyrounded elementary functions with state-of-the-art approximations. Prototype.

Our prototype for creating correctly rounded math libraries supports

Bfloat16 , Posit16 (16-bit posit type in the Posit standard [Gustafson 2017]), and the 32-bit float typein the FP representation. The user can provide custom range reduction and output compensationfunctions. The prototype uses the MPFR library [Fousse et al. 2007] with 2 ,

000 precision bits asthe oracle to compute the real result of f ( x ) and rounds it to the target representation. Althoughthere is no bound on the precision to compute the oracle result ( i.e. , Table-maker’s dilemma) withthe MPFR library, prior work has shown around 160 precision bits in the worst case is empiricallysufficient for the double representation [LefÃĺvre and Muller 2001]. Hence, we determine thatusing 2 ,

000 precision bits with the MPFR library is sufficient for the oracle result. The prototypeuses SoPlex [Gleixner et al. 2015, 2012], an exact rational LP solver as the arbitrary precision LPsolver for polynomial generation from constraints. Any arbitrary precision LP solver can be usedwith our prototype.

OurLibm.

We created OurLibm, a math library that contains correctly rounded elementaryfunctions for multiple data types. It contains 10 elementary functions for

Bfloat16 and 6 elementaryfunctions for

Posit16 . The library produces correctly rounded outputs for all inputs. To show thatour approach works for large data types, OurLibm also includes a correctly rounded loд ( x ) for the32-bit float type for inputs in [ , ) . Our approximation for loд ( x ) with the 32-bit float type iscreated by sampling a few oracle values rather than all the inputs. There are 2 float values in [ , ) .We initially sample random 10 ,

000 input points and generate the polynomial that produces thecorrectly rounded result for the sample inputs using our approach. Then, we verify that the generatedpolynomial produces the correctly rounded result for all inputs in [ , ) . We add any input that doesnot produce the correctly rounded result to the sample and re-generate the polynomial. We repeatthis process until the generated polynomial produces the correct result for all inputs. Our approachwill also produce the correctly rounded outputs for many other functions. We do not report it becausewe have not exhaustively tested it. OurLibm performs range reduction and output compensationusing the double type. We used state-of-the-art range reduction techniques for various elementaryfunctions. Additionally, we split the reduced domain into multiple disjoint smaller domains usingthe properties of specific elementary functions to generate efficient polynomials. We evaluateall polynomials using the Horner’s method, i.e. P ( x ) = c + x ( c + x ( c + . . . )) [Borwein andErdelyi 1995], which reduces the number of operations in polynomial evaluation. Supplementalmaterial provides the range reduction, output compensation, and the polynomial generated andthe coefficients for each function for each data type. Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:19

Table 1. (a) The list of

Bfloat16 functions used for our evaluation. The second column shows whetherOurLibm produces the correct result for all inputs. The third column and fourth column shows whether glibc ’s float and double library used with Bfloat16 produces the correct result for all inputs. We use ( ✓ )to indicate correctly rounded results and ✗ , otherwise. (b) The list of Posit16 functions used. The secondcolumn shows whether OurLibm produces the correct results for all inputs. The third column shows whetherthe functions in SoftPosit-Math produces correctly rounded results for all inputs.

N/A indicates that functionis not available in SoftPosit-Math. (c) The float function used. First column indicates whether OurLibmproduces the correctly rounded result for inputs x ∈ [ , ) . In the second and third column, we show whether glibc ’s float and math library produce the correct result for all inputs. Bfloat16

Functions UsingOurLibm UsingFloat mlib UsingDouble mlib ln ( x ) ✓ ✓ ✓ loд ( x ) ✓ ✓ ✓ loд ( x ) ✓ ✓ ✓ exp ( x ) ✓ ✓ ✓ exp ( x ) ✓ ✓ ✓ exp ( x ) ✓ ✗ ✓ sinpi ( x ) ✓ N/A N/A cospi ( x ) ✓ N/A N/A sqrt ( x ) ✓ ✓ ✓ cbrt ( x ) ✓ ✓ ✓ (a) Correctly rounded results with Bfloat16 Posit16

Functions UsingOurLibm UsingSoftPosit-Math ln ( x ) ✓ ✓ loд ( x ) ✓ ✓ loд ( x ) ✓ N/A sinpi ( x ) ✓ ✓ cospi ( x ) ✓ ✓ sqrt ( x ) ✓ ✓ (b) Correctly rounded results with Posit16 float Functions UsingOurLibm UsingFloat mlib Usingdouble mlib loд ( x ) for x ∈ [ , ) ✓ ✗ ✓ (c) Correct results in [ , ) for a 32-bit float The entire OurLibm and the prototype is written in C++. Both the prototype and OurLibm isavailable as open-source and publicly available. Although we have not optimized OurLibm forperformance on a specific target, it already has better performance than the existing state-of-the-artapproaches.

Experimental setup.

We describe our experimental setup to check the correctness and per-formance of OurLibm. To the best of our knowledge, there is no publicly available math libraryspecifically designed for

Bfloat16 . Hence, to compare the performance of our

Bfloat16 elementaryfunctions, we convert

Bfloat16 input to float or double , use glibc ’s float or double libraryfunction, and then convert the result back to Bfloat16 . We use SoftPosit-Math library [Leong 2019]to compare against our

Posit16 elementary functions. Finally, we compare our float functionsto glibc ’s float and double math library. For our performance experiments, we compiled ourOurLibm with g++-9.2.1 and with -O3 optimizations. All experiments were conducted on a machinewith 2.10GHz processor and 32GB of RAM, running the Ubuntu 16.04 LTS operating system. Wecount the number of cycles taken to compute the correctly rounded result for each input usinghardware performance counters. We use both the average number of cycles per input and totalcycles for all inputs to compare performance. Table 1(a) shows that OurLibm produces correctly rounded results for all inputs with numerouselementary functions for the

Bfloat16 representation. In contrast to OurLibm, we discovered thatre-purposing existing glibc ’s float library for Bfloat16 did not produce correctly rounded resultfor all inputs. The case with input x = − . exp ( x ) was already discussed inSection 2.6. This case is interesting because the glibc ’s float math library produces the correctly Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020. b f l o a t _ l n ( x ) b f l o a t _ l o g ( x ) b f l o a t _ l o g ( x ) b f l o a t _ e x p ( x ) b f l o a t _ e x p ( x ) b f l o a t _ e x p ( x ) b f l o a t _ s i n p i ( x ) b f l o a t _ c o s p i ( x ) b f l o a t _ s q r t ( x ) b f l o a t _ c b r t ( x ) a v e r a g e Sp ee d u p . . . . . . Speedup over float libm Speedup over double libm (a) b f l o a t _ l n ( x ) b f l o a t _ l o g ( x ) b f l o a t _ l o g ( x ) b f l o a t _ e x p ( x ) b f l o a t _ e x p ( x ) b f l o a t _ e x p ( x ) b f l o a t _ s i n p i ( x ) b f l o a t _ c o s p i ( x ) b f l o a t _ s q r t ( x ) b f l o a t _ c b r t ( x ) a v e r a g e Sp ee d u p . . . Speedup over float libm Speedup over double libm (b)Fig. 11. (a) Speedup of OurLibm’s elementary functions compared to a baseline using the float math library(left bar) and to a baseline using the double math library (right bar). These functions take a

Bfloat16 inputand produce a

Bfloat16 output. (b) Speedup with OurLibm’s internal computation compared to a baselineusing the float math library (left bar) and to a baseline with double math library (right bar). Here the inputsand outputs are in double . Hence, it avoids the overheads of cast from

Bfloat16 to double for inputs andfrom double to Bfloat16 for the output. rounded result of exp ( x ) with respect to the float type. However, the result for Bfloat16 iswrong. We found that glibc ’s double library produces the correctly rounded results for all inputsfor Bfloat16 . Our experience during this evaluation illustrates that a correctly rounded functionfor T ′ does not necessarily produce a correctly rounded library for T even if T ′ has more precisionthat T .Table 1(b) reports that OurLibm produces correctly rounded results for all inputs with elementaryfunctions for Posit16 . We found the SoftPosit-Math functions also produce the correctly roundedresults for all but one function loд ( x ) , which is not available in SoftPosit-Math library.Finally, Table 1(c) reports that OurLibm produces the correctly rounded results for loд ( x ) forinputs in the domain [ , ) . The glibc ’s double library produces the correct result for all inputs inthe domain [ , ) . However, glibc ’s float math library does not produce the correctly roundedresults for all inputs in [ , ) . We found 1 . [ , ) where glibc ’s float library produces the wrong result. In summary, we are able to generate correctly rounded resultsfor many elementary functions with various representations using our proposed approach. We empirically compare the performance of the elementary functions in OurLibm for

Bfloat16 , Posit16 , and a 32-bit float type to the corresponding or similar glibc and SoftPosit-Math imple-mentations.

Bfloat16

Functions in OurLibm.

To measure performance, we measure theamount of time it takes for OurLibm to produce a

Bfloat16 result given a

Bfloat16 input for allinputs. Similarly, we measure the time taken by glibc ’s float or double to produce Bfloat16 output given a

Bfloat16 input, which involves a cast from

Bfloat16 to a float or a double value.As sinpi ( x ) and cospi ( x ) are not available in glibc ’s libm, we transform sinpi ( x ) = sin ( πx ) and cospi ( x ) = cos ( πx ) before using glibc ’s sin and cos functions.Figure 11(a) shows the speedup of OurLibm’s functions for Bfloat16 compared to glibc ’s float math library (left bar in the cluster) and the double library (right bar in the cluster). Onaverage, OurLibm’s functions are 1 . × faster when compared to the float library and 2 . × fasterover the double math library. There are two reasons for the speedup. First, the polynomials used Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:21 p o s i t _ l n ( x ) p o s i t _ l o g ( x ) p o s i t _ s i n p i ( x ) p o s i t _ c o s p i ( x ) p o s i t _ s q r t ( x ) a v e r a g e Sp ee d u p Fig. 12. Performance speedup of OurLibm’s functions compared to SoftPosit-Math library when the inputis available as a double . It avoids the cast from

Posit16 to double with OurLibm. SoftPosit-Math takes asinput a Posit16 value that is internally represented as an integer. for correctly rounded

Bfloat16 elementary functions are simpler ( i.e. , they have a smaller degreeor a few number of piece-wise polynomials) compared to the polynomials in the float or double ’smath library. Second, using the float or double library requires casts from Bfloat16 to float (or double ) and vice versa, for all inputs and the output, which incurs additional overhead.To understand the speedup with polynomial evaluation, we conducted another experiment wherewe measure the time taken to perform polynomial evaluation with out any casts ( i.e. , with double inputs that produce double outputs) for both OurLibm and glibc ’ math libraries. Figure 11(b)shows the speedup of OurLibm’s functions compared to the baseline of float ’s (left bar in thecluster) and double ’s library (right bar in the cluster) just for polynomial evaluation. On average,OurLibm has 1 . × speedup over the float math library and 1 . × speedup over the double mathlibrary. For sqrt ( x ) , OurLibm’s version has a slowdown because both float and double mathlibrary utilizes the hardware instruction, FSQRT , to compute sqrt ( x ) whereas OurLibm performspolynomial evaluation. Our cbrt ( x ) function is slightly slower than the float math library. The float library likely uses sophisticated range reduction and has a lower degree polynomial. Incontrast, OurLibm uses a 6 th degree polynomial that approximates √ x for the input domain [ , ) .Overall, OurLibm’s functions for Bfloat16 not only produce correct results for all inputs but alsoare faster than the state-of-the-art glibc ’s libraries re-purposed for

Bfloat16 . Posit16

Elementary Functions in OurLibm.

Figure 12 shows the speedupof OurLibm’s functions when compared to a baseline that uses SoftPosit-Math functions. The

Posit16 input is cast to the double type before using OurLibm. We did not measure the cost of thiscast, which can incur additional overhead. SoftPosit-Math library does not have an implementationfor loд ( x ) . Hence, we do not report it. On average, OurLibm has similar performance as SoftPosit-Math even when SoftPosit-Math is super-optimized for Posit16 . SoftPosit-Math library performsall computations using integers. OurLibm’s functions are 1 . × faster than SoftPosit-Math library’sfunctions excluding the sqrt function. SoftPosit-Math library computes sqrt ( x ) using the Newton-Raphson refinement method rather than using polynomial approximation functions, and thusproduces a more efficient function. In summary, OurLibm’s functions have similar performanceto a highly optimized library with integer operations. We plan to explore integer operations forinternal computation to further improve OurLibm’s performance. Float . OurLibm’s approximation of loд ( x ) for the 32-bit floating point type has a 1 . × speedup over glibc ’s float math library,which produces wrong results for 1 . [ , ) . Compared to glibc ’s double mathlibrary, which produces the correctly rounded results (in float ) for all float inputs in [ , ) , the float math library function in OurLibm has 1 . × speedup. Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020. h’ = 1.01171…l’ = 1.003906…g(x’) = 1.003907… P H (x’) = 1.003977…g(x’) - |P H (x’) - g(x’)| = 1.00383… Fig. 13. More freedom in generating a polynomial for x with our approach. The reduced interval [ l ′ , h ′ ] (ingreen box) corresponds to the reduced input x ′ = . . . . . We show the real value of д ( x ′ ) (black circle)and the result produced by the polynomial generated with our approach (red diamond). If we approximatedthe real result д ( x ′ ) instead of the correctly rounded result, the margin of error for any such polynomialwould be lower. We provide case studies to show that our approach (1) has more freedom in generating betterpolynomials, (2) generates different polynomials for the same underlying elementary functionto account for numerical errors in range reduction and output compensation, and (3) generatescorrectly rounded results even when the polynomial evaluation is performed with the double type. x for Bfloat16 . The 10 x function is defined over the input domain (−∞ , ∞) . There are four classes of special cases:Special cases of 10 x =  . x ≤ − . . − . × − ≤ x ≤ . × − ∞ if x ≥ . N aN if x = N aN

A quick initial check returns their result and reduces the overall input that we need to approximate.We approximate 10 x using 2 x , which is easier to compute. We use the property, 10 x = xloд ( ) to approximate 10 x using 2 x . Subsequently, we perform range reduction by decomposing xloд ( ) as xloд ( ) = i + x ′ , where i is an integer and x ′ ∈ [ , ) is the fractional part.Now, 10 x decomposes to 10 x = xloд ( ) = i + x ′ = i x ′ The above decomposition requires us to approximate 2 x ′ where x ′ ∈ [ , ) . Multiplication by 2 i can be performed using integer operations. The range reduction, output compensation, and thefunction we are approximating д ( x ′ ) is as follows: RR ( x ) = x ′ = xloд ( ) − ⌊ xloд ( )⌋ OC ( y ′ , x ) = y ′ i = y ′ ⌊ xloд ( )⌋ д ( x ′ ) = x ′ Our approach generated a 4 th degree polynomial that approximates 2 x ′ in the input domain [ , ) . Our polynomial produces the correctly rounded result for all inputs in the entire domain for10 x when used with range reduction and output compensation.We are able to generate a lower degree polynomial because our approach provides more freedomto generate the correctly rounded results. We illustrate this point with an example. Figure 13presents a reduced interval ( [ l ′ , h ′ ] in green region) for the reduced input ( x ′ = . . . . ) inour approach. The real value of д ( x ′ ) is shown in black circle. In our approach, the polynomialthat approximates д ( x ′ ) has to produce a value in [ l ′ , h ′ ] such that the output compensated valueproduces the correctly rounded result of 10 x for all input x that reduces to x ′ . The value of д ( x ′ ) is extremely close to l ′ with a margin of error ϵ = | д ( x ′ ) − l ′ | ≈ . × − . In contrast to ourapproach, if we approximated the real value of д ( x ′ ) , then we must generate a polynomial withan error of at most ϵ , i.e. the polynomial has to produce a value in [ д ( x ′ ) − ϵ , д ( x ′ ) + ϵ ] , which Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:23 potentially necessitates a higher degree polynomial. The polynomial that we generate produces avalue shown in Figure 13 with red diamond. This value has an error of | P H ( x ′ ) − д ( x ′ )| ≈ . × − ,which is much larger than ϵ . Still, the 4 th degree polynomial generated by our approach producesthe correctly rounded value when used with the output compensation function for all inputs. ln ( x ) , loд ( x ) , and loд ( x ) for Bfloat16 . While creating the

Bfloat16 approximations for functions ln ( x ) , loд ( x ) , and loд ( x ) , we observed that our approach generatesdifferent polynomials for the same underlying elementary function to account for numerical errorsin range reduction and output compensation. We highlight this observation in this case study.To approximate these functions, we use a slightly modified version of the Cody and Waite rangereduction technique [Cody and Waite 1980]. As a first step, we use mathematical properties oflogarithms, loд b ( x ) = loд ( x ) loд ( b ) to approximate all three functions ln ( x ) , loд ( x ) , and loд ( x ) usingthe approximation for loд ( x ) . As a second step, we perform range reduction by decomposing theinput x as x = t × e where t ∈ [ , ) is the fractional value represented by the mantissa and e is an integer representing the exponent of the value. Then, we use the mathematical property oflogarithms, loд b ( x × y z ) = loд b ( x ) + zloд b ( y ) , to perform range reduction and output compensation.Now, any logarithm function loд b ( x ) can be decomposed to loд b ( x ) = loд ( t ) + eloд ( b ) .As a third step, to ease the job of generating a polynomial for loд ( t ) , we introduce a new variable x ′ = t − t + and transform the function loд ( t ) to a function with rapidly converging polynomialexpansion: д ( x ′ ) = loд (cid:18) + x ′ − x ′ (cid:19) where the function д ( x ′ ) evaluates to loд ( t ) .The above input transformation, attributed to Cody and Waite [Cody and Waite 1980], enablesthe creation of a rapidly convergent odd polynomial, P ( x ) = c x + c x ... , which reduces the numberof operations. In contrast, the polynomial would be of the form P ( x ) = c + c x + c x ... in theabsence of above input transformation, which has terms with both even and odd degrees.When the input x is decomposed into x = t ∗ e where t ∈ [ , ) and e is an integer, the rangereduction function x ′ = RR ( x ) , the output compensation function y = OC ( y ′ , x ) , and the functionthat we need to approximate, y ′ = д ( x ′ ) are as follows, RR ( x ) = x ′ = t − t + , OC ( y ′ , x ) = y ′ + e ( loд ( b )) д ( x ′ ) = loд (cid:18) + x ′ − x ′ (cid:19) Hence, we approximate the same elementary function for ln ( x ) , loд ( x ) and loд ( x ) ( i.e. , д ( x ′ ) ).However, the output compensation functions are different for each of them.We observed that our approach produced different polynomials that produced correct outputfor ln ( x ) , loд ( x ) , and loд ( x ) functions for Bfloat16 , which is primarily to account for numericalerrors in each output compensation function. We produced a 5 th degree odd polynomial for loд ( x ) ,a 5 th degree odd polynomial with different coefficients for loд ( x ) , and a 7 th degree odd polynomialfor ln ( x ) . Our technique also determined that there was no correct 5 th degree odd polynomialfor ln ( x ) . Although these polynomials approximate the same function д ( x ′ ) , they cannot be usedinterchangeably. For example, our experiment show that the 5 th degree polynomial produced for loд ( x ) cannot be used to produce the correctly rounded result of ln ( x ) for all inputs. loд ( x ) for a 32-bit Float . To show that our approach is scalable to datatypes with numerous inputs, we illustrate a correctly rounded loд ( x ) function for a 32-bit float type in the input domain [ , ) . There are 2 values in this input domain. When we generate asingle polynomial for such a large number of points, it is likely that we generate a polynomial of a Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020. higher degree. When a polynomial of a large degree is evaluated in a finite precision representation,it is likely to have numerical errors in polynomial evaluation. We wanted to test whether ourapproach can generate a polynomial that produces correctly rounded results for all inputs evenwhen the polynomial is evaluated in some finite precision representation ( i.e. , double ).We used the Cody and Waite range reduction described above to reduce the input domain to [ , ) .We used the oracle to generate rounding intervals, then identified reduced and combined intervals,and fed the constraints to the LP solver. We were able to generate a 15-degree odd polynomialthat produces the correct result for all inputs! This 15-degree polynomial produces the correctlyrounded results in the domain [ , ) for the float type when polynomial evaluation is performedin double . This experiment shows that our approach can generate polynomials of a large degreethat produce correct results when evaluated in limited precision representation. Correctly Rounded Math Libraries for FP.

Since the introduction of the floating point stan-dard [Cowlishaw 2008], a number of correctly rounded math library have been proposed includingthe IBM LibUltim (or also known as MathLib) [IBM 2008; Ziv 1991], Sun Microsystem’s LibMCR [Mi-crosystems 2008], CR-LIBM [Daramy et al. 2003], and the MPFR math library [Fousse et al. 2007].MPFR produces correctly rounded result for any arbitrary precision.CR-LIBM [Daramy et al. 2003; LefÃĺvre et al. 1998] is a correctly rounded double math librarydeveloped using Sollya [Chevillard et al. 2010], which is a tool and a library for developing FPcode. Given a degree d , a representation H , and the elementary function f ( x ) , Sollya generatespolynomials of degree d with coefficients in H that has the minimum infinity norm [Brisebarreand Chevillard 2007]. Sollya uses a modified Remez algorithm with lattice basis reduction toproduce polynomials. It also computes the error bound on the polynomial evaluation using intervalarithmetic [Chevillard et al. 2011; Chevillard and Lauter 2007] and produces Gappa [Melquiond2019] proofs for the error bound. Metalibm [Brunie et al. 2015; Kupriianova and Lauter 2014] isa math library function generator built using Sollya. MetaLibm is able to automatically identifyrange reduction and domain splitting techniques for some transcendental functions. It has beenused to create correctly rounded elementary functions for the float and double types.A number of other approaches have been proposed to generate correctly rounded results fordifferent transcendental functions including square root [Jeannerod et al. 2011] and exponentia-tion [Bui and Tahar 1999]. A modified Remez algorithm has also been used to generate polynomialsfor approximating some elementary functions [Arzelier et al. 2019]. It generates a polynomial thatminimizes the infinity norm compared to ideal elementary function and the numerical error in thepolynomial evaluation. It can be used to produce correctly rounded results when range reductionis not necessary. Compared to prior techniques, our approach approximates the correctly roundedvalue RN T ( f ( x )) and the margin of error is much higher, which generates efficient polynomials.Additionally, our approach also takes into account numerical errors in range reduction, outputcompensation, and polynomial evaluation. Posit math libraries.

SoftPosit-Math [Leong 2019] has a number of correctly rounded

Posit16 elementary functions, which are created using the Minefield method [Gustafson 2020]. The Minefieldmethod identifies the interval of values that the internal computation should produce and declaresall other regions as a minefield. Then the goal is to generate a polynomial that avoids the mines.The polynomials in the minefield method were generated by trial and error. Our approach isinspired by the Minefield method. It generalizes it to numerous representations, range reduction,and output compensation. Our approach also automates the process of generating polynomials byencoding the mines as linear constraints and uses an LP solver. Recently, researchers have usedthe CORDIC method to generate a number of approximations to trigonometric functions for the

Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:25

Posit32 type [Lim et al. 2020]. However, they do not provide correctly rounded results for allinputs.

Verification of math libraries.

As performance and correctness are both important with mathlibraries, there is extensive research to prove the correctness of math libraries. Sollya verifies thatthe generated implementations of elementary functions produce correctly rounded results withthe aid of Gappa [Daumas et al. 2005; de Dinechin et al. 2011; de Dinechin et al. 2006]. Typically,it uses interval arithmetic and has been used to prove the correctness of CR-LIBM. Recently,researchers have also verified that many functions in Intel’s math.h implementations have at most1 ulp error [Lee et al. 2017]. Various elementary function implementations have also been provencorrect using HOL Light [Harrison 1997a,b, 2009]. Similarly, CoQ proof assistant has been used toprove the correctness of argument reduction [Boldo et al. 2009]. Instruction sets of mainstreamprocessors have also been proven correct using proof assistants ( e.g. , division and sqrt ( x ) instructionin IBM Power4 processor [Sawada 2002]). If the oracle includes all inputs, then OurLibm producespolynomial functions that produce correctly rounded results for all inputs. If used with sampling,then we can use prior approaches to prove the correctness of our implementations for all inputs. Rewriting tools.

Mathematical rewriting tools are other alternatives to create correctly roundedfunctions. If the rounding error in the implementation is the root cause of an incorrect result, we canuse tools that detect numerical errors to diagnose them [Benz et al. 2012; Chowdhary et al. 2020; Fuand Su 2019; Goubault 2001; Sanchez-Stern et al. 2018; Yi et al. 2019; Zou et al. 2019]. Subsequently,we can rewrite them using tools such as Herbie [Panchekha et al. 2015] or Salsa [Damouche andMartel 2018]. Recently, a repair tool was proposed specifically for reducing the error of mathlibraries [Yi et al. 2019]. It identifies the domain of inputs that result in high error. Then, it usespiece-wise linear or quadratic equations to repair them for the specific domain. However, currently,these rewriting tools do not guarantee correctly rounded results for all inputs.

Generating numerical code.

There is a body of work on generating numerical code thatprovides verified bounds on the numerical error [Darulova et al. 2018; Darulova and Kuncak 2017;Rocca et al. 2017]. Our approach and techniques produce accurate numerical code for elementaryfunctions and can be extended to expressions that can be approximated using polynomials.

A library to approximate elementary functions is a key component of the FP representation and itsalternatives. We propose a novel methodology to generate approximations for elementary functionsthat produces correctly rounded results for all inputs. The key insight is to identify the amount offreedom available to generate the correctly rounded result. Subsequently, we use this freedom togenerate a polynomial using linear programming that produces the correct rounded results for allinputs. This paper shows that this approach generates polynomial approximations that are fasterthan existing libraries while producing correct results. Our approach can also allow designers ofelementary functions to make pragmatic trade-offs with respect to performance and correct results.More importantly, it can enable standards to mandate correctly rounded results for elementaryfunctions with new representations.

Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

REFERENCES

Denis Arzelier, Florent BrÃľhard, and Mioara Joldes. 2019. Exchange Algorithm for Evaluation and Approximation Error-Optimized Polynomials. In . 30–37.Florian Benz, Andreas Hildebrandt, and Sebastian Hack. 2012. A Dynamic Program Analysis to Find Floating-point AccuracyProblems. In

Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (Beijing, China) (PLDI ’12) . ACM, New York, NY, USA, 453–462.Jeremy Bernstein, Jiawei Zhao, Markus Meister, Ming-Yu Liu, Anima Anandkumar, and Yisong Yue. 2020. Learningcompositional functions via multiplicative weight updates. arXiv:2006.14560 [cs.NE]Sylvie Boldo, Marc Daumas, and Ren-Cang Li. 2009. Formally Verified Argument Reduction with a Fused Multiply-Add. In

IEEE Transactions on Computers , Vol. 58. 1139–1145.Peter Borwein and Tamas Erdelyi. 1995.

Polynomials and Polynomial Inequalities . Springer New York.Nicolas Brisebarre and Sylvvain Chevillard. 2007. Efficient polynomial L-approximations. In .Nicolas Brisebarre, Jean-Michel Muller, and Arnaud Tisserand. 2006. Computing Machine-Efficient Polynomial Approxima-tions. In

ACM ACM Transactions on Mathematical Software , Vol. 32. Association for Computing Machinery, New York,NY, USA, 236âĂŞ256.Nicolas Brunie, Florent de Dinechin, Olga Kupriianova, and Christoph Lauter. 2015. Code Generators for MathematicalFunctions. In . 66–73.Hung Tien Bui and Sofiene Tahar. 1999. Design and synthesis of an IEEE-754 exponential function. In

Engineering Solutionsfor the Next Millennium. 1999 IEEE Canadian Conference on Electrical and Computer Engineering , Vol. 1. 450–455 vol.1.Sylvain Chevillard, John Harrison, Mioara Joldes, and Christoph Lauter. 2011. Efficient and accurate computation of upperbounds of approximation errors.

Theoretical Computer Science

Mathematical Software - ICMS 2010 (Lecture Notes in Computer Science, Vol. 6327) . Springer, Heidelberg, Germany,28–31.Sylvain Chevillard and Christopher Lauter. 2007. A Certified Infinite Norm for the Implementation of Elementary Functions.In

Seventh International Conference on Quality Software (QSIC 2007) . 153–160.Sangeeta Chowdhary, Jay P. Lim, and Santosh Nagarakatte. 2020. Debugging and Detecting Numerical Errors in Computationwith Posits. In .William J Cody and William M Waite. 1980.

Software manual for the elementary functions . Prentice-Hall, Englewood Cliffs,NJ.Mike Cowlishaw. 2008.

IEEE Standard for Floating-Point Arithmetic . IEEE 754-2008. IEEE Computer Society. 1–70 pages.Nasrine Damouche and Matthieu Martel. 2018. Salsa: An Automatic Tool to Improve the Numerical Accuracy of Programs.In

Automated Formal Methods (Kalpa Publications in Computing, Vol. 5) , Natarajan Shankar and Bruno Dutertre (Eds.).63–76.Catherine Daramy, David Defour, Florent Dinechin, and Jean-Michel Muller. 2003. CR-LIBM: A correctly rounded elementaryfunction library. In

Proceedings of SPIE Vol. 5205: Advanced Signal Processing Algorithms, Architectures, and ImplementationsXIII , Vol. 5205.Eva Darulova, Anastasiia Izycheva, Fariha Nasir, Fabian Ritter, Becker Heiko, and Robert Bastian. 2018. Framework forAnalysis and Optimization of Numerical Programs. In

Daisy - Tools and Algorithms for the Construction and Analysis ofSystems .Eva Darulova and Viktor Kuncak. 2017. Towards a Compiler for Reals. In

ACM Transactions on Programming Languages andSystems .Marc Daumas, Guillaume Melquiond, and Cesar Munoz. 2005. Guaranteed proofs using interval arithmetic. In . 188–195.Florent de Dinechin, Christopher Lauter, and Guillaume Melquiond. 2011. Certifying the Floating-Point Implementation ofan Elementary Function Using Gappa. In

IEEE Transactions on Computers , Vol. 60. 242–253.Florent de Dinechin, Christoph Quirin Lauter, and Guillaume Melquiond. 2006. Assisted Verification of Elementary FunctionsUsing Gappa. In

Proceedings of the 2006 ACM Symposium on Applied Computing (Dijon, France) (SAC âĂŹ06) . Associationfor Computing Machinery, New York, NY, USA, 1318âĂŞ1322. https://doi.org/10.1145/1141277.1141584Laurent Fousse, Guillaume Hanrot, Vincent Lefèvre, Patrick Pélissier, and Paul Zimmermann. 2007. MPFR: A Multiple-precision Binary Floating-point Library with Correct Rounding.

ACM Trans. Math. Software

33, 2, Article 13 (June2007).Zhoulai Fu and Zhendong Su. 2019. Effective Floating-point Analysis via Weak-distance Minimization. In

Proceedings of the40th ACM SIGPLAN Conference on Programming Language Design and Implementation (Phoenix, AZ, USA) (PLDI 2019) .ACM, New York, NY, USA, 439–452.Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:27

Ambros Gleixner, Daniel E. Steffy, and Kati Wolter. 2015.

Iterative Refinement for Linear Programming . Technical Report15-15. ZIB, Takustr. 7, 14195 Berlin.Ambros M. Gleixner, Daniel E. Steffy, and Kati Wolter. 2012. Improving the Accuracy of Linear Programming Solverswith Iterative Refinement. In

Proceedings of the 37th International Symposium on Symbolic and Algebraic Computation (Grenoble, France) (ISSAC âĂŹ12) . Association for Computing Machinery, New York, NY, USA, 187âĂŞ194. https://doi.org/10.1145/2442829.2442858Eric Goubault. 2001. Static Analyses of the Precision of Floating-Point Operations. In

Proceedings of the 8th InternationalSymposium on Static Analysis (SAS) . Springer, 234–259.John Gustafson. 2017.

Posit Arithmetic . https://posithub.org/docs/Posits4.pdfJohn Gustafson. 2020.

The Minefield Method: A Uniformly Fast Solution to the Table-MakerâĂŹs Dilemma . https://bit.ly/2ZP4kHjJohn Gustafson and Isaac Yonemoto. 2017. Beating Floating Point at Its Own Game: Posit Arithmetic.

SupercomputingFrontiers and Innovations: an International Journal

4, 2 (June 2017), 71–86.John Harrison. 1997a. Floating point verification in HOL light: The exponential function. In

Algebraic Methodology andSoftware Technology , Michael Johnson (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 246–260.John Harrison. 1997b. Verifying the Accuracy of Polynomial Approximations in HOL. In

International Conference on TheoremProving in Higher Order Logics .John Harrison. 2009. HOL Light: An Overview. In

Proceedings of the 22nd International Conference on Theorem Provingin Higher Order Logics, TPHOLs 2009 (Lecture Notes in Computer Science, Vol. 5674) , Stefan Berghofer, Tobias Nipkow,Christian Urban, and Makarius Wenzel (Eds.). Springer-Verlag, Munich, Germany, 60–66.IBM. 2008.

Accurate Portable MathLib . http://oss.software.ibm.com/mathlib/Intel. 2019.

Delivering a New Intelligence with AI at Scale

IEEE Trans. Comput.

60. https://doi.org/10.1109/TC.2010.152Jeff Johnson. 2018.

Rethinking floating point for deep learning . http://export.arxiv.org/abs/1811.01721William Kahan. 2004.

A Logarithm Too Clever by Half . https://people.eecs.berkeley.edu/~wkahan/LOG10HAF.TXTDhiraj D. Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha,Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, AlexanderHeinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and PradeepDubey. 2019. A Study of BFLOAT16 for Deep Learning Training.

ArXiv abs/1905.12322 (2019).Olga Kupriianova and Christoph Lauter. 2014. Metalibm: A Mathematical Functions Code Generator. In .Wonyeol Lee, Rahul Sharma, and Alex Aiken. 2017. On Automatically Proving the Correctness of Math.h Implementations.

Proceedings of the ACM on Programming Languages

2, POPL, Article 47 (Dec. 2017), 32 pages.Vincent LefÃĺvre and Jean-Michel Muller. 2001. Worst Cases for Correct Rounding of the Elementary Functions in DoublePrecision. In . 111–118. https://doi.org/10.1109/ARITH.2001.930110Vincent LefÃĺvre, Jean-Michel Muller, and Arnaud Tisserand. 1998. Toward correctly rounded transcendentals.

IEEE Trans.Comput.

47, 11 (1998), 1235–1243.Cerlane Leong. 2019.

SoftPosit-Math . https://gitlab.com/cerlane/softposit-mathJay P. Lim, Matan Shachnai, and Santosh Nagarakatte. 2020. Approximating Trigonometric Functions for Posits Using theCORDIC Method. In

Proceedings of the 17th ACM International Conference on Computing Frontiers (Catania, Sicily, Italy) (CF âĂŹ20) . Association for Computing Machinery, New York, NY, USA, 19âĂŞ28.Guillaume Melquiond. 2019.

Gappa . http://gappa.gforge.inria.frSun Microsystems. 2008.

LIBMCR 3 "16 February 2008" "libmcr-0.9"

Elementary Functions: Algorithms and Implementation . Birkhauser.NVIDIA. 2020.

TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x . https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/Pavel Panchekha, Alex Sanchez-Stern, James R. Wilcox, and Zachary Tatlock. 2015. Automatically Improving Accuracy forFloating Point Expressions. In

Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design andImplementation , Vol. 50. Association for Computing Machinery, New York, NY, USA, 1âĂŞ11. https://doi.org/10.1145/2813885.2737959Eugene Remes. 1934. Sur un procédé convergent dâĂŹapproximations successives pour déterminer les polynômes dâĂŹap-proximation.

Comptes rendus de l’AcadÃľmie des Sciences

198 (1934), 2063–2065.Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Alexandre Rocca, Victor Magron, and Thao Dang. 2017. Certified Roundoff Error Bounds Using Bernstein Expansions andSparse Krivine-Stengle Representations. In .Alex Sanchez-Stern, Pavel Panchekha, Sorin Lerner, and Zachary Tatlock. 2018. Finding Root Causes of Floating Point Error.In

Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (Philadelphia,PA, USA) (PLDI 2018) . ACM, New York, NY, USA, 256–269.Joe Sawada. 2002. Formal verification of divide and square root algorithms using series calculation. In .Giuseppe Tagliavini, Stefan Mach, Davide Rossi, Andrea Marongiu, and Luca Benin. 2018. A transprecision floating-pointplatform for ultra-low power computing. In .1051–1056.Lloyd N. Trefethen. 2012.

Approximation Theory and Approximation Practice (Other Titles in Applied Mathematics) . Societyfor Industrial and Applied Mathematics, USA.Shibo Wang and Pankaj Kanwar. 2019.

BFloat16: The secret to high performance on Cloud TPUs . https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpusXin Yi, Liqian Chen, Xiaoguang Mao, and Tao Ji. 2019. Efficient Automated Repair of High Floating-Point Errors in NumericalLibraries.

Proceedings of the ACM on Programming Languages

3, POPL, Article 56 (Jan. 2019), 29 pages.Abraham Ziv. 1991. Fast Evaluation of Elementary Mathematical Functions with Correctly Rounded Last Bit.

ACM Trans.Math. Software

17, 3 (Sept. 1991), 410âĂŞ423. https://doi.org/10.1145/114697.116813Daming Zou, Muhan Zeng, Yingfei Xiong, Zhoulai Fu, Lu Zhang, and Zhendong Su. 2019. Detecting Floating-Point Errorsvia Atomic Conditions.

Proceedings of the ACM on Programming Languages

4, POPL, Article 60 (Dec. 2019), 27 pages.Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:29

A DETAILS ON OURLIBM

In the appendices, we describe the range reduction technique, special cases, and the polynomialswe generated to create math library functions in OurLibm. We use the same range reductiontechnique for each family of elementary functions across all types, i.e. ln ( x ) , loд ( x ) , and loд ( x ) for all bfloat16 , posit16 , and float use the same range reduction technique. Hence, we firstdescribe the range reduction techniques that we use for each family of elementary functions inAppendix B. In each subsequent section, we describe the specific special cases, range reductionfunction, output compensation function, and the polynomial we use to create each function for bfloat16 (Appendix C), posit16 (Appendix D), and float (Appendix E). B RANGE REDUCTION TECHNIQUES

In this section, we explain the general range reduction technique OurLibm uses to reduce the inputdomain for each class of elementary functions.

B.1 Logarithm functions ( loд b ( x ) ) We use a slightly modified version of Cody and Waite’s range reduction technique [Cody and Waite1980] for all logarithm functions. As a first step, we use the mathematical property of logarithms, loд b ( x ) = loд ( x ) loд ( b ) to approximate logarithm functions using the approximation of loд ( x ) . As asecond step, we decompose the input x as x = t × m where t is the fractional value representedby the mantissa and m is the exponent of the input. Then we use the mathematical property oflogarithm functions, loд ( x × y z ) = loд ( x ) + zloд ( y ) to decompose loд ( x ) . Thus, any logarithmfunction loд b ( x ) can be decomposed to , loд b ( x ) = loд ( t ) + mloд ( b ) As a third step, to ease the job of generating an accurate polynomial for loд ( t ) , we introduce anew variable x ′ = t − t + and transform the function loд ( t ) to a function with rapidly convergingpolynomial expansion: д ( x ′ ) = loд (cid:18) + x ′ − x ′ (cid:19) The function д ( x ′ ) evaluates to loд ( t ) . The polynomial expansion of д ( x ′ ) is an odd polynomial, i.e. P ( x ) = c x + c x + c x . . . . Combining all steps, we decompose loд b ( x ) to, loд b ( x ) = loд (cid:16) + x ′ − x ′ (cid:17) + mloд ( b ) When the input x is decomposed into x = t × e where t ∈ [ , ) and e is an integer, the rangereduction function x ′ = RR ( x ) , the output compensation function y = OC ( y ′ , x ) , and the functionthat we need to approximate, y ′ = д ( x ′ ) can be summarized as follows, RR ( x ) = t − t + OC ( y ′ , x ) = y ′ + mloд ( b ) д ( x ′ ) = loд (cid:18) + x ′ − x ′ (cid:19) With this range reduction technique, we need to approximate д ( x ′ ) for the reduced input domain x ′ ∈ [ , ) . B.2 Exponential Functions ( a x ) We approximate all exponential functions with 2 x . As a first step, we use the mathematical property a x = xloд ( a ) to decompose any exponential function to a function of 2 x . Second, we decompose the Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020. value xloд ( a ) into the integral part i and the remaining fractional part x ′ ∈ [ , ) , i.e. xloд ( a ) = i + x ′ .We can define i and x ′ more formally as: i = ⌊ xloд ( a )⌋ , x ′ = xloд ( a ) − i where ⌊ x ⌋ is a floor function that rounds down x to an integer. Using the property 2 x + y = x y , a x decomposes to a x = xloд ( a ) = i + x ′ = x ′ i = xloд ( a )−⌊ xloд ( a )⌋ ⌊ xloд ( a )⌋ The above decomposition allows us to approximate any exponential functions by approximating 2 x for x ∈ [ , ) . The range reduction function x ′ = RR ( x ) , output compensation function y = OC ( y ′ , x ) ,and the function we need to approximate y ′ = д ( x ′ ) can be summarized as follows: RR ( x ) = xloд ( b ) − ⌊ xloд ( b )⌋ OC ( y ′ , x ) = y ′ ⌊ xloд ( b )⌋ д ( x ′ ) = x ′ With this range reduction technique, we need to approximate 2 x ′ for the reduced input domain x ′ ∈ [ , ) . B.3 Square Root Function ( √ x ) To perform range reduction on √ x , we first decompose the input x into x = x ′ × m where m is aneven integer and x ′ = x m ∈ [ , ) . Second, using the mathematical properties √ xy = √ x √ y and √ x = x , we decompose √ x to: √ x = √ x ′ × m = √ x ′ × m The above decomposition allows us to approximate the square root function by approximating √ x for x ∈ [ , ) . Since m is an even integer, m is an integer and multiplication of 2 m can be performedusing integer arithmetic.When the input x is decomposed into x = x ′ × m where x ′ ∈ [ , ) and m is an even integer,the range reduction function x ′ = RR ( x ) , output compensation function y = OC ( y ′ , x ) , and thefunction we need to approximate y ′ = д ( x ′ ) can be summarized as follows: RR ( x ) = x ′ OC ( y ′ , x ) = y ′ m д ( x ′ ) = √ x With this range reduction technique, we need to approximate √ x ′ for the reduced input domain x ′ ∈ [ , ) . B.4 Cube Root Function ( √ x ) To perform range reduction on √ x , we first decompose the input x into x = s × x ′ × m . The value s ∈ {− , } represents the sign of x , m is an integer multiple of 3, and x ′ = x m ∈ [ , ) . Second,using the mathematical properties √ xy = √ x √ y and √ x = x , we decompose √ x to: √ x = √ s × x ′ × m = s × √ x ′ × m The above decomposition allow us to approximate the cube root function by approximating √ x for x ∈ [ , ) . Since m is an integer multiple of 3, m is an integer and multiplication of 2 m can beperformed using integer arithmetic.When we decompose the input x into x = s × x ′ × m where s is the sign of the input, x ′ ∈ [ , ) ,and m is an integer multiple of 3, the range reduction function x ′ = RR ( x ) , output compensationfunction y = OC ( y ′ , x ) , and the function we need to approximate y ′ = д ( x ′ ) can be summarized asfollows: RR ( x ) = x ′ OC ( y ′ , x ) = s × y ′ m д ( x ′ ) = √ x Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:31

With this range reduction technique, we need to approximate √ x ′ for the reduced input domain x ′ ∈ [ , ) . B.5 Sinpi Function ( sin ( πx ) ) To perform range reduction on sin ( πx ) , we use the property of sin ( πx ) that it is a periodic and oddfunction. First, using the property sin (− πx ) = − sin ( πx ) , we decompose the input x into x = s × | x | where s is the sign of the input. The function decomposes to sin ( πx ) = s × sin ( π | x |) .Second, we use the properties sin ( π ( x + z )) = sin ( πx ) where z is an integer and sin ( π ( x + z + )) = − sin ( πx ) . We decompose | x | into | x | = i + t where i is an integer and t ∈ [ , ) is the fractionalpart, i.e. t = | x | − i . More formally, we can define t and i as, i = ⌊| x |⌋ , t = | x | − i If i is an even integer, then sin ( π ( t + i )) = sin ( πt ) (from the property sin ( π ( x + z )) = sin ( πx ) ). If i is an odd integer, then sin ( π ( t + i )) = − sin ( πt ) (from the property sin ( π ( x + z + )) = − sin ( πx ) ).Thus, we can decompose the sin ( πx ) function into, sin ( πx ) = s × sin ( π | x |) = (cid:40) s × sin ( πt ) if i ≡ ( mod )− s × sin ( πt ) if i ≡ ( mod ) Third, we use the property sin ( πt ) = sin ( π ( − t )) for 0 . < t < . x ′ , x ′ = (cid:40) − t if 0 . < t < . t otherwiseSince we perform the subtraction only when 0 . < t < . x ′ can be computed exactly due toSterbenz Lemma [Muller 2005]. The above decomposition reduces the input domain to x ′ ∈ [ , . ] and requires us to approximate sin ( πx ′ ) for the reduced domain.In summary, we decompose the input x into x = s × ( i + t ) where s is the sign of the input, i is an integer, and t ∈ [ , ) is the fractional part of | x | , i.e. | x | = i + t . The range reductionfunction x ′ = RR ( x ) , the output compensation function y = OC ( y ′ , x ) , and the function we need toapproximate, y ′ = д ( x ′ ) are as follows: RR ( x ) = (cid:40) − t if 0 . < t < . t otherwise , OC ( y ′ , x ) = (cid:40) s × y ′ if i ≡ ( mod )− s × y ′ if i ≡ ( mod ) , д ( x ′ ) = sin ( πx ′ ) With this range reduction technique, we need to approximate sin ( πx ′ ) for the reduced input domain x ′ ∈ [ , . ] . B.6 Cospi Function ( cos ( πx ) ) To perform range reductino on cos ( πx ) , we use the property of cos ( πx ) that it is a periodic and evenfunction. First, using the property cos (− πx ) = cos ( πx ) , we decompose the input x into x = s × | x | where s is the sign of the input. The function decomposes to cos ( πx ) = cos ( π | x |) .Second, we use the properties cos ( π ( x + z )) = cos ( πx ) where z is an integer and cos ( π ( x + z + )) = − cos ( πx ) . We decompose | x | into | x | = i + t where i is an integer and t ∈ [ , ) is thefractional p art, i.e. t = | x | − i . More formally, we can define t and i as, i = ⌊| x |⌋ , t = | x | − i If i is an even integer, then cos ( π ( t + i )) = cos ( πt ) (from the property cos ( π ( x + z )) = cos ( πx ) ). If i is an odd integer, then cos ( π ( t + i )) = − cos ( πt ) (from the property cos ( π ( x + z + )) = − cos ( πx ) ). Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Thus, we can decompose cos ( πx ) into, cos ( πx ) = cos ( π | x |) = (− ) i ( mod ) × cos ( πt ) where i ( mod ) is the modulus operation in base 2.Third, we use the property cos ( πt ) = − cos ( π ( − t )) for 0 . < t < . t to, x ′ = (cid:40) − t if 0 . < t < . t otherwise Since we perform the subtraction only when 0 . < t < .

0, 1 − t can be computed exactly due toSterbenz Lemma. Consequently, cos ( πx ) function decomposes to, cos ( πx ) = (cid:40) − × (− ) i ( mod ) × cos ( πx ′ ) if 0 . < t < . (− ) i ( mod ) × cos ( πx ′ ) otherwiseThe above decomposition reduces the input domain to x ′ ∈ [ , . ] .In summary, we decompose the input x into s ×( i + t ) where s is the sign of the input, i is an integer,and t ∈ [ , ) is the fractional part of | x | , i.e. | x | = i + t . The range reduction function x ′ = RR ( x ) , theoutput compensation function y = OC ( y ′ , x ) , and the function we need to approximate, y ′ = д ( x ′ ) are as follows: RR ( x ) = (cid:40) − t if 0 . < t < . t otherwiseOC ( y ′ , x ) = (cid:40) − × (− ) i ( mod ) × y ′ if 0 . < t < . (− ) i ( mod ) × y ′ otherwise д ( x ′ ) = cos ( πx ′ ) With this range reduction technique, we need to approximate cos ( πx ′ ) for the reduced input domain x ′ ∈ [ , . ] . C DETAILS ON BFLOAT16 FUNCTIONS

In this section, we explain the bfloat16 functions in OurLibm. More specifically, we describethe special cases, the range reduction and output compensation function, the function we mustapproximate, and the polynomials we generated for each bfloat16 math library function in OurLibm.

C.1 ln ( x ) for Bfloat16 The elementary function ln ( x ) is defined over the input domain ( , ∞) . There are three classes ofspecial case inputs: Special case of ln ( x ) =  −∞ if x = ∞ if x = ∞ N aN if x < x = N aN

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:33

The value t is the fractional value represented by the mantissa of the input x and m is the exponent, i.e. x = t × m . With this range reduction technique, we need to approximate д ( x ′ ) for x ′ ∈ [ , ) .To approximate д ( x ′ ) , we use a 7 th degree odd polynomial P ( x ) = c x + c x + c x + c x withthe coefficients, c = . c = . c = . c = . C.2 loд ( x ) for Bfloat16 The elementary function loд ( x ) is defined over the input domain ( , ∞) . There are three classes ofspecial case inputs: Special case of loд ( x ) =  −∞ if x = ∞ if x = ∞ N aN if x < x = N aN

We use the range reduction technique described in Appendix B.1. For loд ( x ) , the range reductionfunction ( x ′ = RR ( x ) ), the output compensation function ( y = OC ( y ′ , x ) ), and the function toapproximate ( y ′ = д ( x ′ ) ) can be summarized as follows: RR ( x ) = t − t + OC ( y ′ , x ) = y ′ + m д ( x ′ ) = loд (cid:18) + x ′ − x ′ (cid:19) The value t is the fractional value represented by the mantissa of the input x and m is the exponent, i.e. x = t × m . With this range reduction technique, we need to approximate д ( x ′ ) for x ′ ∈ [ , ) .To approximate д ( x ′ ) , we use a 5 th degree odd polynomial P ( x ) = c x + c x + c x with thecoefficients, c = . c = . c = . C.3 loд ( x ) for Bfloat16 The elementary function loд ( x ) is defined over the input domain ( , ∞) . There are three classesof special case inputs:Special case of loд ( x ) =  −∞ if x = ∞ if x = ∞ N aN if x < x = N aN

We use the range reduction technique described in Appendix B.1. For loд ( x ) , the range reductionfunction ( x ′ = RR ( x ) ), the output compensation function ( y = OC ( y ′ , x ) ), and the function toapproximate ( y ′ = д ( x ′ ) ) can be summarized as follows: RR ( x ) = t − t + OC ( y ′ , x ) = y ′ + mloд ( ) д ( x ′ ) = loд (cid:18) + x ′ − x ′ (cid:19) The value t is the fractional value represented by the mantissa of the input x and m is the exponent, i.e. x = t × m . With this range reduction technique, we need to approximate д ( x ′ ) for x ′ ∈ [ , ) . Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

To approximate д ( x ′ ) , we use a 5 th degree odd polynomial P ( x ) = c x + c x + c x with thecoefficients, c = . c = . c = . C.4 e x for Bfloat16 The elementary function e x is defined over the input domain (−∞ , ∞) . There are four classes ofspecial case inputs:Special case of e x =  . x ≤ − . . − . × − ≤ x ≤ . × − ∞ if x ≥ . N aN if x = N aN

We use the range reduction technique described in Appendix B.2. For e x , the range reductionfunction x ′ = RR ( x ) , output compensation function y = OC ( y ′ , x ) , and the function we have toapproximate to approximate y ′ = д ( x ′ ) is summarized below: RR ( x ) = xloд ( e ) − ⌊ xloд ( e )⌋ OC ( y ′ , x ) = y ′ ⌊ xloд ( e )⌋ д ( x ′ ) = x ′ where ⌊ x ⌋ is a floor function that rounds down x to an integer. With this range reduction technique,we need to approximate 2 x ′ for x ′ ∈ [ , ) .To approximate 2 x ′ , we use a 4 th degree polynomial P ( x ) = c + c x + c x + c x + c x withthe coefficients, c = . c = . c = . c = . × − c = . × − C.5 x for Bfloat16 The elementary function 2 x is defined over the input domain (−∞ , ∞) . There are four classes ofspecial case inputs:Special case of 2 x =  . x ≤ − . . − . × − ≤ x ≤ . × − ∞ if x ≥ . N aN if x = N aN

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:35

To approximate 2 x ′ , we use a 4 th degree polynomial P ( x ) = c + c x + c x + c x + c x withthe coefficients, c = . c = . c = . c = . × − c = . × − C.6 x for Bfloat16 The elementary function 10 x is defined over the input domain (−∞ , ∞) . There are four classes ofspecial case inputs:Special case of 10 x =  . x ≤ − . . − . × − ≤ x ≤ . × − ∞ if x ≥ . N aN if x = N aN

Range reduction.

We use the range reduction technique described in Appendix B.2. The rangereduction function x ′ = RR ( x ) , output compensation function y = OC ( y ′ , x ) , and the function wehave to approximate to approximate y ′ = д ( x ′ ) is summarized below: RR ( x ) = xloд ( ) − ⌊ xloд ( )⌋ OC ( y ′ , x ) = y ′ ⌊ xloд ( )⌋ д ( x ′ ) = x ′ where ⌊ x ⌋ is a floor function that rounds down x to an integer. With this range reduction technique,we need to approximate 2 x ′ for x ′ ∈ [ , ) .To approximate 2 x ′ , we use a 4 th degree polynomial P ( x ) = c + c x + c x + c x + c x withthe coefficients, c = . c = . c = . c = . × − c = . × − C.7 √ x for Bfloat16 The elementary function √ x is defined over the input domain [ , ∞) . There are three classes ofspecial case inputs: Special case of √ x =  . x = . ∞ if x = ∞ N aN if x < x = N aN

We use the range reduction technique described in Appendix B.3. The range reduction func-tion x ′ = RR ( x ) , the output compensation function y = OC ( y ′ , x ) and the function we have toapproximate y ′ = д ( x ′ ) can be summarized as follows: RR ( x ) = x ′ OC ( y ′ , x ) = y ′ m д ( x ′ ) = √ x Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020. where x ′ is a value in [ , ) and m is an even integer such that x = x ′ × m for the input x . Withthis range reduction technique, we need to approximate √ x ′ for x ′ ∈ [ , ) .To approximate √ x ′ , we use a 4 th degree polynomial P ( x ) = c + c x + c x + c x + c x withthe coefficients, c = . c = . c = − . c = . × − c = − . × − C.8 √ x for Bfloat16 The elementary function √ x is defined over the input domain (−∞ , ∞) . There are four classes ofspecial case inputs: Special case of √ x =  . x = . ∞ if x = ∞−∞ if x = −∞ N aN if x = N aN

We use the range reduction technique described in Appendix B.4. The range reduction func-tion x ′ = RR ( x ) , the output compensation function y = OC ( y ′ , x ) and the function we have toapproximate y ′ = д ( x ′ ) can be summarized as follows: RR ( x ) = x ′ OC ( y ′ , x ) = s × y ′ m д ( x ′ ) = √ x where s is the sign of the input x , x ′ is a value in [ , ) and m is integer multiple of 3 such that x = s × x ′ × m . With this range reduction technique, we need to approximate √ x ′ for x ′ ∈ [ , ) .To approximate √ x ′ , we use a 6 th degree polynomial P ( x ) = c + c x + c x + c x + c x + c x + c x with the coefficients, c = . c = . c = − . c = . × − c = − . × − c = . × − c = − . × − C.9 sin ( πx ) for Bfloat16 The elementary function sin ( πx ) is defined over the input domain (−∞ , ∞) . There are two classesof special case inputs:Special case of sin ( πx ) = (cid:40) N aN if x = N aN or x = ±∞ x ≥

256 or x ≤ − x into x = s × ( i + t ) where s is the sign of the input, i is an integer, and t ∈ [ , ) is the fractional part Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:37 of | x | , i.e. | x | = i + t . The range reduction function x ′ = RR ( x ) , the output compensation function y = OC ( y ′ , x ) , and the function we need to approximate, y ′ = д ( x ′ ) can be summarized as follows: RR ( x ) = (cid:40) − t if 0 . < t < . t otherwise , OC ( y ′ , x ) = (cid:40) s × y ′ if i ≡ ( mod )− s × y ′ if i ≡ ( mod ) , д ( x ′ ) = sin ( πx ) With this range reduction technique, we need to approximate sin ( πx ′ ) for x ′ ∈ [ , . ] .The sin ( πx ) function exhibit a linear-like behavior around x =

0. To approximate sin ( πx ) , weuse a piecewise polynomial consisting of two polynomials: P ( x ) = (cid:40) c x if x ′ ≤ . × − d x + d x + d x + d x otherwisewith the coefficients, c = . d = . d = − . d = . d = − . C.10 cos ( πx ) for Bfloat16 The elementary function cos ( πx ) is defined over the input domain (−∞ , ∞) . There are two classesof special case inputs:Special case of cos ( πx ) = (cid:40) N aN if x = N aN or x = ±∞ x ≥

256 or x ≤ − x into x = s × ( i + t ) where s is the sign of the input, i is an integer, and t ∈ [ , ) is the fractional partof | x | , i.e. | x | = i + t . The range reduction function x ′ = RR ( x ) , the output compensation function y = OC ( y ′ , x ) , and the function we need to approximate y ′ = д ( x ′ ) can be summarized as follows: RR ( x ) = (cid:40) − t if 0 . < t < . t otherwiseOC ( y ′ , x ) = (cid:40) − × (− ) i ( mod ) × y ′ if 0 . < t < . (− ) i ( mod ) × y ′ otherwise д ( x ′ ) = cos ( πx ′ ) With this range reduction technique, we need to approximate cos ( πx ′ ) for x ′ ∈ [ , . ] .The cos ( πx ) function exhibit a linear property around x =

0. To approximate cos ( πx ) , we usethe piecewise polynomial: P ( x ) =  c if x ′ ≤ . × − d + d x + d x + d x if 1 . × − < x ′ < . . x ′ = . Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020. with the coefficients, c = . d = . d = − . d = . d = − . D DETAILS ON POSIT16 FUNCTIONS

In this section, we explain the posit16 functions in OurLibm. More specifically, we describe thespecial cases, the range reduction technique we used, how we split the reduced domain, and thepolynomials we generated for each posit16 math library function in OurLibm.

D.1 ln ( x ) for Posit16 The elementary function ln ( x ) is defined over the input domain ( , ∞) . There are two classes ofspecial case inputs: Special case of ln ( x ) = (cid:40) N aR if x ≤ N aR if x = N aR

We use the range reduction technique described in Appendix B.1. For ln ( x ) , the range reductionfunction ( x ′ = RR ( x ) ), the output compensation function ( y = OC ( y ′ , x ) ), and the function toapproximate ( y ′ = д ( x ′ ) ) can be summarized as follows: RR ( x ) = t − t + OC ( y ′ , x ) = y ′ + mloд ( e ) д ( x ′ ) = loд (cid:18) + x ′ − x ′ (cid:19) The value t is the fractional value represented by the mantissa of the input x and m is the exponent, i.e. x = t × m . With this range reduction technique, we need to approximate д ( x ′ ) for x ′ ∈ [ , ) .To approximate д ( x ′ ) , we use a 9 th degree odd polynomial P ( x ) = c x + c x + c x + c x + c x with the coefficients, c = . c = . c = . c = . c = . D.2 loд ( x ) for Posit16 The elementary function loд ( x ) is defined over the input domain ( , ∞) . There are two classes ofspecial case inputs: Special case of loд ( x ) = (cid:40) N aR if x ≤ N aR if x = N aR

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:39

The value t is the fractional value represented by the mantissa of the input x and m is the exponent, i.e. x = t × m . With this range reduction technique, we need to approximate д ( x ′ ) for x ′ ∈ [ , ) .To approximate д ( x ′ ) , we use a 5 th degree odd polynomial P ( x ) = c x + c x + c x + c x + c x with the coefficients, c = . c = . c = . c = . c = . D.3 loд ( x ) for Posit16 The elementary function loд ( x ) is defined over the input domain ( , ∞) . There are two classes ofspecial case inputs: Special case of loд ( x ) = (cid:40) N aR if x ≤ N aR if x = N aR

We use the range reduction technique described in Appendix B.1. For loд ( x ) , the range reductionfunction ( x ′ = RR ( x ) ), the output compensation function ( y = OC ( y ′ , x ) ), and the function toapproximate ( y ′ = д ( x ′ ) ) can be summarized as follows: RR ( x ) = t − t + OC ( y ′ , x ) = y ′ + mloд ( ) д ( x ′ ) = loд (cid:18) + x ′ − x ′ (cid:19) The value t is the fractional value represented by the mantissa of the input x and m is the exponent, i.e. x = t × m . With this range reduction technique, we need to approximate д ( x ′ ) for x ′ ∈ [ , ) .We approximate д ( x ′ ) with a 9 th degree odd polynomial P ( x ) = c x + c x + c x + c x + c x with the coefficients, c = . c = . c = . c = . c = . D.4 √ x for Posit16 The elementary function √ x is defined over the input domain [ , ∞) . There are two classes ofspecial case inputs: Special case of √ x = (cid:40) . x = . N aR if x < x = N aR

To approximate √ x ′ , we use a piecewise polynomial consisting of two 6 th degree polynomials P ( x ) = (cid:40) c + c x + c x + c x + c x + c x + c x if x ′ ≤ . d + d x + d x + d x + d x + d x + d x otherwisewith the coefficients, c = . c = . c = − . c = . c = − . c = . × − c = − . × − d = . d = . d = − . d = . × − d = − . × − d = . × − d = − . × − D.5 sin ( πx ) for Posit16 The elementary function sin ( πx ) is defined over the input domain (−∞ , ∞) . There is one specialcase input: sin ( πx ) = N aR if x = N aR

We use the range reduction technique described in Appendix B.5. We decompose the input x into x = s × ( i + t ) where s is the sign of the input, i is an integer, and t ∈ [ , ) is the fractional partof | x | , i.e. | x | = i + t . The range reduction function x ′ = RR ( x ) , the output compensation function y = OC ( y ′ , x ) , and the function we need to approximate, y ′ = д ( x ′ ) can be summarized as follows: RR ( x ) = (cid:40) − t if 0 . < t < . t otherwise , OC ( y ′ , x ) = (cid:40) s × y ′ if i ≡ ( mod )− s × y ′ if i ≡ ( mod ) , д ( x ′ ) = sin ( πx ) With this range reduction technique, we need to approximate sin ( πx ′ ) for x ′ ∈ [ , . ] .The sin ( πx ) function exhibit a linear-like behavior around x =

0. To approximate sin ( πx ) , weuse a piecewise polynomial consisting of two polynomials: P ( x ) = (cid:40) c x if x ′ ≤ . × − d x + d x + d x + d x + d x otherwise Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020.

Novel Approach to Generate Correctly Rounded Math Libraries for New Floating Point Representations 753:41 with the coefficients, c = . d = . d = − . d = . d = − . d = . × − D.6 cos ( πx ) for Posit16 The elementary function cos ( πx ) is defined over the input domain (−∞ , ∞) . There are two classesof special case inputs: cos ( πx ) = N aR if x = N aR

We use the range reduction technique described in Appendix B.6. We decompose the input x into x = s × ( i + t ) where s is the sign of the input, i is an integer, and t ∈ [ , ) is the fractional partof | x | , i.e. | x | = i + t . The range reduction function x ′ = RR ( x ) , the output compensation function y = OC ( y ′ , x ) , and the function we need to approximate y ′ = д ( x ′ ) can be summarized as follows: RR ( x ) = (cid:40) − t if 0 . < t < . t otherwiseOC ( y ′ , x ) = (cid:40) − × (− ) i ( mod ) × y ′ if 0 . < t < . (− ) i ( mod ) × y ′ otherwise д ( x ′ ) = cos ( πx ′ ) With this range reduction technique, we need to approximate cos ( πx ′ ) for x ′ ∈ [ , . ] .The cos ( πx ) function exhibit a linear property around x =

0. To approximate cos ( πx ′ ) , we usethe piecewise polynomial: P ( x ) =  c if x ′ ≤ . × − d + d x + d x + d x + d x if 3 . × − < x ′ < . . x ′ = . c = . d = . d = − . d = . d = − . d = . E loд ( x ) FOR FLOAT

Our loд x function for float in OurLibm is guaranteed to produce the correct results for the inputsin [ , ) . For all other inputs, the result is undefined.We use the range reduction technique described in Appendix B.1 to ease the job of creatingthe polynomial. For loд ( x ) , the range reduction function ( x ′ = RR ( x ) ), the output compensation Rutgers-DCS-TR, Vol. 1, No. Rutgers Computer Science Technical Report, Article 753. Publication date: July 2020. function ( y = OC ( y ′ , x ) ), and the function to approximate ( y ′ = д ( x ′ ) ) can be summarized asfollows: RR ( x ) = t − t + OC ( y ′ , x ) = y ′ д ( x ′ ) = loд (cid:18) + x ′ − x ′ (cid:19) The value t is the fractional value represented by the mantissa of the input x when x is decomposedto x = t × m with an integer exponent m . With this range reduction technique, we need toapproximate д ( x ′ ) for x ′ ∈ [ , ) .To approximate д ( x ′ ) , we use a 15 th degree odd polynomial, P ( x ) = c x + c x + c x + c x + c x + c x + c x + c x with the coefficients, c = . c = . c = . c = . c = . c = . c = . c = .298387164422755202242143468538415618240833282470703125