[PDF] SLEEF: A Portable Vectorized Library of C Standard Mathematical Functions

Abstract

In this paper, we present techniques used to implement our portable vectorized library of C standard mathematical functions written entirely in C language. In order to make the library portable while maintaining good performance, intrinsic functions of vector extensions are abstracted by inline functions or preprocessor macros. We implemented the functions so that they can use sub-features of vector extensions such as fused multiply-add, mask registers and extraction of mantissa. In order to make computation with SIMD instructions efficient, the library only uses a small number of conditional branches, and all the computation paths are vectorized. We devised a variation of the Payne-Hanek argument reduction for trigonometric functions and a floating point remainder, both of which are suitable for vector computation. We compare the performance of our library to Intel SVML.

Full PDF

aa r X i v : . [ c s . M S ] J a n IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 6, JUNE 2020 1

SLEEF: A Por table Vectorized Library ofC Standard Mathematical Functions

Naoki Shibata,

Member, IEEE, and Francesco Petrogalli

Abstract —In this paper, we present techniques used to implement our portable vectorized library of C standard mathematicalfunctions written entirely in C language. In order to make the library portable while maintaining good performance, intrinsic functions ofvector extensions are abstracted by inline functions or preprocessor macros. We implemented the functions so that they can usesub-features of vector extensions such as fused multiply-add, mask registers and extraction of mantissa. In order to make computationwith SIMD instructions efﬁcient, the library only uses a small number of conditional branches, and all the computation paths arevectorized. We devised a variation of the Payne-Hanek argument reduction for trigonometric functions and a ﬂoating point remainder,both of which are suitable for vector computation. We compare the performance of our library to Intel SVML.

Index Terms —Parallel and vector implementations, SIMD processors, elementary functions, ﬂoating-point arithmetic ✦ NTRODUCTION T HE instruction set architecture of most modern proces-sors provides Single Instruction Multiple Data (SIMD) instructions that process multiple instances of data con-currently [1]. The programming model that utilizes theseinstructions is a key technique for many computing sys-tems to reach their peak performance. Most software SIMDoptimizations are introduced manually by programmers.However, this approach introduces a portability problembecause the code needs to be re-written when targeting anew vector extension. In order to improve portability ofcodes with SIMD optimizations, recent compilers have in-troduced auto-vectorizing capability [2]. To fully exploit theSIMD capabilities of a system, the transformation for auto-vectorization of a compiler must be able to invoke a versionof functions that operates on concurrent iterations, or on a vector function . This applies particularly to C mathematicalfunctions deﬁned in math.h that are frequently called inhot-loops.In this paper, we describe our implementation of avectorized library of C standard math functions, calledSLEEF library. SLEEF stands for

SIMD Library for EvaluatingElementary Functions , and implements a vectorized versionof all C99 real ﬂoating-point math functions. Our libraryprovides 1-ULP accuracy version and 3.5-ULP accuracy ver-sion for most of the functions. We conﬁrmed that our librarysatisﬁes such accuracy requirements on an empirical basis.Our library achieves both good performance and portabilityby abstracting intrinsic functions. This abstraction enablessub-features of vector extensions such as mask registers to • Naoki Shibata is with Graduate School of Information Science, NaraInstitute of Science and Technology, Nara, Japan. • Francesco Petrogalli is with ARM, 110 Fulbourn Road, Cambridge, CB19NJ, United Kingdom.Manuscript received 25 Apr. 2019; revised 11 Dec. 2019; accepted 13 Dec.2019. Date of publication 18 Dec. 2019.(Correcponding author: Naoki Shibata.)Recommended for acceptance by D. S. Nikolopoulos.Digital Object Identiﬁer no. 10.1109/TPDS.2019.2960333 be utilized while the source code of our library is sharedamong different vector extensions. We also implementeda version of functions that returns bit-wise consistent re-sults across all platforms. Our library is designed to beused in conjunction with vectorizing compilers. In order tohelp development of vectorizing compilers, we collaboratedwith compiler developers in designing a Vector FunctionApplication Binary Interface (ABI). The main difﬁculty invectorizing math functions is that conditional branches areexpensive. We implemented many of the functions in ourlibrary without conditional branches. We devised reductionmethods and adjusted domains of polynomials so that asingle polynomial covers the entire input domain. For anincreased vector size, a value requiring a slow path is morelikely to be contained in a vector. Therefore, we vectorizedall the code paths in order to speed up the computationin such cases. We devised a variation of the Payne-Hanekrange reduction and a remainder calculation method thatare both suitable for vectorized implementation.We compare the implementation of several selected func-tions in our library to those in other open-source libraries.We also compare the reciprocal throughput of functions inour library, Intel SVML [3], FDLIBM [4], and Vector-libm [5].We show that the performance of our library is comparableto that of Intel SVML.The rest of this paper is organized as follows. Section 2introduces related work. Section 3 discusses how portabilityis improved by abstracting vector extensions. Section 4explains the development of a Vector ABI and a vectorizedmathematical library. Section 5 shows an overview of theimplementation of SLEEF, while comparing our library withFDLIBM and Vector-libm. Section 6 explains how our libraryis tested. Section 7 compares our work with prior art. InSection 8, the conclusions are presented.

EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 6, JUNE 2020 2

ELATED W ORK

The C standard library (libc) includes the standard math-ematical library (libm) [6]. There have been many imple-mentations of libm. Among them, FDLIBM [4] and the libmincluded in the GNU C Library [7] are the most widely usedlibraries. FDLIBM is a freely distributable libm developedby Sun Microsystems, Inc., and there are many derivationsof this library. Gal et al. described the algorithms used in theelementary mathematical library of the IBM Israel ScientiﬁcCenter [8]. Their algorithms are based on the accurate tablesmethod developed by Gal. It achieves high performance andproduces very accurate results. Crlibm is a project to builda correctly rounded mathematical library [9].There are several existing vectorized implementations oflibm. Intel Short Vector Math Library (SVML) is a highly re-garded commercial library [3]. This library provides highlyoptimized subroutines for evaluating elementary functionswhich can use several kinds of vector extensions available inIntel’s processors. However, this library is proprietary andonly optimized for Intel’s processors. There are also a fewcommercial and open-source implementations of vectorizedlibm. AMD is providing a vectorized libm called AMD CoreMath Library (ACML) [10].Some of the code from SVML is published under a freesoftware license, and it is now published as Libmvec [11],which is a part of Glibc. This library provides functions with4-ULP error bound. It is coded in assembly language, andtherefore it does not have good portability. C. K. Anand et al.reported their C implementation of 32 single precision libmfunctions tuned for the Cell BE SPU compute engine [12].They used an environment called Coconut that enablesrapid prototyping of patterns, rapid unit testing of assemblylanguage fragments and patterns to develop their library.M. Dukhan published an open-source and portable SIMDvector libm library named Yeppp! [13], [14]. Most of vec-torized implementations of libm utilizes assembly codingor intrinsic functions to specify which vector instruction isused for each operator. On the other hand, there are alsoother implementations of vector versions of libm which arewritten in a scalar fashion but rely on a vectorizing compilerto generate vector instructions and generate a vectorizedbinary code. Christoph Lauter published an open-sourceVector-libm library implemented with plain C [5]. VDTMathematical Library [15], is a math library written for thecompiler’s auto-vectorization feature.

Manilov et al. propose a C source code translator for sub-stituting calls to platform-speciﬁc intrinsic functions in asource code with those available on the target machine [16].This technique utilizes graph-based pattern matching tosubstitute intrinsics. It can translate SIMD intrinsics betweenextensions with different vector lengths. This rewriting iscarried out through loop-unrolling.N. Gross proposes specialized C++ templates for mak-ing the source code easily portable among different vectorextensions without sacriﬁcing performance [17]. With thesetemplates, some part of the source code can be writtenin a way that resembles scalar code. In order to vectorize algorithms that have a lot of control ﬂow, this schemerequires the bucketing technique is applied, to compute allthe paths and choose the relevant results at the end.Clark et al. proposes a method for combining staticanalysis at compile time and binary translation with aJIT compiler in order to translate SIMD instructions intothose that are available on the target machine [18]. In thismethod, SIMD instructions in the code are ﬁrst convertedinto an equivalent scalar representation. Then, a dynamictranslation phase turns the scalar representation back intoarchitecture-speciﬁc SIMD equivalents.Leißa et al. propose a C-like language for portable and ef-ﬁcient SIMD programming [19]. With their extension, writ-ing vectorized code is almost as easy as writing traditionalscalar code. There is no strict separation in host code andkernels, and scalar and vector programming can be mixed.Switching between them is triggered by the type system.The authors present a formal semantics of their extensionand prove the soundness of the type system.Most of the existing methods are aiming at translatingSIMD intrinsics or instructions to those provided by adifferent vector extension in order to port a code. Intrinsicsthat are unique in a speciﬁc extension are not easy to handle,and translation works only if the source and the targetarchitectures have equivalent SIMD instructions. Automaticvectorizers in compilers have a similar weakness. Wheneverpossible, we have specialized the implementation of themath functions to exploit the SIMD instructions that arespeciﬁc to a target vector extension. We also want to makespecial handling of FMA, rounding and a few other kindsof instructions, because these are critical for both executionspeed and accuracy. We want to implement a library thatis statically optimized and usable with Link Time Opti-mization (LTO). The users of our library do not appreciateusage of a JIT compiler. In order to minimize dependency onexternal libraries, we want to write our library in C. In orderto fulﬁll these requirements, we take a cross-layer approach.We have been developing our abstraction layer of intrinsics,the library implementation, and the algorithms in order tomake our library run fast with any vector extensions.

BSTRACTION OF V ECTOR E XTENSIONS

Modern processors supporting SIMD instructions haveSIMD registers that can contain multiple data [1]. For ex-ample, a 128-bit wide SIMD register may contain four 32-bit single-precision FP numbers. A SIMD add instructionmight take two of these registers as operands, add the fourpairs of numbers, and overwrite one of these registers withthe resulting four numbers. We call an array of FP numberscontained in a SIMD register a vector.SIMD registers and instruction can be exposed in a Cprogram with intrinsic functions and types [20]. An intrinsicfunction is a kind of inline function that exposes the archi-tectural features of an instruction set at C level. By callingan intrinsic function, a programmer can make a compilergenerate a speciﬁc instruction without hand-coded assem-bly. Nevertheless, the compiler can reorder instructions andallocate registers, and therefore optimize the code. Whenintrinsic functions corresponding to SIMD instructions are

EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 6, JUNE 2020 3 // Type definition typedef svbool_t vopmask; typedef svfloat64_t vdouble; // Abstraction of intrinsic functions ptrue svptrue_b8() vdouble vcast_vd_d( double d) { return svdup_n_f64(d); } vdouble vsub_vd_vd_vd(vdouble x, vdouble y) { return svsub_f64_x(ptrue, x, y); } vopmask veq_vo_vd_vd(vdouble x, vdouble y) { return svcmpeq_f64(ptrue, x, y); } vopmask vlt_vo_vd_vd(vdouble x, vdouble y) { return svcmplt_f64(ptrue, x, y); } vopmask vor_vo_vo_vo(vopmask x, vopmask y) { return svorr_b_z(ptrue, x, y); } vdouble vsel_vd_vo_vd_vd(vopmask mask, vdouble x,vdouble y) { return svsel_f64(o, x, y); } Fig. 1: A part of deﬁnitions in VEAL for SVE vdouble xfdim(vdouble x, vdouble y) { vdouble ret = vsub_vd_vd_vd(x, y); ret =vsel_vd_vo_vd_vd(vor_vo_vo_vo(vlt_vo_vd_vd(ret,vcast_vd_d(0)), veq_vo_vd_vd(x, y)), vcast_vd_d(0), ret); return ret; } Fig. 2: Implementation of vectorized fdim (positive differ-ence) function with VEALdeﬁned inside a compiler, C data types for representingvectors are also deﬁned.In SLEEF, we use intrinsic functions to specify whichassembly instruction to use for each operator. We abstractintrinsic functions for each vector extension by a set ofinline functions or preprocessor macros. We implement thefunctions exported from the library to call abstract intrinsicfunctions instead of directly calling intrinsic functions. Inthis way, it is easy to swap the vector extension to use. Wecall our set of inline functions for abstracting architecture-speciﬁc intrinsics

Vector Extension Abstraction Layer (VEAL) .In some of the existing vector math libraries, functionsare implemented with hand-coded assembly [11]. This ap-proach improves the absolute performance because it ispossible to provide the optimal implementation for eachmicroarchitecture. However, processors with a new microar-chitecture are released every few years, and the libraryneeds revision accordingly in order to maintain the optimalperformance.In other vector math libraries, the source code is writtenin a scalar fashion that is easy for compilers to auto-vectorize[5], [15]. Although such libraries have good portability, it isnot easy for compilers to generate a well-optimized code.In order for each transformation rule in an optimizer tokick in, the source code must satisfy many conditions toguarantee that the optimized code runs correctly and faster.In order to control the level of optimization, a programmermust specify special attributes and compiler options.

There are differences in the features provided by differentvector extensions, and we must change the function im-plementation according to the available features. Thanksto the level of abstraction provided by the VEALs, weimplemented the functions so that all the different versionsof functions can be built from the same source ﬁles withdifferent macros enabled. For example, the availability ofFMA instructions is important when implementing double-double (DD) operators [21]. We implemented DD operatorsboth with and without FMA by manually specifying if thecompiler can convert each combination of multiplicationand addition instructions to an FMA instruction, utilizingVEALs.Generally, bit masks are used in a vectorized code inorder to conditionally choose elements from two vectors. Insome vector extensions, a vector register with a width thatmatches a vector register for storing FP values, is used tostore a bit mask. Some vector extensions provide narrowervector registers that are dedicated to this purpose, which isSLEEF makes use of these opmask registers by providing adedicated data type in VEALs. If a vector extension does notsupport an opmask, the usual bit mask is used instead of anopmask. It is also better to have an opmask as an argumentof a whole math function and make that function onlycompute the elements speciﬁed by the opmask. By utilizinga VEAL, it is also easy to implement such a functionality.

Fig. 1 shows some deﬁnitions in the VEAL for SVE [22].We abstract vector data types and intrinsic functions withtypedef statements and inline functions, respectively.The vdouble data type is for storing vectors of doubleprecision FP numbers. vopmask is the data type for theopmask described in 3.1.The function vcast_vd_d is a function that returns avector in which the given scalar value is copied to all ele-ments in the vector. vsub_vd_vd_vd is a function for vec-tor subtraction between two vdouble data. veq_vo_vd_vd compares elements of two vectors of vdouble type. Theresults of the comparison can be used, for example, by vsel_vd_vo_vd_vd to choose a value for each element be-tween two vector registers. Fig. 2 shows an implementationof a vectorized positive difference function using a VEAL.This function is a vectorized implementation of the fdim function in the C standard math library.

The method of implementing math functions describedso far, can deliver computation results that slightly differdepending on architectures and other conditions, althoughthey all satisfy the accuracy requirements, and other speciﬁ-cations. However, in some applications, bit-wise consistentresults are required.To this extent, the SLEEF project has been workingclosely with Unity Technologies, which specializes in de-veloping frameworks for video gaming, and we discovered

1. https://unity3d.com/.

EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 6, JUNE 2020 4 that they have unique requirements for the functionalitiesof math libraries. Networked video games run on manygaming consoles with different architectures and they sharethe same virtual environment. Consistent results of simu-lation at each terminal and server are required to ensurefairness among all players. For this purpose, fast compu-tation is more important than accurate computation, whilethe results of computation have to perfectly agree betweenmany computing nodes, which are not guaranteed to relyon the same architecture. Usually, ﬁxed-point arithmetic isused for a purpose like this, however there is a demand formodifying existing codes with FP computation to supportnetworking.There are also other kinds of simulation in which bit-wise identical reproducibility is important. In [23], the au-thors show that modeled mean climate states, variabilityand trends at different scales may be signiﬁcantly changedor even lead to opposing results due to the round-off errorsin climate system simulations. Since reproducibility is afundamental principle of scientiﬁc research, they propose topromote bit-wise identical reproducibility as a worldwidestandard.One way to obtain bit-wise consistent values from mathfunctions is to compute correctly rounded values. However,for applications like networked video games, this might betoo expensive. SLEEF provides vectorized math functionsthat return bit-wise consistent results across all platformsand other settings, and this is also achieved by utilizingVEALs. The basic idea is to always apply the same sequenceof operations to the arguments. The IEEE 754 standardguarantees that the basic arithmetic operators give correctlyrounded results [24], and therefore the results from these op-erators are bit-wise consistent. Because most of the functionsexcept trigonometric functions do not have a conditionalbranch in our library, producing bit-wise consistent resultsis fairly straightforward with VEALs. Availability of FMAinstructions is another key for making results bit-wise con-sistent. Since FMA instructions are critical for performance,we cannot just give up using FMA instructions. In SLEEF,the bit-wise consistent versions of functions have two ver-sions both with and without FMA instructions. We providea non-FMA version of the functions to guarantee bit-wiseconsistency among extensions such as Intel SSE2 that do nothave FMA instructions. Another issue is that the compilermight introduce inconsistency by FP contraction, which isthe result of combining a pair of multiplication and additionoperations into an FMA. By disabling FP contraction, thecompiler strictly preserves the order and the type of FPoperations during optimization. It is also important to makethe returned values from scalar functions bit-wise consistentwith the vector functions. In order to achieve this, we alsomade a VEAL that only uses scalar operators and datatypes. The bit-wise consistent and non-consistent versionsof vector and scalar functions are all built from the samesource ﬁles, with different VEALs and macros enabled. Asdescribed in Section 5, trigonometric functions in SLEEFchooses a reduction method according to the maximumargument of all elements in the argument vector. In orderto make the returned value bit-wise consistent, the bit-wiseconsistent version of the functions ﬁrst applies the reductionmethod for small arguments to the elements covered by this method. Then it applies the second method only to theelements with larger arguments which the ﬁrst method doesnot cover. HE D EVELOPMENT OF A V ECTOR F UNCTION

ABI

AND

SLEEF

Recent compilers are developing new optimization tech-niques to automatically vectorize a code written in standardprogramming languages that do not support paralleliza-tion [25], [26]. Although the ﬁrst SIMD and vector com-puting systems [27] appeared a few decades ago, compilerswith auto-vectorization capability have not been widelyused until recently, because of several difﬁculties in imple-menting such functionality for modern SIMD architectures.Such difﬁculties include verifying whether the compiler canvectorize a loop or not, by determining data access patternsof the operations in the loop [2], [28]. For languages like Cand C++, it is also difﬁcult to determine the data dependen-cies through the iteration space of the loop, because it is hardto determine aliasing conditions of the arrays processed inthe loop.

Vectorizing compilers convert calls to scalar versions ofmath functions such as sine and exponential to the SIMDversion of the math functions. The most recent versions ofIntel Compiler [29], GNU Compiler [30], and Arm Compilerfor HPC [31], which is based on Clang/LLVM [32], [33], arecapable of this transformation, and rely on the availabilityof vector math libraries such as SVML [3], Libmvec [11]and SLEEF respectively to provide an implementation of thevector function calls that they generate. In order to developthis kind of transformations, a target-dependent

ApplicationBinary Interface (ABI) for calling vectorized functions had tobe designed.The Vector Function ABI for AArch64 architecture [34]was designed in close relationship with the developmentof SLEEF. This type of ABI must standardize the mappingbetween scalar functions and vector functions. The existenceof a standard enables interoperability across different com-pilers, linkers and libraries, thanks to the use of standardnames deﬁned by the speciﬁcation.The ABI includes a name mangling function, a mapthat converts the scalar signature to the vector one, andthe calling conventions that the vector functions must obey.In particular, the name mangling function that takes thename of the scalar function to the vector function mustencode all the information that is necessary to reverse thetransformation back to the original scalar function. A linkercan use this reverse mapping to enable more optimizations(Link Time Optimizations) that operate on object ﬁles, anddoes not have access to the scalar and vector functionprototypes. There is a demand by users for using a dif-ferent vector math library according to the usage. Reversemapping is also handy for this purpose. A vector mathlibrary implements a function for each combination of avector extension, a vector length and a math function toevaluate. As a result, the library exports a large number offunctions. Some vector math libraries can only implement

EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 6, JUNE 2020 5 part of all the combinations. By using the reverse mappingmechanism, the compiler can check the availability of thefunctions by scanning the symbols exported by a library.The Vector Function ABI is also used with OpenMP [35].From version 4.0 onwards, OpenMP provides the direc-tive declare simd . A user can decorate a function with thisdirective to inform the compiler that the function can besafely invoked concurrently on multiple instances of its ar-guments [36]. This means that the compiler can vectorize thefunction safely. This is particularly useful when the functionis provided via a separate module, or an external library,for example in situations where the compiler is not able toexamine the behavior of the function in the call site. Thescalar-to-vector function mapping rules stipulated in theVector Function ABI are based on the classiﬁcation of vectorfunctions associated with the declare simd directive ofOpenMP. Currently, work for implementing these OpenMPdirectives on LLVM is ongoing.The Vector Function ABI speciﬁcations are provided forthe Intel x86 and the Armv8 (AArch64) families of vectorextensions [34], [37]. The compiler generates SIMD functioncalls according to the compiler ﬂags. For example, whentargeting AArch64 SVE auto-vectorization, the compiler willtransform a call to the standard sin function to a call to thesymbol _ZGVsMxv_sin . When targeting Intel AVX-512 [38]auto-vectorization, the compiler would generate a call to thesymbol _ZGVeNe8v_sin . SLEEF is provided as two separate libraries. The ﬁrst li-brary exposes the functions of SLEEF to programmers forinclusion in their C/C++ code. The second library exposesthe functions with names mangled according to the VectorFunction ABI. This makes SLEEF a viable alternative to libmand its SIMD counterpart libmvec , in glibc. This also en-ables a user work-ﬂow that relies on the auto-vectorizationcapabilities of a compiler. The compatibility with libmvecenables users to swap from libmvec to libsleef by simplychanging compiler options, without changing the code thatgenerated the vector call. The two SLEEF libraries are builtfrom the same source code, which are conﬁgured to targetthe different versions via auto-generative programs thattransparently rename the functions according to the rulesof the target library.

VERVIEW OF L IBRARY I MPLEMENTATION

One of the objectives of the SLEEF project is to providea library of vectorized math functions that can be usedin conjunction with vectorizing compilers. When a non-vectorized code is automatically vectorized, the compilerconverts calls to scalar math functions to calls to a SIMDversion of the math functions. In order to make this conver-sion safe and applicable to wide variety of codes, we needfunctions with 1-ULP error bound that conforms to ANSIC standard. On the other hand, there are users who needbetter performance. Our library provides 1-ULP accuracyversion and 3.5-ULP accuracy version for most of the func-tions. We conﬁrmed that our library satisﬁes the accuracyrequirements on an empirical basis. For non-ﬁnite inputs and outputs, we implemented the functions to return thesame results as libm, as speciﬁed in the ANSI C standard.They do not set errno nor raise an exception.In order to optimize a program with SIMD instructions,it is important to eliminate conditional branches as muchas possible, and execute the same sequence of instructionsregardless of the argument. If the algorithm requires con-ditional branches according to the argument, it must pre-pare for the case where the elements in the input vectorcontain both values that would make a branch happenand not happen. Recent processors have a long pipelineand therefore branch misprediction penalty can reach morethan 10 cycles [39]. Making a decision for a conditionalbranch also requires non-negligible computation, within thescope of our tests. A conditional move is an operator forchoosing one value from two given values according to acondition. This is equivalent to a ternary operator and canbe used in a vectorized code to replace a conditional branch.Some other operations are also expensive in vectorizedimplementation. A table-lookup is expensive. Although in-register table lookup is reported fast on Cell BE SPU [12], itis substantially slower than polynomial evaluation withoutany table lookup, within the scope of our tests. Most vectorextensions do not provide 64-bit integer multiplication or avector shift operator with which each element of a vectorcan be speciﬁed a different number of bits to shift. Onthe other hand, FMA and round-to-integer instructions aresupported by most vector extensions. Due to the nature ofthe evaluation methods, dependency between operationscannot be completely eliminated. Latencies of operationsbecome an issue when a series of dependent operations areexecuted. FP division and square root are not too expensivefrom this aspect. The actual structure of the pipeline in a processor iscomplex, and such level of details are not well-documentedfor most CPUs. Therefore, it is not easy to optimize the codeaccording to such hardware implementation. In this paper,we deﬁne the latency and throughput of an instruction ora subroutine as follows [41]. The latency of an instructionor a subroutine is the delay that it generates in a depen-dency chain. The throughput is the maximum number ofinstructions or subroutines of the same kind that can beexecuted per unit time when the inputs are independentof the preceding instructions or subroutines. Several toolsand methods are proposed for automatically constructingmodels of latency, throughput, and port usage of instruc-tions [42], [43]. Within the scope of our tests, most of theinstruction latency in the critical path of evaluating a vectormath function tends to be dominated by FMA operations. Inmany processors, FMA units are implemented in a pipelinemanner. Some powerful processors have multiple FMAunits with out-of-order execution, and thus the throughputof FMA instruction is large, while the latency is long. InSLEEF, we try to maximize the throughput of computationin a versatile way by only taking account of dependenciesamong FMA operations. We regard each FMA operation asa job that can be executed in parallel and try to reduce thelength of the critical path.

2. The latencies of 256-bit DP add, divide and sqrt instructions are 4,14 and 18 cycles, respectively on Intel Skylake processors [40].

EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 6, JUNE 2020 6

In order to evaluate a double-precision (DP) functionto 1-ULP accuracy, the internal computation with accuracybetter than 1 ULP is sometimes required.

Double-double (DD) arithmetic, in which a single value is expressed by a sum oftwo double-precision FP values [44], [45], is used for thispurpose. All the basic operators for DD arithmetic can beimplemented without a conditional branch, and thereforeit is suitable for vectorized implementation. Because weonly need 1-ULP overall accuracy for DP functions, we usesimpliﬁed DD operators with less than the full DD accuracy.In SLEEF, we omit re-normalization of DD values by default,allowing overlap between the two numbers. We carry outre-normalization only when necessary.Evaluation of an elementary function often consists ofthree steps: range reduction, approximation, and reconstruc-tion [21]. An approximation step computes the elementaryfunction using a polynomial. Since this approximation isonly valid for a small domain, a number within that rangeis computed from the argument in a range reduction step.The reconstruction step combines the results of the ﬁrst twosteps to obtain the resulting number.An argument reduction method that ﬁnds an FP remain-der of dividing the argument x by π is used in evaluationof trigonometric functions. The range reduction methodsuggested by Cody and Waite [46], [47] is used for smallarguments. The Payne and Hanek’s method [48] providesan accurate range-reduction for a large argument of trigono-metric function, but it is expensive in terms of operations.There are tools available for generating the coefﬁcients ofthe polynomials, such as Maple [49] and Sollya [50]. In orderto ﬁne-tune the generated coefﬁcients, we created a tool forgenerating coefﬁcients that minimizes the maximum rela-tive error. When a SLEEF function evaluates a polynomial, itevaluates a few lowest degree terms in DD precision whileother terms are computed in double-precision, in order toachieve 1-ULP overall accuracy. Accordingly, coefﬁcientsin DD precision or coefﬁcients that can be represented byFP numbers with a few most signiﬁcant bits in mantissaare used in the last few terms. We designed our tool togenerate such coefﬁcients. We use Estrin’s scheme [51] toevaluate a polynomial to reduce dependency between FMAoperations. This scheme reduces bubbles in the pipeline,and allows more FMA operations to be executed in parallel.Reducing latency can improve the throughput of evaluatinga function because the latency and the reciprocal throughputof the entire function are close to each other.Below, we describe and compare the implementationsof selected functions in SLEEF, FDLIBM [4] and ChristophLauter’s Vector-libm [5]. We describe 1-ULP accuracy ver-sion of functions in SLEEF. The error bound speciﬁcation ofFDLIBM is 1 ULP. sin and cos FDLIBM uses Cody-Waite range reduction if the argumentis under π . Otherwise, it uses the Payne-Hanek rangereduction. Then, it switches between polynomial approxi-mations of the sine and cosine functions on [ − π/ , π/ .Each polynomial has 6 non-zero terms. sin and cos in Vector-libm have 4-ULP error bound.They use a vectorized path if all arguments are greater than 3.05e-151, and less than 5.147 for sine and 2.574 forcosine. In the vectorized paths, a polynomial with 8 and 9non-zero terms is used to approximate the sine function on [ − π/ , π/ , following Cody-Waite range reduction. In thescalar paths, Vector-libm uses a polynomial with 10 non-zero terms.SLEEF switches among two Cody-Waite range reductionmethods with approximation with different sets of con-stants, and the Payne-Hanek reduction. The ﬁrst versionof the algorithm operates for arguments within [ − , ,and the second version for arguments that are within [ − , ] . Otherwise, SLEEF uses a vectorized Payne-Hanek reduction, which is described in 5.8. SLEEF only usesconditional branches for choosing a reduction method fromCody-Waite and Payne-Hanek. SLEEF uses a polynomialapproximation of the sine function on [ − π/ , π/ , whichhas 9 non-zero terms. The sign is set in the reconstructionstep. tan After Cody-Waite or Payne-Hanek reduction, FDLIBM re-duces the argument to [0 , . , and uses a polynomialapproximation with 13 non-zero terms. It has 10 if state-ments after Cody-Waite reduction. tan in Vector-libm has 8-ULP error bound. A vector-ized path is used if all arguments are less than 2.574 andgreater than 3.05e-151. After Cody-Waite range reduction, apolynomial with 9 non-zero terms for approximating sinefunction on [ − π/ , π/ is used twice to approximate sineand cosine of the reduced argument. The result is obtainedby dividing these values. In the scalar path, Vector-libmevaluates a polynomial with 10 non-zero terms twice.In SLEEF, the argument is reduced in 3 levels. It ﬁrstreduces the argument to [ − π/ , π/ with Cody-Waite orPayne-Hanek range reduction. Then, it reduces the argu-ment to [ − π/ , π/ with tan a = 1 / tan( π/ − a ) . Atthe third level, it reduces the argument to [ − π/ , π/ withthe double-angle formula. Let a be the reduced argu-ment with Cody-Waite or Payne-Hanek. a = π/ − a if | a | > π/ . Otherwise, a = a . Then, SLEEF usesa polynomial approximation of the tangent function on [ − π/ , π/ , which has 9 non-zero terms, to approximate tan( a / . Let t be the obtained value with this approxi-mation. Then, tan a ≈ t/ (1 − t ) if | a | ≤ π/ . Otherwise, tan a ≈ (1 − t ) / (2 t ) . SLEEF only uses conditional branchesfor choosing a reduction method from Cody-Waite andPayne-Hanek. Annotated source code of tan is shown inAppendix A asin and acos FDLIBM and SLEEF ﬁrst reduces the argument to [0 , . using arcsin x = π/ − p (1 − x ) / and arccos x =2 arcsin p (1 − x ) / .Then, SLEEF uses a polynomial approximation of arcsineon [0 , . with 12 non-zero terms.FDLIBM uses a rational approximation with 11 terms(plus one division). For computing arcsine, FDLIBMswitches the approximation method if the original argumentis over 0.975. For computing arccosine, it has three pathsthat are taken when | x | < . , x ≤ − . and x ≥ . , EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 6, JUNE 2020 7 respectively. It has 7 and 6 if statements in asin and acos ,respectively. asin and acos in Vector-libm have 6-ULP error bound. asin and acos in Vector-libm use vectorized paths if ar-guments are all greater than 3.05e-151 and 2.77e-17, respec-tively. Vector-libm evaluates polynomials with 3, 8, 8, and 5terms to compute arcsine. It evaluates a polynomial with 21terms for arccosine. atan FDLIBM reduces the argument to [0 , / . It uses a poly-nomial approximation of the arctangent function with 11non-zero terms. It has 9 if statements. atan in Vector-libm have 6-ULP error bound. Vector-libm uses vectorized paths if arguments are all greater than1.86e-151 and less than 2853. It evaluates four polynomialswith 7, 9, 9 and 4 terms in the vectorized path.SLEEF reduces argument a to [0 , using arctan x = π/ − arctan(1 /x ) . Let a ′ = 1 /a if | a | ≥ . Otherwise, a ′ = a .It then uses a polynomial approximation of arctangent func-tion with 20 non-zero terms to approximate r ≈ arctan a ′ .As a reconstruction, it computes arctan a ≈ π/ − r if | a | ≥ . Otherwise, arctan a ≈ r . log FDLIBM reduces the argument to [ √ / , √ . It then ap-proximates the reduced argument with a polynomial thatcontains 7 non-zero terms in a similar way to SLEEF. It has9 if statements. log in Vector-libm has 4-ULP error bound. It uses avectorized path if the input is a normalized number. It usesa polynomial with 20 non-zero terms to approximate thelogarithm function on [0 . , . . It does not use division.SLEEF multiplies the argument a by , if the argumentis a denormal number. Let a ′ be the resulting argument, e = ⌊ log (4 a ′ / ⌋ and m = a ′ · − e . If a is a denormalnumber, e is subtracted 64. SLEEF uses a polynomial with 7non-zero terms to evaluate log m ≈ P n =0 C n (cid:16) m − m +1 (cid:17) n +1 ,where C ...C are constants. As a reconstruction, it com-putes log a = e log 2 + log m . exp All libraries reduce the argument range to [ − (log 2) / , (log 2) / by ﬁnding r and integer k suchthat x = k log 2 + r, | r | ≤ (log 2) / .SLEEF then uses a polynomial approximation with 13non-zero terms to directly approximate the exponentialfunction of this domain. It achieves 1-ULP error boundwithout using a DD operation.FDLIBM uses a polynomial with 5 non-zero terms toapproximate f ( r ) = r ( e r + 1) / ( e r − . It then computes exp ( r ) = 1 + 2 r/ ( f ( r ) − r ) . It has 11 if statements.The reconstruction step is to add integer k to the expo-nent of the resulting FP number of the above computation. exp in Vector-libm has 4-ULP error bound. A vectorizedpath covers almost all input domains. It uses a polynomialwith 11 terms to approximate the exponential function. pow FDLIBM computes y log x in DD precision. Then, it com-putes pow ( x , y ) = e log 2 · y log x . It has 44 if statements.Vector-libm does not implement pow .SLEEF computes e y log x . The internal computation iscarried out in DD precision. In order to compute logarithminternally, it uses a polynomial with 11 non-zero terms.The accuracy of the internal logarithm function is around0.008 ULP. The internal exponential function in pow uses apolynomial with 13 non-zero terms. Our method computes rfrac(2 x/π ) · π/ , where rfrac( a ) := a − round( a ) . The argument x is an FP number, and thereforeit can be represented as M · E , where M is an integermantissa and E is an integer exponent E . We now denotethe integral part and the fractional part of E · /π as I ( E ) and F ( E ) , respectively. Then, rfrac(2 x/π ) = rfrac( M · E · /π )= rfrac( M · ( I ( E ) + F ( E )))= rfrac( M · F ( E )) . The value F ( E ) only depends on the exponent of theargument, and therefore, can be calculated and stored in atable, in advance. In order to compute rfrac( M · F ( E )) in DDprecision, F ( E ) must be in quad-double-precision. We nowdenote F ( E ) = F ( E ) + F ( E ) + F ( E ) + F ( E ) , where F ( E ) ...F ( E ) are DP numbers and | F ( E ) | ≥ | F ( E ) | ≥| F ( E ) | ≥ | F ( E ) | . Then, rfrac( M · F ( E ))=rfrac( M · F ( E ) + M · F ( E )+ M · F ( E ) + M · F ( E ))=rfrac(rfrac(rfrac(rfrac( M · F ( E )) + M · F ( E ))+ M · F ( E )) + M · F ( E )) , (1)because rfrac( a + b ) = rfrac(rfrac( a ) + b ) . In the method,we compute (1) in DD precision in order to avoid overﬂow.The size of the table retaining F ( E ) ...F ( E ) is 32K bytes.Our method is included in the source code of tan shown inAppendix A.FDLIBM seems to implement the original Payne-Hanekalgorithm with more than 100 lines of C code, which in-cludes 13 if statements, 18 for loops, 1 switch statementand 1 goto statement. The numbers of iterations of most ofthe for loops depend on the argument.Vector-libm implements a non-vectorized variation ofthe Payne-Hanek algorithm which has some similarity withour method. In order to reduce argument x , it ﬁrst decom-poses | x | into E and n such that E · n = | x | . A triple-double(TD) approximation to t ( E ) = 2 E /π − ·⌊ E − /π ⌋ is looked-up from a table. It then calculates m = n · t ( E ) in TD. Thereduced argument is obtained as a product of π and thefractional part of m . In Table 1, we compare the numbers ofFP operators in the implementations. Note that the methodin Vector-libm is used for trigonometric functions with 4-ULP error bound, while our method is used for functionswith 1-ULP error bound. EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 6, JUNE 2020 8 ≈≈≈ ≈≈≈ ≈

Fig. 3: Example computation of FP remainderTABLE 1: Number of FP operators in the Payne-Hanekimplementations

Operator SLEEF (1-ULP) Vector-libm (4-ULP)add/sub 36 71mul 5 18FMA 11 0round 8 0

ALGORITHM 1:

Exact remainder calculation

Input:

Finite positive numbers n and d Output:

Returns n − d ⌊ n/d ⌋ r := n, k := 0 while d ≤ r k do q k is an arbitrary integer satisfying ( r k /d ) / ≤ q k ≤ r k /d r k +1 := r k − q k d k := k + 1 end while return r k We devised an exact remainder calculation method suitablefor vectorized implementation. The method is based on thelong division method, where an FP number is regarded as adigit. Fig. 3 shows an example process for calculating the FPremainder of 1e+40 / 0.75. Like a typical long division, weﬁrst ﬁnd integer quotient 1.333e+40 so that 1.333e+40 · . does not exceed 1e+40. We multiply the found quotient with . , and then subtract it from 1e+40 to ﬁnd the dividend4.836e+24 for the second iteration.Our basic algorithm is shown in Algorithm 1. If n , d and q k are FP numbers of the same precision p , then r k is representable with an FP number of precision p . Inthis case, the number of iterations can be minimized bysubstituting q k with the largest FP number of precision p within the range speciﬁed at line 3. However, the algorithmstill works if q k is any FP number of precision p withinthe range. By utilizing this property, an implementer canuse a division operator that does not return a correctlyrounded result. The source code of an implementation ofthis algorithm is shown in Fig. 7 in Appendix B. A part ofthe proof of correctness is shown in Appendix C.FDLIBM uses a method of shift and subtract. It ﬁrstconverts the mantissa of two given arguments into 64-bitintegers, and calculates a remainder in a bit-by-bit basis.The main loop iterates ix − iy times, where ix and iy arethe exponents of the arguments of fmod . This loop includes10 integer additions and 3 if statements. The number ofiterations of the main loop can reach more than 1000.Vector-libm does not implement FP remainder. Our implementation gives a value within the speciﬁed er-ror bound without special handling of denormal numbers,unless otherwise noted.When a function has to return a speciﬁc value for aspeciﬁc value of an argument (such as a NaN or a negativezero) is given, such a condition is checked at the end of eachfunction. The return value is substituted with the specialvalue if the condition is met. This process is complicated infunctions like pow , because they have many conditions forreturning special values.SLEEF functions do not give correct results if the compu-tation mode is different from round-to-nearest. They do notset errno nor raise an exception. This is a common behavioramong vectorized math libraries including Libmvec [11] andSVML [3]. Because of SIMD processing, functions can raisespurious exceptions if they try to raise an exception.

FDLIBM extensively uses conditional branches in order toswitch the polynomial according to the argument( sin , cos , tan , log , etc), to return a special value if the argumentsare special values( pow , etc.), and to control the number ofiterations (the Payne-Hanek reduction).Vector-libm switches between a few polynomials in mostof the functions. It does not provide functions with 1-ULPerror bound, nevertheless, the numbers of non-zero termsin the polynomials are larger than other two libraries insome of the functions. A vectorized path is used only if theargument is smaller than 2.574 in cos and tan , althoughthese functions are frequently evaluated with an argumentup to π . In most of the functions, Vector-libm uses a non-vectorized path if the argument is very small or a non-ﬁnitenumber. For example, it processes 0 with non-vectorizedpaths in many functions, although 0 is a frequently eval-uated argument in normal situations. If non-ﬁnite numbersare once contained in data being processed, the whole pro-cessing can become signiﬁcantly slower afterward. Variationin execution time can be exploited for a side-channel attackin cryptographic applications.SLEEF uses the fastest paths if all the arguments areunder 15 for trigonometric functions, and the same vector-ized path is used regardless of the argument in most of thenon-trigonometric functions. SLEEF always uses the samepolynomial regardless of the argument in all functions.Although reducing the number of conditional brancheshas a few advantages in implementing vector math libraries,it seems to be not given a high priority in other libraries. EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 6, JUNE 2020 9

ESTING

SLEEF includes three kinds of testers. The ﬁrst two kindsof testers test the accuracy of all functions against high-precision evaluation using the MPFR library. In these tests,the computation error in ULP is calculated by comparingthe values output by each SLEEF function and the valuesoutput by the corresponding function in the MPFR library,and it is checked if the error is within the speciﬁed bounds.

The ﬁrst kind of tester carries out a perfunctory set of teststo check if the build is correct. These tests include standardscompliance tests, accuracy tests and regression tests.In the standards compliance tests, we test if the functionsreturn the correct values when values that require specialhandling are given as the argument. These argument valuesinclude ± Inf, NaN and ±

0. Atan2 and pow are binaryfunctions and have many combinations of these specialargument values. These are also all tested.In the accuracy test, we test if the error of the returnedvalues from the functions is within the speciﬁed range,when a predeﬁned set of argument values are given. Theseargument values are basically chosen between a few combi-nations of two values at regular intervals. The trigonometricfunctions are also tested against argument values close tointegral multiples of π/ . Each function is tested againsttens of thousands of argument values in total.In the regression test, the functions are tested with ar-gument values that triggered bugs in the previous libraryrelease, in order to prevent re-emergence of the same bug.The executables are separated into a tester and IUTs( Implementation Under Test). The tests are carried outby making these two executables communicate via an in-put/output pipeline, in order to enable testing of librariesfor architectures which the MPFR library does not support. The second kind of tester is designed to run continuously.This tester generates random arguments and compare theoutput from each function to the output calculated with thecorresponding function in the MPFR library. This tester isexpected to ﬁnd bugs if it is run for a sufﬁciently long time.In order to randomly generate an argument, the testergenerates random bits of the size of an FP value, andreinterprets the bits as an FP value. The tester executesthe randomized test for all the functions in the library atseveral thousand arguments per second for each functionon a computer with a Core i7-6700 CPU.In the SLEEF project, we use randomized testing in orderto check the correctness of functions, rather than formalveriﬁcation. It is indeed true that proving correctness ofimplementation contributes to the reliability of implemen-tation. However, there is a performance overhead becausethe way of implementation is limited in a form that is easyto prove the correctness. There would be an increased costof maintaining the library because of the need for updatingthe proof each time the implementation is modiﬁed.

The third kind of tester is for testing if bit-identical re-sults are returned from the functions that are supposed toreturn such results. This test is designed to compare theresults among the binaries compiled with different vectorextensions. For each predetermined list of arguments, wecalculate an MD5 hash value of all the outputs from eachfunction. Then, we check if the hash values match amongfunctions for different architectures.

ERFORMANCE C OMPARISON

In this section, we present results of a performance compari-son between FDLIBM Version 5.3 [4], Vector-libm [5], SLEEF3.4, and Intel SVML [3] included in Intel C Compiler 19.We measured the reciprocal throughput of each functionby measuring the execution time of a tight loop that repeat-edly calls the function in a single-threaded process. In orderto obtain useful results, we turned off optimization ﬂagswhen compiling the source code of this tight loop, whilethe libraries are compiled with their default optimizationoptions. We did not use LTO. We conﬁrmed that the calls tothe function are not compiled out or inlined by checking theassembly output from the compiler. The number of functioncalls by each loop is , and the execution time of this loopis measured with the clock_gettime function.We compiled SLEEF and FDLIBM using gcc-7.3.0with “ -O3 -mavx2 -mfma ” optimization options.We compiled Vector-libm using gcc-7.3.0 with thedefault “ -O3 -march=native -ftree-vectorize-ftree-vectorizer-verbose=1 -fno-math-errno ”options. We changed VECTOR LENGTH in vector.h to 4 and compiled the source code on a computerwith an Intel Core i7-6700 CPU. The accuracy offunctions in SVML can be chosen by compiler options.We speciﬁed an “ -fimf-max-error=1.0 ” and an“ -fimf-max-error=4.0 ” options for icc to obtain the1-ULP and 4-ULP accuracy results, respectively.We carried out all the measurements on a physical PCwith Intel Core i7-6700 CPU @ 3.40GHz without any virtualmachine. In order to make sure that the CPU is alwaysrunning at the same 3.4GHz clock speed during the mea-surements, we turned off Turbo Boost. With this setting, 10nano sec. corresponds to 34 clock cycles.The following results compare the the reciprocalthroughput of each function. If the implementation is vector-ized and each vector has N elements of FP numbers, then asingle execution evaluates the corresponding mathematicalfunction N times. We generated arguments in advance andstored in arrays. Each time a function is executed, we set a

3. If we turn on the optimizer, there is concern that the compileroptimizes away the call to a function. In order to prevent this, we haveto introduce extra operations, but this also introduces overhead. Aftertrying several conﬁgurations of the loop and optimizer settings, wedecided to turn off the optimizer in favor of reproducibility, simplicityand fairness. We checked the assembly output from the compiler andconﬁrmed that the unoptimized loop simply calls the target functionand increments a counter, and therefore that the operations inside aloop are minimal.4. Vector-libm evaluates functions with 512 bits of vector length bydefault. Because SLEEF and SVML are 256-bit wide, the setting ischanged.

EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 6, JUNE 2020 10 E x e c u t i o n t i m e i n m i c r o s e c . log10(Numerator / Denominator)Intel scalarfdlibm scalarSLEEF 256-bit Fig. 4: Reciprocal throughput of double-precision fmodfunctionsrandomly generated argument to each element of the argu-ment vector (each element is set with a different value). Themeasurement results do not include the delay for generatingrandom numbers.

We compared the reciprocal throughput of double-precision fmod functions in the libm included in Intel C Compiler19, FDLIBM and SLEEF. All the FP remainder functionsalways return a correctly-rounded result. We generated arandom denominator d and a numerator uniformly dis-tributed within [1 , and [0 . r · d, . r · d ] , respectively,where r is varied from 1 to . Fig. 4 shows the reciprocalthroughput of the fmod function in each library. Please notethat SVML does not contain a vectorized fmod function.The graph of reciprocal throughput looks like a stepfunction, because the number of iterations increases in thisway. We compared the reciprocal throughput of 256-bit widevectorized double-precision functions in Vector-libm, SLEEFand SVML, and scalar functions in FDLIBM. We generatedrandom arguments that were uniformly distributed withinthe indicated intervals for each function. In order to checkexecution speed of fast paths in trigonometric functions, wemeasured the reciprocal throughput with arguments within [0 . , . . The result is shown in Table 2.The reciprocal throughput of functions in SLEEF is com-parable to that of SVML in all cases. This is because thelatency of FP operations is generally dominant in the exe-cution time of math functions. Because there are two levelsof scheduling mechanisms, which includes the optimizer ina compiler and the out-of-order execution hardware, thereis small room for making a difference to the throughput orlatency.Execution speed of FDLIBM is not very slow despitemany conditional branches. This seems to be because ofa smaller number of FP operations, and faster executionspeed of scalar instructions compared to equivalent SIMDinstructions.Vector-libm is slow even if only the vectorized pathis used. This seems to be because Vector-libm evaluatespolynomials with a large number of terms. Auto-vectorizers TABLE 2: Reciprocal throughput in nano sec. Func, error bound, Vector-libm FDLIBM SLEEF SVMLdomainsin, 1 ulp, 4.927 11.43 13.68 [0 . , . sin, 4 ulps, 9.601 7.504 6.679 [0 . , . sin, 1 ulp, 18.96 11.41 13.86 [0 , . sin, 4 ulps, 12.48 7.507 6.723 [0 , . sin, 1 ulp, 162.3 48.79 41.72 [0 , e + 100] sin, 4 ulps, 288.6 43.82 34.96 [0 , e + 100] cos, 1 ulp, 11.42 13.75 12.99 [0 . , . cos, 4 ulps, 9.557 7.850 7.917 [0 . , . cos, 1 ulp, 18.45 13.74 13.18 [0 , . cos, 4 ulps, 13.97 7.850 7.838 [0 , . cos, 1 ulp, 162.1 50.38 38.38 [0 , e + 100] cos, 4 ulps, 360.3 46.09 36.57 [0 , e + 100] tan, 1 ulp, 7.819 17.30 15.71 [0 . , . tan, 4+ ulps, 15.58 9.367 7.570 [0 . , . tan, 1 ulp, 22.24 17.28 15.78 [0 , . tan, 4+ ulps, 20.16 9.367 7.595 [0 , . tan, 1 ulp, 177.0 48.82 43.31 [0 , e + 100] tan, 4+ ulps, 399.4 36.54 40.50 [0 , e + 100] asin, 1 ulp, 14.87 12.99 12.10 [ − , asin, 4+ ulps, 20.75 5.552 9.627 [ − , acos, 1 ulp, 12.07 16.09 12.11 [ − , acos, 4+ ulps, 23.62 7.572 10.23 [ − , atan, 1 ulp, 10.16 22.12 19.97 [ − , atan, 4+ ulps, 35.54 9.251 12.09 [ − , log, 1 ulp, 31.66 15.46 12.05 [0 , e + 300] log, 4 ulps, 39.64 9.636 8.842 [0 , e + 300] exp, 1 ulp, 12.19 7.663 7.968 [ − , exp, 4 ulps, 17.35 6.756 [ − , pow, 1 ulp, 69.40 55.53 75.18 [ − , − , are still developing, and the compiled binary code mightnot be well optimized. When a slow path has to be used,Vector-libm is even slower since a scalar evaluation has tobe carried out for each of the elements in the vector.Vector-libm uses Horner’s method to evaluate polynomi-als, which involves long latency of chained FP operations.In FDLIBM, this latency is reduced by splitting polynomialsinto even and odd terms, which can be evaluated in parallel.SLEEF uses Estrin’s scheme. In our experiments, there was EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 6, JUNE 2020 11 only a small difference between Estrin’s scheme and split-ting polynomials into even and odd terms with respect toexecution speed.

ONCLUSION

In this paper, we showed that our SLEEF library shows per-formance comparable to commercial libraries while main-taining good portability. We have been continuously devel-oping SLEEF since 2010. [52] We distribute SLEEF underthe Boost Software License [53], which is a permissive opensource license. We actively communicate with developersof compilers and members of other projects in order tounderstand the needs of real-world users. The Vector Func-tion ABI is important in developing vectorizing compilers.The functions that return bit-identical results are added toour library to reﬂect requests from our multiple partners.We thoroughly tested these functionalities, and SLEEF isalready adopted in multiple commercial products. A PPENDIX AA NNOTATED SOURCE CODE OF tan

Fig. 5 and 6 shows a C source code of our implementationof the tangent function with 1-ULP error bound. We omittedthe second Cody-Waite reduction for the sake of simplicity.The deﬁnitions of DD operators are provided in Table 3. Inthe implementation of our library, we wrote all the operatorswith VEAL, as described in Sec. 3. The only conditionalbranch in this source code is the if statement at line 6. Weimplemented the other if statements at line 14, 26, 57, and62 with conditional move operators. The for loop at line 16is unrolled. We implemented round functions at line 8,19 and 20 with a single instruction for most of the vectorextensions. Macro ESTRIN at line 35 evaluates a polynomialin double-precision with Estrin’s scheme.In the loop from line 16 to 23 in the Payne-Hanekreduction, Eq. (1) is computed. The result is multiplied by π/ at line 24. The numbers of FP operators shown in Table 1are the numbers of operators from line 12 to line 26. Thepath with the Payne-Hanek reduction is also taken if theargument is non-ﬁnite. In this case, variable x is set to NaNat line 26, and this will propagate to the ﬁnal result. A PPENDIX BA NNOTATED SOURCE CODE OF THE FP REMAINDER

Fig. 7 shows a sample C source code of our FP remainder.In this implementation, both dividend and divisor are DPnumbers. It correctly handles signs and denormal numbers.It executes a division instruction only once throughout thecomputation of the remainder. The implementation alsosupports the case where a division operator does not give acorrectly rounded result. In this case, the nextafter functionmust be applied multiple times at line 12 according to themaximum error. We implemented if statements at line 3, 21and 23 with conditional move operators. In the main loop,the algorithm ﬁnds the remainder of n/d , and at line 22, thecorrect sign is assigned to the resulting remainder.

5. https://sleef.org/

TABLE 3: DD Functions

Function name Outputdd ( x, y ) DD number x + y ddadd2 d2 d2 d2 ( x, y ) Sum of DD numbers x and y ddadd d2 d2 d2 Addition of two DD numbers x and y ,where | x | ≥ | y | ddadd d2 d d Addition of two DP numbers x and y ,where | x | ≥ | y | ddadd d2 d d2 Addition of DP number x andDD number y , where | x | ≥ | y | ddmul d2 d2 d2 ( x, y ) Product of DD numbers x and y ddmul d2 d2 d ( x, y ) Product of DD number x andDP number y ddmul d2 d d ( x, y ) Product of DP numbers x and y dddiv d2 d2 d2 ( x, y ) Returns x/y , where x and y are DDnumbersddsqu d2 d2 ( x ) Returns x , where x is a DD numberddscale d2 d2 d ( x, y ) Product of DD number x and DPnumber y , where y = 2 N ddnormalize d2 d2 ( x ) Re-normalize DD number x The stopping condition r.x ≥ d of the for loop at line 11is not strictly required, and this loop can be terminatedafter all the values in vectors satisfy the stopping conditionin a vectorized implementation. Since we assume that alloperators return a round-to-nearest FP number, we use the nextafter function at line 9 and 12 to ﬁnd a value close to r/d but not exceeding it. This method is applicable only if x/y is smaller than DBL_MAX . In order to ﬁnd the correct q when r/d is between 1 and 3, we use a few comparisons to detectthis case at line 13 and 14. A PPENDIX CC ORRECTNESS OF

FP R

EMAINDER

It is obvious that Algorithm 1 returns a correct result. Weshow partial proof that r k is representable as a radix-2 FPnumber of precision p if n , d and q k are radix-2 FP numbersof precision p .We now deﬁne property F P p and function RD p . F P p ( x ) holds iff there are integer ≤ m < p and integer e thatsatisfy x = m · e . If x and y are ﬁnite DP numbers, F P ( x ) and F P ( xy ) holds. RD p ( x ) denotes the maximum num-ber that does not exceed x and F P p ( x ) holds.We now show that F P p ( r k ) and F P p ( q k d ) hold if F P p ( n ) and F P p ( d ) hold. By Lemma 1, integer q k existsthat satisﬁes F P p ( q k ) . Then, F P p ( r ) and F P p ( q k d ) hold. F P p ( r k − q k d ) holds because r k / ≤ q k d ≤ r k . Then, F P p ( r k +1 ) holds. Lemma 1

Given s ≥ and integer p ≥ . There exists aninteger q that satisﬁes F P p ( q ) and s/ ≤ q ≤ s . Proof: If s < p , F P p ( ⌊ s ⌋ ) holds. In this case, q canbe ⌊ s ⌋ since ⌊ s ⌋ ≥ . Otherwise, RD p ( s ) is an integer. ( s − RD p ( s )) /s < − p , and therefore (1 − − p ) s < RD p ( s ) .Because p ≥ , s/ < RD p ( s ) , and therefore q can be RD p ( s ) .We now discuss the number of iterations. We suppose q k is set to min ( RD p ( r k /d ) , ⌊ r k /d ⌋ ) . If r k /d < p , ⌊ r k /d ⌋ = q k and the loop terminates after k -th iteration. Otherwise, q k = RD p ( r k /d ) < ⌊ r k /d ⌋ , and ( r k /d − q k ) / ( r k /d ) < − p . EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 6, JUNE 2020 12

Then, r k +1 = r k − q k d < − p r k . Therefore the number ofiterations is ⌈ (log ( n/d )) / ( p − ⌉ .In order to show that the number that substitutes q atline 12 in Fig. 7 satisﬁes the condition at line 3 in Algo-rithm 1, we prove ( r/d ) / < ⌊ RN ( nextaf ter ( RN ( r ) , · nextaf ter ( RN (1 . /d ) , ⌋ assuming p = 53 , r/d ≥ and d > . RN ( x ) is the ﬂoating-point number that is the closestto x . We assume no overﬂow or underﬂow.It is obvious that q is substituted with a number thatis smaller or equal to r/d . ( r − nextaf ter ( RN ( r ) , /r < − p , where p is precision. Therefore, (1 − − p ) · r < nextaf ter ( RN ( r ) , . Similarly, (1 − − p ) · . /d < nextaf ter ( RN (1 . /d ) , . Therefore, (1 − − p ) (1 − − p ) r/d < RN ( nextaf ter ( RN ( r ) , · nextaf ter ( RN (1 . /d ) , . Let u = (1 − − p ) (1 − − p ) . ur/d − ≤ ⌊ ur/d ⌋ . ( r/d ) / < ur/d − ↔ / ( u − / < r/d . / ( u − / < r/d holds since p = 53 and r/d ≥ . Therefore, ( r/d ) / < ⌊ RN ( nextaf ter ( RN ( r ) , · nextaf ter ( RN (1 . /d ) , ⌋ . A CKNOWLEDGMENTS

We wish to acknowledge Will Lovett and Srinath Vadlamanifor their valuable input and suggestions. The authors areparticularly grateful to Robert Werner, who reviewed theﬁnal manuscript. The authors would like to thank Prof.Leigh McDowell for his suggestions. The authors would liketo thank all the developers that contributed to the SLEEFproject, in particular Diana Bite and Alexandre Mutel. R EFERENCES [1] G. Conte, S. Tommesani, and F. Zanichelli, “The long and windingroad to high-performance image processing with mmx/sse,” in

Proceedings Fifth IEEE International Workshop on Computer Architec-tures for Machine Perception , Sep. 2000, pp. 302–310.[2] D. Naishlos, “Autovectorization in GCC,” in

Proceedings of the 2004GCC Developers Summit , November 2016, pp. 407–411.[6] ISO,

ISO/IEC 9899:2011 Information technology — Pro-gramming languages — C

ACM Trans. Math. Softw. , vol. 17, no. 1,pp. 26–45, March 1991.[9] C. Daramy-Loirat, D. Defour, F. de Dinechin, M. Gallet, N. Gast,C. Lauter, and J.-M. Muller, “CR-LIBM: a correctly rounded el-ementary function library,” in

Proc. SPIE 5205, Advanced SignalProcessing Algorithms, Architectures, and Implementations XIII , vol.5205, December 2003, pp. 458–464.[10] Advanced Micro Devices, Inc. (2013) AMDcore math library. [Online]. Available:http://developer.amd.com/tools-and-sdks/archive/acml-product-features/[11] GNU Project. (2015) Libmvec in glibc. [Online]. Available:https://sourceware.org/glibc/wiki/libmvec[12] C. K. Anand and W. Kahl, “An optimized cell be special functionlibrary generated by coconut,”

IEEE Transactions on Computers ,vol. 58, no. 8, pp. 1126–1138, Aug 2009.[13] M. Dukhan et al. (2013) Yeppp! library. [Online]. Available:https://bitbucket.org/MDukhan/yeppp [14] M. Dukhan and R. Vuduc, “Methods for high-throughput compu-tation of elementary functions,” in

Parallel Processing and AppliedMathematics: 10th International Conference , May 2014, pp. 86–95.[15] D. Piparo, V. Innocente, and T. Hauth, “Speeding up HEP ex-periment software with a library of fast and auto-vectorisablemathematical functions,”

Journal of Physics: Conference Series , vol.513, no. 5, p. 052027, 2014.[16] S. Manilov, B. Franke, A. Magrath, and C. Andrieu, “Freerider: A source-level transformation tool for retargeting platform-speciﬁc intrinsic functions,”

ACM Trans. Embed. Comput. Syst. ,vol. 16, no. 2, pp. 38:1–38:24, Dec. 2016. [Online]. Available:http://doi.acm.org/10.1145/2990194[17] M. Gross, “Neat SIMD: Elegant vectorization in C++ by usingspecialized templates,” in , July 2016, pp. 848–857.[18] N. Clark, A. Hormati, S. Yehia, S. Mahlke, and K. Flautner, “LiquidSIMD: Abstracting SIMD Hardware using Lightweight DynamicMapping,” in , Feb 2007, pp. 216–227.[19] R. Leißa, S. Hack, and I. Wald, “Extending a c-like language forportable SIMD programming,”

SIGPLAN Not. , vol. 47, no. 8, pp.65–74, Feb. 2012.[20] Arm Limited,

ARM NEON Intrinsics Reference ,September 2014, no. IHI0073A. [Online]. Available:http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A arm neon intrinsics ref.pdf[21] J.-M. Muller,

Elementary Functions: Algorithms and Implementation .Birkhauser Boston, Inc., 1997.[22] Arm Limited. (2017) ARM C language exten-sions for SVE documentation. [Online]. Available:https://developer.arm.com/docs/100987/0000[23] L. Liu, S. Peng, C. Zhang, R. Li, B. Wang, C. Sun, Q. Liu, L. Dong,L. Li, S. Yanyan, Y. He, W. Zhao, and G. Yang, “Importance ofbitwise identical reproducibility in earth system modeling andstatus report,”

Geoscientiﬁc Model Development Discussions , vol. 8,pp. 4375–4400, June 2015.[24] J.-M. Muller, N. Brisebarre, F. de Dinechin, C.-P. Jeannerod,V. Lef`evre, G. Melquiond, N. Revol, D. Stehl´e, and S. Torres,

Handbook of Floating-Point Arithmetic , 1st ed. Birkhauser Boston,Inc., 2009.[25] O. Krzikalla, K. Feldhoff, R. M ¨uller-Pfefferkorn, and W. E. Nagel,“Auto-Vectorization Techniques for Modern SIMD Architectures,”in

Proc. of the 16th Workshop on Compilers for Parallel Computing ,January 2012.[26] D. Nuzman, S. Dyshel, E. Rohou, I. Rosen, K. Williams, D. Yuste,A. Cohen, and A. Zaks, “Vapor SIMD: Auto-vectorize once, runeverywhere,” in

International Symposium on Code Generation andOptimization , ser. CGO 2011, April 2011, pp. 151–160.[27] R. M. Russell, “The CRAY-1 computer system,”

Commun. ACM ,vol. 21, no. 1, pp. 63–72, Jan. 1978.[28] D. Nuzman, I. Rosen, and A. Zaks, “Auto-vectorization of inter-leaved data for SIMD,” in

Proceedings of the 27th ACM SIGPLANConference on Programming Language Design and Implementation , ser.PLDI ’06, 2006, pp. 132–143.[29] X. Tian, A. Bik, M. Girkar, P. Grey, H. Saito, and E. Su, “IntelOpenMP C++/Fortran Compiler for Hyper-Threading Technol-ogy: Implementation and Performance,” in

Intel Technology Journal

Proceedings of the2004 International Symposium on Code Generation and Optimization(CGO’04)

Scaling OpenMP for Exascale Performance and Portability ,August 2017, pp. 62–74.

EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 6, JUNE 2020 13 [37] X. Tian, H. Saito, S. Kozhukhov, K. B. Smith, R. Geva,M. Girkar, and S. V. Preis. (2015, November) Vector functionapplication binary interface. Intel Corporation. [Online]. Available:https://software.intel.com/en-us/articles/vector-simd-function-abi[38] A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod,S. Chinthamani, S. Hutsell, R. Agarwal, and Y. C. Liu, “KnightsLanding: Second-Generation Intel Xeon Phi Product,”

IEEE Micro ,vol. 36, no. 2, pp. 34–46, March 2016.[39] S. Eyerman, J. E. Smith, and L. Eeckhout, “Characterizing thebranch misprediction penalty,” in , March 2006,pp. 48–58.[40] Intel Corporation,

Intel 64 and IA-32 Architectures OptimizationReference Manual

Proceedings of the Twenty-Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems , ser. ASPLOS ’19. New York, NY, USA: ACM, 2019, pp.673–686.[43] Google’s Compiler Research Team and The LLVM compilerinfrastructure project . (2018) llvm-exegesis: Automaticmeasurement of instruction latency/uops. [Online]. Available:https://github.com/google/EXEgesis[44] T. J. Dekker, “A ﬂoating-point technique for extending the avail-able precision,”

Numer. Math. , vol. 18, no. 3, pp. 224–242, June1971.[45] J. Richard Shewchuk, “Adaptive precision ﬂoating-point arith-metic and fast robust geometric predicates,”

Discrete & Compu-tational Geometry , vol. 18, no. 3, pp. 305–363, Oct 1997.[46] W. Cody and W. Waite. (1980) Software manual for the elementaryfunctions. Englewood Cliffs, N.J.[47] W. J. Cody, “Implementation and testing of function software,”in

Problems and Methodologies in Mathematical Software Production ,November 1980, pp. 24–47.[48] M. H. Payne and R. N. Hanek, “Radian reduction for trigonometricfunctions,”

SIGNUM Newsl. , vol. 18, no. 1, pp. 19–24, January 1983.[49] B. W. Char, K. O. Geddes, W. M. Gentleman, and G. H. Gonnet,“The design of maple: A compact, portable, and powerful com-puter algebra system,” in

Computer Algebra , J. A. van Hulzen, Ed.Berlin, Heidelberg: Springer Berlin Heidelberg, 1983, pp. 101–115.[50] S. Chevillard, M. Joldes¸, and C. Lauter, “Sollya: An environmentfor the development of numerical codes,” in

Mathematical Software- ICMS 2010 , ser. Lecture Notes in Computer Science, K. Fukuda,J. van der Hoeven, M. Joswig, and N. Takayama, Eds., vol. 6327.Heidelberg, Germany: Springer, September 2010, pp. 28–31.[51] G. Estrin, “Organization of computer systems: The ﬁxedplus variable structure computer,” in

Papers Presented atthe May 3-5, 1960, Western Joint IRE-AIEE-ACM ComputerConference , ser. IRE-AIEE-ACM ’60 (Western). New York,NY, USA: ACM, 1960, pp. 33–40. [Online]. Available:http://doi.acm.org/10.1145/1460361.1460365[52] N. Shibata, “Efﬁcient evaluation methods of elementary functionssuitable for SIMD computation,”

Computer Science - Research andDevelopment

PLACEPHOTOHERE

Naoki Shibata is an associate professor at NaraInstitute of Science and Technology. He receivedthe Ph.D. degree in computer science from Os-aka University, Japan, in 2001. He was an assis-tant professor at Nara Institute of Science andTechnology 2001-2003 and an associate profes-sor at Shiga University 2004-2012. His researchareas include distributed and parallel systems,and intelligent transportation systems. He is amember of IPSJ, ACM and IEEE.PLACEPHOTOHERE

Francesco Petrogalli is a software engineerworking on the development of Arm Compilerfor HPC. He contributed to the implementation ofthe Vector Length Agnostic (VLA) vectorizer forthe Scalable Vector Extension (SVE) of Arm, tothe ABI speciﬁcations for the vector functions onAArch64, and to the open source library SLEEF.Francesco has also worked on optimizing a va-riety of computational kernels that are core toMachine Learning algorithms.

EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 6, JUNE 2020 14 double xtan( double d) { double u; double2 x = dd(0, 0), y; int q = 0; if (fabs(d) < 15) { // Cody-Waite q = round(d * M_2_PI); u = fma(q, -0x1.921fb54442d18p0, d); x = ddadd_d2_d_d(u, q * -0x1.1a62633145c07p-54); } else { // Payne-Hanek int ex = ilogb(d), M = ex > 700 ? -130 : -2; if (ex < 0) ex = 0; u = ldexp(d, M); for ( int i=0;i<4;i++) { y = ddmul_d2_d_d(u, tab[ex*4+i]); x = ddadd2_d2_d2_d2(x, y); double r = round(4*x.x); q += (int32_t)((r - 4*round(x.x))); x.x -= r * 0.25; x = ddnormalize_d2_d2(x); } x = ddmul_d2_d2_d2(x, dd(0x1.921fb54442d18p2, 0x1.1a62633145c07p-52)); if (!isfinite(d)) x.x = NAN; // NAN handling } // Reduction with double-angle formula y = ddscale_d2_d2_d(x, 0.5); // Polynomial evaluation with Estrin’s scheme // Domain : |y| <= PI/8 x = ddsqu_d2_d2(y); u = ESTRIN(x.x, +0.3245098826639276316e-3, +0.5619219738114323735e-3, +0.1460781502402784494e-2, +0.3591611540792499519e-2, +0.8863268409563113126e-2, +0.2186948728185535498e-1, +0.5396825399517272970e-1, +0.1333333333330500581e+0); // Last two terms are evaluated with Horner’smethod u = fma(u, x.x, +0.3333333333333343695e+0); // Last term is evaluated in DD precision x = ddadd_d2_d2_d2(y, ddmul_d2_d2_d(ddmul_d2_d2_d2(x, y), u)); // Reconstruction with double-angle formula y = ddadd_d2_d_d2(-1, ddsqu_d2_d2(x)); x = ddscale_d2_d2_d(x, -2); // Reconstruction with tan(PI/2 - x) = cot(x) if (q & 1) { double2 t = x; x = y; y.x = -t.x; y.y = -t.y; } x = dddiv_d2_d2_d2(x, y); if (d == 0) return d; // Negative-zero handling return x.x + x.y; } Fig. 5: C source code of tan double tab[4096]; void init() { mpfr_set_default_prec(1280); mpfr_t pi, twoopi, m; mpfr_inits(pi, twoopi, m, NULL); mpfr_const_pi(pi, GMP_RNDN); mpfr_d_div(twoopi, 2, pi, GMP_RNDN); for ( int ex=0;ex<1024;ex++) { int M = ex > 700 ? -128 : 0; mpfr_set(m, twoopi, GMP_RNDN); mpfr_set_exp(m, mpfr_get_exp(m) + (ex - 53)); mpfr_frac(m, m, GMP_RNDN); mpfr_set_exp(m, mpfr_get_exp(m) - (ex - 53 + M)); for ( int i=0;i<4;i++) { union { double d; int64_t i; } tmp = { .d =mpfr_get_d(m, GMP_RNDN) }; tmp.i &= 0xfffffffffffffffeLL; tab[ex*4+i] = tmp.d; mpfr_sub_d(m, m, tab[ex*4+i], GMP_RNDN); } } mpfr_clears(pi, twoopi, m, NULL); } Fig. 6: C source code of Payne-Hanek table generation double xfmod( double x, double y) { double n = fabs(x), d = fabs(y), s = 1, q; if (d < DBL_MIN) { n *= 1ULL << 54; d *= 1ULL << 54; s = 1.0 / (1ULL << 54); } double2 r = dd(n); double rd = nextafter(1.0 / d, 0); for ( int i=0;i < 21 && r.x >= d;i++) { q = trunc(nextafter(r.x, 0) * rd); q = (3*d > r.x && r.x > d) ? 2 : q; q = (2*d > r.x && r.x > d) ? 1 : q; q = r.x == d ? (r.y >= 0 ? 1 : 0) : q; r = ddadd2_d2_d2_d2(r, ddmul_d2_d_d(q, -d)); r = ddnormalize_d2_d2(r); } double ret = r.x * s; if (r.x + r.y == d) ret = 0; ret = copysign(ret, x); if (n < d) ret = x; return d == 0 ? NAN : ret; }}