[PDF] Custom-Precision Mathematical Library Explorations for Code Profiling and Optimization

Abstract

The typical processors used for scientific computing have fixed-width data-paths. This implies that mathematical libraries were specifically developed to target each of these fixed precisions (binary16, binary32, binary64). However, to address the increasing energy consumption and throughput requirements of scientific applications, library and hardware designers are moving beyond this one-size-fits-all approach. In this article we propose to study the effects and benefits of using user-defined floating-point formats and target accuracies in calculations involving mathematical functions. Our tool collects input-data profiles and iteratively explores lower precisions for each call-site of a mathematical function in user applications. This profiling data will be a valuable asset for specializing and fine-tuning mathematical function implementations for a given application. We demonstrate the tool's capabilities on SGP4, a satellite tracking application. The profile data shows the potential for specialization and provides insight into answering where it is useful to provide variable-precision designs for elementary function evaluation.

Full PDF

aa r X i v : . [ c s . M S ] M a y Custom-Precision Mathematical LibraryExplorations for Code Proﬁling and Optimization

David Defour § , Pablo de Oliveira Castro ∗† , Matei Is¸toan ∗† , and Eric Petit †‡∗ University of Versailles – Li-PaRAD, Email: { pablo.oliveira, matei.istoan } @uvsq.fr † Exascale Computing Research, ECR ‡ Intel Corporation, Email: [email protected] § University of Perpignan, Email: [email protected]

Abstract —The typical processors used for scientiﬁc computinghave ﬁxed-width data-paths. This implies that mathematicallibraries were speciﬁcally developed to target each of theseﬁxed precisions (binary16, binary32, binary64). However, toaddress the increasing energy consumption and throughputrequirements of scientiﬁc applications, library and hardwaredesigners are moving beyond this one-size-ﬁts-all approach. Inthis article we propose to study the effects and beneﬁts ofusing user-deﬁned ﬂoating-point formats and target accuracies incalculations involving mathematical functions. Our tool collectsinput-data proﬁles and iteratively explores lower precisions foreach call-site of a mathematical function in user applications.This proﬁling data will be a valuable asset for specializing andﬁne-tuning mathematical function implementations for a givenapplication. We demonstrate the tool’s capabilities on SGP4, asatellite tracking application. The proﬁle data shows the potentialfor specialization and provides insight into answering where itis useful to provide variable-precision designs for elementaryfunction evaluation.

Index Terms —HPC, libm, ﬂoating-point, custom-precision, op-timization, specialization

I. I

NTRODUCTION

A growing interest to adapt ﬂoating-point formats to the realneeds of applications is becoming ever more ubiquitous. Thisprocess has been successfully conducted by the AI communitywhich has settled on the BF16 [1] and fp16 [2] formats, inorder to increase performance and efﬁciency. Similar beneﬁtshave been achieved in other domains [3], [4] by reducingthe precision of basic operations ( + , − , ∗ , / ) and harnessinghardware-support for multiple internal ﬂoating-point formats.Oberman et al. [5] demonstrate that neglecting optimizationsof infrequent operations, such as division and square root,can severely impact performance. We believe that elementaryfunctions should not be neglected either when optimizingfor mixed-precision. Even though elementary functions arenot widely available in hardware and infrequently used inapplications, their impact can be important. HPC Patmosneutronic solver [6] spends of the execution time in themathematical library ( libm ) functions. This conﬁrms similarobservations by CERN [7] on their HPC codes.Various mathematical libraries specialize evaluationschemes for different accuracy/performance trade-offs [8].Recent developments such as metalibm [9] automaticallygenerate elementary functions to best ﬁt the hardware and accuracy constraints. These works highlight that a trade-offcan be explored between performance, accuracy and precision.In this paper, we propose a tool for collecting input intervalsand output required precision proﬁles from real applicationsin order to guide the design of specialized mathematicallibraries. Indeed, considering limited input data ranges andapplication-focused output accuracy could drastically inﬂuencethe implementation performance. We demonstrate the tool’scapabilities on SGP4, a satellite tracking application. Theproﬁle data shows the potential for specialization and providesinsight into where it is useful to provide variable-precisiondesigns for elementary function evaluation.II. R ELATED W ORKS

Contrary to basic operations, properties of elementary func-tion are not standardized mainly because the correctly roundedproperty is difﬁcult to achieve [10], [11]. As a consequence,there are numerous available implementations of such func-tions, either in software or hardware, each representing adifferent trade-off between, accuracy, performance, hardwarerequirements and programming language. The most notablemathematical library embedding different trade-offs are theVector Mathematical Functions from Intel’s MKL library [8],which offers three accuracy modes: High-/Low-accuracy andEnhanced Performance. Another example are Nvidia’s GPUs,which embed dedicated hardware for fast approximation ofsome functions and software implementation of more accu-rate and larger input range versions [12]. This has led tothe OpenCL 2.2 standard which deﬁnes the requirements interms of accuracy of mathematical functions from half to double [13].Developing and maintaining multiple implementations foreach function is a daunting endeavor. Several tools have beenproposed to automate this task either, for hardware or softwareimplementation of such functions [9], [14].Porting the concept of memoization to mathematical func-tions has been explored in [15], [16] where the authorsinvestigated how considering real input-data proﬁle can beused to optimize the evaluation. However, they did not evaluatethe potential decrease in accuracy.II. S IMULATING V ARIABLE P RECISION AND R ANGE FOR M ATHEMATICAL F UNCTIONS

Our approach to simulate mathematical functions with re-duced range/precision is twofold: ﬁrst we transparently inter-pose calls to mathematical functions, then the

VPREC-libm library computes the result in a reduced precision.

A. Library Call Interposition

In Linux, the dynamic loader offers the possibility tointercept dynamic library calls, so that a custom library iscalled instead. This is achieved by setting the

LD_PRELOAD environment variable. This interception method works out-of-the-box with a compiled binary and is transparent to the user.We use it to replace standard calls to the libm with customcalls to our own

VPREC-libm library which simulates non-standard precisions and range. This approach is ﬂexible, buthas two limitations: • it is not applicable to statically linked programs: for those,the user must manually re-link the program against the VPREC-libm ; • it only intercepts library calls; to ensure that we inter-cept all operations we disable compiler optimizationswhich replace calls by hardware instrinsics (such as sqrtsd assembly instruction in IA-64). Fortunately, the -fno-builtin ﬂag disables these optimizations inmost standard compilers. Binary . . .

ProﬁleOptimizedPrecisionsExecute

LD_PRELOADVPREClibm

DichotomicExplorationConﬁg. Execute

LD_PRELOADVPREClibm

First Proﬁle Exploration

Fig. 1:

VPREC-libm optimization process overview

B. Implementation of the VPREC-libm

The interposed mathematical call is handled by

VPREC-libm , which returns the result in the targetﬂoating-point format. For example in double precision,one can customize the bit length of the pseudo-exponent r ∈ [1 , and the pseudo-mantissa p ∈ [0 , .VPREC-libm operates in two steps. First, it computes a binary128 result e z by calling the corresponding mathemat-ical function from the GCC’s libquadmath .Then, e z is converted to the target format using Veriﬁcarlo-VPREC [3]. If e z is representable in the target range, afaithfully rounded result at target precision p is returned. If e z is outside the target range, VPREC returns ±∞ for overﬂowsand ± for underﬂows. Rounding is achieved by adding a ulpat precision p + 1 followed by a truncation ( ⌊ e z + 2 e z − p − ⌋ p ). C. Exploring Precision Requirements Using VPREC-libm

VPREC-libm can be used in two modes: proﬁling , and ex-ecution . In proﬁling mode,

VPREC-libm creates a proﬁle ofthe executed code. For each call, it updates the boundaries ofthe operands and output intervals, the number of occurrencesfor the unique program address and stack trace from which thecall is made. This information is aggregated and processed toproduce execution statistics.In execution mode,

VPREC-libm accepts a conﬁgurationﬁle which speciﬁes the ﬂoating-point formats to use at eachcall-site. An initial conﬁg ﬁle is generated automatically aftera proﬁling run, containing information for all encountered call-sites, which the user can later modify as it is required.Taking advantage of this functionality, we developed amethod for exploring the optimization potential of a givenﬂoating-point code. On a broad scale, we perform a dichotomicsearch for the minimal output precision of the

VPREC-libm functions which meets the user-speciﬁed correctness criteria.This search is applied sequentially per call-site and convergesin logarithmic time, in the size of the mantissa. As we aredealing with a vast search-space, the order in which thisoptimization is performed is quite important. According toAmdahl’s law, optimizing the code in which most of the timeis spent results in the biggest performance gains. Thereforewe prioritize the call-site exploration by call frequency as aheuristic. Figure 1 summarizes this process.IV. E

XPERIMENTAL R ESULTS

The example discussed throughout this section illustratesthe potential of our proposed method, applied to a real-worldastrophysics application used to predict the position and thevelocity of Earth-orbiting objects (most notably satellites).

A. Satellite Tracker: SGP4

In order to track the position and the velocity of satellites, acommon technique is to use simpliﬁed general perturbations(SGP) models, which predict the inﬂuence of drag, the Earth’sshape, as well as that of the sun and the moon on the trajectory.Work on SGP started in the 1960’s, but at the beginning ofthe 1980s NORAD released the equations and the sourcecode to predict satellite positions [17]. NORAD maintainsand periodically reﬁnes data sets on all resident space objects,ensuring the accuracy of trajectory predictions. The data-setswere made available to users, through NASA, being the onlysource of readily available orbital data.However, a user could not just go ahead and use anyprediction model she wished, she had to use the same modelemployed to generate the data-set, even if the user’s choicemight have had better performance, an aspect highlightedin [17]. This made SGP, especially its SGP4/SDP4 variants,commonplace among users wishing to use NORAD’s orbitaldata, distributed as two-line element (TLE) sets. Near-Earth objects (period less than minutes) are tracked using SGP4,while deep-space objects (period over minutes) are tracked North American Space Defense Command sing SDP4, which also models the gravitational effects of themoon and the sun and certain Earth harmonics.In the period following the original release of SGP, a multi-tude of code variations came to exist, making interoperabilityand compatibility an issue. For experimental purposes, wewill use the SGP4 version documented in [18], which is acommunity effort to keep a version up to date with the modelsand TLE data-sets used by NORAD. The mathematical modelsused throughout are those presented in [19]. The

C++ versionused for the experiments is the one provided by CelesTrak,available online . Only minimal changes were made, in orderto ensure that the program runs correctly on Linux. Thishas not affected the program outputs, as veriﬁed against theprovided test-suites. B. Results

We applied our method to the data-set sgp4-all.tle discussed in [18]. Figure 2 illustrates our proﬁling and pre-cision exploration results for the libm call-sites with themost calls. At the top we show the number of occurrencesfor different libm call-sites. In the middle graph we show foreach call-site the dynamic range of the input data, as usedin the original code, extracted from the proﬁle produced by VPREC-libm . Finally, in the bottom graph we show for eachcall-site the required precision of the outputs determined byour exploration with the

VPREC-libm method.We can observe that the second and third call-sites arealmost twice more frequent than the next ten entries. Theoutput precision cannot be signiﬁcantly decreased for the twoﬁrst call-sites, as shown in the bottom sub-ﬁgure. Indeed theyoccur at the very end of the algorithm, directly inﬂuencing theoutputs. On the other hand, the dynamic range of the input isquite reduced at these two call-sites.Analyzing the code, we notice that these calls to sin() and cos() take the same argument, as they are part of arotation. Therefore, they could be replaced with a call to sincos() , effectively reducing their combined workload byalmost a factor of two, similar to what is done in [20].Analyzing the rest of the call-sites, we can observe two gen-eral trends. The ﬁrst are call-sites where the required precisionis close to the default one; here we only manage to save − -bits. The second are call-sites where the required precisionhovers around the -bits mark. A plausible explanation forthese results can be found through a bit of computer-sciencearchaeology. SGP4 was ﬁrst developed in FortranIV , on aHoneywell-6000 series computer [17]. This machine had -bit words, so ﬂoating-point numbers had a -bit exponent anda - or -bit signiﬁcand, for single- or double-precision,respectively. As noted in [18], the code originally used a mix ofsingle- and double-precision computations. With the evolutionof the underlying hardware, the code was moved to double-precision throughout, which made for a smoother behavior, butdid not improve the accuracy. These observations are indeed httsp://celestrak.com/publications/AIAA/2006-6753/ https://en.wikipedia.org/wiki/Fortran https://en.wikipedia.org/wiki/Honeywell 6000 series coherent with our ﬁndings on optimizing the precision of theoutputs of the mathematical functions.Figure 3 shows the results of our method on to a data-setcontaining just the satelite number , which is the ﬁrst satellitein the sgp4-all.tle dataset. This is a near-Earth satellite,which means that its trajectory is tracked using SGP4, notSDP4. Its period, perigee and eccentricity ensure that no cornercase is triggered in the model. The only particularity of thisexample is that it uses the TEME coordinate system, whichrequires a conversion to a more standard coordinate systems.Indeed, avoiding exceptional cases in the model shows thatconsiderably lower precisions can be used throughout. Thetrends for the precision of the math functions’ output remainsmostly the same. We notice, that over a third of the functionscould be evaluated in single-precision, requiring at most -bits of precision at the output.It should be noted that the x-axis indices in Fig. 2 and 3 donot necessarily match, as the execution paths can differ, dueto the different nature of the two data-sets.V. C ONCLUSION AND F UTURE W ORK

In this paper, we focus on providing a software tool andmethodology to proﬁle the mathematical library usage in a fullscale application. The objective is to measure the potential anddrive future ad-hoc optimizations of the math library.Usually, elementary functions are implemented followinga four step scheme [21]: special-case handling, argumentreduction, reduced domain splitting and interpolation ( e.g. polynomial or iterative).When limiting the input domain, the ﬁrst two steps can beoptimized. Furthermore, reducing the required accuracy andinput domain may lower the interpolation complexity. It canalso diminish the implementation cost in special purpose ar-chitecture designs or re-programmable architectures (FPGA).Rewriting ad-hoc custom elementary functions with a targetaccuracy on a given input interval is a costly and errorprone task. We propose to explore existing tools to assist orautomate these optimizations and measure emprical speedupon real use cases. An easy ﬁrst step would be to automaticallyselect the best ﬁtting implementation among existing libraries,such as Intel MKL VML [8]. Finally, approaches such as metalibm [9] for math function code generation could beleveraged to produce specialized libraries.To conclude, the work presented in this paper showspromising results on co-designing mathematical libraries fromapplication proﬁles. One weakness of the approach is that theproﬁle is data-input dependent; further experiments on a largerset of use-cases will be done to demonstrate the generalizationof the approach and how one can deal with the speculativeaspect of proﬁle-guided optimization for math libraries.R

EFERENCES[1] Intel, “White paper: Bﬂoat16 hardware numericsdeﬁnition,” Tech. Rep., November 2018. [Online]. Available:https://software.intel.com/sites/default/ﬁles/managed/40/8b/bf16-hardware-numerics-deﬁnition-white-paper.pdf True Equator, Mean Equinox N u m b e r o f C a ll s OriginalandOptimized − I npu ti n t e r v a l s i ze ( l og ) Optimized F A B S C O S S I N F A B S P O W P O W F M O D F M O D F M O D F M O D F M O D A T A N A T A N F M O D F M O D F M O D S Q R T S I N C O S S I N C O S S I N S Q R T C O S S I N C O S S I N C O S S Q R T S Q R T A C O S S I N F L O O R S Q R T S Q R T F L O O R S Q R T S Q R T F A B S S Q R T S Q R T S Q R T F A B S A C O S F A B S S Q R T S Q R T F L O O R A C O S C O S V p r ec - li b m ou t pu t p r ec i s i on ( b it s ) OriginalOptimized

Fig. 2: Analysis of calls to libm for all satellites data-set

58 46 46 46 24 24 24 24 24 24 24 24 24 24 24 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 N u m b e r o f C a ll s OriginalandOptimized − I npu ti n t e r v a l s i ze ( l og ) Optimized F A B S C O S S I N F A B S F M O D A T A N A T A N F M O D F M O D F M O D P O W P O W F M O D F M O D F M O D A C O S S Q R T F L O O R C O S A C O S F A B S S Q R T F L O O R C O S C O S F A B S S I N S Q R T S Q R T F L O O R C O S C O S F A B S A C O S S Q R T S I N S Q R T S Q R T S I N A C O S S Q R T F L O O R C O S S I N F A B S S I N S Q R T S I N S Q R T F L O O R V p r ec - li b m ou t pu t p r ec i s i on ( b it s ) OriginalOptimized

Fig. 3: Analysis of calls to libm for satellite 5 data-set [2] NVidia, “White paper: Training with mixed precision,”Tech. Rep., November 2018. [Online]. Available:https://docs.nvidia.com/deeplearning/sdk/pdf/Training-Mixed-Precision-User-Guide.pdf[3] Y. Chatelain, E. Petit, P. de Oliveira Castro, G. Lartigue, and D. Defour,“Automatic exploration of reduced ﬂoating-point representations in itera-tive methods,” in

Euro-Par 2019 Parallel Processing - 25th InternationalConference , ser. Lecture Notes in Computer Science. Springer, 2019.[4] A. Haidar, A. Abdelfattah, M. Zounon, P. Wu, S. Pranesh, S. Tomov,and J. Dongarra, “The design of fast and energy-efﬁcient linear solvers:On the potential of half-precision arithmetic and iterative reﬁnementtechniques,” in

International Conference on Computational Science .Springer, 2018, pp. 586–600.[5] S. F. Oberman and M. J. Flynn, “Design issues in division and otherﬂoating-point operations,”

IEEE Transactions on Computers , vol. 46,no. 2, pp. 154–161, Feb 1997.[6] E. Brun, S. Chauveau, and F. Malvagi, “Patmos: A prototype MonteCarlo transport code to test high performance architectures,” in

Pro-ceedings of International Conference on Mathematics & ComputationalMethods Applied to Nuclear Science & Engineering, Jeju, Korea , 2017.[7] D. Piparo, V. Innocente, and T. Hauth, “Speeding up HEP experimentsoftware with a library of fast and auto-vectorisable mathematicalfunctions,”

Journal of Physics: Conference Series , vol. 513, no. 5, 2014.[8] “Intel® math kernel library, developer reference,” 2020. [Online].Available: https://software.intel.com/en-us/mkl-developer-reference-c[9] N. Brunie, F. d. Dinechin, O. Kupriianova, and C. Lauter, “Codegenerators for mathematical functions,” in , June 2015, pp. 66–73.[10] C. M. Black, , and R. B. Thomas H. Miller, “The need for an industrystandard of accuracy for elementary-function programs,”

ACM Trans.Math. Softw. , vol. 10, no. 4, pp. 361–366, Dec. 1984.[11] D. Defour, G. Hanrot, V. Lef`evre, J.-M. Muller, N. Revol, and P. Zim-mermann, “Proposal for a standardization of mathematical function implementation in ﬂoating-point arithmetic,”

Numerical Algorithms ,vol. 37, no. 1, pp. 367–375, 2004.[12] S. Collange, M. Daumas, and D. Defour, “Graphic processors to speed-up simulations for the design of high performance solar receptors,”in , Dec 2010, pp. 110–117.[15] S. Collange, M. Daumas, D. Defour, and R. Oliv`es, “Fonctions´el´ementaires sur GPU exploitant la localit´e de valeurs,” in

SYMPosiumen Architectures nouvelles de machines , 2008, p. 12.[16] A. Suresh, “Intercepting functions for memoization,” Theses, Universit´eRennes 1, May 2016.[17] F. R. Hoots and R. L. Roehrich, “Spacetrack Report No. 3: Modelsfor propagation of NORAD element sets,” Aerospace Defense Center,Peterson Air Force Base, Tech. Rep., 1980.[18] D. Vallado, P. Crawford, R. Hujsak, and T. Kelso,

Revisiting Spacetrack Report . [Online]. Available:https://arc.aiaa.org/doi/abs/10.2514/6.2006-6753[19] F. R. Hoots, P. W. Schumacher, and R. A. Glover, “History of AnalyticalOrbit Modeling in the U. S. Space Surveillance System,”

Journal ofGuidance, Control, and Dynamics , vol. 27, no. 2, pp. 174–185, 2004.[20] P. Markstein, “Accelerating sine and cosine evaluation with compilerassistance,” in ,June 2003, pp. 137–140.[21] C. Lauter and M. Mezzarobba, “Semi-automatic ﬂoating-point imple-mentation of special functions,” in2015 IEEE 22nd Symposium onComputer Arithmetic