Custom-Precision Mathematical Library Explorations for Code Profiling and Optimization
David Defour, Pablo de Oliveira Castro, Matei Istoan, Eric Petit
aa r X i v : . [ c s . M S ] M a y Custom-Precision Mathematical LibraryExplorations for Code Profiling and Optimization
David Defour § , Pablo de Oliveira Castro ∗† , Matei Is¸toan ∗† , and Eric Petit †‡∗ University of Versailles – Li-PaRAD, Email: { pablo.oliveira, matei.istoan } @uvsq.fr † Exascale Computing Research, ECR ‡ Intel Corporation, Email: [email protected] § University of Perpignan, Email: [email protected]
Abstract —The typical processors used for scientific computinghave fixed-width data-paths. This implies that mathematicallibraries were specifically developed to target each of thesefixed precisions (binary16, binary32, binary64). However, toaddress the increasing energy consumption and throughputrequirements of scientific applications, library and hardwaredesigners are moving beyond this one-size-fits-all approach. Inthis article we propose to study the effects and benefits ofusing user-defined floating-point formats and target accuracies incalculations involving mathematical functions. Our tool collectsinput-data profiles and iteratively explores lower precisions foreach call-site of a mathematical function in user applications.This profiling data will be a valuable asset for specializing andfine-tuning mathematical function implementations for a givenapplication. We demonstrate the tool’s capabilities on SGP4, asatellite tracking application. The profile data shows the potentialfor specialization and provides insight into answering where itis useful to provide variable-precision designs for elementaryfunction evaluation.
Index Terms —HPC, libm, floating-point, custom-precision, op-timization, specialization
I. I
NTRODUCTION
A growing interest to adapt floating-point formats to the realneeds of applications is becoming ever more ubiquitous. Thisprocess has been successfully conducted by the AI communitywhich has settled on the BF16 [1] and fp16 [2] formats, inorder to increase performance and efficiency. Similar benefitshave been achieved in other domains [3], [4] by reducingthe precision of basic operations ( + , − , ∗ , / ) and harnessinghardware-support for multiple internal floating-point formats.Oberman et al. [5] demonstrate that neglecting optimizationsof infrequent operations, such as division and square root,can severely impact performance. We believe that elementaryfunctions should not be neglected either when optimizingfor mixed-precision. Even though elementary functions arenot widely available in hardware and infrequently used inapplications, their impact can be important. HPC Patmosneutronic solver [6] spends of the execution time in themathematical library ( libm ) functions. This confirms similarobservations by CERN [7] on their HPC codes.Various mathematical libraries specialize evaluationschemes for different accuracy/performance trade-offs [8].Recent developments such as metalibm [9] automaticallygenerate elementary functions to best fit the hardware and accuracy constraints. These works highlight that a trade-offcan be explored between performance, accuracy and precision.In this paper, we propose a tool for collecting input intervalsand output required precision profiles from real applicationsin order to guide the design of specialized mathematicallibraries. Indeed, considering limited input data ranges andapplication-focused output accuracy could drastically influencethe implementation performance. We demonstrate the tool’scapabilities on SGP4, a satellite tracking application. Theprofile data shows the potential for specialization and providesinsight into where it is useful to provide variable-precisiondesigns for elementary function evaluation.II. R ELATED W ORKS
Contrary to basic operations, properties of elementary func-tion are not standardized mainly because the correctly roundedproperty is difficult to achieve [10], [11]. As a consequence,there are numerous available implementations of such func-tions, either in software or hardware, each representing adifferent trade-off between, accuracy, performance, hardwarerequirements and programming language. The most notablemathematical library embedding different trade-offs are theVector Mathematical Functions from Intel’s MKL library [8],which offers three accuracy modes: High-/Low-accuracy andEnhanced Performance. Another example are Nvidia’s GPUs,which embed dedicated hardware for fast approximation ofsome functions and software implementation of more accu-rate and larger input range versions [12]. This has led tothe OpenCL 2.2 standard which defines the requirements interms of accuracy of mathematical functions from half to double [13].Developing and maintaining multiple implementations foreach function is a daunting endeavor. Several tools have beenproposed to automate this task either, for hardware or softwareimplementation of such functions [9], [14].Porting the concept of memoization to mathematical func-tions has been explored in [15], [16] where the authorsinvestigated how considering real input-data profile can beused to optimize the evaluation. However, they did not evaluatethe potential decrease in accuracy.II. S IMULATING V ARIABLE P RECISION AND R ANGE FOR M ATHEMATICAL F UNCTIONS
Our approach to simulate mathematical functions with re-duced range/precision is twofold: first we transparently inter-pose calls to mathematical functions, then the
VPREC-libm library computes the result in a reduced precision.
A. Library Call Interposition
In Linux, the dynamic loader offers the possibility tointercept dynamic library calls, so that a custom library iscalled instead. This is achieved by setting the
LD_PRELOAD environment variable. This interception method works out-of-the-box with a compiled binary and is transparent to the user.We use it to replace standard calls to the libm with customcalls to our own
VPREC-libm library which simulates non-standard precisions and range. This approach is flexible, buthas two limitations: • it is not applicable to statically linked programs: for those,the user must manually re-link the program against the VPREC-libm ; • it only intercepts library calls; to ensure that we inter-cept all operations we disable compiler optimizationswhich replace calls by hardware instrinsics (such as sqrtsd assembly instruction in IA-64). Fortunately, the -fno-builtin flag disables these optimizations inmost standard compilers. Binary . . .
ProfileOptimizedPrecisionsExecute
LD_PRELOADVPREClibm
DichotomicExplorationConfig. Execute
LD_PRELOADVPREClibm
First Profile Exploration
Fig. 1:
VPREC-libm optimization process overview
B. Implementation of the VPREC-libm
The interposed mathematical call is handled by
VPREC-libm , which returns the result in the targetfloating-point format. For example in double precision,one can customize the bit length of the pseudo-exponent r ∈ [1 , and the pseudo-mantissa p ∈ [0 , .VPREC-libm operates in two steps. First, it computes a binary128 result e z by calling the corresponding mathemat-ical function from the GCC’s libquadmath .Then, e z is converted to the target format using Verificarlo-VPREC [3]. If e z is representable in the target range, afaithfully rounded result at target precision p is returned. If e z is outside the target range, VPREC returns ±∞ for overflowsand ± for underflows. Rounding is achieved by adding a ulpat precision p + 1 followed by a truncation ( ⌊ e z + 2 e z − p − ⌋ p ). C. Exploring Precision Requirements Using VPREC-libm
VPREC-libm can be used in two modes: profiling , and ex-ecution . In profiling mode,
VPREC-libm creates a profile ofthe executed code. For each call, it updates the boundaries ofthe operands and output intervals, the number of occurrencesfor the unique program address and stack trace from which thecall is made. This information is aggregated and processed toproduce execution statistics.In execution mode,
VPREC-libm accepts a configurationfile which specifies the floating-point formats to use at eachcall-site. An initial config file is generated automatically aftera profiling run, containing information for all encountered call-sites, which the user can later modify as it is required.Taking advantage of this functionality, we developed amethod for exploring the optimization potential of a givenfloating-point code. On a broad scale, we perform a dichotomicsearch for the minimal output precision of the
VPREC-libm functions which meets the user-specified correctness criteria.This search is applied sequentially per call-site and convergesin logarithmic time, in the size of the mantissa. As we aredealing with a vast search-space, the order in which thisoptimization is performed is quite important. According toAmdahl’s law, optimizing the code in which most of the timeis spent results in the biggest performance gains. Thereforewe prioritize the call-site exploration by call frequency as aheuristic. Figure 1 summarizes this process.IV. E
XPERIMENTAL R ESULTS
The example discussed throughout this section illustratesthe potential of our proposed method, applied to a real-worldastrophysics application used to predict the position and thevelocity of Earth-orbiting objects (most notably satellites).
A. Satellite Tracker: SGP4
In order to track the position and the velocity of satellites, acommon technique is to use simplified general perturbations(SGP) models, which predict the influence of drag, the Earth’sshape, as well as that of the sun and the moon on the trajectory.Work on SGP started in the 1960’s, but at the beginning ofthe 1980s NORAD released the equations and the sourcecode to predict satellite positions [17]. NORAD maintainsand periodically refines data sets on all resident space objects,ensuring the accuracy of trajectory predictions. The data-setswere made available to users, through NASA, being the onlysource of readily available orbital data.However, a user could not just go ahead and use anyprediction model she wished, she had to use the same modelemployed to generate the data-set, even if the user’s choicemight have had better performance, an aspect highlightedin [17]. This made SGP, especially its SGP4/SDP4 variants,commonplace among users wishing to use NORAD’s orbitaldata, distributed as two-line element (TLE) sets. Near-Earth objects (period less than minutes) are tracked using SGP4,while deep-space objects (period over minutes) are tracked North American Space Defense Command sing SDP4, which also models the gravitational effects of themoon and the sun and certain Earth harmonics.In the period following the original release of SGP, a multi-tude of code variations came to exist, making interoperabilityand compatibility an issue. For experimental purposes, wewill use the SGP4 version documented in [18], which is acommunity effort to keep a version up to date with the modelsand TLE data-sets used by NORAD. The mathematical modelsused throughout are those presented in [19]. The
C++ versionused for the experiments is the one provided by CelesTrak,available online . Only minimal changes were made, in orderto ensure that the program runs correctly on Linux. Thishas not affected the program outputs, as verified against theprovided test-suites. B. Results
We applied our method to the data-set sgp4-all.tle discussed in [18]. Figure 2 illustrates our profiling and pre-cision exploration results for the libm call-sites with themost calls. At the top we show the number of occurrencesfor different libm call-sites. In the middle graph we show foreach call-site the dynamic range of the input data, as usedin the original code, extracted from the profile produced by VPREC-libm . Finally, in the bottom graph we show for eachcall-site the required precision of the outputs determined byour exploration with the
VPREC-libm method.We can observe that the second and third call-sites arealmost twice more frequent than the next ten entries. Theoutput precision cannot be significantly decreased for the twofirst call-sites, as shown in the bottom sub-figure. Indeed theyoccur at the very end of the algorithm, directly influencing theoutputs. On the other hand, the dynamic range of the input isquite reduced at these two call-sites.Analyzing the code, we notice that these calls to sin() and cos() take the same argument, as they are part of arotation. Therefore, they could be replaced with a call to sincos() , effectively reducing their combined workload byalmost a factor of two, similar to what is done in [20].Analyzing the rest of the call-sites, we can observe two gen-eral trends. The first are call-sites where the required precisionis close to the default one; here we only manage to save − -bits. The second are call-sites where the required precisionhovers around the -bits mark. A plausible explanation forthese results can be found through a bit of computer-sciencearchaeology. SGP4 was first developed in FortranIV , on aHoneywell-6000 series computer [17]. This machine had -bit words, so floating-point numbers had a -bit exponent anda - or -bit significand, for single- or double-precision,respectively. As noted in [18], the code originally used a mix ofsingle- and double-precision computations. With the evolutionof the underlying hardware, the code was moved to double-precision throughout, which made for a smoother behavior, butdid not improve the accuracy. These observations are indeed httsp://celestrak.com/publications/AIAA/2006-6753/ https://en.wikipedia.org/wiki/Fortran https://en.wikipedia.org/wiki/Honeywell 6000 series coherent with our findings on optimizing the precision of theoutputs of the mathematical functions.Figure 3 shows the results of our method on to a data-setcontaining just the satelite number , which is the first satellitein the sgp4-all.tle dataset. This is a near-Earth satellite,which means that its trajectory is tracked using SGP4, notSDP4. Its period, perigee and eccentricity ensure that no cornercase is triggered in the model. The only particularity of thisexample is that it uses the TEME coordinate system, whichrequires a conversion to a more standard coordinate systems.Indeed, avoiding exceptional cases in the model shows thatconsiderably lower precisions can be used throughout. Thetrends for the precision of the math functions’ output remainsmostly the same. We notice, that over a third of the functionscould be evaluated in single-precision, requiring at most -bits of precision at the output.It should be noted that the x-axis indices in Fig. 2 and 3 donot necessarily match, as the execution paths can differ, dueto the different nature of the two data-sets.V. C ONCLUSION AND F UTURE W ORK
In this paper, we focus on providing a software tool andmethodology to profile the mathematical library usage in a fullscale application. The objective is to measure the potential anddrive future ad-hoc optimizations of the math library.Usually, elementary functions are implemented followinga four step scheme [21]: special-case handling, argumentreduction, reduced domain splitting and interpolation ( e.g. polynomial or iterative).When limiting the input domain, the first two steps can beoptimized. Furthermore, reducing the required accuracy andinput domain may lower the interpolation complexity. It canalso diminish the implementation cost in special purpose ar-chitecture designs or re-programmable architectures (FPGA).Rewriting ad-hoc custom elementary functions with a targetaccuracy on a given input interval is a costly and errorprone task. We propose to explore existing tools to assist orautomate these optimizations and measure emprical speedupon real use cases. An easy first step would be to automaticallyselect the best fitting implementation among existing libraries,such as Intel MKL VML [8]. Finally, approaches such as metalibm [9] for math function code generation could beleveraged to produce specialized libraries.To conclude, the work presented in this paper showspromising results on co-designing mathematical libraries fromapplication profiles. One weakness of the approach is that theprofile is data-input dependent; further experiments on a largerset of use-cases will be done to demonstrate the generalizationof the approach and how one can deal with the speculativeaspect of profile-guided optimization for math libraries.R
EFERENCES[1] Intel, “White paper: Bfloat16 hardware numericsdefinition,” Tech. Rep., November 2018. [Online]. Available:https://software.intel.com/sites/default/files/managed/40/8b/bf16-hardware-numerics-definition-white-paper.pdf True Equator, Mean Equinox N u m b e r o f C a ll s OriginalandOptimized − I npu ti n t e r v a l s i ze ( l og ) Optimized F A B S C O S S I N F A B S P O W P O W F M O D F M O D F M O D F M O D F M O D A T A N A T A N F M O D F M O D F M O D S Q R T S I N C O S S I N C O S S I N S Q R T C O S S I N C O S S I N C O S S Q R T S Q R T A C O S S I N F L O O R S Q R T S Q R T F L O O R S Q R T S Q R T F A B S S Q R T S Q R T S Q R T F A B S A C O S F A B S S Q R T S Q R T F L O O R A C O S C O S V p r ec - li b m ou t pu t p r ec i s i on ( b it s ) OriginalOptimized
Fig. 2: Analysis of calls to libm for all satellites data-set
58 46 46 46 24 24 24 24 24 24 24 24 24 24 24 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 N u m b e r o f C a ll s OriginalandOptimized − I npu ti n t e r v a l s i ze ( l og ) Optimized F A B S C O S S I N F A B S F M O D A T A N A T A N F M O D F M O D F M O D P O W P O W F M O D F M O D F M O D A C O S S Q R T F L O O R C O S A C O S F A B S S Q R T F L O O R C O S C O S F A B S S I N S Q R T S Q R T F L O O R C O S C O S F A B S A C O S S Q R T S I N S Q R T S Q R T S I N A C O S S Q R T F L O O R C O S S I N F A B S S I N S Q R T S I N S Q R T F L O O R V p r ec - li b m ou t pu t p r ec i s i on ( b it s ) OriginalOptimized
Fig. 3: Analysis of calls to libm for satellite 5 data-set [2] NVidia, “White paper: Training with mixed precision,”Tech. Rep., November 2018. [Online]. Available:https://docs.nvidia.com/deeplearning/sdk/pdf/Training-Mixed-Precision-User-Guide.pdf[3] Y. Chatelain, E. Petit, P. de Oliveira Castro, G. Lartigue, and D. Defour,“Automatic exploration of reduced floating-point representations in itera-tive methods,” in
Euro-Par 2019 Parallel Processing - 25th InternationalConference , ser. Lecture Notes in Computer Science. Springer, 2019.[4] A. Haidar, A. Abdelfattah, M. Zounon, P. Wu, S. Pranesh, S. Tomov,and J. Dongarra, “The design of fast and energy-efficient linear solvers:On the potential of half-precision arithmetic and iterative refinementtechniques,” in
International Conference on Computational Science .Springer, 2018, pp. 586–600.[5] S. F. Oberman and M. J. Flynn, “Design issues in division and otherfloating-point operations,”
IEEE Transactions on Computers , vol. 46,no. 2, pp. 154–161, Feb 1997.[6] E. Brun, S. Chauveau, and F. Malvagi, “Patmos: A prototype MonteCarlo transport code to test high performance architectures,” in
Pro-ceedings of International Conference on Mathematics & ComputationalMethods Applied to Nuclear Science & Engineering, Jeju, Korea , 2017.[7] D. Piparo, V. Innocente, and T. Hauth, “Speeding up HEP experimentsoftware with a library of fast and auto-vectorisable mathematicalfunctions,”
Journal of Physics: Conference Series , vol. 513, no. 5, 2014.[8] “Intel® math kernel library, developer reference,” 2020. [Online].Available: https://software.intel.com/en-us/mkl-developer-reference-c[9] N. Brunie, F. d. Dinechin, O. Kupriianova, and C. Lauter, “Codegenerators for mathematical functions,” in , June 2015, pp. 66–73.[10] C. M. Black, , and R. B. Thomas H. Miller, “The need for an industrystandard of accuracy for elementary-function programs,”
ACM Trans.Math. Softw. , vol. 10, no. 4, pp. 361–366, Dec. 1984.[11] D. Defour, G. Hanrot, V. Lef`evre, J.-M. Muller, N. Revol, and P. Zim-mermann, “Proposal for a standardization of mathematical function implementation in floating-point arithmetic,”
Numerical Algorithms ,vol. 37, no. 1, pp. 367–375, 2004.[12] S. Collange, M. Daumas, and D. Defour, “Graphic processors to speed-up simulations for the design of high performance solar receptors,”in , Dec 2010, pp. 110–117.[15] S. Collange, M. Daumas, D. Defour, and R. Oliv`es, “Fonctions´el´ementaires sur GPU exploitant la localit´e de valeurs,” in
SYMPosiumen Architectures nouvelles de machines , 2008, p. 12.[16] A. Suresh, “Intercepting functions for memoization,” Theses, Universit´eRennes 1, May 2016.[17] F. R. Hoots and R. L. Roehrich, “Spacetrack Report No. 3: Modelsfor propagation of NORAD element sets,” Aerospace Defense Center,Peterson Air Force Base, Tech. Rep., 1980.[18] D. Vallado, P. Crawford, R. Hujsak, and T. Kelso,
Revisiting Spacetrack Report . [Online]. Available:https://arc.aiaa.org/doi/abs/10.2514/6.2006-6753[19] F. R. Hoots, P. W. Schumacher, and R. A. Glover, “History of AnalyticalOrbit Modeling in the U. S. Space Surveillance System,”
Journal ofGuidance, Control, and Dynamics , vol. 27, no. 2, pp. 174–185, 2004.[20] P. Markstein, “Accelerating sine and cosine evaluation with compilerassistance,” in ,June 2003, pp. 137–140.[21] C. Lauter and M. Mezzarobba, “Semi-automatic floating-point imple-mentation of special functions,” in2015 IEEE 22nd Symposium onComputer Arithmetic