Issues with rounding in the GCC implementation of the ISO 18037:2008 standard fixed-point arithmetic
aa r X i v : . [ c s . M S ] A p r Issues with rounding in the GCC implementation ofthe ISO 18037:2008 standard fixed-point arithmetic
Mantas Mikaitis
Department of Mathematics, The University of Manchester, Manchester, UKEmail: [email protected]
Abstract —We describe various issues caused by the lack ofround-to-nearest mode in the gcc compiler implementation of thefixed-point arithmetic data types and operations. We demonstratethat round-to-nearest is not performed in the conversion of con-stants, conversion from one numerical type to a less precise typeand results of multiplications. Furthermore, we show that mixed-precision operations in fixed-point arithmetic lose precision onarguments, even before carrying out arithmetic operations. TheISO 18037:2008 standard was created to standardize C languageextensions, including fixed-point arithmetic, for embedded sys-tems. Embedded systems are usually based on ARM processors,of which approximately 100 billion have been manufactured bynow. Therefore, the observations about numerical issues that wediscuss in this paper can be rather dangerous and are importantto address, given the wide ranging type of applications that theseembedded systems are running.
Index Terms —fixed-point arithmetic, rounding, ISO18037:2008
I. I
NTRODUCTION
The ISO 18037:2008 standard [1] defines C programminglanguage extensions to support various unconventional featuresof embedded processors. Embedded processors are usually lowpower/performance processors found in trains, planes, fab-rication equipment and communication devices [2]. Anothernotable example are battery-powered medical devices usinginteger processors such as the ARM Cortex-M3 [3]. One ofthe main features that the ISO/IEC TR 18037:2008 standardaddresses is fixed-point arithmetic and numerical data types forembedded processors. The standard aims to move away fromembedded software designed in assembly languages to a moreportable and reusable C programming language, since code isgetting bigger and new platforms are rapidly being developedwith each new one requiring assembly level changes.Since processors for embedded systems need to be ex-tremely low power, floating-point hardware support is notaffordable and either hardware fixed-point support is provided,or, more commonly, integer arithmetic instructions are usedto simulate fixed-point arithmetic. However, as the standardstates, the C programming language does not provide supportfor any fixed-point arithmetic types which leads to the com-mon solution of handcrafted arithmetic libraries in assemblylanguages. The standard aims to improve this situation bydefining numerical types and operations that C compilers cansupport.In this paper we describe some issues that arise in the gcc compiler implementation of fixed-point arithmetic defined by this standard. Section II and III provides background on fixed-point arithmetic. Section IV describes the issues with roundingdecimal constants to fixed-point data types. Section V de-scribes rounding in conversions between different types. InSection VI we address mixed-format operations and issueswith bit truncation of the arguments, due to limited supportfor mixed-format operations by gcc . Finally Section VII showsthat gcc does not support round-to-nearest mode on the resultsof fixed-point multiplication and that the pragma that shouldenable this rounding mode, as indicated by the standard, doesnot work.All of the experiments were compiled with the gcc compilerversion 9.2.1, using the optimization flag -O2 and run on anARM968 processor.II. F
IXED - POINT ARITHMETIC
The standard defines multiple numerical types for fixed-point arithmetic in the form { s, u } X.Y , where { s, u } defineswhether it is a signed or unsigned format (if signed, 2’scomplement representation is used), X defines the number ofinteger bits and Y defines the number of fractional bits. Ma-chine epsilon of a fixed-point type is defined as ε { s,u } X.Y =2 − Y , which is the gap between any two neighbouring fixed-point values and is absolute across the dynamic range. Somenotable fixed-point numerical formats supported by gcc are:s16.15, u16.16, s0.31, u0.32, s8.7, u8.8, s0.15, u0.16. Thes16.15 representation of the real number . has the 14th and15th bits set to 1 and the others set to 0, and can be repre-sented as the hexadecimal number 0xC000 or as the integer + 2 = 49152 . This can be converted to a decimal valueby multiplying it with ε s . = 2 − , × − = 1 . .Table I shows examples of some decimal values of the threemain numerical types explored in this paper. TABLE IM
INIMUM AND MAXIMUM POSITIVE NUMBERS OF VARIOUS
BITFIXED - POINT NUMERICAL TYPES .Property s16.15 u0.32 s0.31Accuracy (abs.) − − − Min (exact) − − − Min (approx.) . . × − . × − Max (exact) − − − − − − Max (approx.) . . ... . ... In terms of rounding, fixed-point arithmetic values canbe rounded using the same rounding modes as floating-oint arithmetic, which is defined by the IEEE 754 standard[4]; these are: round-down, round-up, round-toward-zero andround-to-nearest. The only difference worth noting is that bittruncation in fixed-point arithmetic is equivalent to round-down mode because of 2’s complement notation to representnegative numbers, whereas in floating point it is equivalent toround-toward-zero.Various libraries have been developed to support fixed-point arithmetic, some of which are based on the ISO 18037standard. • The gcc compiler supports fixed-point arithmetic of theISO 18037 standard [5]. • The MPLAB XC32 C/C++ compiler, a port of gcc forcompiling code for the devices developed by a companyMicrochip [6] (unfortunately, support for rounding is notspecified). • The library called libfixmath [7] implements a 32-bitfixed-point type with the possibility to control roundingon operations, including bit truncation and round-to-nearest. • MATLAB’s
Fixed-Point Designer tool supports fixed-point types with configurable integer and fraction bit-lengths and includes various rounding modes [8].III. ISO 18037:2008
STANDARD
The following quotes can be found in Section 4 and AnnexA of the ISO standard [1], dealing with fixed-point numberrounding:
Quote 1 : Conversion of a real numeric value to a fixed-point type may require rounding and/or may overflow. Ifthe source value cannot be represented exactly by thefixed-point type, the source value is rounded to eitherthe closest fixed-point value greater than the source value(rounded up) or to the closest fixed-point value less thanthe source value (rounded down).
Note that Quote 1 can be interpreted to state that one wayrounding is suitable, either round-up or round-down, since itdoes not mention that the decision has to be done based onthe bits tha are rounded off.
Quote 2 : Processors that support fixed-point arithmeticin hardware have no problems in attaining the requiredprecision without loss of speed; however, simulationsusing integer arithmetic may require for multiplication anddivision extra instructions to get the correct result; oftenthese additional instructions are not needed if the requiredprecision is 2 ulps. The
FX FULL PRECISION pragmaprovides a means to inform the implementation when aprogram requires full precision for these operations (thestate of the
FX FULL PRECISION pragma is ”on”), orwhen the relaxed requirements are allowed (the state ofthe
FX FULL PRECISION pragma is ”off”). For morediscussion on this topic see A.4. Whether rounding is upor down is implementation-defined and may differ for dif-ferent values and different situations; an implementationmay specify that the rounding is indeterminable.
Quote 2 talks about a pragma that can be set in order toimprove the accuracy of arithmetic operations and mentions 2ulp accuracy. However, the standard does not mention the errorbounds for different rounding modes that can be implemented.Lastly, this quote has some indication about rounding, and theimplication that it may differ for different values seems tosuggest round-to-nearest, but it does not explicitly state thisrounding mode and its maximum error bound of 0.5 ulp thatcould be achieved if the macro is set.
Quote 3 : Generally it is required that if a value cannotbe represented exactly by the fixed-point type, it shouldbe rounded up or down to the nearest representable valuein either direction. It was chosen not to specify thisfurther as there is no common path chosen for this inhardware implementations, so it was decided to leave thisimplementation defined.
Quote 3 seems to indicate that rounding should be to one ofthe two directions, rather than any direction which will givethe nearest value.
Quote 4 : All conversions between a fixed-point type andanother arithmetic type (which can be another fixed- pointtype) are defined. Rounding and overflow are handledaccording to the usual rules for the destination type.Conversions from a fixed-point to an integer type roundtoward zero. The rounding of conversions from a fixed-point type to a floating-point type is unspecified.
This quote provides details about the conversion betweendifferent numerical types. It instructs to apply the usual ruleswhen converting between fixed-point types and any othernumerical types, probably referring to Quote 3.
Quote 5 : If the result type of an arithmetic operationis a fixed-point type, for operators other than * and /,the calculated result is the mathematically exact resultwith overflow handling and rounding performed to thefull precision of the result type [...]. The * and / operatorsmay return either this rounded result or, depending of thestate of the
FX FULL PRECISION pragma, the closestlarger or closest smaller value representable by the resultfixed-point type. (Between rounding and this optional ad-justment, the multiplication and division operations permita mathematical error of almost 2 units in the last placeof the result type.)
This quote is very unclear, since it states that the result ofaddition/subtraction is mathematically exact, but rounded.IV. R
OUNDING OF CONSTANTS
Firstly, we address rounding of constants. This was com-mented on previously by us in [9] and also noticed in [10]. Aconstant, for example 0.04, cannot be represented exactly in afinite-precision arithmetic (Table II) and has to be rounded tothe nearest value in the numerical data type of the constant. Forexample, the two nearest values in the integer representation ins16.15 are ⌊ . − ⌋ = 1310 and ⌈ . − ⌉ = 1311 , produced byround-down and round-up respectively. These correspond tothe real values of . ... and . ... . However, since . − = 1310 . , it makes most sense to represent . as ⌈ . − ⌉ = 1311 , since it is closer to the real value of . . Thatis, round . to the nearest s16.15 value (or any other givenfixed-point format that is being used to store the constant).This operation is done on compilation, when the constant iswritten into the memory by the compiler, and therefore no run-time performance penalty is incurred. Unfortunately, we foundthat this was not done by the gcc compiler, which resulted inlarge total errors due to magnification of these small errors inthe constants, for example in Ordinary Differential Equations(ODE) solvers run using fixed-point arithmetic with the gcc compiler [10], [11]. The code for this is1 accum a = 0 . 0 4 k ;where the letter k is used to indicate that this constant isin s16.15 format (not necessary to use in this context sincethe destination format is known but we chose to use it fordemonstration). The accum data type is another name in Cfor the s16.15 data type.We believe this to be an issue due to Quote 2 — the pragmathat is defined there should only be applied to control run-timeperformance, that is, rounding of various values that come upat run time, not at compile time. On compilation we expect allthe constants to be rounded to the nearest representable valuesirrespective of the run-time accuracy settings. And in general,we found that the pragma FX FULL PRECISION does nothave any effect in gcc and does not turn on rounding neitheron compilation nor run time. TABLE IIV
ALUES OF A CONSTANT
IN DIFFERENT DATA TYPES
Data type round-to-nearest next nearests16.15 . . s0.31 . ... . ... u0.32 . ... . ... binary32 . ... . ... V. R
OUNDING ON CONVERSION
Here we show that round-to-nearest is not applied whenconverting to a fixed-point type a numerical value that is heldin a more precise data type. First, we try to convert a valueheld in s0.31 to s16.15. We choose a value that is smallerthan the smallest value representable in s16.15: − +2 − =0 . . ε s . , where ε s . = 2 − .In C code we write:1 l o n g f r a c t a = 2 . 2 8 8 8 1 8 3 5 9 3 7 5E − accum b = a ;Here long fract is another name for s0.31 and accum fors16.15. The letters lr next to the constant tell the compilerthat this is a s0.31 constant, as defined in the ISO standard.Once this code is executed, b evaluates to 0, rather thanthe nearest representable value of − , therefore round-to-nearest is not performed on conversion — round-down, or bittruncation, is performed instead. Another test that demonstrates this involves conversion fromthe single precision floating-point format binary32, defined inthe IEEE 754 standard [4], to s16.15.1 f l o a t a = 0 . 0 4 ;2 accum b = a ;This uses the same constant that we have used in the Sec-tion IV, which is not rounded to the nearest fixed-point valuewhen specified as a decimal value 0.04 in the source code aswas shown in Section IV. In this case binary32 approximationof 0.04 is more accurate than the s16.15 approximation, so thevalue of 0.04 held as binary32 should be rounded to the nearestvalue in s16.15. However, b still evaluates to the value below0.04, meaning that upon conversion from binary32 to s16.15round-down is used instead of round-to-nearest.Therefore, conversion of fixed-point values does not im-plement round-to-nearest to minimize the conversion errorand either follows the vague specification of the standardin Quotes 1, 3, and 4 or simply performs bit truncationby shifting. Our recommendation is that the specification inQuote 2 should be implemented, with the pragma enablinground-to-nearest on conversions.VI. R OUNDING OF ARGUMENTS IN MIXED - FORMATOPERATIONS
We have observed multiple fixed-point arithmetic routinesin the assembly from gcc with some loss of precision andspeed, for example the multiplication of a value in s16.15by u0.32 is performed as the multiplication of two s32.31values, or the multiplication of s16.15 by u0.16 is performedas the multiplication of two s16.15 values. This is achieved byconverting all the arguments to the common internal format,which means that u0.32 argument in the former case isconverted to s32.31, and u0.16 argument in the latter caseto s16.15 (one bit less precision in the fractional part). Thiscauses loss of precision on conversion in the arguments, evenbefore multiplication is performed, and the main reason isthat gcc does not support mixed-format multipliers directly,as indicated by a list of internal compiler functions forperforming fixed-point arithmetic operations [12]. A test forthis is as follows:1 u n s i g n e d l o n g f r a c t a = pow ( 2 , − accum b = 65535 k ;3 u n s i g n e d l o n g f r a c t c = a ∗ b ;We chose a = ε u . = 2 − since that is the smallest valuerepresentable by u0.32 (only the least significant bit set) and b = 65535 the largest integer value representable by s16.15.In this scenario we expect to get c = 65535 × a , howeverwe get c = 0 because the last bit of a is dropped before themultiplication takes place, causing a = 0 . Same issue happensirrespective of what b is set to. Furthermore, we can enclosethis code in a conditional execution that checks the values of a and b and it executes the conditional code and incorrectlyupdates c to 0, overwriting it’s previous value: u n s i g n e d l o n g f r a c t a = pow ( 2 , − accum b = 65535 k ;3 u n s i g n e d l o n g f r a c t c = 0 . 8 u l r ;4 i f ( a > > ∗ b ;Lastly, if we modify the code as follows:1 l o n g f r a c t a = pow ( 2 , − l o n g f r a c t b = − l o n g f r a c t c = a ∗ b ;In this case we are using signed fractional type s0.31 so thatwe can represent − . Running this testcase we do not get c = 0 but a correct multiplication result of c = − − . Boththis and the previous testcase are quite similar — multiplyinga very small value by a value that is not smaller than inmagnitude, expectation is that this code will scale a , returning | c | ≥ | a | . However, the first provides unexpected result dueto the conversion of a and b to the common numerical types32.31, as outlined above, and the second works as expectedas no conversion is needed. This leads to a major problem:we know that a is not zero, but multiplying it by a non-zerovalue with a magnitude larger than 1 sometimes can give ananswer of 0 and sometimes a correct answer, depending on thenumerical types used to store a and b . For most of the userswho do not necessarily think about how exactly arithmetic isperformed at the lowest level, this behaviour would be andpotentially is very puzzling.VII. R OUNDING OF MULTIPLICATION RESULTS
In this section we show that there is no support for round-to-nearest in arithmetic operations with fixed-point numbers.Specifically, we test multiplication, which multiplies the twoarguments and returns a value in a more precise fixed-pointformat. The result held in extended precision in most casesrequires rounding in order to convert to the format of one ofthe arguments. A simple test is to declare two s16.15 values, a = 3 × ε s . = 0 . and b = 0 . . Thisshould give us . × ε s . = ε s . which should roundto a nearest value of ε s . . The code for this is:1 accum a = 0 . 0 0 0 0 9 1 5 5 2 7 3 4 3 7 5k ;2 accum b = 0 . 2 5 k ;3 accum c = a ∗ b ;In this piece of code c evaluates to 0, which means that theresult ε s . is rounded down to 0 rather than the closestvalue of ε s . , most likely as a result of bit truncation thathappens when the product in full precision (which has to bestored in the s32.30 format) is shifted right 15 steps to convertit to s16.15. The pragma that is described by Quote 2 doesnot change the rounding mode.VIII. C ONCLUSION
We have shown various numerical accuracy issues in the gcc compiler implementation of the standardized fixed-point arithmetic [1]. The main issue is lack of rounding in decimalto fixed-point conversion, generally any format to fixed-pointconversion and arithmetic operations such as multiplication.Furthermore, there is precision loss in the arguments in mixed-format arithmetic operations. In our understanding, these soft-ware bugs exist both because of the vague specifications ofvarious fixed-point properties and required features in the ISO18037 standard, and lack of support of different features ofthe fixed-point arithmetic in gcc . This arithmetic in gcc shouldbe carefully reimplemented taking care of various edge casesand all possible mixed-format combinations to support theembedded systems community.In summary, the current paper should inform the embeddedsystems community about the numerical accuracy problems inthe current implementation of the gcc fixed-point arithmetic,as well as help identify and understand numerical problems intheir codes. IX. A
CKNOWLEDGEMENTS
The author thanks to Massimiliano Fasi and NicholasJ. Higham for their feedback on the early drafts of themanuscript. The author was supported by an EPSRC DoctoralPrize Fellowship. R
EFERENCES[1] ISO/IEC, “Programming languages - C - extensions to support embeddedprocessors, ISO/IEC TR 18037:2008,”
ISO/IEC JTC 1/SC 22
Embedded System Design . Berlin, Heidelberg: Springer-Verlag, 2006.[3] S. R. Sridhara, M. DiRenzo, S. Lingam, S. Lee, R. Blazquez, J. Maxey,S. Ghanem, Y. Lee, R. Abdallah, P. Singh, and M. Goel, “Microwattembedded processor platform for medical system-on-chip applications,”
IEEE Journal of Solid-State Circuits , vol. 46, no. 4, pp. 721–730, April2011. [Online]. Available: https://doi.org/10.1109/JSSC.2011.2108910[4]
IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2019(revision of IEEE Std 754-2008) . Piscataway, NJ, USA: Instituteof Electrical and Electronics Engineers, 2019. [Online]. Available:https://doi.org/10.1109/IEEESTD.2008.4610935[5] GNU, “Fixed-point arithmetic support,” Online:https://gcc.gnu.org/wiki/FixedPointArithmetic, accessed: 22/4/2020.[6] Microschip, “MPLAB R (cid:13) XC32 C/C++ compiler users guide,” Online:http://ww1.microchip.com/downloads/en/DeviceDoc/50001686J.pdf, ac-cessed: 22/4/2020.[7] Google-Code-Archive, “libfixmath, cross platform fixed point mathslibrary,” Online: https://code.google.com/archive/p/libfixmath/, accessed:22/4/2020.[8] MathWorks, “Fixed-Point Designer, perform fixed-point arithmetic,”Online: https://uk.mathworks.com/help/fixedpoint/, accessed: 22/4/2020.[9] M. Hopkins, M. Mikaitis, D. R. Lester, and S. Furber, “Stochasticrounding and reduced-precision fixed-point arithmetic for solvingneural ordinary differential equations,”
Phil. Tran. of the RoyalSoc. A: Mathematical, Physical and Engineering Sciences , vol.378, no. 2166, p. 20190052, January 2020. [Online]. Available:https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2019.0052[10] G. Trensch, R. Gutzen, I. Blundell, M. Denker, and A. Morrison,“Rigorous neural network simulations: A model substantiationmethodology for increasing the correctness of simulation resultsin the absence of experimental validation data,”
Frontiers inNeuroinformatics