[PDF] Issues with rounding in the GCC implementation of the ISO 18037:2008 standard fixed-point arithmetic

Abstract

We describe various issues caused by the lack of round-to-nearest mode in the \textit{gcc} compiler implementation of the fixed-point arithmetic data types and operations. We demonstrate that round-to-nearest is not performed in the conversion of constants, conversion from one numerical type to a less precise type and results of multiplications. Furthermore, we show that mixed-precision operations in fixed-point arithmetic lose precision on arguments, even before carrying out arithmetic operations. The ISO 18037:2008 standard was created to standardize C language extensions, including fixed-point arithmetic, for embedded systems. Embedded systems are usually based on ARM processors, of which approximately 100 billion have been manufactured by now. Therefore, the observations about numerical issues that we discuss in this paper can be rather dangerous and are important to address, given the wide ranging type of applications that these embedded systems are running.

Full PDF

aa r X i v : . [ c s . M S ] A p r Issues with rounding in the GCC implementation ofthe ISO 18037:2008 standard ﬁxed-point arithmetic

Mantas Mikaitis

Department of Mathematics, The University of Manchester, Manchester, UKEmail: [email protected]

Abstract —We describe various issues caused by the lack ofround-to-nearest mode in the gcc compiler implementation of theﬁxed-point arithmetic data types and operations. We demonstratethat round-to-nearest is not performed in the conversion of con-stants, conversion from one numerical type to a less precise typeand results of multiplications. Furthermore, we show that mixed-precision operations in ﬁxed-point arithmetic lose precision onarguments, even before carrying out arithmetic operations. TheISO 18037:2008 standard was created to standardize C languageextensions, including ﬁxed-point arithmetic, for embedded sys-tems. Embedded systems are usually based on ARM processors,of which approximately 100 billion have been manufactured bynow. Therefore, the observations about numerical issues that wediscuss in this paper can be rather dangerous and are importantto address, given the wide ranging type of applications that theseembedded systems are running.

Index Terms —ﬁxed-point arithmetic, rounding, ISO18037:2008

I. I

NTRODUCTION

The ISO 18037:2008 standard [1] deﬁnes C programminglanguage extensions to support various unconventional featuresof embedded processors. Embedded processors are usually lowpower/performance processors found in trains, planes, fab-rication equipment and communication devices [2]. Anothernotable example are battery-powered medical devices usinginteger processors such as the ARM Cortex-M3 [3]. One ofthe main features that the ISO/IEC TR 18037:2008 standardaddresses is ﬁxed-point arithmetic and numerical data types forembedded processors. The standard aims to move away fromembedded software designed in assembly languages to a moreportable and reusable C programming language, since code isgetting bigger and new platforms are rapidly being developedwith each new one requiring assembly level changes.Since processors for embedded systems need to be ex-tremely low power, ﬂoating-point hardware support is notaffordable and either hardware ﬁxed-point support is provided,or, more commonly, integer arithmetic instructions are usedto simulate ﬁxed-point arithmetic. However, as the standardstates, the C programming language does not provide supportfor any ﬁxed-point arithmetic types which leads to the com-mon solution of handcrafted arithmetic libraries in assemblylanguages. The standard aims to improve this situation bydeﬁning numerical types and operations that C compilers cansupport.In this paper we describe some issues that arise in the gcc compiler implementation of ﬁxed-point arithmetic deﬁned by this standard. Section II and III provides background on ﬁxed-point arithmetic. Section IV describes the issues with roundingdecimal constants to ﬁxed-point data types. Section V de-scribes rounding in conversions between different types. InSection VI we address mixed-format operations and issueswith bit truncation of the arguments, due to limited supportfor mixed-format operations by gcc . Finally Section VII showsthat gcc does not support round-to-nearest mode on the resultsof ﬁxed-point multiplication and that the pragma that shouldenable this rounding mode, as indicated by the standard, doesnot work.All of the experiments were compiled with the gcc compilerversion 9.2.1, using the optimization ﬂag -O2 and run on anARM968 processor.II. F

IXED - POINT ARITHMETIC

The standard deﬁnes multiple numerical types for ﬁxed-point arithmetic in the form { s, u } X.Y , where { s, u } deﬁneswhether it is a signed or unsigned format (if signed, 2’scomplement representation is used), X deﬁnes the number ofinteger bits and Y deﬁnes the number of fractional bits. Ma-chine epsilon of a ﬁxed-point type is deﬁned as ε { s,u } X.Y =2 − Y , which is the gap between any two neighbouring ﬁxed-point values and is absolute across the dynamic range. Somenotable ﬁxed-point numerical formats supported by gcc are:s16.15, u16.16, s0.31, u0.32, s8.7, u8.8, s0.15, u0.16. Thes16.15 representation of the real number . has the 14th and15th bits set to 1 and the others set to 0, and can be repre-sented as the hexadecimal number 0xC000 or as the integer + 2 = 49152 . This can be converted to a decimal valueby multiplying it with ε s . = 2 − , × − = 1 . .Table I shows examples of some decimal values of the threemain numerical types explored in this paper. TABLE IM

INIMUM AND MAXIMUM POSITIVE NUMBERS OF VARIOUS

BITFIXED - POINT NUMERICAL TYPES .Property s16.15 u0.32 s0.31Accuracy (abs.) − − − Min (exact) − − − Min (approx.) . . × − . × − Max (exact) − − − − − − Max (approx.) . . ... . ... In terms of rounding, ﬁxed-point arithmetic values canbe rounded using the same rounding modes as ﬂoating-oint arithmetic, which is deﬁned by the IEEE 754 standard[4]; these are: round-down, round-up, round-toward-zero andround-to-nearest. The only difference worth noting is that bittruncation in ﬁxed-point arithmetic is equivalent to round-down mode because of 2’s complement notation to representnegative numbers, whereas in ﬂoating point it is equivalent toround-toward-zero.Various libraries have been developed to support ﬁxed-point arithmetic, some of which are based on the ISO 18037standard. • The gcc compiler supports ﬁxed-point arithmetic of theISO 18037 standard [5]. • The MPLAB XC32 C/C++ compiler, a port of gcc forcompiling code for the devices developed by a companyMicrochip [6] (unfortunately, support for rounding is notspeciﬁed). • The library called libﬁxmath [7] implements a 32-bitﬁxed-point type with the possibility to control roundingon operations, including bit truncation and round-to-nearest. • MATLAB’s

Fixed-Point Designer tool supports ﬁxed-point types with conﬁgurable integer and fraction bit-lengths and includes various rounding modes [8].III. ISO 18037:2008

STANDARD

The following quotes can be found in Section 4 and AnnexA of the ISO standard [1], dealing with ﬁxed-point numberrounding:

Quote 1 : Conversion of a real numeric value to a ﬁxed-point type may require rounding and/or may overﬂow. Ifthe source value cannot be represented exactly by theﬁxed-point type, the source value is rounded to eitherthe closest ﬁxed-point value greater than the source value(rounded up) or to the closest ﬁxed-point value less thanthe source value (rounded down).

Note that Quote 1 can be interpreted to state that one wayrounding is suitable, either round-up or round-down, since itdoes not mention that the decision has to be done based onthe bits tha are rounded off.

Quote 2 : Processors that support ﬁxed-point arithmeticin hardware have no problems in attaining the requiredprecision without loss of speed; however, simulationsusing integer arithmetic may require for multiplication anddivision extra instructions to get the correct result; oftenthese additional instructions are not needed if the requiredprecision is 2 ulps. The

FX FULL PRECISION pragmaprovides a means to inform the implementation when aprogram requires full precision for these operations (thestate of the

FX FULL PRECISION pragma is ”on”), orwhen the relaxed requirements are allowed (the state ofthe

FX FULL PRECISION pragma is ”off”). For morediscussion on this topic see A.4. Whether rounding is upor down is implementation-deﬁned and may differ for dif-ferent values and different situations; an implementationmay specify that the rounding is indeterminable.

Quote 2 talks about a pragma that can be set in order toimprove the accuracy of arithmetic operations and mentions 2ulp accuracy. However, the standard does not mention the errorbounds for different rounding modes that can be implemented.Lastly, this quote has some indication about rounding, and theimplication that it may differ for different values seems tosuggest round-to-nearest, but it does not explicitly state thisrounding mode and its maximum error bound of 0.5 ulp thatcould be achieved if the macro is set.

Quote 3 : Generally it is required that if a value cannotbe represented exactly by the ﬁxed-point type, it shouldbe rounded up or down to the nearest representable valuein either direction. It was chosen not to specify thisfurther as there is no common path chosen for this inhardware implementations, so it was decided to leave thisimplementation deﬁned.

Quote 3 seems to indicate that rounding should be to one ofthe two directions, rather than any direction which will givethe nearest value.

Quote 4 : All conversions between a ﬁxed-point type andanother arithmetic type (which can be another ﬁxed- pointtype) are deﬁned. Rounding and overﬂow are handledaccording to the usual rules for the destination type.Conversions from a ﬁxed-point to an integer type roundtoward zero. The rounding of conversions from a ﬁxed-point type to a ﬂoating-point type is unspeciﬁed.

This quote provides details about the conversion betweendifferent numerical types. It instructs to apply the usual ruleswhen converting between ﬁxed-point types and any othernumerical types, probably referring to Quote 3.

Quote 5 : If the result type of an arithmetic operationis a ﬁxed-point type, for operators other than * and /,the calculated result is the mathematically exact resultwith overﬂow handling and rounding performed to thefull precision of the result type [...]. The * and / operatorsmay return either this rounded result or, depending of thestate of the

FX FULL PRECISION pragma, the closestlarger or closest smaller value representable by the resultﬁxed-point type. (Between rounding and this optional ad-justment, the multiplication and division operations permita mathematical error of almost 2 units in the last placeof the result type.)

This quote is very unclear, since it states that the result ofaddition/subtraction is mathematically exact, but rounded.IV. R

OUNDING OF CONSTANTS

Firstly, we address rounding of constants. This was com-mented on previously by us in [9] and also noticed in [10]. Aconstant, for example 0.04, cannot be represented exactly in aﬁnite-precision arithmetic (Table II) and has to be rounded tothe nearest value in the numerical data type of the constant. Forexample, the two nearest values in the integer representation ins16.15 are ⌊ . − ⌋ = 1310 and ⌈ . − ⌉ = 1311 , produced byround-down and round-up respectively. These correspond tothe real values of . ... and . ... . However, since . − = 1310 . , it makes most sense to represent . as ⌈ . − ⌉ = 1311 , since it is closer to the real value of . . Thatis, round . to the nearest s16.15 value (or any other givenﬁxed-point format that is being used to store the constant).This operation is done on compilation, when the constant iswritten into the memory by the compiler, and therefore no run-time performance penalty is incurred. Unfortunately, we foundthat this was not done by the gcc compiler, which resulted inlarge total errors due to magniﬁcation of these small errors inthe constants, for example in Ordinary Differential Equations(ODE) solvers run using ﬁxed-point arithmetic with the gcc compiler [10], [11]. The code for this is1 accum a = 0 . 0 4 k ;where the letter k is used to indicate that this constant isin s16.15 format (not necessary to use in this context sincethe destination format is known but we chose to use it fordemonstration). The accum data type is another name in Cfor the s16.15 data type.We believe this to be an issue due to Quote 2 — the pragmathat is deﬁned there should only be applied to control run-timeperformance, that is, rounding of various values that come upat run time, not at compile time. On compilation we expect allthe constants to be rounded to the nearest representable valuesirrespective of the run-time accuracy settings. And in general,we found that the pragma FX FULL PRECISION does nothave any effect in gcc and does not turn on rounding neitheron compilation nor run time. TABLE IIV

ALUES OF A CONSTANT

IN DIFFERENT DATA TYPES

Data type round-to-nearest next nearests16.15 . . s0.31 . ... . ... u0.32 . ... . ... binary32 . ... . ... V. R

OUNDING ON CONVERSION

Here we show that round-to-nearest is not applied whenconverting to a ﬁxed-point type a numerical value that is heldin a more precise data type. First, we try to convert a valueheld in s0.31 to s16.15. We choose a value that is smallerthan the smallest value representable in s16.15: − +2 − =0 . . ε s . , where ε s . = 2 − .In C code we write:1 l o n g f r a c t a = 2 . 2 8 8 8 1 8 3 5 9 3 7 5E − accum b = a ;Here long fract is another name for s0.31 and accum fors16.15. The letters lr next to the constant tell the compilerthat this is a s0.31 constant, as deﬁned in the ISO standard.Once this code is executed, b evaluates to 0, rather thanthe nearest representable value of − , therefore round-to-nearest is not performed on conversion — round-down, or bittruncation, is performed instead. Another test that demonstrates this involves conversion fromthe single precision ﬂoating-point format binary32, deﬁned inthe IEEE 754 standard [4], to s16.15.1 f l o a t a = 0 . 0 4 ;2 accum b = a ;This uses the same constant that we have used in the Sec-tion IV, which is not rounded to the nearest ﬁxed-point valuewhen speciﬁed as a decimal value 0.04 in the source code aswas shown in Section IV. In this case binary32 approximationof 0.04 is more accurate than the s16.15 approximation, so thevalue of 0.04 held as binary32 should be rounded to the nearestvalue in s16.15. However, b still evaluates to the value below0.04, meaning that upon conversion from binary32 to s16.15round-down is used instead of round-to-nearest.Therefore, conversion of ﬁxed-point values does not im-plement round-to-nearest to minimize the conversion errorand either follows the vague speciﬁcation of the standardin Quotes 1, 3, and 4 or simply performs bit truncationby shifting. Our recommendation is that the speciﬁcation inQuote 2 should be implemented, with the pragma enablinground-to-nearest on conversions.VI. R OUNDING OF ARGUMENTS IN MIXED - FORMATOPERATIONS

We have observed multiple ﬁxed-point arithmetic routinesin the assembly from gcc with some loss of precision andspeed, for example the multiplication of a value in s16.15by u0.32 is performed as the multiplication of two s32.31values, or the multiplication of s16.15 by u0.16 is performedas the multiplication of two s16.15 values. This is achieved byconverting all the arguments to the common internal format,which means that u0.32 argument in the former case isconverted to s32.31, and u0.16 argument in the latter caseto s16.15 (one bit less precision in the fractional part). Thiscauses loss of precision on conversion in the arguments, evenbefore multiplication is performed, and the main reason isthat gcc does not support mixed-format multipliers directly,as indicated by a list of internal compiler functions forperforming ﬁxed-point arithmetic operations [12]. A test forthis is as follows:1 u n s i g n e d l o n g f r a c t a = pow ( 2 , − accum b = 65535 k ;3 u n s i g n e d l o n g f r a c t c = a ∗ b ;We chose a = ε u . = 2 − since that is the smallest valuerepresentable by u0.32 (only the least signiﬁcant bit set) and b = 65535 the largest integer value representable by s16.15.In this scenario we expect to get c = 65535 × a , howeverwe get c = 0 because the last bit of a is dropped before themultiplication takes place, causing a = 0 . Same issue happensirrespective of what b is set to. Furthermore, we can enclosethis code in a conditional execution that checks the values of a and b and it executes the conditional code and incorrectlyupdates c to 0, overwriting it’s previous value: u n s i g n e d l o n g f r a c t a = pow ( 2 , − accum b = 65535 k ;3 u n s i g n e d l o n g f r a c t c = 0 . 8 u l r ;4 i f ( a > > ∗ b ;Lastly, if we modify the code as follows:1 l o n g f r a c t a = pow ( 2 , − l o n g f r a c t b = − l o n g f r a c t c = a ∗ b ;In this case we are using signed fractional type s0.31 so thatwe can represent − . Running this testcase we do not get c = 0 but a correct multiplication result of c = − − . Boththis and the previous testcase are quite similar — multiplyinga very small value by a value that is not smaller than inmagnitude, expectation is that this code will scale a , returning | c | ≥ | a | . However, the ﬁrst provides unexpected result dueto the conversion of a and b to the common numerical types32.31, as outlined above, and the second works as expectedas no conversion is needed. This leads to a major problem:we know that a is not zero, but multiplying it by a non-zerovalue with a magnitude larger than 1 sometimes can give ananswer of 0 and sometimes a correct answer, depending on thenumerical types used to store a and b . For most of the userswho do not necessarily think about how exactly arithmetic isperformed at the lowest level, this behaviour would be andpotentially is very puzzling.VII. R OUNDING OF MULTIPLICATION RESULTS

In this section we show that there is no support for round-to-nearest in arithmetic operations with ﬁxed-point numbers.Speciﬁcally, we test multiplication, which multiplies the twoarguments and returns a value in a more precise ﬁxed-pointformat. The result held in extended precision in most casesrequires rounding in order to convert to the format of one ofthe arguments. A simple test is to declare two s16.15 values, a = 3 × ε s . = 0 . and b = 0 . . Thisshould give us . × ε s . = ε s . which should roundto a nearest value of ε s . . The code for this is:1 accum a = 0 . 0 0 0 0 9 1 5 5 2 7 3 4 3 7 5k ;2 accum b = 0 . 2 5 k ;3 accum c = a ∗ b ;In this piece of code c evaluates to 0, which means that theresult ε s . is rounded down to 0 rather than the closestvalue of ε s . , most likely as a result of bit truncation thathappens when the product in full precision (which has to bestored in the s32.30 format) is shifted right 15 steps to convertit to s16.15. The pragma that is described by Quote 2 doesnot change the rounding mode.VIII. C ONCLUSION

We have shown various numerical accuracy issues in the gcc compiler implementation of the standardized ﬁxed-point arithmetic [1]. The main issue is lack of rounding in decimalto ﬁxed-point conversion, generally any format to ﬁxed-pointconversion and arithmetic operations such as multiplication.Furthermore, there is precision loss in the arguments in mixed-format arithmetic operations. In our understanding, these soft-ware bugs exist both because of the vague speciﬁcations ofvarious ﬁxed-point properties and required features in the ISO18037 standard, and lack of support of different features ofthe ﬁxed-point arithmetic in gcc . This arithmetic in gcc shouldbe carefully reimplemented taking care of various edge casesand all possible mixed-format combinations to support theembedded systems community.In summary, the current paper should inform the embeddedsystems community about the numerical accuracy problems inthe current implementation of the gcc ﬁxed-point arithmetic,as well as help identify and understand numerical problems intheir codes. IX. A

CKNOWLEDGEMENTS

The author thanks to Massimiliano Fasi and NicholasJ. Higham for their feedback on the early drafts of themanuscript. The author was supported by an EPSRC DoctoralPrize Fellowship. R

EFERENCES[1] ISO/IEC, “Programming languages - C - extensions to support embeddedprocessors, ISO/IEC TR 18037:2008,”

ISO/IEC JTC 1/SC 22

Embedded System Design . Berlin, Heidelberg: Springer-Verlag, 2006.[3] S. R. Sridhara, M. DiRenzo, S. Lingam, S. Lee, R. Blazquez, J. Maxey,S. Ghanem, Y. Lee, R. Abdallah, P. Singh, and M. Goel, “Microwattembedded processor platform for medical system-on-chip applications,”

IEEE Journal of Solid-State Circuits , vol. 46, no. 4, pp. 721–730, April2011. [Online]. Available: https://doi.org/10.1109/JSSC.2011.2108910[4]

IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2019(revision of IEEE Std 754-2008) . Piscataway, NJ, USA: Instituteof Electrical and Electronics Engineers, 2019. [Online]. Available:https://doi.org/10.1109/IEEESTD.2008.4610935[5] GNU, “Fixed-point arithmetic support,” Online:https://gcc.gnu.org/wiki/FixedPointArithmetic, accessed: 22/4/2020.[6] Microschip, “MPLAB R (cid:13) XC32 C/C++ compiler users guide,” Online:http://ww1.microchip.com/downloads/en/DeviceDoc/50001686J.pdf, ac-cessed: 22/4/2020.[7] Google-Code-Archive, “libﬁxmath, cross platform ﬁxed point mathslibrary,” Online: https://code.google.com/archive/p/libﬁxmath/, accessed:22/4/2020.[8] MathWorks, “Fixed-Point Designer, perform ﬁxed-point arithmetic,”Online: https://uk.mathworks.com/help/ﬁxedpoint/, accessed: 22/4/2020.[9] M. Hopkins, M. Mikaitis, D. R. Lester, and S. Furber, “Stochasticrounding and reduced-precision ﬁxed-point arithmetic for solvingneural ordinary differential equations,”

Phil. Tran. of the RoyalSoc. A: Mathematical, Physical and Engineering Sciences , vol.378, no. 2166, p. 20190052, January 2020. [Online]. Available:https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2019.0052[10] G. Trensch, R. Gutzen, I. Blundell, M. Denker, and A. Morrison,“Rigorous neural network simulations: A model substantiationmethodology for increasing the correctness of simulation resultsin the absence of experimental validation data,”

Frontiers inNeuroinformatics