Non-asymptotic moment bounds for random variables rounded to non-uniformly spaced sets
RRounding random variables to finite precision
Tyler Chen ∗ July 23, 2020
Abstract
In this paper, we provide general bounds on the mean absolute difference and differenceof moments of a random variable X and a perturbation rd( X ), when | rd( x ) − x | ≤ (cid:15) | x | or | rd( x ) − x | ≤ δ which depend linearly on (cid:15) and δ . We then show that if the perturbationcorresponds to rounding to the nearest point in some fixed discrete set, the bounds on thedifference of moments can be improved to quadratic in many cases. When the points in thisfixed set are uniformly spaced, our analysis can be viewed as a generalization of Sheppard’scorrections. We discuss how our bounds can be used to balance measurement error with sampleerror in a rigorous way, as well as how they can be used to generalize classical numerical analysisresults. The frameworks developed in our analysis can be applied to a wider range of applicationsthan those studied in this paper and may be of general interest. Algorithms involving randomness have become commonplace, and in practice these algorithms areoften run in finite precision. As a result, some of their theoretical properties, based on the useof exact samples from given distributions, can no longer be guaranteed. Even so, many random-ized algorithms appear to perform as well in practice as predicted by theory [HMT11], suggestingthat errors resulting from sampling such distributions in finite precision are often negligible. Atthe same time, especially in the case of Monte Carlo simulations, it is not typically clear how todifferentiate the possible effects of rounding errors from the effects of sampling error. In fact, inmany areas (such as the numerical solution to stochastic differential equations) this problem istypically addressed by ignoring the effects of rounding errors under the assumption that they aresmall [KP92]. However, with the recent trend towards lower precision computations in the machinelearning [VSM11, GAGN15, WCB +
18, etc.] and scientific computing [Ste73, Hig02] communities,and with the massive increase in the amount of data available, the foundational problems of under-standing the effect of rounding errors on random variables and the interplay between rounding andsampling error can no longer be ignored.In many settings, including random matrix theory and numerical methods for stochastic differen-tial equations, there is a range of established theory regarding universality, which loosely speaking,say that if the distribution of the source of randomness in an algorithm is changed slightly, then thebehavior of the algorithm will be essentially unchanged [DMOT14, KP92]. What “changed slightly”means is of course technical, but is often is characterized through certain moment matching condi-tions for the distribution of the source of randomness. A simple example of this is the central limittheorem, which states that the average of repeated samples of independent and identically distributed(iid) random variables is asymptotically normally distributed, with mean and variance dependingonly on the mean and variance of the original distribution, but not on any higher moments. Thus, ∗ University of Washington, Seattle WA ( [email protected] ). a r X i v : . [ m a t h . S T ] J u l t is critical to understand the effects of rounding on distributions, and more specifically, the effectof rounding on the moments of distributions.The idea to represent random variables to finite precision by rounding them directly to finiteprecision was considered in [Mon85], where such random variables were called ideal approximation.In fact, for many distributions, there are numerical algorithms to generate finite precision sampleswhich are distributed as if they were sampled exactly (infinite precision) and then rounded to nearestfinite precision number; i.e. samples of an ideal approximation [vN51, Mon79, Kar16]. However, tothe best of our knowledge, the statistical properties of these so-called ideal approximations have notbeen studied in detail.Of course, finite precision is not the only situation in which random variables may be rounded.Indeed, understanding the effects of rounding to a uniformly spaced set is particularly relevant tomeasured data, where samples are often taken with a uniformly graduated device or rounded tosome fixed precision [Wil05]. For example, measurements of length may be given in terms of thedistance between consecutive markings on a ruler, weight may be measured to the nearest gram, andage may be rounded to the nearest year. The problem of rounding to a uniformly spaced set wasstudied as early as the late 1890s by Sheppard, who gave the relationship between the moments of arandom variable X (satisfying certain niceness conditions) and the moments of a perturbed randomvariable rd( X ), obtained by rounding X to a uniformly spaced set [She97], and continues to remainof interest [Tri90, Jan06, BZZH09, SKA10, UU17].In this paper we consider a wider range of perturbations. Specifically, for constants δ, (cid:15) ≥
0, weconsider how close a perturbation rd( X ) is to X where | rd( x ) − x | ≤ δ or | rd( x ) − x | ≤ (cid:15) | x | . We firstshow that the mean absolute difference of X and rd( X ) as well as the difference of the moments of X and rd( X ) differ by at most terms linear in δ or (cid:15) . We then show, that if the rounding function is‘round to nearest’ or a specific randomized scheme, then the difference of these quantities for manycommon random variables is quadratic in δ or (cid:15) . Our analysis is derived from assumptions madeabout the distance between consecutive points, and as a result does not require that the sets towhich we round be known explicitly. This allows us to more easily analyze the effects of rounding tosets such as floating point number systems as well as to handle rounded data subjected to nonlineareffects prior to rounding; e.g. distortion by a camera lens. To obtain such bounds we develop ageneral framework which is applicable to a wide range of error functions which may be encounteredin practice. Finally, we discuss how our bounds can be used to balance measurement error withsample error in a rigorous way, as well as how they can be used to generalize classical numericalanalysis results. rd rd Figure 1: left : original distribution. center : distribution after being subjected to a continouousnonlinear transformation rd ; e.g. a distortion due to a lens. right : center distribution after beginsubjected to a discretization rd ; e.g. discretization due to a measurement device. The dashed lineis in the position of the original distribution for reference. Note that rd and rd may not commute.Thus, even if rd is invertible, rd − ◦ rd ◦ rd need not equal rd . We discuss the mean absoluteerror and difference of moments of such distributions. The study of rounding random variables is closely related to the study of techniques for estimatingdensity functions. The histogram was introduced in the 1890s by Pearson [Pea95], and is a widelyused method for density estimation in practice. Much of the study of histograms focuses how to2elect an bin widths so that the histogram density best approximates the true density (for a givennumber of samples) [FD81, CMN98, Knu19].We additionally contrast our work with probabilistic numerics, which makes statistical assump-tions about the rounding errors incurred by a numerical algorithm [Wil63, HM19]. In our paper, westudy the effect of deterministic perturbations to random variables; in fact, many of the results inthis paper are derived directly from the deterministic structure of rounding errors.
It is often unreasonable to perform computations involving real numbers symbolically. As a result,it is common to fix some set F of real numbers, called a finite precision number system, and performcomputations using only numbers in F . Typically | F | < ∞ , and on binary computers often | F | = 2 n ,where n is the number of bits used to represent numbers in F .When choosing a finite precision number system practitioners must balance how well F representsthe numbers used in their application with how easily (quickly) basic operations can be performed.Given n bits, F can theoretically consist of any set of 2 n numbers. However, in practice F istypically determined by a simple map from bitstrings of length n to real numbers; for instance IEEE754 floating point numbers [iee19] and other standard data types. The advantage of such choices of F is that operations such as addition, multiplication, and basic function evaluations can be computedefficiently using highly optimized algorithms and hardware. Indeed, much of the push towards lowerprecision has been driven by the architecture of Graphics Processing Units (GPUs), which are oftendesigned to perform certain operations with specific data types extremely efficiently. The hardwarepush has been accompanied by the introduction of a range of lower precision data types such as quarter , half , bfloat16 , etc. The recent growth of low precision data types has been primarilydriven by the machine learning community [VSM11, GAGN15, WCB +
18, etc.], but it is also gainingnew traction in other fields, such as scientific computing, where the use of multiple precisions hasbeen studied for many years [Ste73, Hig02].Once F has been determined, a rounding scheme is required to convert real numbers numbersin F . There are many approaches to this. The simplest rounding schemes are perhaps ‘round up’,rd( x ) := (cid:100) x (cid:101) , and ‘round down’, rd( x ) := (cid:98) x (cid:99) where we have introduced the notation (cid:100) x (cid:101) := inf { z ∈ F : z ≥ x } and (cid:98) x (cid:99) := sup { z ∈ F : z ≤ x } . However, such schemes are not symmetric and notparticularly common. d x eb x c (a) round up d x eb x c (b) round down d x eb x c (c) ‘round to nearest’ d x eb x c (d) stochatic rounding Figure 2: Error e ( x ) := rd( x ) − x for selected rounding functions. Darker colors represent higherprobability. 3n this paper we consider the rounding schemesround towards zero rd( x ) := (cid:40) (cid:98) x (cid:99) , x ≥ (cid:100) x (cid:101) , x < x ) := (cid:40) (cid:100) x (cid:101) , x ≥ (cid:98) x (cid:99) , x < x ) := (cid:40) (cid:98) x (cid:99) , x < ( (cid:98) x (cid:99) + (cid:100) x (cid:101) ) (cid:100) x (cid:101) , x > ( (cid:98) x (cid:99) + (cid:100) x (cid:101) )We will not concern ourselves too much with how ties are broken in the round to nearest scheme,although we note that such intricacies are important in certain applications. We note that the roundto nearest scheme obviously minimizes the distance between X and a random variable supported on F in many metrics; e.g. “earth mover” distance, L p norm, etc. Thus, by using the exact structure ofrd, it is not surprising that we are able to provide higher order bounds on the difference of momentsof X and rd( X ).In addition to these commonly used deterministic schemes, we consider a randomized round-ing scheme has gained popularity in machine learning for use with low precision number systems[GAGN15],stochastic rounding rd( x ) := (cid:40) (cid:98) x (cid:99) , w.p. 1 − ( x − (cid:98) x (cid:99) ) / ( (cid:100) x (cid:101) − (cid:98) x (cid:99) ) (cid:100) x (cid:101) , w.p. ( x − (cid:98) x (cid:99) ) / ( (cid:100) x (cid:101) − (cid:98) x (cid:99) )When the rounding function in question involves randomness further intricacies arise. For in-stance, if we need to round the same number twice, should it be rounded the same way both times?Two reasonable approaches are to (i) use a single realization of the rounding function for an entirecomputation, or (ii) use independent realizations of the rounding function each time we round. Incomputing terms, these two choice can roughly be identified with (i) setting a seed once at thebeginning of a program, or (ii) resetting the seed to the same fixed value prior to every instance ofrounding.If the probability of sampling the same number twice is zero, as is the case for continuousdistributions, the two approaches are more or less equivalent. On the other hand if there is anonzero probability of sampling the same number twice (as is the case for discrete distributions),the two approaches may result in very different behavior. In such cases, the second approach alignswith what we intuitively expect more closely. Moreover, practically speaking, storing a realizationof the rounding function would require storing information about how previous samples have beenrounded, or using some sort of pseudorandom rounding scheme (i.e. resetting a seed). Thus, we willrestrict ourselves to the second case case. Once ( F , rd) has been specified, we can consider how performing operations in the finite precisionnumber system compare to performing the operations exactly. It would be exceedingly tedious toperform a separate analysis for every finite precision number system F and perturbation functionrd : R → F . This was recognized by numerical analysts decades ago. To simplify analysis, theyadopted the following standard assumption from which most numerical analysis results are derived[Hig02]. For the rest of this paper, unless otherwise specified, we will assume ( F , rd) satisfies thefollowing assumption. Assumption 1. size of rounding error is bounded:i. | rd( x ) − x | ≤ (cid:15) | x | for all x (cid:15) δ round towards zero (cid:15) / (1 + (cid:15) ) δ round from zero (cid:15) δ round to nearest (cid:15) / δ / (cid:15) δ Table 1: Suppose F so that (cid:100) x (cid:101) ≤ (1 + (cid:15) ) (cid:98) x (cid:99) and (cid:100) x (cid:101) ≤ (cid:98) x (cid:99) + δ for all x ≥ (cid:100) x (cid:101) , (cid:98) x (cid:99) ∈ F .Then the table gives the values of (cid:15) and δ for which Assumption 1i. and Assumption 1ii. hold for( F , rd). ii. | rd( x ) − x | ≤ δ for all x . For notational convenience, we also define the error function,err( x ) := rd( x ) − x which tells us both the size and direction of rounding errors. Note that Assumption 1 does notrequire that F (cid:54) = R . Indeed, such an assumption could cover the case of continuous transforms from R → R .It is well known that Assumption 1 is a worst case error bound, and does not take into accountthe actual sign or size of err( x ). For instance, using this assumption to bound the error of a sum of n numbers results in a bound growing like n , while the true error tends to be closer to √ n . In orderto understand such behavior, we must take into account more structure about rd. In this paper, we provide bounds on | M k [rd( X )] − M k [ X ] | . However, rd( X ) is not necessarilythe random variable supported on F optimizing this difference for k less than some constant. Forexample, suppose X ∼ N (0 ,
1) is a standard normal random variable, F is some symmetric set, andrd is ‘round to nearest’. Then the mean of rd( X ) is zero by symmetry, but the variance of rd( X )may not be exactly one. On the other hand, the mean and variance of a ± ± ∈ F can be representedexactly. Optimizing such a difference would require solving a method of moments type problemwhich would be hard if F is large, but may be possible in very low precision. This could be useful forcertain applications, such as numerical methods for SDEs, where the noise term can be generatedby random variables with a few specified moments [KP92].We must also consider how applicable our analysis is to the methods used for generating samplesof random numbers in practice. While we can theoretically generate ideal approximations usingstandard inverse transform techniques, this requires being able to evaluate F X to high precision.More commonly used algorithms for generating ideal approximations fall into the class of rejectionmethods, which tend to be slower than other common approaches to generating random numbers ofa specified distribution such as transformation methods [KW09]. However, while the higher orderbounds we derive apply only to ideal approximations, algorithm dependent bounds may be derivableusing the same general framework. In any case, the simple bounds from Section 3 can be appliedto standard algorithms for generating random numbers provided some guarantees are given for theaccuracy of their outputs. Finally, we note that all of the methods for generating random numbersdiscussed here rely on a source of uniformly distributed bits, which in practice are typically replacedwith bits generated by some pseudorandom number generator.Finally, we discuss how reasonable the assumptions from the previous section are. Assumption 1i.is a standard assumption from which many standard results in numerical analysis are derived [Hig02].Of course, this assumption cannot hold for all real numbers if the finite precision number system is5ot infinite, and in practice the best that we can hope for is that this assumption holds for somefinite subset of the reals (not containing some interval about zero). With IEEE 754 single precisionfloating point numbers [iee19], which use 32 bits, Assumption 1i. will hold for ( − b, − a ] ∪ [ a, b ), whileAssumption 1ii. will hold for [ − a, a ] where, a ≈ × − , b ≈ × , (cid:15) ≈ × − , δ ≈ × − . Thus, for most numerical algorithms, overflow and underflow do not pose a major concern. Of course,for random variables supported on the reals, the probability sampling numbers which would causeoverflow or underflow is non-zero. However, even for heavy tailed distributions, these probabilitiesare small for most parameter ranges encountered in practice, but it is important to note that ourmodeling assumptions for floating point arithmetic ignore overflow and underflow. A fully rigorousanalysis would involve “preprocessing” an unbounded random variable so that it is bounded andthen applying the results in this paper.
We first provide some basic bounds on how the moments of a rounded random variable differ fromthe original random variable. These bounds, while simple to derive, are of practical use due to theirgenerality. Again, recall that we assume ( F , rd) satisfy Assumption 1. Theorem 1 (1st order strong convergence) . For any real valued random variable X ,i. E [ | rd( X ) − X | n ] = E [ | err( X ) | n ] ≤ E [ | X | n ] · (cid:15) n ii. E [ | rd( X ) − X | n ] = E [ | err( X ) | n ] ≤ δ n . The form of Theorem 1 is reminiscent of Assumption 1. This is not surprising, as it is essentiallythe same as applying Assumption 1 pointwise to the value of X corresponding to each outcome in X ’s sample space. However, it allows us to trivially generalize a wide range of numerical analysisresults to the random variable case; see Section 5.2.Of course, as mentioned above, sometimes distributional information is more useful in practice.Thus, we would like to be able to characterize the difference in moments of the original randomvariables and the perturbed random variables. Lemma 1.
For any real valued random variable X , constant µ , and non-negative integers m and n , i. (cid:12)(cid:12) E [( X − µ ) m err( X ) n ] (cid:12)(cid:12) ≤ E [ | X − µ | m | X | n ] · (cid:15) n ii. (cid:12)(cid:12) E [( X − µ ) m err( X ) n ] (cid:12)(cid:12) ≤ E [ | X − µ | m ] · δ n .Suppose further that rd( x ) = − rd( − x ) and that X has density function f X . Define, g X ( s ) := min { f X ( s ) , f X ( − s ) } , h X ( s ) := f X ( s ) − g X ( s ) . Then, if m + n is odd,i. (cid:12)(cid:12) E [ X m err( X ) n ] (cid:12)(cid:12) ≤ (cid:20) E [ X n + m ] − (cid:90) −∞ x n + m h X ( x ) d x (cid:21) · (cid:15) n ii. moreover, if m is odd, (cid:12)(cid:12) E [ X m err( X ) n ] (cid:12)(cid:12) ≤ (cid:20) E [ X m ] − (cid:90) −∞ x m h X ( x ) d x (cid:21) · δ n . x ) = (cid:15) | x | orrd( x ) = δ . However, if more is known about rd( x ), for instance, if rd( x ) is symmetric, then we canobtain the stronger bounds in the second part. eor many real valued random variables, these boundsmay prove more useful. For instance, if f X is unimodal and symmetric about some point x ∗ > h X ( x ) = 0 for all x <
0. In this such cases, the revised bound will involve only moments of theoriginal random variable X , rather than of | X | .Lemma 1 trivially implies that E [ p (rd( X )) − p ( X )] converges for any polynomial of degree k ,provided that the k -th moments of X are finite. This in turn implies that centered moments converge.We denote the k -th centered moment of Y by M k [ Y ]; i.e. M k [ Y ] := E [( Y − E [ Y ]) k ]. Theorem 2 (1st order convergence of centered moments) . For any real valued random variable X with finite k -th moment,i. (cid:12)(cid:12) M k [rd( X )] − M k [ X ] (cid:12)(cid:12) = O ( (cid:15) ) ii. (cid:12)(cid:12) M k [rd( X )] − M k [ X ] (cid:12)(cid:12) = O ( δ ) . In deriving the bounds from the previous section we have thrown away much of the actual structureof the rounding functions. Instead, we relied only on Assumption 1, and possibly the the assumptionthat rd( x ) = − rd( − x ). While the error functions described in Section 2.2 satisfy these assumptionsfor commonly used floating point number systems, in many cases we can obtain higher order boundsby taking into account the exact structure of the rounding functions.This is illustrated in Fig. 3, which depicts an error function corresponding to the ‘round tonearest’ scheme. The key observation is that the integral of the error function of any finite intervalis much smaller than the bound from Assumption 1 as much of the integral cancels away. bay x err( x ) := rd( x ) − x ± (cid:15) | x | Figure 3: Note that the contribution of the integral of the ‘round to nearest’ error function over theinterval [ a, b ] is at most the area of the furthest right darkly shaded triangle: [ b / · (cid:15) . This is incontrast to the lightly shaded area used by Assumption 1 which is of size [( b − a ) / · (cid:15) .In principle, if we know ( F , rd) then we can compute, possibly numerically, E [ X m err( X ) n ] = (cid:90) ba x m err( x ) n f X ( x ) d x. (1)However, such a computation would be tedious and require using a precision higher than F . Thus,we seek easier to compute estimates and bounds for such expressions similar to Eq. (1).Intuitively, if (cid:82) ba err( x ) d x is small, then integrals of the form (cid:90) ba f ( x ) err( x ) d x f ( x ) is relatively “well behaved”. This is made more precise by thefollowing technical lemma.First, however, we recall a basic property the Lebesgue–Stieltjes integral. If g is absolutelyintegrable with respect to a measure ν , then, µ ( A ) := (cid:90) A g d ν is a (signed) measure and, (cid:90) A f g d ν = (cid:90) A f d µ. Lemma 2.
Let f : R → R ≥ be lower semi-continuous and g : R → R integrable. Suppose that f g is absolutely integrable and that there exists a function G : R × R → R ≥ such that for any a, b ∈ R , (cid:90) ba g ( x ) d x ≤ G ( a, b ) Let
O ⊂ R denote the set of all open subsets of R . Recall that any open set A ∈ O \ {∅} can be writen A = (cid:83) ki =1 ( a i , b i ) where ( a i , b i ) are pairwise disjoint and k ∈ Z > ∪ {∞} . Extend G : R × R → R ≥ to a function µ : O → R ≥ ∪ {∞} on open sets by µ ( ∅ ) = 0 and, µ ( A ) = µ (cid:32) k (cid:91) i =1 ( a i , b i ) (cid:33) = k (cid:88) i =1 G ( a i , b i ) , ∀ A ∈ O \ {∅} Then, (cid:90) R f ( x ) g ( x ) d x ≤ (cid:90) ∞ µ ( { x : f ( x ) > u } ) d u. Proof of Lemma 2.
Define ν ( A ) := (cid:82) A g ( x ) d x and observe that by definition ν ( A ) ≤ µ ( A ) for all A ∈ O .Then, by definition of Lebesgue–Stieltjes integral, (cid:90) R f ( x ) g ( x ) d x = (cid:90) ∞ ν ( { x : f ( x ) > u } ) d u ≤ (cid:90) ∞ µ ( { x : f ( x ) > u } ) d u. We note that this lemma allows us to provide lower bounds on the integral of f g given lowerbounds on the integral of g over [ a, b ] (simply replace g with − g in the above statement). This boundcan be naturally extended to apply to any function which have negative outputs by decomposing f = f + + f − where f + , − f − ≥ f + and − f − are lower semi-continuous. Moreover,if G ( a, b ) is of the form G ( a, b ) = (cid:82) ba h ( x ) d x , (cid:90) ∞ µ ( { x : f ( x ) > u } d u = (cid:90) R f ( x ) h ( x ) d x How tight Lemma 2 is depends on how tight the bound for the integral of g is. For instance, ifthe bound on g is equality then the proposition’s bound is equality; in fact it is simply the Lebesgue–Stieltjes integral (cid:82) f d G , where G is an antiderivative of g . On the other hand, as in the case of8ounding scheme c rd ( k ) d rd ( k ) β rd round towards zero 1 / ( k + 1) 1 / ( k + 1) 1 / (1 − (cid:15) )round from zero 1 / ( k + 1) 1 / ( k + 1) 1round to nearest 1 / ( k + 1) 1 / ( k + 1) 1 / (1 − (cid:15) )stochastic round 2 / ( k + 3 k + 2) (1 − ( k + 3)2 − ( k +1) ) / ( k + 3 k + 2) 1Table 2: Constants for Theorem 3, Theorem 3’.our subsequent applications of this proposition, if the bound on g does not take into account somebehavior of g , then the bound will be more pessimistic as it assumes the worst case interactionbetween f and g .In order to use Lemma 2 to bound expressions of the form Eq. (1) we must bound the integralof err( x ) n on some interval [ a, b ]. As illustrated in Fig. 3, we expect this bound to be of order (cid:15) n +1 or δ n +1 . This is made more precise in the following Theorem.Before we give our main theorem, we provide a lemma which we will prove useful in some of ourexamples. Lemma 3.
Suppose rd is one of ‘round to towards zero’, ‘round from zero’, or ‘round to nearest’.Then, given any two finite precision numbers a < b ∈ F , integer k > , and constant c rd fromTable 2,(i) (cid:90) ba | err( x ) | k d x = (cid:34) c rd ( k ) (cid:90) ba | x | k d x (cid:35) · (cid:15) k (ii) (cid:90) ba | err( x ) | k d x = (cid:34) c rd ( k ) (cid:90) ba d x (cid:35) · δ k Moreover, when k is odd and rd is ‘round to nearest’,(iii) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) ba err( x ) k d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 0 (iv) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) ba err( x ) k d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 0Our main result is simply that Lemma 3 approximately holds when the restriction a, b ∈ F isrelaxed to a, b ∈ R . This is intuitive, since any real number must be “close” to some finite precisionnumber in the sense of Assumption 1. Theorem 3.
Suppose rd is one of ‘round to towards zero’, ‘round from zero’, or ‘round to nearest’.Then, given any two real numbers a < b ∈ R , integer k > , and constants c rd , d rd , and β rd fromTable 2,(i) (cid:90) ba | err( x ) | k d x ≤ (cid:34) c rd ( k ) (cid:90) ba | x | k d x (cid:35) · (cid:15) k + (cid:2) c rd ( k ) (cid:0) | a | k +1 + | b | k +1 (cid:1)(cid:3) · ( β rd (cid:15) ) k +1 (ii) (cid:90) ba | err( x ) | k d x ≤ (cid:34) c rd ( k ) (cid:90) ba d x (cid:35) · δ k + [4 c rd ( k )] · δ k +1 Moreover, when k is odd and rd is ‘round to nearest’, iii) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) ba err( x ) k d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:2) d rd ( k ) max( | a | , | b | ) k +1 (cid:3) · (cid:15) k +1 (iv) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) ba err( x ) k d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ [ d rd ( k )] · δ k +1 Roughly speaking, these bounds improve the bounds from Theorem 1 by factor of 1 / ( n + 1), andthe bounds from Theorem 2 by factor of (cid:15) or δ .The bounds from Theorem 3 combined with the approach from Lemma 2 can be used togetherto bound the error in moments of random variables with “nice” density functions when rounded to( F , rd) in a relatively straightforward manner. We prove a somewhat general use corollary, but thebounds could be used more directly as well. Corollary 1.
Suppose rd is ‘round to nearest’. Suppose further that f : R → R ≥ is lower semi-continuous, non-negative, and that there exists x ∗ ∈ R so that f is non-decreasing on ( −∞ , x ∗ ) andnon-increasing on ( x ∗ , ∞ ) . For x ≥ define, ˆ f ( x ) = sup | z | >x f ( z ) = (cid:40) f ( x ∗ ) , x < | x ∗ | max { f ( x ) , f ( − x ) } , x > | x ∗ | and for u ≤ f ( x ∗ ) , ˆ f − ( u ) = sup {| x | : f ( x ) > u } Then,(i) (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) f ( x ) | err( x ) | k d x (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:20)(cid:90) | x | k f ( x ) d x (cid:21) · (cid:15) k + (cid:20) c rd ( k ) (cid:90) ∞ ˆ f − ( u ) k +1 d u (cid:21) · ( β rd (cid:15) ) k +1 = (cid:20)(cid:90) | x | k f ( x ) d x (cid:21) · (cid:15) k + (cid:20) k + 1) c rd ( k ) (cid:90) ∞ x k ˆ f ( x ) d x (cid:21) · ( β rd (cid:15) ) k +1 (ii) (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) f ( x ) | err( x ) | k d x (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:20) c rd ( k ) (cid:90) f ( x ) d x (cid:21) · δ k + [4 c rd ( k ) f ( x ∗ )] · δ k +1 Moreover, if k is odd,(i) (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) f ( x ) err( x ) k d x (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:20) d rd ( k ) (cid:90) ∞ ˆ f − ( u ) k +1 d u (cid:21) · (cid:15) k +1 = (cid:20) ( k + 1) d rd ( k ) (cid:90) ∞ x k ˆ f ( x ) d x (cid:21) · (cid:15) k +1 (ii) (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) f ( x ) err( x ) k d x (cid:12)(cid:12)(cid:12)(cid:12) ≤ [ d rd ( k ) f ( x ∗ )] · δ k +1 We can then provide the following result on the difference of moments of X and rd( X ) when thedensity of X is nice. Note that the while conditions for this theorem cover a wide range of commonrandom variables, they are not necessary conditions for quadratic convergence. Theorem 4 (2nd order convergence of centered moments) . Suppose rd is ‘round to nearest’. Let X be a random variable with density f X ( x ) and finite k -th moment. Suppose further that | x n f X ( x ) | is lower semi-continous, bounded, and has a finite number of connected regions of local maxima for n = 1 , , . . . , k . Then,i. (cid:12)(cid:12) M k [rd( X )] − M k [ X ] (cid:12)(cid:12) = O ( (cid:15) ) ii. (cid:12)(cid:12) M k [rd( X )] − M k [ X ] (cid:12)(cid:12) = O ( δ ) . .1 Stochastic rounding We now derive a bound similar to Theorem 3 when rd is the ‘stochastic rounding’ scheme. Thisscheme is unbiased, in the sense that for any x the expectation of rd( x ) is x . More specificallyfor each x , rd( x ) is a random variable on (Ω x , F x , P x ) taking outcome (cid:100) x (cid:101) with probability ( x −(cid:98) x (cid:99) ) / ( (cid:100) x (cid:101) − (cid:98) x (cid:99) ) and taking outcome (cid:98) x (cid:99) with probability 1 − ( x − (cid:98) x (cid:99) ) / ( (cid:100) x (cid:101) − (cid:98) x (cid:99) ) in which case E x [rd( x )] = x .Ideally, we would view rd : R → F as a random function (variable) on the natural probabilityproduct space (Ω rd , F rd , P rd ); i.e. Ω rd = (cid:81) x Ω x .Recall that we are interested the case where we sample rd independently each time we sample X , as opposed to a single time at the beginning of the experiment. That is, every time we sample X , we roll the Ω rd dice to determine how X should be rounded. Then, rd( X ) : Ω × Ω rd → F is a realvalued “random variable” on ( ˜Ω , ˜ F , ˜ P ) where, ˜Ω = Ω × Ω rd and ˜ P = P × P rd . Suppose g : R → R is a real valued function so that ( g ◦ rd)( X ) = g (rd( X )) is a random variable on ( ˜Ω , ˜ F , ˜ P ). Then,heuristically, ˜ E [( g ◦ rd)( X )] = ˜ E [˜ E [( g ◦ rd)( X ) | X ]] = E [ E rd [ g ◦ rd]( X )]where E rd [ g ◦ rd]( x ) := E x [( g ◦ rd)( x )]. Lemma 4.
By direct computation, E rd (cid:2) | err( x ) | k (cid:3) = ( x − (cid:98) x (cid:99) ) k (cid:18) − x − (cid:98) x (cid:99)(cid:100) x (cid:101) − (cid:98) x (cid:99) (cid:19) + ( (cid:100) x (cid:101) − x ) k (cid:18) x − (cid:98) x (cid:99)(cid:100) x (cid:101) − (cid:98) x (cid:99) (cid:19) E rd (cid:2) err( x ) k (cid:3) = ( (cid:98) x (cid:99) − x ) k (cid:18) − x − (cid:98) x (cid:99)(cid:100) x (cid:101) − (cid:98) x (cid:99) (cid:19) + ( (cid:100) x (cid:101) − x ) k (cid:18) x − (cid:98) x (cid:99)(cid:100) x (cid:101) − (cid:98) x (cid:99) (cid:19) . Theorem 3’.
Suppose rd is ‘stochastic rounding’. Then, given any two real numbers a and b ,integer k >
0, and constants c rd and d rd from Table 2,(i) (cid:90) ba E rd (cid:2) | err( x ) | k (cid:3) d x ≤ (cid:34) c rd ( k ) (cid:90) ba | x | k d x (cid:35) · (cid:15) k + (cid:2) c rd ( k ) (cid:0) | a | k +1 + | b | k +1 (cid:1)(cid:3) · (cid:15) k +1 (ii) (cid:90) ba E rd (cid:2) | err( x ) | k (cid:3) d x ≤ (cid:34) c rd ( k ) (cid:90) ba d x (cid:35) · δ k + [4 c rd ( k )] · δ k +1 Moreover, when k is odd,(iii) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) ba E rd (cid:2) err( x ) k (cid:3) d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:2) d rd ( k ) max( | a | , | b | ) k +1 (cid:3) · (cid:15) k +1 (iv) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) ba E rd (cid:2) err( x ) k (cid:3) d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ [ d rd ( k )] · δ k +1 Thus, results relying on Theorem 3 can be expanded in a straightforward manner to the case rdis ‘stochastic rounding’. In particular, we retain 2nd order convergence of all centered moments.We comment breifly on the values of the constants c rd and d rd . First, note that d rd (1) = 0,agreeing with the fact that this rounding scheme is unbiased. However, for all k > − k times the corresponding constants for ‘round to nearest’.Thus, because the value of (cid:15) for a fixed F is twice as large for ‘stochastic rounding’ as for ‘roundto nearest’ (see Table 1), the bounds given in Theorem 3’ are somewhat worse than those given inTheorem 3. 11 .2 Two sided bounds for uniform meshes Sheppard’s corrections provide estimates for how the moments of X differ from rd( X ) when F is auniformly spaced set and rd is ‘round to nearest’ [She97]. Specifically, if F = { δ z + α : z ∈ Z } andthe density of X satisfies certain niceness assumptions, Sheppard’s corrections state that E [rd( X )] ≈ E [ X ] , V [rd( X )] ≈ V [ X ] + ( δ ) . Similar corrections for higher order moments exist.So far we have only provided upper bounds for the difference of moments of X and rd( X ). Thisis because our analysis relies on Assumption 1 which does not limit how near points in F can be.However, if F has equally spaced points we can provide lower bounds for some of quantities. Corollary 2.
Suppose F = { δ z + α : z ∈ Z } and that rd is ‘round to nearest’. Then, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) ba | err( x ) | n d x − (cid:34) n + 1 (cid:90) ba d x (cid:35) · δ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:20) n + 1 (cid:21) · δ n +1 . We now relate this to Sheppard’s corrections. Roughly speaking, Sheppard’s corrections assumesthat f X is smooth enough and vanishes outside of some interval. Then, E [rd( X )] ≈ E [ X ] becausethe probability that a sample of X is rounded up, will be almost the same as the probability it isrounded down; i.e. E [err( X )] ≈
0. This is not generally true. Indeed, as our analysis makes clear,many real valued random variables may have E [rd( X )] (cid:54) = E [ X ].In general, we have that, V [rd( X )] = V [ X ] + V [err( X )] + 2 Co V [ X, err( X )] . Sheppard’s corrections assume that X and err( X ) are approximately uncorrolated. Intuitively, if f X is smooth enough, no matter where X is sampled, the probability of being rounded up and downare roughly equal. Since E [err( X )] ≈
0, then V [err( X )] ≈ E [err( X ) ].Corollary 2 shows that if f X is reasonable, E (cid:2) err( X ) (cid:3) = (cid:34) (cid:90) ba f X ( x ) d x (cid:35) · (cid:18) δ (cid:19) + O (( δ ) )= ( δ )
12 + O (( δ ) ) . In this sense, our analysis can be viewed as a generalization of Sheppard’s corrections. In particular,since we do not require as strict of constraints on the density of X , they can be applied in a widerrange of applications.Our analysis alone cannot be easily used to show that the covariance of X and err( X ) is O ( δ ).This is because Theorem 3 does not take into account of the error terms at the ends of the intervals.This means than when we apply Lemma 2 we accumulate these error terms, even though many ofthem may cancel away if we are integrating a nice enough function.This is not strictly a shortcoming of our approach. Indeed, our bounds apply to distributions,such as the uniform distribution, where the regularity conditions for Sheppard’s corrections are notwell satisfied because they require integrability type conditions, rather than smoothness conditions.The lack of applicability of Sheppard’s corrections to certain distributions is discussed in moredetail in [Jan06, SKA10]. This is additionally illustrated in our subsequent examples; see especiallyExample 3. Finally, our approach provides explicit constants on higher order terms, allowing forbounds at large mesh sizes, as well as asymptotic bounds for small mesh sizes.12 Applications
Often, one would like to estimate statistics about a random variable by repeatedly sampling saidrandom variable. For instance, if X , . . . , X n are independent and identically distributed (iid) sam-ples of X , then the sample mean Z := ( X + · · · + X n ) /n provides an estimate for the true mean E [ X ] in the sense that P [ | Z − E [ X ] | > t ] ≤ V [ X ] nt . In practice, sampling X necessarily incurs some sort of measurement errors. These could bedue to discretization, biases in the measurement device, random noise, unknown non-linear effects,etc. Intuitively, there is some balance between sampling error and measurement error; if only a fewsamples are taken then sampling error will dominate, and conversely, if a large number of samplesare taken then measurement error will dominate. Our analysis provides several ways to relate themoments of a perturbed random variable rd( X ) to those of the underlying random variable X , basedon various amounts of information about the perturbation rd and random variable X . We can usethese bounds to balance measurement error and sampling error.It is easy to imagine scenarios in which more measurements can be taken, provided that theyare done less accurately. For instance, recording data in half precision instead of double precisionwould allow four times as many data points to be saved (per unit storage), and less accurate sensorsbe produced more cheaply. Naturally then, we may hope to optimize the overall cost subject toan accuracy constraint, the accuracy subject to a cost constraint, or some combination of the two.Such a trade off for discretization error was explored in [Wil05]. However, this analysis is based onSheppard’s corrections, and is therefore not generally applicable, due to niceness constraints on thedensity of the random variable to be measured, as well as the fact that measurements errors are notsoler due to discretization. Example 1.
Suppose rd satisfies Assumption 1. This could account for rounding errors, but couldalso account for systemic errors, non-linear transformations in the data, etc. Define Z as the samplemean of the rounded data, Z := (rd( X ) + · · · rd( X n )) /n . Then, P [ | Z − E [rd( X )] | > t (cid:48) ] ≤ V [rd( X )] nt (cid:48) . Using that rd( X ) = X + err( X ), and assuming that that t ≥ | E [err( X )] | , P [ | Z − E [ X ] | ≥ t ] ≤ V [rd( X )] n ( t − E [err( X )]) where we have used that P [ | Z − E [ X ] | ≥ t ] ≤ P [ | Z − E [rd( X )] | ≥ ( t − | E [err( X )] | )] . Since | E [err( X )] | ≤ δ , this expression allows us to explicitly balance the number of samples takenwith the precision used take samples in a rigorous way. While the variance of rd( X ) is unknown, itcan be estimated by repeatedly sampling the rounded distribution. Alternately, we can obtain anexpression involving the variance of X using our bounds on the differences of moments of X andrd( X ). In particular, we have, V [rd( X )] ≤ V [ X ] + δ + 2 E [ | X − E [ X ] | ] δ.
13y Jensen’s or H¨older’s inequality, we have that E [ | X − E [ X ] | ] ≤ (cid:112) V [ X ], so, P [ | Z − E [ X ] | ≥ t ] ≤ V [ X ] + 2 V [ X ] / δ + δ n ( t − δ ) ≤ n (cid:32) (cid:112) V [ X ] + δt − δ (cid:33) . (2)Setting t = c (cid:112) V [ X ], we can ensure that our sample mean is within c standard deviations of thetrue mean with probability 1 − p by taking n > / ( pc ) iid samples of X with maximum absolutemeasurement error, δ ≤ c √ np − √ np + 1 (cid:112) V [ X ] . As a quick check, note that if n ≈ /pc then we require δ ≈
0, and if n → ∞ , we require δ ≈ c (cid:112) V [ X ].We emphasize that this result holds is based on a very weak assumption about our perturbationfunction rd, and therefore holds for a wide range cases. If more information is known about X ,alternate, stronger bounds could be used in place of Chebyshev’s inequality. Similarly, if moreinformation about the form of the perturbation is also known, then stronger bounds could be usedto bound | E [err( X )] | and V [rd( X )]. Example 1 is simply meant to illustrate the idea measurementerror can be easily balanced with sampling error in a rigorous way. Theorem 1 can be easily used to generalize many standard numerical analysis results from the scalarvariables to random variable case. We provide a simple example, analyzing the sum of randomvariables. Such expressions are common in Monte Carlo methods, and in particular in numericalmethods for stochastic differential equations [KP92]. Since we use Theorem 1 our bounds are onlylinear in (cid:15) , but they do provide a means to determine how much precision is needed in certainalgorithms.
Example 2.
Suppose X , . . . , X k are random variables. Fix an ordering and let S k = X + · · · + X k ,and let ˜ S k = X ⊕ · · · ⊕ X k . Then S k = S k − + X k and ˜ S k = rd( ˜ S k − + X k ).We can compute, E [ | S k − ˜ S k | ] = E [ | S k − + X k − ( ˜ S k − + X k + err( ˜ S k − + X k )) | ]= E [ | ( S k − − ˜ S k − ) − err( ˜ S k − + X k )) | ] ≤ E [ | ( S k − − ˜ S k − ) | ] + E [ | err( ˜ S k − + X k )) | ]Therefore, (cid:12)(cid:12) E [ S n ] − E [ ˜ S n ] (cid:12)(cid:12) ≤ n − (cid:88) k =1 E [ | err( ˜ S k − + X k ) | ]By the reverse triangle inequality E [ | ˜ S k − | − | S k − | ] ≤ E [ | ˜ S k − − S k − | ], which by induction wemay assume is of size O ( (cid:15) ). Thus, E [ | err( ˜ S k − + X k ) | ] ≤ E [ | ˜ S k − + X k | ] · (cid:15) ≤ E [ | ˜ S k − | + | X k | ] · (cid:15) = E [ | S k − | + | X k | ] · (cid:15) + O ( (cid:15) ) ≤ (cid:34) k (cid:88) i =1 E [ | X i | ] (cid:35) · (cid:15) + O ( (cid:15) )14herefore, E [ | S n − ˜ S n | ] ≤ (cid:34) n − (cid:88) k =1 k +1 (cid:88) i =1 E [ | X i | ] (cid:35) · (cid:15) + O ( (cid:15) ) ≤ ( n − (cid:34) n (cid:88) i =1 E [ | X i | ] (cid:35) · (cid:15) + O ( (cid:15) ) . We note that if rd is symmetric, | rd( X ) | = rd( | X | ). This means that if the X i are obtained byrounding to a symmetric set F using ‘round to nearest’, then E [ | X i | ] is within (cid:15) of the mean of theabsolute value of the original variable (assuming niceness conditions). In this section we provide a flavor of how our results can be used to provided error bounds.
Sheppard’s corrections are based on truncating a series derived using the Euler–Maclaurin formula.However, if certain regularity conditions are not satisfied, the error term of the truncation maynot be convergent. The following example illustrates a case when Sheppard’s corrections are notparticularly accurate, due to a nonzero first derivative at the endpoints of the integration interval.In addition, this example demonstrates how to use a range of our bounds, and how to use symmetryto improve them.As seen in Fig. 4, the exact interaction between ( F , rd) and X is actually quite complex. WhileSheppard’s corrections provide a reasonable reasonable zeroth order guess for the relationship be-tween the moments of X and rd( X ), in some cases they can be quite inaccurate, and so havingbounds such as the ones presented in this paper is of interest. Example 3.
Suppose X is distributed according to the semi-circle distribution with parameters r and µ . That is, f X ( x ) = 2 πr (cid:112) r − ( x − µ ) for − r ≤ x − µ ≤ r and f X ( x ) = 0 otherwise.We will bound the differences of the mean and variances based on increasing amounts of infor-mation.(a) ( F , rd) satisfies Assumption 1 with constant δ (b) rd is ‘round to nearest’(c) F has uniform spacing 2 δ .(d) F = { δz + a : z ∈ Z } whera a ∈ [0 , δ ] is knownBecause a is arbitrary, without loss of generality we set µ = 0. We will use the bounds, | E [rd( X )] − E [ X ] | ≤ | E [err( X )] || V [rd( X )] − V [ X ] | ≤ | E [ X err( X )] | + E [err( X ) ] + 2 E [err( X )] . The results of our analysis are shown in Fig. 4. 15 − − − − − − − − − − − − E [rd( X )] − E [ X ] − − − V [rd( X )] − V [ X ] − − − − − − − − − − − − − E [ X err( X )] − additive error bound δ − − − E [err( X ) ] − additive error bound δ − − − − − − − − − − − − − ∆ V − δ / bound (a)bound (b)bound (c)Sheppard’s correction Figure 4: Error bounds when X has semicircle distribution. The labels of the bounds correspond towhich information is used to derive the bound; see Example 3. The ranges of ∆ E and ∆ V shown arecomputed by varying a continuously from 0 to 2 δ to produce a uniform mesh F = { δz + a : z ∈ Z } .Suppose that ( F , rd) satisfies Assumption 1. Then we can apply Lemma 1 to compute, | E [err( X )] | ≤ δ, | E [ X err( X )] | ≤ E [ | X | ] · δ = 4 r π · δ, | E [err( X ) ] | ≤ δ . (a)If we additionally know that rd is ‘round to nearest’, then we can improve this bound to quadraticin δ . In particular, using Corollary 1, | E [err( X )] | = (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) R err( x ) f X ( x ) d x (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:20) (cid:18) sup x f X ( x ) (cid:19)(cid:21) · δ = (cid:20) πr (cid:21) · δ . (b)Likewise, but noting that xf X ( x ) changes sign (hence the factor of 2), | E [ X err( X )] | = (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) R x err( x ) f X ( x ) d x (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:20) (cid:18) sup x | xf X ( x ) | (cid:19)(cid:21) · δ ≤ (cid:20) π (cid:21) · δ . (b)Using only Assumption 1, we already have that E [err( X )] = O ( δ ). However, using Corollary 1 we16an improve the constant to, | E [err( X ) ] | = (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) R err( x ) f X ( x ) d x (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:20) (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) f X ( x ) d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) · δ + (cid:20)
43 sup x f X ( x ) (cid:21) · δ ≤ (cid:20) (cid:21) · δ + (cid:20) πr (cid:21) · δ . (b)If we know that F is uniformly spaced with spacing 2 δ , then we can use symmetry to improvethe constants of the previous bound. Assume that F = { δz + a : z ∈ Z } and that a ∈ [0 , δ ] ispossibly unknown. We will decompose f X ( x ) as f X ( x ) := g X ( x ) + h X ( x ) where 0 ≤ g X ( x ) ≤ f X ( x )is the aprt of f X ( x ) symmetric about x = c . That is, g X ( x ) = min { f X (2 c − x ) , f X ( x ) } = f X (2 c − x ) x ∈ [2 c − r, c ] f X ( x ) x ∈ [ c, r ]0 otherwise . Then (cid:82) R err( x ) g X ( x ) d x = 0 so, | E [err( X )] | = (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) R err( X ) h X ( x ) d x (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:20) (cid:18) sup x h X ( x ) (cid:19)(cid:21) · δ = (cid:20) f X (2 c − r ) (cid:21) · δ . We have symmetry about any point in the mesh, or any midpoint in the mesh. Thus, the bestchoice will be to have c as near to zero as possible. That is, (assuming a is known), c = a a ≤ δ/ a − δ δ/ < a ≤ δ/ a − δ δ/ < a ≤ δ. (d)Note that in all cases, c ≤ δ/ a is unknown, we can obtain the bound, | E [err( X )] | ≤ (cid:20) f X ( δ − r ) (cid:21) · δ . (c)In theory, we could apply a similar procedure to improve the constants of our bound for E [ X err( X )]using symmetry. However, the this approach yields only improvements of at most a constant factorindependent of δ , so we omit such analysis for brevity.Finally, if our mesh is uniform we can use Corollary 2 to provide a two sided bound. We note theupper bound is the same as our previous upper bound, but we can now provide an correspondinglower bound. (cid:12)(cid:12)(cid:12)(cid:12) E [err( X ) ] − · δ (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) R err( X ) f X ( x ) d x − · δ (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:20) (cid:18) sup x f X ( x ) (cid:19)(cid:21) · δ = (cid:20) πr (cid:21) · δ . (c)17 .2 Floating point number systems We will consider floating point number systems similar to IEEE 754 [iee19]. This encompasses manyof emerging number systems such as bfloat16 .For integers m > k min < k max , these floating point number systems will have 2 m numbersuniformly spaced numbers on each interval [2 i , i +1 ) for i = k min , . . . , k max − . This of course leaves agap between 0 and 2 k min . Most standards add another 2 m uniformly spaced numbers of this interval,commonly referred to as subnormal numbers. A sign bit is then added to include negative numbers.This setup guarantees that the relative spacing between consecutive floating point numbers inthe range [2 k min , k max ] is no more than 2 − m . Thus, such a system, equipped with the ‘round tonearest’ scheme will satisfy Assumption 1 with (cid:15) = 2 − m .Define 2 δ min = 2 k min − m = 2 k min − m , δ i = 2 i +1 − i m = 2 i − m as the size of Assumption 1 on [0 , k min ) and [2 i , i +1 ) respectively.We will consider the ‘round to nearest’ scheme. Then, we can decompose an integral as, (cid:90) R f ( x ) err( x ) k d x = (cid:90) k min f ( x ) err( x ) k d x (3)+ k max − (cid:88) i = k min (cid:90) i +1 i f ( x ) err( x ) k d x + R (4)where R is a remainder term accounting for overflow given by, R = (cid:90) ∞ k max f ( x )(2 k max − x ) k d x. which can often be ignored if f ( x ) is very small for x ≥ k max .On each of these intervals (provided i ≤ k max −
1) our floating point number system is uniformlyspaced, and thus we could use symmetry as in Example 3, although this will not be useful if f ( x ) ismonotonic.Otherwise, we note that if a and b are the endpoints of the above intervals, (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) ba f ( x ) err( x ) k d x (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:20) k i k + 1 (cid:18) sup x f ( x ) − inf x f ( x ) (cid:19)(cid:21) · ( δ i ) k +1 ≤ (cid:20) ( k +1) i k i k + 1 (cid:18) sup x f ( x ) − inf x f ( x ) (cid:19)(cid:21) · (cid:15) k +1 where k = number of connected regions of local maxima, and the supremum and infimum are takenover x ∈ [ a, b ].This follows from the fact that a and b are in our floating point number system, which allows usto apply Lemma 3 to the constant function inf x f ( x ). Then, sup x ( f ( x ) − inf x f ( x )) = sup x f ( x ) − inf x f ( x ). Example 4.
Suppose X is exponentially distributed with parameter λ . That is, f X ( x ) = λ exp( − λx ).We will bound E [rd( X ) − E [ X ] = E [err( X )] when F is a floating point number system, and rd is‘round to nearest’.First note that f X is decreasing, so that on any interval [ a, b ] it has only a single maximum and,sup x f ( x ) − inf x f ( x ) = f ( a ) − f ( b ) . f X is decreasing, using symmetry will not help us. Let a, b ∈ F so that between a and b F hasuniform spacing 2 δ . Then, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) ba err( x ) f X ( x ) d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ [ f X ( a ) − f X ( b )] · δ = λ (exp( − λa ) − exp( − λb )) · δ . We can then use this bound on each term of the decomposition Eq. (4). In particular, we havethat, (cid:90) ∞ err( X ) f X ( x ) d x ≤ (cid:90) k min err( X ) f X ( x ) d x + k max − (cid:88) i = k min (cid:90) i +1 i err( X ) f X ( x ) d x + R. With δ min = 2 k min − m − and δ i = 2 i − m − , we have, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) k min err( X ) f X ( x ) d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ (cid:0) exp(0) − exp( − λ k min ) (cid:1) · ( δ min ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) i +1 i err( X ) f X ( x ) d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ (cid:0) exp( − λ i ) − exp( − λ i +1 ) (cid:1) · ( δ i ) . Given λ and the particular parameters of our floating point system, we can now compute a boundfor E [err( X )].However, if our goal is to determine the necessary precision for some application, we may hopeto determine the parameter m or the corresponding relative precision (cid:15) = 2 − m − . Factoring this outof each term we can write, | E [err( X )] | ≤ (cid:34) λ k min (cid:0) exp(0) − exp( − λ k min ) (cid:1) + k max − (cid:88) i = k min (cid:12)(cid:12) λ i (cid:0) exp( − λ i ) − exp( − λ i +1 ) (cid:1) (cid:12)(cid:12)(cid:35) · (cid:15) + | R | . This sum can now be bounded independently of (cid:15) . If the exact structure of the finite precision number may be unknown or difficult to work with thenthe approaches seen in the above examples will not be sufficient. This may happen if data is collectedfrom an irregularly located sensor network, subjected to a known or unknown nonlinear transforma-tion, etc. In such cases, we can return to the more general form of the bounds in Theorem 3. Thesebounds are only derived on the assumptions that rd is ‘round to nearest’ and that ( F , rd) satisfiesAssumption 1. Example 5.
Suppose rd is ‘round to nearest’ and ( F , rd) satisfies Assumption 1. Let X ∼ N ( µ, σ ),and let m and n be integers, with n odd.We will bound, (cid:12)(cid:12) E [( X − µ ) m err( X ) n ] (cid:12)(cid:12) µ > f ( x ) := ( x − µ ) m f X ( x ), where f X ( x ) :=exp( − ( x − µ ) / (2 σ )) / √ πσ is the density function of X . Using the fact taht f ( µ ) = 0, we canseparate the integral as, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) R f ( x ) err( x ) n d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) µ −∞ | f ( x ) | err( x ) n d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) ∞ µ | f ( x ) | err( x ) n d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . where we use the fact that f ( x ) does not change signs on ( −∞ , µ ) or ( µ, ∞ ).We note that we could have used symmetry prior to this point, but do not do so for simplicity.We can apply Corollary 1 to both of the right hand side integrals. Note that | f ( x ) | has evensymmetry around x = µ , with local maxima located at x = µ ± √ mσ and a local minima locatedat x = µ .Define x r := µ + √ mσ and x l := µ − √ mσ , and for x ≥ f r ( x ) := (cid:40) f ( x r ) x < x r f ( x ) x > x r ˆ f l ( x ) := (cid:40) | f ( x l ) | x < | x l | max {| f ( x ) | , | f ( − x ) |} x > | x l | . Observe that ˆ f l ( x ) ≤ ˆ f r ( x ) so that the bound obtained for the integral on ( −∞ , µ ) is no greaterthan the bound obtained for the integral on ( µ, ∞ ). That is, (cid:90) ∞ x n ˆ f l ( x ) d x ≤ (cid:90) ∞ x n ˆ f r ( x ) d x. By direct computation, (cid:90) ∞ x n ˆ f r ( x ) d x = (cid:90) x r x n f ( x r ) d x + (cid:90) ∞ x r x m + n f X ( x ) d x = 1 n + 1 ( x r ) n +1 f ( x r ) + 2 m/ − √ π Γ (cid:18) m + 12 , m (cid:19) . Thus, | E [( X − µ ) m err( X ) n ] | ≤ [ n m,n ] · (cid:15) n +1 where n m,n := 2 n + 1 ( µ + √ mσ ) n +1 ( √ mσ ) m f X ( µ + √ mσ )+ 2 m/ √ π Γ (cid:18) m + 12 , m (cid:19) . Two natural directions for future work are (i) the application of our bounds to numerical algorithms,and (ii) the generalization or improvement of our bounds.In Section 5.2 we showed how the simple bounds from Section 3 can be applied to sums ofrandom variables. This approach does not provide particularly strong bounds since it does not20everage any cancellation of rounding errors, Even so, it is simple and provides a way to obtainstatistical error bounds for algorithms run on random data. For instance, Chebyshev’s inequalityor other concentration inequalities could be applied to Example 2 to bound the probability of thefinite precision sum differing greatly from the deterministic sum.It should be fairly straightforward to apply a similar approach to other algorithms which mayhelp identify parts of algorithms which can be done in lower precision. For instance, the convergenceof numerical methods for Stock attic Differential Equations is well studied in terms of the timestep,but the interplay between convergence and the precision used is not well understood. If the effectsof rounding errors in particular steps, such as sampling the noise term, are shown to be small, thenperhaps those steps can be done in reduced precision with a provably small effect on convergence.In Section 4 we give higher order bounds which take into account cancellation of rounding errors.However, the analysis in this section is not suited for discrete random variables whose densityfunctions consist of many discrete delta masses. However, it seems plausible that in many casessimilar higher order bounds could be obtained. For instance, suppose X and X are supported on F . Then X + X is supported on { f + f : f , f ∈ F } . If F is IEEE 754 floating point arithmetic,then the relation between these two supporting sets is straightforward and can be exploited. Inparticular, the sum of many floating point numbers can be computed exactly, and even when thesums cannot be computed exactly, again, the probability that a sample is rounded up will be closeto the probability it is rounded down (given the correct niceness conditions). This paper provides a general framework for analyzing the effects of rounding and other perturbationson random variables. We provide bounds for the mean absolute difference of the original andperturbed random variables, as well as bounds on the differences of the moments of these variables.We additionally provide examples of how our bounds can be used. In particular, we show how ourbounds can be use to design measurement devices which balance the cost of precise measurementswith the number of samples needed to resolve certain statistical properties and show how our boundscan be used to trivially generalize standard results in numerical analysis.
Acknowledgments
This material is based upon work supported by the National Science Foundation Graduate ResearchFellowship Program under Grant No. DGE-1762114. Any opinions, findings, and conclusions orrecommendations expressed in this material are those of the author and do not necessarily reflectthe views of the National Science Foundation.The author thanks Yu-Chen Cheng, Matt Loirg, Tom Togdon, and Ying-Jen Yang for discussionsand comments.
References [BZZH09] Zhidong Bai, Shurong Zheng, Baoxue Zhang, and Guorong Hu. Statistical analysis forrounded data.
Journal of Statistical Planning and Inference , 139(8):2526–2542, August2009.[CMN98] Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. Random sampling for his-togram construction: How much is enough? In
Proceedings of the 1998 ACM SIGMODInternational Conference on Management of Data , SIGMOD ’98, New York, NY, USA,1998. ACM. 21DMOT14] Percy A. Deift, Govind Menon, Sheehan Olver, and Thomas Trogdon. Universality innumerical computations with random data.
Proceedings of the National Academy ofSciences , 111(42):14973–14978, 2014.[FD81] David Freedman and Persi Diaconis. On the histogram as a density estimator:l2 theory.
Zeitschrift f ur Wahrscheinlichkeitstheorie und Verwandte Gebiete , 57(4):453–476, Dec1981.[GAGN15] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deeplearning with limited numerical precision. In
International Conference on MachineLearning , pages 1737–1746, 2015.[Hig02] Nicholas J. Higham.
Accuracy and Stability of Numerical Algorithms . Society for Indus-trial and Applied Mathematics, second edition, 2002.[HM19] Nicholas J. Higham and Th´eo Mary. A new approach to probabilistic rounding erroranalysis.
SIAM Journal on Scientific Computing , 41(5):A2815–A2835, 2019.[HMT11] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with ran-domness: Probabilistic algorithms for constructing approximate matrix decompositions.
SIAM review , 53(2):217–288, 2011.[iee19] IEEE Standard for Floating-Point Arithmetic, July 2019.[Jan06] Svante Janson. Rounding of continuous random variables and oscillatory asymptotics.
The Annals of Probability , 34(5):1807–1826, 2006.[Kar16] Charles FF Karney. Sampling exactly from the normal distribution.
ACM Transactionson Mathematical Software (TOMS) , 42(1):3, 2016.[Knu19] Kevin H. Knuth. Optimal data-based binning for histograms and histogram-based prob-ability density models.
Digital Signal Processing , 95:102581, December 2019.[KP92] Peter E. Kloeden and Eckhard Platen.
Numerical Solution of Stochastic DifferentialEquations . Springer Berlin Heidelberg, 1992.[KW09] Malvin H Kalos and Paula A Whitlock.
Monte carlo methods . John Wiley & Sons, 2009.[Mon79] John F. Monahan. Extensions of von neumann’s method for generating random variables.
Mathematics of Computation , 33(147):1065–1069, 1979.[Mon85] John F Monahan. Accuracy in random number generation.
Mathematics of Computation ,45(172):559–568, 1985.[Pea95] Karl Pearson. Contributions to the mathematical theory of evolution. ii. skew variationin homogeneous material.
Philosophical Transactions of the Royal Society of London.A , 186:343–414, 1895.[She97] W. F. Sheppard. On the calculation of the most probable values of frequency-constants,for data arranged according to equidistant division of a scale.
Proceedings of the LondonMathematical Society , s1-29(1):353–380, 11 1897.[SKA10] Hans Schneeweiss, John Komlos, and Amar Ahmad. Symmetric and asymmetric round-ing: A review and some new results.
AStA Advances in Statistical Analysis , 94:247–271,09 2010.[Ste73] Gilbert W Stewart.
Introduction to matrix computations . Elsevier, 1973.22Tri90] A. R. Tricker. The effect of rounding on the significance level of certain normal teststatistics.
Journal of Applied Statistics , 17(1):31–38, January 1990.[UU17] N.G. Ushakov and V.G. Ushakov. Statistical analysis of rounded data: Recovering ofinformation lost due to rounding.
Journal of the Korean Statistical Society , 46(3):426 –437, 2017.[vN51] John von Neumann. Various techniques used in connection with random digits. InA. S. Householder, G. E. Forsythe, and H. H. Germond, editors,
Monte Carlo Method ,volume 12 of
National Bureau of Standards Applied Mathematics Series , chapter 13,pages 36–38. US Government Printing Office, Washington, DC, 1951.[VSM11] Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. Improving the speed of neuralnetworks on cpus. In
Deep Learning and Unsupervised Feature Learning Workshop, NIPS2011 , 2011.[WCB +
18] Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakr-ishnan. Training deep neural networks with 8-bit floating point numbers. In
Advancesin neural information processing systems , pages 7675–7684, 2018.[Wil63] James Hardy Wilkinson.
Rounding errors in algebraic processes . Prentence Hall Inc.,1963.[Wil05] Peter-Th Wilrich. Rounding of measurement values or derived values.
Measurement ,37(1):21–30, January 2005. 23
Proofs
We provide proofs ommitted in the main sections of this paper. We typically prove only the resultscorresponding to the multiplicative bound Assumption (Assumption 1i.), as the proof for the additivebound Assumption (Assumption 1ii.) follow easily from the same general approach.
Proof of Lemma 1 part 2.
Let g X and h X as in the statement. Suppose that m + n is odd so that x m err( x ) n and therefore x m err( x ) n g X ( x ) are odd functions. Then, using that (cid:82) R x m err( x ) n g X ( x ) d x =0, (cid:12)(cid:12) E [ X m err( X ) n ] (cid:12)(cid:12) ≤ (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) R | x | m + n h X ( x ) d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) · (cid:15) n Finally, note that, (cid:90) R | x | m + n h X ( x ) d x = (cid:90) R x m + n h X ( x ) d x − (cid:90) −∞ x m + n h X ( x ) d x. The result follows by observing that (cid:82) R x m + n g X ( x ) d x = 0 since x m + n is odd. Proof of Lemma 3.
Suppose c ∈ [rd( a ) , rd( b )]. On the interval [ (cid:98) c (cid:99) , (cid:100) c (cid:101) ] we have,err( x ) = (cid:40) (cid:98) c (cid:99) − x x < ( (cid:98) c (cid:99) + (cid:100) c (cid:101) ) (cid:100) c (cid:101) − x x > ( (cid:98) c (cid:99) + (cid:100) c (cid:101) )Thus, assuming that n is odd, (cid:90) (cid:100) c (cid:101)(cid:98) c (cid:99) err( x ) n d x = 0 . That is to say, the integral of err( x ) n is zero between any consecutive finite precision numberes iszero provided n is odd. In such cases, by induction, (cid:90) rd( b )rd( a ) err( x ) n d x = 0 . Similarly, we can directly compute, (cid:90) (cid:100) c (cid:101)(cid:98) c (cid:99) | err( x ) | n d x = 1 n + 1 (cid:90) (cid:100) c (cid:101)(cid:98) c (cid:99) (cid:18) (cid:100) c (cid:101) − (cid:98) c (cid:99) (cid:19) n d x. Since ( F , rd) satisfies Assumption 1, we have that (cid:100) c (cid:101) − (cid:98) c (cid:99) (cid:12)(cid:12)(cid:12)(cid:12) err (cid:18) (cid:98) c (cid:99) + (cid:100) c (cid:101) (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) (cid:98) c (cid:99) + (cid:100) c (cid:101) (cid:12)(cid:12)(cid:12)(cid:12) so that, (cid:90) (cid:100) c (cid:101)(cid:98) c (cid:99) (cid:18) (cid:100) c (cid:101) − (cid:98) c (cid:99) (cid:19) n d x ≤ (cid:34)(cid:90) (cid:100) c (cid:101)(cid:98) c (cid:99) (cid:12)(cid:12)(cid:12)(cid:12) (cid:98) c (cid:99) + (cid:100) c (cid:101) (cid:12)(cid:12)(cid:12)(cid:12) n d x (cid:35) · (cid:15) n ≤ (cid:34)(cid:90) (cid:100) c (cid:101)(cid:98) c (cid:99) | x | n d x (cid:35) · (cid:15) n where the right inequality can be observed by direct computation or noting that | x | n is convex.Thus, by induction, we conclude that, (cid:90) (cid:98) b (cid:99)(cid:100) a (cid:101) | err( x ) | n d x ≤ (cid:34) n + 1 (cid:90) (cid:98) b (cid:99)(cid:100) a (cid:101) | x | n d x (cid:35) · (cid:15) n ≤ (cid:34) n + 1 (cid:90) ba | x | n d x (cid:35) · (cid:15) n . roof of Theorem 3 (‘round to nearest’). The general approach is as follows. We bound integrals onthe interval [ a, b ] by bounding a main contribution between two finite precision numbers near a and b , and then individually bounding two smaller integrals near the endpoints. We prove the secondpart of this Theorem first as it is somewhat simpler and will illustrate our general approach.By Lemma 3 we have that when n is odd, (cid:90) (cid:100) c (cid:101)(cid:98) c (cid:99) err( x ) n d x = 0 . We now bound the contributions near the endpoints. Note that on [ c, rd( c )], err( x ) = rd( c ) − x ,and by Assumption 1, | rd( c ) − c | ≤ (cid:15) | c | . Thus,0 ≤ (cid:90) rd( c ) c err( x ) n d x = (cid:90) rd( c ) c (rd( c ) − x ) n d x = (rd( c ) − c ) n +1 n + 1 ≤ (cid:20) | c | n +1 n + 1 (cid:21) · (cid:15) n +1 . Applying this bound to each endpoint we have, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) ba err( x ) n d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) rd( b )rd( a ) err( x ) n d x + (cid:90) rd( a ) a err( x ) n d x + (cid:90) b rd( b ) err( x ) n d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) rd( a ) a err( x ) n d x − (cid:90) rd( a ) b err( x ) n d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:20) max( | a | , | b | ) n +1 n + 1 (cid:21) · (cid:15) n +1 . y x d c eb c c | err( x ) | n (cid:16) d c e−b c c (cid:17) n (cid:16) b c c + d c e (cid:17) n (cid:15) n | x | n (cid:15) n Figure 5: Note that the contribution of the integral of the n -th power of absolute error functionover the interval [ (cid:98) c (cid:99) , (cid:100) c (cid:101) ] is at most 1 / ( n + 1) the area of the integral of the constant function( (cid:100) c (cid:101) − (cid:98) c (cid:99) ) n / n over this interval, which is itself smaller than the integral of (cid:15) n | x | n over this interval.We now prove the first part of this Theorem in a similar matter. Again, by Lemma 3 we have, (cid:90) (cid:98) b (cid:99)(cid:100) a (cid:101) | err( x ) | n d x ≤ (cid:34) n + 1 (cid:90) ba | x | n d x (cid:35) · (cid:15) n We now aim to bound the integral near the endpoints. Without loss of generality, assume c > (cid:98) c (cid:99) ≤ c and (cid:100) c (cid:101) ≤ c (1 + (cid:15) ) / (1 − (cid:15) ). Then, with β = 1 / (1 − (cid:15) ), (cid:100) c (cid:101) − (cid:98) c (cid:99) ≤ (cid:18) (cid:100) c (cid:101) + (cid:98) c (cid:99) (cid:19) (cid:15) ≤ (cid:18) c + c (cid:15) − (cid:15) (cid:19) (cid:15) = β c (cid:15). We note that c may be larger than rd( c ), but use the notation [ c, rd( c )] to indicate the direction of the integral. c , we have that, (cid:90) (cid:100) c (cid:101)(cid:98) c (cid:99) | err( x ) | n d x = 1 n + 1 (cid:90) (cid:100) c (cid:101)(cid:98) c (cid:99) (cid:18) (cid:100) c (cid:101) − (cid:98) c (cid:99) (cid:19) n d x ≤ (cid:20) | c | n +1 n + 1 (cid:21) · ( β (cid:15) ) n +1 Therefore, similar to before, we have, (cid:90) ba err( x ) n d x = (cid:90) (cid:98) b (cid:99)(cid:100) a (cid:101) err( x ) n d x + (cid:90) (cid:100) a (cid:101) a err( x ) n d x + (cid:90) b (cid:98) b (cid:99) err( x ) n d x ≤ (cid:34) n + 1 (cid:90) ba | x | n d x (cid:35) · (cid:15) n + (cid:20) n + 1 ( | a | n +1 + | b | n +1 ) (cid:21) · ( β (cid:15) ) n +1 Proof of Corollary 1.
We prove the n odd case first and seen a bound to (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) R f ( x ) err( x ) k d x (cid:12)(cid:12)(cid:12)(cid:12) . Theorem 3 gives us a bound for integrals of err( x ) n , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) ba err( x ) n d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:20) n + 1 max( | a | , | b | ) n +1 (cid:21) · (cid:15) n +1 =: G ( a, b ) . By assumption, since f is lower-semicontinuous and has a single local maxima (or connected regionof local maxima), { x : f ( x ) > u } = (inf { x : f ( x ) > u } , sup { x : f ( x ) > u } )so that, in the notation of Lemma 2, µ ( { x : f ( x ) > u } ) = G (inf { x : f ( x ) > u } , sup { x : f ( x ) > u } ) ≤ n + 1 ˆ f − ( u ) n +1 . Therefore, applying Lemma 2, (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) R f ( x ) err( x ) n d x (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) R µ ( { x : f ( x ) > u } ) d u (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) R n + 1 ˆ f − ( u ) n +1 d u (cid:12)(cid:12)(cid:12)(cid:12) . By drawing a picture, it is clear that, (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) R ˆ f − ( u ) n +1 d u (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) R ˆ f ( z / ( n +1) ) d z (cid:12)(cid:12)(cid:12)(cid:12) . Finally, through a substitution x = z / ( n +1) , (cid:90) R ˆ f ( z / ( n +1) ) d z = ( n + 1) (cid:90) x n ˆ f ( x ) d x. The first case follows from the observation that | a | n +1 + | b | n +1 ≤ | a | , | b | ) n +1 .We note a somewhat improved bound could be obtained using | a | n +1 + | b | n +1 directly, but theimprovement would be a most a factor of 2. 26 roof of Corollary 2. This follows from the proof of Theorem 3 and by noting that, (cid:90) ba err( x ) n d x = (cid:90) (cid:100) b (cid:101)(cid:98) a (cid:99) err( x ) n d x + (cid:90) (cid:98) a (cid:99) a err( x ) n d x + (cid:90) b (cid:100) b (cid:101) err( x ) n d x Proof of Theorem 3’ (‘stochastic rounding’).
We follow the proof of Theorem 3 closely. As before,when k is odd we have no contribution from E rd [err( x ) k ].Note that on [ (cid:98) c (cid:99) , (cid:100) c (cid:101) ], (cid:12)(cid:12) E rd [err( x ) k ] (cid:12)(cid:12) has even symmetry about ( (cid:100) x (cid:101) + (cid:98) x (cid:99) ) /
2. Moreover, | rd( c ) − c | ≤ ( (cid:100) c (cid:101) − (cid:98) c (cid:99) ) / (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) rd( c ) c E rd [err( x ) k ] d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) (cid:100) c (cid:101)(cid:98) c (cid:99) (cid:12)(cid:12) E rd [err( x ) k ] (cid:12)(cid:12) d x = (cid:90) ( (cid:98) c (cid:99) + (cid:100) c (cid:101) ) / (cid:98) c (cid:99) (cid:12)(cid:12) E rd [err( x ) k ] (cid:12)(cid:12) d x Now note that since ( F , rd) satisfies Assumption 1, (cid:100) c (cid:101) − (cid:98) c (cid:99) ≤ (cid:15) min {|(cid:98) c (cid:99)| , |(cid:100) c (cid:101)|} ≤ (cid:15) | c | . (5)We can directly compute, (cid:90) ( (cid:98) c (cid:99) + (cid:100) c (cid:101) ) / (cid:98) c (cid:99) (cid:12)(cid:12) E rd [err( x ) k ] (cid:12)(cid:12) d x = 1 − − ( k +1) ( k + 3) k + 3 k + 2 (cid:90) (cid:100) c (cid:101)(cid:98) c (cid:99) ( (cid:100) c (cid:101) − (cid:98) c (cid:99) ) k d x so that, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) rd( c ) c E rd [err( x ) k ] d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:20) − − − ( k +2) ( k + 3) k + 3 k + 2 | c | k +1 (cid:21) · (cid:15) k +1 . Similarly, we can directly compute, (cid:90) (cid:100) c (cid:101)(cid:98) c (cid:99) E rd (cid:2) | err( x ) | k (cid:3) d x = 2 k + 3 k + 2 (cid:90) (cid:100) c (cid:101)(cid:98) c (cid:99) ( (cid:100) c (cid:101) − (cid:98) c (cid:99) ) k d x so that, (cid:90) (cid:100) c (cid:101)(cid:98) c (cid:99) E rd (cid:2) | err( x ) | k (cid:3) d x ≤ (cid:34) k + 3 k + 2 (cid:90) (cid:100) c (cid:101)(cid:98) c (cid:99) | x | k d x (cid:35) · (cid:15) k . Finally, since [ (cid:98) c (cid:99) , c ] , [ c, (cid:100) c (cid:101) ] ⊆ [ (cid:98) c (cid:99) , (cid:100) c (cid:101) ],max (cid:32)(cid:90) c (cid:98) c (cid:99) E rd (cid:2) | err( x ) | k (cid:3) d x, (cid:90) (cid:100) c (cid:101) c E rd (cid:2) | err( x ) | k (cid:3) d x (cid:33) ≤ (cid:20) k + 3 k + 2 | c | k +1 (cid:21) · (cid:15) k +1 ..