Dither computing: a hybrid deterministic-stochastic computing framework
DDither computing: a hybrid deterministic-stochasticcomputing framework
Chai Wah Wu
IBM T.J. Watson Research Center
Yorktown Heights, NY 10598, [email protected] 21, 2021
Abstract —Stochastic computing has a long history as an alter-native method of performing arithmetic on a computer. Whileit can be considered an unbiased estimator of real numbers, ithas a variance and MSE on the order of Ω( N ) . On the otherhand, deterministic variants of stochastic computing removethe stochastic aspect, but cannot approximate arbitrary realnumbers with arbitrary precision and are biased estimators.However, they have an asymptotically superior MSE on theorder of O ( N ) . Recent results in deep learning with stochasticrounding suggest that the bias in the rounding can degradeperformance. We proposed an alternative framework, calleddither computing, that combines aspects of stochastic computingand its deterministic variants and that can perform computingwith similar efficiency, is unbiased, and with a variance andMSE also on the optimal order of Θ( N ) . We also show that itcan be beneficial in stochastic rounding applications as well. Weprovide implementation details and give experimental results tocomparatively show the benefits of the proposed scheme. I. INTRODUCTION
Stochastic computing [1]–[4] has a long history and isan alternative framework for performing computer arithmeticusing stochastic pulses. It can approximate arbitrary realnumbers and perform arithmetic on them to the correct valuein expectation, but the stochastic nature means that the resultis not accurate each time. Recently, Ref. [5] suggests thatdeterministic variants of stochastic computing can be just asefficient, and does not have the random errors introducedby the random nature of the pulses. Nevertheless in suchdeterministic variants the finiteness of the scheme impliesthat it cannot approximate general real numbers with arbitraryprecision. This paper proposes a framework that combinesthese two approaches to get the best of both worlds, and inheritsome of the best properties of both schemes. In the process, wealso provide a more complete probabilistic analysis of theseschemes. In addition to considering both the first moment ofthe approximation error (e.g. average error) and the varianceof the representation, we also consider the set of real numbersthat are represented and processed to be drawn from anindependent distribution as well. This allows us to provideda more complete picture of the tradeoffs in the bias, varianceof the approximation and the number of pulses along with theprior distribution of the data. II. R
EPRESENTATION OF REAL NUMBERS VIA SEQUENCES
We consider two independent random variables X , Y withsupport in the unit interval [0 , . A common assumption isthat X and Y are uniformly distributed. The interpretationis that X and Y generate the real numbers that we wantto perform arithmetic on. In order to represent a sample x ∈ [0 , from X the main idea of stochastic computing (andother representations such as unary coding [6]) is to use asequence of N binary pulses. In particular, x is represented bya sequence of independent N Bernoulli trials X i . We estimate x via X s = N (cid:80) Ni =1 X i . Our standing assumption is that X , Y , X i and Y i are all independent. We are interested in howwell X s approximates a sample x in X . In particular, we define L x = E (( X s − x ) | X = x ) and are interested in the ex-pected mean squared error (EMSE) defined as L = E X ( L x ) .Note that L x consists of two components, bias and variance,and the bias-variance decomposition [7] is given by L x = Bias ( X s , x ) + V ar ( X s ) where Bias ( X s , x ) = E ( X s ) − x .The following result gives a lower bound on the EMSE: Theorem 2.1: L ≥ N (cid:82) p X ( x )( N x − round ( N x )) dx . Proof:
Follows from the fact that X s is a rational number withdenominator N and thus | X s − x | ≥ N | N x − round ( N x ) | . (cid:3) Given the standard assumption that X is uniformly dis-tributed, this implies that L ≥ N (cid:82) N x dx = N , i.e.the EMSE can decrease at a rate of at most Ω( N ) .In the next sections, we analyze how well X s approximatessamples in X asymptotically as N → ∞ by analyzing theerror L for various variants of stochastic computing. A. Stochastic computing
A detailed survey of stochastic computing can be found in[2]. We give here a short description of the unipolar format.Using the notation above, X i are chosen to be iid Bernoullitrials with p ( X i = 1) = x . Then E ( X s ) = x and X s is an unbiased estimator of x . Since Bias ( X s , x ) = 0 and V ar ( X s ) = N x (1 − x ) , i.e. V ar ( X s ) = Ω( N ) , we have L x = Ω( N ) for x ∈ (0 , . More specifically, if X has a uni-form distribution on [0 , , then L = N (cid:82) x (1 − x ) dx = N . a r X i v : . [ c s . A R ] F e b . A deterministic variant of stochastic computing In [5] deterministic variants of stochastic computing areproposed. Several approaches such as clock dividing, andrelative prime encoding are introduced and studied. One of thebenefits of a deterministic algorithm is the lack of randomness,i.e. the representation of x via X i does not change and V ar ( X s ) = 0 . However, the bias term Bias ( X s , x ) can benonzero. Because x is represented by counting the number of1’s in X i , it can only represent fractions with denominator N .For x = m N where m is odd, the error is X s − x = N .This means that such values of x , L x = Bias ( X s , x ) + V ar ( X s ) = N = O ( N ) . If X is a discrete randomvariable with support only on the rational points mN for integer ≤ m ≤ N , then L = 0 . However, in practice, we want tothe represent arbitrary real numbers in [0 , . Assume that X is uniformly distributed in [0 , . By symmetry, we only needto analyze the x ∈ [0 , N ] . Then X s − x = x and L x = x .It follows that L = 2 N (cid:82) N x dx = N = O ( N ) . C. Stochastic rounding
As the deterministic variant (Sect. II-B) has a better asymp-totic EMSE than stochastic computing (Sect. II-A), one mightwonder why is stochastic computing useful. It’s instructive toconsider a special case: 1-bit stochastic rounding [8], in whichrounding a number x ∈ [0 , is given as a Bernoulli trial X with P ( X = 1) = x . This type of rounding is equivalent tothe special case N = 1 of the stochastic computing mechanismin Sect. II-A. In deterministic rounding, X = round ( x ) andthe corresponding EMSE is ˜ L = . For stochastic rounding, X has a Bernoulli distribution. If P ( X = 1) = p , then L x = ( p − x ) + p (1 − p ) = p (1 − x )+ x . Since ∂L x ∂p = 1 − x ,it follows that for x ∈ [0 , ] , L x is minimized when p = 0 and for x ∈ [ , , L x is minimized when p = 1 , i.e. L x isminimized when p = round ( x ) . Thus L x ≥ ˜ L x with equalityexactly when p = round ( x ) . This shows that the EMSEfor deterministic rounding is minimal among all stochasticrounding schemes. Thus at first glance, deterministic roundingis preferred over stochastic rounding. While deterministicrounding has a lower EMSE than stochastic rounding, itis a biased estimator. This is problematic for applicationssuch as reduced precision deep learning where an unbiasedestimator such as stochastic rounding has been shown toprovide improved performance over a biased estimator such asdeterministic rounding. As indicated in [9], part of the reasonis that subsequent values that are rounded are correlated andin this case the stochastic rounding prevents stagnation. D. Dither computing: A hybrid deterministic-stochastic com-puting framework
The main goal of this paper is to introduce dither computing,a hybrid deterministic-stochastic computing framework thatcombines the benefits of the stochastic computing (Sec. II-A)and its deterministic (Sec. II-B) variants and eliminates the In the sequel for brevity we will sometimes refer to these schemes simplyas “deterministic variants”. bias component while preserving the optimal O ( N ) asymp-totic rate for the EMSE L . The encoding is constructed asfollows. Let σ be a permutation of { , , · · · , N } .For x ∈ [0 , ] , let n = (cid:98) N x (cid:99) ≤ N and ≤ r = x − nN ≤ N . Then we pick the N Bernoulli trials with P ( X σ ( i ) = 1) =1 for ≤ i ≤ n and P ( X σ ( i ) = 1) = δ for n + 1 ≤ i ≤ N with δ = NrN − n . Then E ( X s ) = N ( n + δ ( N − n )) = x . Inaddition, since n ≤ N and rN ≤ , this implies that δ ≤ N . V ar ( X s ) = N ( N − n ) δ (1 − δ ) ≤ N = O ( N ) . Thus thebias is 0 and the EMSE is of the order O ( N ) . It is clear thatthis remains true if σ is either a deterministic or a randompermutation as X s does not depend on σ .For x ∈ ( , , let n = (cid:100) N x (cid:101) ≥ N and ≤ r = nN − x ≤ N . We pick the N Bernoulli trials with P ( X σ ( i ) = 1) = 1 − δ for ≤ i ≤ n and P ( X σ ( i ) = 1) = 0 for n + 1 ≤ i ≤ N with δ = rNn . E ( X s ) = n (1 − δ ) N = x . In addition, since n ≥ N and rN ≤ , this implies that δ ≤ N . V ar ( X s ) = nN δ (1 − δ ) ≤ N = O ( N ) . Thus again the bias is 0 and the EMSE is ofthe order O ( N ) .The above analysis shows that dither computing offers betterEMSE error than stochastic computing while preserving thezero bias property. In order for such representations to beuseful in building computing machinery, we need to showthat this advantage persists under arithmetic operations suchas multiplication and (scaled) addition.III. M ULTIPLICATION OF VALUES
In this section, we consider whether this advantage ismaintained for these schemes for multiplication of sequencesvia bitwise AND. The sequence corresponding to the productof X i and Y i is given by Z i = X i Y i and the product z = xy is estimated via Z s = N (cid:80) i Z i . A. Stochastic computing
In this case we want to compute the product z = xy . Let X i and Y i be independent with P ( X i = 1) = x and P ( Y = 1) = y , Then for Z i = X i Y i , Z i are Bernoulli with P ( Z i = 1) = xy and E ( Z s ) = xy = z . V ar ( Z i ) = z (1 − z ) and V ar ( Z s ) = N z (1 − z ) = Ω( N ) .Thus Bias ( Z s , z ) = 0 and the varianceand the MSE of the product maintains the suboptimal Ω( N ) asymptotic rate. B. Deterministic variant of stochastic computing
For numbers x, y ∈ [0 , , we consider a unary encoding for x , i,e, P ( X i = 1) = 1 for ≤ i ≤ R and P ( X i = 1) = 0 otherwise, where R = round ( N x ) . For y we have P ( Y i ) = 1 if (cid:98) iy (cid:99) (cid:54) = (cid:98) ( i + 1) y (cid:99) . Let m be the number of indices i suchthat Z i (cid:54) = 0 , then | m − yR | ≤ and Z s = m/N . This meansthat | Z s − z | ≤ | mN − yRN | + | yRN − xy | . Since | R − N x | ≤ ,this implies that | Z s − z | ≤ N and thus the bias is on theorder of O ( N ) and the EMSE L is on the order of O ( N ) . C. Dither computing
For numbers x, y ∈ [0 , , we consider the encoding inSection II-D with the permutation σ x for x defined as theidentity and the permutation σ y for y defined as spreading bits in a sample ( y , · · · , y N ) of ( Y , · · · Y N ) as much aspossible. In particular, let y i be a sample of Y i and s y = (cid:80) i y i .Then σ ( i ) = (cid:98) is y + T (cid:99) mod N for i = 1 , · · · , (cid:98) Ns y (cid:99) ,where T is a uniformly distributed random variable on [0,1]independent from X i and Y i . We will only consider the case x, y ∈ ( , as the other cases are similar. Let n x = (cid:100) N x (cid:101) , n y = (cid:100) N y (cid:101) , δ x = 1 − Nxn x and δ y = 1 − Nyn y . Then Z i = (1 − δ x )(1 − δ y ) = N zn x n y (cid:54) = 0 for n x n y N of indiceson average and otherwise. This implies that E ( Z s ) = z andthe bias is . Similar to the deterministic variant, it can beshown that | Z s − z | ≤ cN for a constant c > , and thus L is O ( N ) .IV. S CALED ADDITION ( OR AVERAGING ) OF VALUES
For x, y ∈ [0 , , the output of the scaled addition (oraveraging) operation is u = ( x + y ) ∈ [0 , . An auxiliarycontrol sequence W i of bits is defined that is used to togglebetween the two sequences by defining U i as alternatingbetween X i and Y i : U i = W i X i + (1 − W i ) Y i and u isestimated via U s = N (cid:80) i U i . A. Stochastic computing
The control sequence W i is defined as N independentBernoulli trials with P ( W i = 1) = . It is assumed that W i , X j and Y k are independent. Then E ( U s ) = E ( X s ) + E ( Y s ) = ( x + y ) = u , i.e., Bias ( U s , u ) = 0 . V ar ( U s ) = N (cid:0) x (1 − x ) + y (1 − y ) (cid:1) = Ω( N ) . Thus again L =Ω( N ) . B. Deterministic variant of stochastic computing
For this case W i are deterministic and we define W i = 1 if i is even and W i = 0 otherwise. Let N e = (cid:98) N (cid:99) and N o = N − N e be the number of even and odd numbers in { , · · · , N } respectively. Then V ar ( E s ) = 0 and E ( U s ) = N e N E ( X s ) + N o N E ( Y s ) . If N is even, E ( U s ) = ( E ( X s ) + E ( Y s )) . If N isodd, | N e N − | − O ( N ) . In either case, | E ( U s ) − u | = O ( N ) and Bias ( U s , u ) = O ( N ) , L = O ( N ) . C. Dither computing
We set σ x and σ y both equal to the identity permutation anddefine sequence { s i } with s i = 1 for i odd and otherwise.With probability , W i = s i for all i and W i = 1 − s i forall i otherwise. Thus the 2 sequences { s i } and { − s i } areeach chosen with probability . Note that W i and W j arecorrelated, E ( W i ) = and V ar ( W i ) = . This means that E ( U s ) = u and the bias is . The 2 sequences for W i selects 2disjoint sets of random variables X i and Y i the sum of whichhas variance O ( N ) . This implies that V ar ( U s ) is O ( N ) .V. N UMERICAL RESULTS
In Figures 1-6 we show the EMSE L and the bias for thecomputing schemes above by generating independentpairs ( x, y ) from a uniform distribution of X and of Y .For each pair ( x, y ) , 1000 trials of dither computing and stochastic computing are used to represent them and computethe product z and average u . N E M S E L Representing x ditherstochasticdeterm. Fig. 1: Sample estimate of EMSE L to represent x for variousvalues of N . N | B i a s | Representing x ditherstochasticdeterm. Fig. 2: Sample estimate of | Bias | to represent x for variousvalues of N .We see that the sample estimate for the bias for x , z and u are lower for both the stochastic computing scheme and thedither computing scheme as compared with the deterministicvariant. On the other hand, the dither computing scheme hassimilar EMSE on the order of O ( N ) as the deterministicscheme, whereas the stochastic computing scheme has higherEMSE on the order of Ω( N ) .Even though both stochastic computing and dither comput-ing have zero bias, the sample estimate of this bias is lowerfor dither computing than for stochastic computing. This isbecause the standard error of the mean is proportional to thestandard deviation and dither computing has standard deviation O ( N ) vs Ω( √ N ) for stochastic computing and this is observedin Figs 2, 4, 6. The set of pairs ( x, y ) are the same for the 3 schemes. For the deterministicvariant, only 1 trial is performed as X are Y are deterministic. N E M S E L Representing z = xy ditherstochasticdeterm. Fig. 3: Sample estimate of EMSE L to represent z = xy forvarious values of N . N | B i a s | Representing z = xy ditherstochasticdeterm. Fig. 4: Sample estimate of | Bias | to represent z = xy forvarious values of N . N E M S E L Representing u = ( x + y )/2 ditherstochasticdeterm. Fig. 5: Sample estimate of EMSE L to represent u = x + y forvarious values of N . N | B i a s | Representing u = ( x + y )/2 ditherstochasticdeterm. Fig. 6: Sample estimate of | Bias | to represent u = x + y forvarious values of N . Stoch. Comp. Determ. Variant Dither Comp.Bias (repr.) N ) 0 Variance (repr.) Ω( N ) 0 Θ( N ) EMSE L (repr.) Ω( N ) Θ( N ) Θ( N ) Bias (mult.) N ) 0 Variance (mult.) Ω( N ) 0 Θ( N ) EMSE L (mult.) Ω( N ) Θ( N ) Θ( N ) Bias (average) N ) 0 Variance (average) Ω( N ) 0 Θ( N ) EMSE L (average) Ω( N ) Θ( N ) Θ( N ) TABLE I: Asymptotic behavior of Bias, Variance and EMSEfor stochastic computing, deterministic variant and dithercomputing to represent a number, to multiply 2 numbers andto perform the average (scaled addition) operation.Furthermore, even though the dither computing representa-tions of x and y has worse EMSE than the deterministic vari-ant, the dither computing representation of both the product z and the scaled addition u has better EMSE. The asymptoticbehavior of bias and EMSE for these different schemes arelisted in Table I.VI. A SYMMETRY IN OPERANDS
In the dither computing scheme (and in the deterministicvariant of the stochastic computing as well), the encodingof the two operands x and y are different. For instance x is encoded as a unary number (denoted as Format 1 ) and y has its -bits spread out as much as possible (denoted as Format 2 ) for multiplication while both x and y are encodedas unary numbers for scaled addition. For multilevel arithmeticoperations, this asymmetry requires additional logic to convertthe output of multiplication and scaled addition into these 2formats depending on which operand and which operation thenext arithmetical operation is. On the other hand, there areseveral applications where the need for this additional step isreduced. For instance,) In memristive crossbar arrays [10], the sequence ofpulses in the product is integrated and converted to dig-ital via an A/D converter and thus the product sequenceof pulses is not used in subsequent computations.2) In using stochastic computing to implement the matrix-vector multiply-and-add in neural networks [11], one ofthe operand is always a weight or a bias and thus fixedthroughout the inference operation. Thus the weight canbe precoded in Format 2 for multiplication and the biasvalue is precoded in Format 1 for addition, whereas thedata to be operated on is always in Format 1 and theresult recoded to Format 1 for the next operation.VII. D ITHER ROUNDING : STOCHASTIC ROUNDINGREVISITED
Recently, stochastic rounding has emerged as an alternativemechanism to deterministic rounding for using reduced pre-cision hardware in applications such as solving differentialequations [12] and deep learning [13]. As mentioned inSec. II-C, 1-bit stochastic rounding can be considered as thespecial case of stochastic computing with N = 1 . For k -bitstochastic rounding, the situation is similar as only the leastsignificant bit is stochastic. Another alternative interpretationis that stochastic computing is stochastic rounding in time, i.e. X i , i = 1 , · · · , N can be considered as applying stochasticrounding N times. Since the standard error of the meanof dither computing is asymptotically superior to stochasticcomputing, we expect this advantage to persist for roundingas well when applied over time.Thus we introduce dither rounding as follows. We assume α ≥ as the case α < can be handled similarly. We definedither rounding of a real number α ≥ as d ( α, i ) = (cid:98) α (cid:99) + X i where { X i } is the dither computing representation of x = α − (cid:98) α (cid:99) as defined in Sect. II-D and α − (cid:98) α (cid:99) is the fractionalpart of α . Note that there is an index i in the definition of d ( · , · ) which is an integer ≤ i < N . In practice we will compute i as σ ( i s mod N ) , where i s counts how many times the ditherrounding operation has been applied so far and σ is a fixedpermutation, one for the left operand and one for the rightoperand of the scalar multiplier.To illustrate the performance of these different roundingschemes, consider the problem of matrix-matrix multiplica-tion, a workhorse of computational science and deep learningalgorithms. Let A and B be p × q and q × r matrices withelements in [0 , . The goal is to compute the matrix C = AB .A straightforward algorithm for computing C requires pqr (scalar) multiplications. Let us assume that we have at ourdisposal only k -bit fixed point digital multipliers and thusfloating point real numbers are rounded to k -bits beforeusing the multiplier. We want to compare the performance ofcomputing C = AB between traditional rounding, stochasticrounding and dither rounding. In particular, since each elementof A is used r times and each element of B is used p times,for dither rounding we set N = min( p, r ) . For dither roundingthe computation of each of the pqr partial results A ij B jk isillustrated in Fig. 7, and the other schemes can be obtained by simply replacing the rounding scheme. We measure the errorby computing the Frobenius matrix norm e f = (cid:107) C − ˜ C (cid:107) frowhere ˜ C is the product matrix computed using the specifiedrounding method and the k -bit fixed point multiplier. In ourcase this is implemented by rescaling the interval [0 , to [0 , k − and rounding to fixed point k -bit integers. Notethat the Frobenius matrix norm is equivalent to the l vectornorm when the matrix is flattened as a vector. 𝑨 𝒊𝒋 𝑩 𝒋𝒌 𝑘 -bit fixed point multiplier Dither rounding 𝐴 𝑖𝑗 + 𝑋 𝜎 𝐿 (𝑖 𝑠 𝑚𝑜𝑑 𝑁) Mult. count 𝑖 𝑠 Dither rounding 𝐵 𝑗𝑘 + 𝑌 𝜎 𝑅 (𝑖 𝑠 𝑚𝑜𝑑 𝑁) 𝑨 𝒊𝒋 𝑩 𝒋𝒌 Fig. 7: Dither rounding to compute the partial result A ij B jk . X i and Y i are the dither computing representations of A ij −(cid:98) A ij (cid:99) and B jk − (cid:98) B jk (cid:99) .We expect dither rounding (and stochastic rounding) tooutperform traditional rounding when the range of the matrixelements is narrow compared to the quantization interval. Forexample, take the special case of A = αJ and B = βJ ,where J is the square matrix of all ’s and α, β ∈ [0 , .When we use traditional rounding to round the elementsof A and B , the corresponding ˜ C is γJ , where γ = round ((2 k − α ) · round ((2 k − β ) / (2 k − . The analysis inSection III shows that for both dither rounding and stochasticrounding the resulting ˜ C satisfies E ( ˜ C ) = αβJ = AB , with E ( e f ) = Θ( N ) for dither rounding and E ( e f ) = Θ( √ N ) forstochastic rounding.We generate 100 pairs of 100 by 100 matrices A and B where elements of A and B are randomly chosen from therange [0 , . and choose N = 100 . The average e f fortraditional rounding, stochastic computing and dither comput-ing are shown in Fig. 8 . We see that dither rounding hassmaller e f than stochastic rounding and that for small k bothdither computing and stochastic rounding has significant lowererror in computing AB than traditional rounding. There is athreshold ˜ k where traditional rounding outperforms dither orstochastic rounding for k ≥ ˜ k , and we expect this thresholdto increase when N, p, q, r increase.For the next numerical experiment, we compare stochasticrounding with dither rounding and set N = p = q = r tobe or with A , B random matrices with elements in which is equivalent to deterministic k -bit quantization. Note that for traditional rounding and k = 1 , A and B are both roundedto the zero matrix, and e f = (cid:107) AB (cid:107) fro in this case. k e f = || CC || f r o Error in computing C = AB dither roundingstochastic roundingtraditional rounding Fig. 8: Comparison of various rounding methods for multiply-ing two by matrices with entries in [0 , . . [0 , and computed trials each. The results are shownin Fig. 9 where we plot the error e f for various k . Againwe see that dither rounding has a smaller average error incomputing C than stochastic rounding. Based on our analysisabove, similar to the previous numerical results, we expectthese gaps to widen as N, p, q, r increase. k e f = || CC || f r o Error in computing C = AB dither (p=q=r=100)stochastic (p=q=r=100)dither (p=q=r=200)stochastic (p=q=r=200) Fig. 9: Comparison of dither vs. stochastic rounding formultiplying two matrices with entries in [0 , .VIII. C ONCLUSIONS
We present a hybrid stochastic-deterministic scheme thatencompasses the best features of stochastic computing andits deterministic variants by achieving the optimal Θ( N ) asymptotic rate for the EMSE of the deterministic variantwhile inheriting the zero bias property of stochastic computingschemes. We also show how it can be beneficial in stochasticrounding applications as well. R EFERENCES[1] B. R. Gaines, “Stochastic computing,” in
Proceedings of the AFIPSSpring Joint Computer Conference , pp. 149–156, 1967.[2] A. Alaghi and J. P. Hayes, “Survey of stochastic computing,”
ACMTransactions on Embedded Computing Systems , vol. 12, no. 2s, pp. 1–19, 2013.[3] T.-H. Chen and J. P. Hayes, “Analyzing and controlling accuracy instochastic circuits,” in
IEEE 32nd International Conference on ComputerDesign (ICCD) , 2014.[4] R. P. Duarte, M. Vestias, and H. Neto, “Enhancing stochastic computa-tions via process variation,” in , 2015.[5] D. Jenson and M. Riedel, “A deterministic approach to stochasticcomputation,” in
ICCAD , 2016.[6] M. D. Davis, R. Sigal, and E. J. Weyuker.,
Computability, Complexity,and Languages: Fundamentals of Theoretical Computer Science . Aca-demic Press, 1994.[7] G. James, D. Witten, T. Hastie, and R. Tibshirani,
An Introduction toStatistical Learning . Springer, 2013.[8] M. H¨ohfeld and S. E. Fahlman, “Probabilistic rounding in neuralnetwork learning with limited precision,”
Neurocomputing , vol. 4, no. 6,pp. 291–299, 1992.[9] M. P. Connolly, N. J. Higham, and T. Mary, “Stochastic rounding andits probabilistic backward error analysis,” Tech. Rep. MIMS EPrint2020.12, The University of Manchester, 2020.[10] T. Gokmen, M. Onen, and W. Haensch, “Training deep convolutionalneural networks with resistive cross-point devices,”
Frontiers in Neuro-science , vol. 11, 10 2017.[11] Y. Liu, S. Liu, Y. Wang, F. Lombardi, and J. Han, “A survey of stochasticcomputing neural networks for machine learning applications,”
IEEETransactions on Neural Networks and Learning Systems , pp. 1–16, 2020.[12] M. Hopkins, M. Mikaitis, D. R. Lester, and S. Furber, “Stochasticrounding and reduced-precision fixed-point arithmetic for solving neuralordinary differential equations,”
Phisophical Transactions A , vol. 378,p. 20190052, 2020.[13] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deeplearning with limited numerical precision,” in