[PDF] Fast Random Integer Generation in an Interval

Abstract

In simulations, probabilistic algorithms and statistical tests, we often generate random integers in an interval (e.g., [0,s)). For example, random integers in an interval are essential to the Fisher-Yates random shuffle. Consequently, popular languages like Java, Python, C++, Swift and Go include ranged random integer generation functions as part of their runtime libraries. Pseudo-random values are usually generated in words of a fixed number of bits (e.g., 32 bits, 64 bits) using algorithms such as a linear congruential generator. We need functions to convert such random words to random integers in an interval ([0,s)) without introducing statistical biases. The standard functions in programming languages such as Java involve integer divisions. Unfortunately, division instructions are relatively expensive. We review an unbiased function to generate ranged integers from a source of random words that avoids integer divisions with high probability. To establish the practical usefulness of the approach, we show that this algorithm can multiply the speed of unbiased random shuffling on x64 processors. Our proposed approach has been adopted by the Go language for its implementation of the shuffle function.

Full PDF

FFast Random Integer Generation in an Interval

DANIEL LEMIRE,

Université du Québec (TELUQ), CanadaIn simulations, probabilistic algorithms and statistical tests, we often generate random integers in an interval(e.g., [ , s ) ). For example, random integers in an interval are essential to the Fisher-Yates random shuffle.Consequently, popular languages like Java, Python, C++, Swift and Go include ranged random integergeneration functions as part of their runtime libraries.Pseudo-random values are usually generated in words of a fixed number of bits (e.g., 32 bits, 64 bits) usingalgorithms such as a linear congruential generator. We need functions to convert such random words to randomintegers in an interval ( [ , s ) ) without introducing statistical biases. The standard functions in programminglanguages such as Java involve integer divisions. Unfortunately, division instructions are relatively expensive.We review an unbiased function to generate ranged integers from a source of random words that avoidsinteger divisions with high probability. To establish the practical usefulness of the approach, we show that thisalgorithm can multiply the speed of unbiased random shuffling on x64 processors. Our proposed approachhas been adopted by the Go language for its implementation of the shuffle function.CCS Concepts: • Theory of computation → Pseudorandomness and derandomization ; •

Software andits engineering → Software performance ;Additional Key Words and Phrases: Random number generation, Rejection method, Randomized algorithms

There are many efficient techniques to generate high-quality pseudo-random numbers such asMersenne Twister [28], Xorshift [27, 32], linear congruential generators [6, 9, 20, 21, 24] and soforth [22, 26]. Many pseudo-random number generators produce 32-bit or 64-bit words that canbe interpreted as integers in [ , ) and [ , ) respectively: the produced values are practicallyindistinguishable from truly random numbers in [ , ) or [ , ) . In particular, no single value ismore likely than any other.However, we often need random integers selected uniformly from an interval [ , s ) , and thisinterval may change dynamically. It is useful for selecting an element at random in an arraycontaining s elements, but there are less trivial uses. For example, the Fisher-Yates random shuffledescribed by Knuth [8, 16] (see Algorithm 1) requires one random integer in an interval for eachvalue in an array to be shuffled. Ideally, we would want these values to be generated without biasso that all integer values in [ , s ) are equally likely. Only then are all permutations equally likely. Arelated algorithm is reservoir sampling [38] (see Algorithm 2) which randomly selects a subset ofvalues from a possibly very large array, even when the size of the array is initially unknown.We use random permutations as part of simulation algorithms [1, 5, 7, 14, 29, 31]. The performanceof randomized permutation algorithms is important. Various non-parametric tests in statisticsand machine learning repeatedly permute randomly the original data. In some contexts, Hinrichset al. found that the computational burden due to random permutations can be prohibitive [15].Unsurprisingly, much work has been done on parallelizing permutations [13, 18, 34, 35, 40] forgreater speed.One might think that going from fixed-bit pseudo-random numbers (e.g., 32-bit integers) topseudo-random numbers in an interval is a minor, inexpensive operation. However, this may notbe true, at least when the interval is a changing parameter and we desire a uniformly-distributedresult. Author’s address: Daniel Lemire, Université du Québec (TELUQ), 5800 Saint-Denis, Office 1105, Montreal, Quebec, H2S 3L5,Canada, [email protected]. a r X i v : . [ c s . D S ] D ec LGORITHM 1:

Fisher-Yates random shuffle: it shuffles an array of size n so that n ! possible permutationsare equiprobable. Require: array A made of n elements indexed from 0 to n − for i = n − , . . . , do j ← random integer in [ , i ] exchange A [ i ] and A [ j ] end forALGORITHM 2: Reservoir sampling: returns an array R containing k distinct elements picked randomlyfrom an array A of size n so that all (cid:0) nk (cid:1) possible samples are equiprobable. Require: array A made of n elements indexed from 0 to n − Require: integer k (0 < k ≤ n ) R ← array of size k for i = , . . . , k − do R [ i ] ← A [ i ] end for for i = k , . . . , n − do j ← random integer in [ , i ] if j < k then R [ j ] ← A [ i ] end if end for return R • Let us consider a common but biased approach to the problem of converting numbers froma large interval [ , n ) to numbers in a subinterval [ , s ) ( s ≤ n ): the modulo reduction x → x mod s . On x64 processors, this could be implemented through the division ( div )instruction when both s and x are parameters. When applied to 32-bit registers, this instructionhas a latency of 26 cycles [10]. With 64-bit registers, the latency ranges from 35 to 88 cycles,with longer running times for small values of s . • Another biased but common approach consists in using a fixed-point floating-point represen-tation consisting of the following step: – we convert the random word to a floating-point number in the interval [ , ) , – we convert the integer s into a floating-point number, – we multiply the two resulting floating-point numbers, – and we convert the floating-point result to an integer in [ , s ) .When using the typical floating-point standard (IEEE 754), we can at best represent all valuesin [ , ) divided by 2 using a 32-bit floating-point number. Thus we do not get the full32-bit range: we cannot generate all numbers in [ , s ) if s > . To do so, we must use doubleprecision floating-point numbers, and then we can represent all values in [ , ) dividedby 2 . Moreover converting between floating-point values and integers is not without cost:the corresponding instructions on x64 processors (e.g., cvtsi2ss , cvttss2si ) have at leastsix cycles of latency on Skylake processors [10].While generating a whole new 64-bit pseudo-random number can take as little as a handful ofcycles [33], transforming it into an integer in an interval ( [ , s ) for s ∈ [ , ) ) without bias cantake an order of magnitude longer when using division operations. here is a fast technique that avoids division and does not require floating-point numbers. Indeed,given an integer x in the interval [ , L ) , we have that the integer ( x × s ) ÷ L is in [ , s ) for anyinteger s ∈ [ , L ] . If the integer x is picked at random in [ , L ) , then the result ( x × s ) ÷ L is arandom integer in [ , s ) [30]. The division by a power of two ( ÷ L ) can be implemented by a bit shiftinstruction, which is inexpensive. A multiplication followed by a shift is much more economical oncurrent processors than a division, as it can be completed in only a handful of cycles. It introduces abias, but we can correct for it efficiently using the rejection method (see § 4). This multiply-and-shiftapproach is similar in spirit to the multiplication by a floating-point number in the unit interval( [ , ) ) in the sense that ( sx ) ÷ L can be intuitively compared with s × x / L where x / L is a randomnumber in [ , ) .Though the idea that we can avoid divisions when generating numbers in an interval is notnovel, we find that many standard libraries (Java, Go, . . . ) use an approach that incurs at least oneinteger division per function call. We believe that a better default would be an algorithm that avoidsdivision instructions with high probability. We show experimentally that such an algorithm canprovide superior performance. We let ⌊ x ⌋ be the largest integer smaller than or equal to x , we let ⌈ x ⌉ be the smallest integergreater than or equal to x . We let x ÷ y be the integer division of x by y , defined as ⌊ x / y ⌋ . We definethe remainder of the division of x by y as x mod y : x mod y ≡ x − ( x ÷ y ) y .We are interested in the integers in an interval [ , L ) where, typically, L =

32 or L =

64. Werefer to these integers as L -bit integers.When we consider the values of x mod s as x goes from 0 to 2 L , we get the 2 L values s values (cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123) , , . . . , s − , s values (cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123) , , . . . , s − , . . . , s values (cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123) , , . . . , s − (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) ( L ÷ s ) s values , L mod s values (cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123) , , . . . , ( L mod s ) − . We have the following lemma by inspection.Lemma 2.1.

Given integers a , b , s > , there are exactly ( b − a ) ÷ s multiples of s in [ a , b ) whenever s divides b − a . More generally, there are exactly ( b − a ) ÷ s integers in [ a , b ) having a given remainderwith s whenever s divides b − a . A geometric distribution with success probability p ∈ [ , ] is a discrete distribution taking value k ∈ { , , . . . } with probability ( − p ) k − p . The mean of a geometric distribution is 1 / p . Assume that we have a source of uniformly-distributed L -bit random numbers, i.e., integers in [ , L ) . From such a source of random numbers, we want to produce a uniformly-distributed randominteger y in [ , s ) for some integer s ∈ [ , L ] . That is all integers from the interval are equallylikely: P ( y = z ) = / s for any integer z ∈ [ , s ) . We then say that the result in unbiased .If s divides 2 L , i.e., it is a power of two, then we can divide the random integer x from [ , L ) by2 L / s = L ÷ s . However, we are interested in the general case where s may be any integer value in [ , L ) .We can achieve an unbiased result by the rejection method [39]. For example, we could generaterandom integers x ∈ [ , L ) until x is in [ , s ) , rejecting all other cases. Rejecting so many values For some applications where streams of random numbers need to be synchronized, the rejection method is not applicable [2,4, 12, 19, 23]. LGORITHM 3:

The OpenBSD algorithm.

Require: source of uniformly-distributed random integers in [ , L ) Require: target interval [ , s ) with s ∈ [ , L ) t ← ( L − s ) mod s { ( L − s ) mod s = L mod s } x ← random integer in [ , L ) while x < t do {Application of the rejection method} x ← random integer in [ , L ) end while return x mod s is wasteful. Popular software libraries use more efficient algorithms. We provide code samples inAppendix A. The C standard library in OpenBSD and macOS have an arc4random_uniform function to generateunbiased random integers in an interval [ , s ) . See Algorithm 3. The Go language (e.g., version1.9) has adopted the same algorithm for its Int63n and

Int31n functions, with minor implemen-tation differences [37]. The GNU C++ standard library (e.g., version 7.2) also relies on the samealgorithm [11].The interval [ L mod s , L ) has size 2 L − ( L mod s ) which is divisible by s . From Lemma 2.1, ifwe generate random integers from integer values in [ L mod s , L ) as remainders of a division by s ( x mod s ), then each of the integers occur for 2 L ÷ s integers x ∈ [ , L ) . To produce integers in [ L mod s , L ) , we use the rejection method: we generate integers in x ∈ [ , L ) but reject the resultwhenever x < L mod s . If we have a source of unbiased random integers in [ , L ) , then the resultof Algorithm 3 is an unbiased random integer in [ , s ) .The number of random integers consumed by this algorithm follows a geometric distribution, witha success probability p = − ( L mod s )/ L . On average, we need 1 / p random words. This averageis less than two, irrespective of the value of s . The algorithm always requires the computation oftwo remainders.There is a possible trivial variation on the algorithm where instead of rejecting the integer from [ , L ) when it is part of the first 2 L mod s values (in [ , L mod s ) ), we reject the integer from [ , L ) when it is part of the last 2 L mod s values (in [ L − ( L mod s ) , L ) ). It is unfortunate that Algorithm 3 always requires the computation of two remainders, especiallybecause we anticipate such computations to have high latency. The first remainder is used todetermine whether a rejection is necessary ( x < ( L − s ) mod s ), and the second remainder is usedto generate the value in [ , s ) as x mod s .The Java language, in its Random class, uses an approach that often requires a single remainder(e.g., as of OpenJDK 9). Suppose we pick a number x ∈ [ , L ) and we compute its remainder x mod s .Having both x and x mod s , we can determine whether x is allowable (is in [ , L − ( L mod s )) )without using another division. When it is the case, we can then return x mod s without anyadditional computation. See Algorithm 4.The number of random words and remainders used by the Java algorithm follows a geometricdistribution, with a success probability p = −( L mod s )/ L . Thus, on average, we need 1 / p randomwords and remainders. Thus when s is small ( s ≪ L ), we need ≈ LGORITHM 4:

The Java algorithm.

Require: source of uniformly-distributed random integers in [ , L ) Require: target interval [ , s ) with s ∈ [ , L ) x ← random integer in [ , L ) r ← x mod s { x − r > L − s ⇔ x ∈ [ , L − ( L mod s )) } while x − r > L − s do {Application of the rejection method} x ← random integer in [ , L ) r ← x mod s end while return r remainders. However, the maximal number of divisions required by the OpenBSD algorithm is two,whereas the Java approach could require infinitely many divisions in the worst case. Though arbitrary integer divisions are relatively expensive on common processors, bit shifts areless expensive, often requiring just one cycle. When working with unsigned integers, a bit shift isequivalent to a division by a power of two. Thus we can compute x ÷ k quickly for any powerof two 2 k . Similarly, we can compute the remainder of the division by a power of two as a simplebitwise logical AND: x mod 2 k = x AND ( k − ) .Moreover, common general-purpose processors (e.g., x64 or ARM processors) can efficientlycompute the full result of a multiplication. That is, when multiplying two 32-bit integers, we canget the full 64-bit result and, in particular, the most significant 32 bits. Similarly, we can multiplytwo 64-bit integers and get the full 128-bit result or just the most significant 64 bits when using a64-bit processor. Most modern programming languages (C, C++, Java, Swift, Go. . . ) have nativesupport for 64-bit integers. Thus it is efficient to get the most significant 32 bits of the productof two 32-bit integers. To get the most significant 64 bits of the product of two 64-bit integers inC and C++, we can use either intrinsics (e.g., __umulh when using the Visual Studio compiler) orthe __uint128_t extension supported by the GNU GCC and LLVM’s clang compilers. The Swiftlanguage has the multipliedFullWidth function that works with both 32-bit and 64-bit integers.It gets compiled to efficient binary code.Given an integer x ∈ [ , L ) , we have that ( x × s ) ÷ L ∈ [ , s ) . By multiplying by s , we takeinteger values in the range [ , L ) and map them to multiples of s in [ , s × L ) . By dividing by 2 L ,we map all multiples of s in [ , L ) to 0, all multiples of s in [ L , × L ) to one, and so forth. The ( i + ) th interval is [ i × L , ( i + ) × L ) . By Lemma 2.1, there are exactly ⌊ L / s ⌋ multiples of s inintervals [ i × L + ( L mod s ) , ( i + ) × L ) since s divides the size of the interval (2 L − ( L mod s ) ).Thus if we reject the multiples of s that appear in [ i × L , i × L + ( L mod s )) , we get that allintervals have exactly ⌊ L / s ⌋ multiples of s . We can formalize this result as the next lemma.Lemma 4.1. Given any integer s ∈ [ , L ) , we have that for any integer y ∈ [ , s ) , there are exactly ⌊ L / s ⌋ values x ∈ [ , L ) such that ( x × s ) ÷ L = y and ( x × s ) mod 2 L ≥ L mod s . Algorithm 5 is a direct application of Lemma 4.1. It generates unbiased random integers in [ , s ) for any integer s ∈ ( , L ) .This algorithm has the same probabilistic distribution of random words as the previously pre-sented algorithms, requiring the same number of L -bit uniformly-distributed random words onaverage. However, the number of divisions (or remainders) is more advantageous. Excluding divi-sions by a power of two, we have a probability L − s L of using no division at all. Otherwise, if the LGORITHM 5:

An efficient algorithm to generate unbiased random integers in an interval.

Require: source of uniformly-distributed random integers in [ , L ) Require: target interval [ , s ) with s ∈ [ , L ) x ← random integer in [ , L ) m ← x × s l ← m mod 2 L if l < s then {Application of the rejection method} t ← ( L − s ) mod s {2 L mod s = ( L − s ) mod s } while l < t do x ← random integer in [ , L ) m ← x × s l ← m mod 2 L end while end if return m ÷ L Table 1. Number of remainder computations (excluding those by known powers of two) in the production anunbiased random integer in [ , s ) . expected number ofremainders per inte-ger in [ , s ) expected number ofremainders whenthe interval is small( s ≪ L ) maximal number ofremainders per inte-ger in [ , s ) OpenBSD (Algorithm 3) 2 2 2Java (Algorithm 4) L L −( L mod s ) ∞ Our approach (Algorithm 5) s L e x pe c t ed nu m be r o f r e m a i nde r s and d i v i s i on s ranged parameter (s)OpenBSD AlgorithmJava

AlgorithmOur approach

Fig. 1. Expected number of remainder computations (excluding those by known powers of two) in theproduction an unbiased random integer in [ , s ) for 32-bit integers. initial value of l = ( x × s ) mod 2 L is less than s , then we incur automatically the cost of a singledivision (to compute t ), but no further division (not counting divisions by a power of two). Table 1and Fig. 1 compare the three algorithms in terms of the number of remainder computations needed. EXPERIMENTS

We implemented our software in C++ on a Linux server with an Intel (Skylake) i7-6700 processorrunning at 3 . -O3 -march=native ” flags. To ensure reproducibility, we make our softwarefreely available. Though we implement and benchmark a Java-like approach (Algorithm 4), all ourexperiments are conducted using C++ code.For our experiments, we use a convenient and fast linear congruential generator with therecurrence formula X n + = c × X n mod 2 where c = X i ∈ [ , ) for i = , , , . . . ), returning X n + ÷ as a 64-bitrandom word [21]. We start from a 128-bit seed X . This well-established generator passes difficultstatistical tests such as Big Crush [25]. It is well suited to x64 processors because they have fast64-bit multipliers.We benchmark the time required, per element, to randomly shuffle arrays of integers havingdifferent sizes. We can consider array indexes to be either 32-bit or 64-bit values. When workingwith 64-bit indexes, we require 64-bit integer divisions which are slower than 32-bit integer divisionson x64 processors. We always use the same 64-bit random-number generator, but in the 32-bit case,we only use the least significant 32 bits of each random word. For reference, in both the 32-bit and64-bit figures, we include the results obtained with the shuffle functions of the standard library( std::shuffle ), implemented using our 64-bit random-number generator. For small arrays, the std::shuffle has performance similar to our implementation of the OpenBSD algorithm, but itbecomes slightly slower when shuffling larger arrays.We present our experimental results in Fig. 3. We report the wall-clock time averaged over atleast 5 shuffles. When the arrays fit in the cache, we expect them to remain in the cache. The timerequired is normalized by the number of elements in the array. As long as the arrays fit in the CPUcache, the array size does not affect the performance. As the arrays grow larger, the latency ofmemory access becomes a factor and the performance decreases. • In the 32-bit case, the approach with few divisions can be almost twice as fast as the Java-likeapproach which itself can be at least 50% faster than the OpenBSD-like approach. • When shuffling with 64-bit indexes as opposed to 32-bit indexes, our implementations of theOpenBSD-like and Java-like algorithms become significantly slower (up to three times) dueto the higher cost of the 64-bit division. Thus our approach can be more than three timesfaster than the Java version in the 64-bit case.The relative speed differences between the different algorithms become less significant when thearrays grow larger. In Fig. 2, we present the ratio of the OpenBSD-like approach with our approach.We see that the relative benefit of our approach diminishes when the array size increases. In the32-bit case, for very large arrays, our approach is merely 50% faster whereas it is nearly three timesfaster for small arrays. Using Linux perf , we estimated the number of cache misses to shuffle anarray containing 100 million integers and found that the OpenBSD approach generates about 50%more cache misses than our approach.To help the processor prefetch memory and reduce the number of cache misses, we can computethe random integers in small blocks, and then shuffle while reading the precomputed integers(see Algorithm 6). The resulting buffered algorithm is equivalent to the conventional Fisher-Yatesrandom shuffle, and it involves the computation of the same number of random indexes, but itdiffers on how the memory accesses are scheduled. In Fig. 4, we see that the OpenBSD-like approach https://github.com/lemire/FastShuffleExperiments O pen BS D - li k e vs . O u r app r oa c h r a t i o array size64-bit32-bit Fig. 2. Ratio of the timings of the OpenBSD-like approach and of our approach. n s / e l e m en t array sizeOpenBSD-likeJava-likeOur approachSTL std::shuf f e (a) 32-bit indexes n s / e l e m en t array sizeOpenBSD-likeJava-likeOur approachSTL std::shuf f e (b) 64-bit indexesFig. 3. Wall-clock time in nanoseconds per element to shuffle arrays of random integers. n s / e l e m en t array sizeOpenBSD-likeOpenBSD-like (buffered)Our approachOur approach (buffered) (a) 32-bit indexes n s / e l e m en t array sizeOpenBSD-likeOpenBSD-like (buffered)Our approachOur approach (buffered) (b) 64-bit indexesFig. 4. Wall-clock time in nanoseconds per element to shuffle arrays of random integers using either regularshuffles or buffered shuffles with a buffer of size 256 (see Algorithm 6). benefits from the buffering when shuffling large arrays. A significant fraction of the running time ofthe regular OpenBSD-like implementation is due to caching issues. When applied to our approach,the benefits of the buffering are small, and for small to medium arrays, the buffering is slightlyharmful. LGORITHM 6:

Buffered version of the Fisher-Yates random shuffle.

Require: array A made of n elements indexed from 0 to n − Require: B ← a small positive constant (the buffer size) i ← n − Z ← some array with capacity B (the buffer) while i ≥ B do for k ∈ { i , i − , . . . , i − B + } do Z k ← random integer in [ , k ] end for for k ∈ { i , i − , . . . , i − B + } do exchange A [ k ] and A [ Z k ] end for i ← i − B end while while i > do j ← random integer in [ , i ] exchange A [ i ] and A [ j ] i ← i − end while n s / e l e m en t array sizeOur approach (unbiased)foating-point (biased) (a) 32-bit indexes n s / e l e m en t array sizeOur approach (unbiased)foating-point (biased) (b) 64-bit indexesFig. 5. Wall-clock time in nanoseconds per element to shuffle arrays of random integers using either Algo-rithm 5 or an approach based on floating-point multiplications. We also compare Algorithm 5 to the floating-point approach we described in § 1 where werepresent the random number as a floating-point value in [ , ) which we multiply by s to get anumber in [ , s ) . As illustrated in Fig. 5, the floating-point approach is slightly slower (10%–30%)whether we use 32-bit or 64-bit indexes. Moreover, it introduces a small bias and it is limited to s ≤ in the 32-bit case and to s ≤ in the 64-bit case. We find that the algorithm often used in OpenBSD and macOS (through the arc4random_uniform function) requires two divisions per random number generation. It is also the slowest in our tests.The Java approach that often requires only one division, can be faster. We believe that it should bepreferred to the OpenBSD algorithm. s we have demonstrated, we can use nearly no division at all and at most one in the worstcase. Avoiding divisions can multiply the speed of unbiased random shuffling functions on x64processors.For its new random shuffle function, the Go programming language adopted our proposedapproach [36]. The Go authors justify this decision by the fact that it results in 30% better speedcompared with the application of the OpenBSD approach (Algorithm 3).Our results are only relevant in the context where the generation of random integers is fastcompared with the latency of a division operation. In the case where the generation of random bitsis likely the bottleneck, other approaches would be preferable [3]. Moreover, our approach may notbe applicable to specialized processors such as Graphics Processing Units (GPUs) that lack supportfor the computation of the full multiplication [17]. The CUDA API provided by Nvidia offers relevant intrinsic functions such as __umul64hi . CODE SAMPLES // returns value in [0 ,s)// random64 is a function returning random 64 - bit words uint64_t openbsd ( uint64_t s , uint64_t (* random64 )( void )) { uint64_t t = (- s ) % s ; uint64_t x ; do { x = random64 () ;} while ( x < t ); return x % s ;} uint64_t java ( uint64_t s , uint64_t (* random64 )( void )) { uint64_t x = random64 () ; uint64_t r = x % s ; while ( x - r > UINT64_MAX - s + 1) { x = random64 () ; r = x % s ;} return r ;} uint64_t nearlydivisionless ( uint64_t s , uint64_t (* random64 )( void )) { uint64_t x = random64 () ; __uint128_t m = ( __uint128_t ) x * ( __uint128_t ) s ; uint64_t l = ( uint64_t ) m ; if ( l < s ) { uint64_t t = - s % s ; while ( l < t ) { x = random64 () ; m = ( __uint128_t ) x * ( __uint128_t ) s ; l = ( uint64_t ) m ;}} return m >> 64;} ACKNOWLEDGMENTS

The work is supported by the Natural Sciences and Engineering Research Council of Canadaunder grant number RGPIN-2017-03910. The author would like to thank R. Startin and J. Epler forindependently reproducing the experimental results and providing valuable feedback.

REFERENCES [1] Michael Amrein and Hans R. Künsch. 2011. A Variant of Importance Splitting for Rare Event Estimation: Fixed Numberof Successes.

ACM Trans. Model. Comput. Simul.

21, 2, Article 13 (Feb. 2011), 20 pages. https://doi.org/10.1145/1899396.1899401[2] Søren Asmussen and Peter W Glynn. 2007.

Stochastic simulation: algorithms and analysis . Springer Science & BusinessMedia, New York, NY, USA.

3] Axel Bacher, Olivier Bodini, Hsien-Kuei Hwang, and Tsung-Hsi Tsai. 2017. Generating Random Permutations by CoinTossing: Classical Algorithms, New Analysis, and Modern Implementation.

ACM Trans. Algorithms

13, 2, Article 24(Feb. 2017), 43 pages. https://doi.org/10.1145/3009909[4] Paul Bratley, Bennet L Fox, and Linus E Schrage. 1987.

A guide to simulation (2 ed.). Springer Science & BusinessMedia, New York, NY, USA.[5] James M. Calvin and Marvin K. Nakayama. 1998. Using Permutations in Regenerative Simulations to Reduce Variance.

ACM Trans. Model. Comput. Simul.

8, 2 (April 1998), 153–193. https://doi.org/10.1145/280265.280273[6] A De Matteis and Simonetta Pagnutti. 1988. Parallelization of random number generators and long-range correlations.

Numer. Math.

53, 5 (1988), 595–608.[7] Luc Devroye. 1997. Random Variate Generation for Multivariate Unimodal Densities.

ACM Trans. Model. Comput.Simul.

7, 4 (Oct. 1997), 447–477. https://doi.org/10.1145/268403.268413[8] Richard Durstenfeld. 1964. Algorithm 235: Random Permutation.

Commun. ACM

7, 7 (July 1964), 420–. https://doi.org/10.1145/364520.364540[9] George Fishman. 2013.

Monte Carlo: concepts, algorithms, and applications (corrected edition ed.). Springer Science &Business Media, Berlin.[10] Agner Fog. 2016.

Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns forIntel, AMD and VIA CPUs

Proceedings of the 2017 Winter Simulation Conference . IEEE Press, New York, NY, USA, 131–157.[13] Jens Gustedt. 2008.

Engineering Parallel In-Place Random Generation of Integer Permutations . Springer Berlin Heidelberg,Berlin, Heidelberg, 129–141. https://doi.org/10.1007/978-3-540-68552-4_10[14] Alejandro S. Hernandez, Thomas W. Lucas, and Matthew Carlyle. 2012. Constructing Nearly Orthogonal LatinHypercubes for Any Nonsaturated Run-variable Combination.

ACM Trans. Model. Comput. Simul.

22, 4, Article 20(Nov. 2012), 17 pages. https://doi.org/10.1145/2379810.2379813[15] Chris Hinrichs, Vamsi K. Ithapu, Qinyuan Sun, Sterling C. Johnson, and Vikas Singh. 2013. Speeding Up PermutationTesting in Neuroimaging. In

Proceedings of the 26th International Conference on Neural Information Processing Systems(NIPS’13) . Curran Associates Inc., USA, 890–898.[16] Donald E. Knuth. 1969.

Seminumerical Algorithms . The Art of Computer Programming, Vol. 2. Addison-Wesley,Boston.[17] W. B. Langdon. 2009. A Fast High Quality Pseudo Random Number Generator for nVidia CUDA. In

Proceedings of the11th Annual Conference Companion on Genetic and Evolutionary Computation Conference: Late Breaking Papers (GECCO’09) . ACM, New York, NY, USA, 2511–2514. https://doi.org/10.1145/1570256.1570353[18] Daniel Langr, Pavel Tvrdík, Tomáš Dytrych, and Jerry P. Draayer. 2014. Algorithm 947: Paraperm—Parallel Generationof Random Permutations with MPI.

ACM Trans. Math. Softw.

41, 1, Article 5 (Oct. 2014), 26 pages. https://doi.org/10.1145/2669372[19] Averill M. Law. 2007.

Simulation Modeling and Analysis (fourth edition ed.). McGraw-Hill, New York, NY, USA.[20] Pierre L’Ecuyer. 1990. Random Numbers for Simulation.

Commun. ACM

33, 10 (Oct. 1990), 85–97. https://doi.org/10.1145/84537.84555[21] Pierre L’Ecuyer. 1999. Tables of linear congruential generators of different sizes and good lattice structure.

Mathematicsof Computation of the American Mathematical Society

68, 225 (1999), 249–260.[22] Pierre L’Ecuyer. 2015. History of uniform random number generation. In

Proceedings of the 2017 Winter SimulationConference . IEEE Press, New York, NY, USA, 1–17.[23] Pierre L’Ecuyer. 2015. Random number generation with multiple streams for sequential and parallel computing. In

Proceedings of the 2015 Winter Simulation Conference . IEEE Press, New York, NY, USA, 31–44.[24] Pierre L’Ecuyer, François Blouin, and Raymond Couture. 1993. A Search for Good Multiple Recursive Random NumberGenerators.

ACM Trans. Model. Comput. Simul.

3, 2 (April 1993), 87–98. https://doi.org/10.1145/169702.169698[25] Pierre L’Ecuyer and Richard Simard. 2007. TestU01: A C Library for Empirical Testing of Random Number Generators.

ACM Trans. Math. Softw.

33, 4, Article 22 (Aug. 2007), 40 pages. https://doi.org/10.1145/1268776.1268777[26] Pierre LâĂŹEcuyer. 2012. Random number generation. In

Handbook of Computational Statistics . Springer, Berlin,35–71.[27] George Marsaglia. 2003. Xorshift RNGs.

Journal of Statistical Software

8, 14 (2003), 1–6.[28] Makoto Matsumoto and Takuji Nishimura. 1998. Mersenne Twister: A 623-dimensionally Equidistributed UniformPseudo-random Number Generator.

ACM Trans. Model. Comput. Simul.

8, 1 (Jan. 1998), 3–30. https://doi.org/10.1145/272991.272995

29] Takayuki Osogami. 2009. Finding Probably Best Systems Quickly via Simulations.

ACM Trans. Model. Comput. Simul.

ACM Trans. Model. Comput.Simul.

8, 1 (Jan. 1998), 71–102. https://doi.org/10.1145/272991.273010[32] François Panneton and Pierre L’Ecuyer. 2005. On the Xorshift Random Number Generators.

ACM Trans. Model. Comput.Simul.

15, 4 (Oct. 2005), 346–361. https://doi.org/10.1145/1113316.1113319[33] Mutsuo Saito and Makoto Matsumoto. 2008.

SIMD-Oriented Fast Mersenne Twister: a 128-bit Pseudorandom NumberGenerator . Springer Berlin Heidelberg, Berlin, Heidelberg, 607–622. https://doi.org/10.1007/978-3-540-74496-2_36[34] Peter Sanders. 1998. Random permutations on distributed, external and hierarchical memory.

Inform. Process. Lett.

BMC Bioinformatics

11, 1 (16 Jun 2010), 329. https://doi.org/10.1186/1471-2105-11-329[36] Josh Bleecher Snyder. 2017. math/rand: add Shuffle. https://go-review.googlesource.com/c/go/+/51891 [last checkedOctober 2017]. (2017).[37] The Go authors 2017. Package rand implements pseudo-random number generators. https://github.com/golang/go/blob/master/src/math/rand/rand.go [last checked October 2017]. (2017).[38] Jeffrey S. Vitter. 1985. Random Sampling with a Reservoir.

ACM Trans. Math. Softw.

11, 1 (March 1985), 37–57.https://doi.org/10.1145/3147.3165[39] John Von Neumann. 1951. Various techniques used in connection with random digits.

National Bureau of StandardsSeries

12 (1951), 36–38.[40] Michael Waechter, Kay Hamacher, Franziska Hoffgaard, Sven Widmer, and Michael Goesele. 2012.