[PDF] Quantile Mechanics II: Changes of Variables in Monte Carlo methods and GPU-Optimized Normal Quantiles

Abstract

This article presents differential equations and solution methods for the functions of the form Q(x)= F −1 (G(x)) , where F and G are cumulative distribution functions. Such functions allow the direct recycling of Monte Carlo samples from one distribution into samples from another. The method may be developed analytically for certain special cases, and illuminate the idea that it is a more precise form of the traditional Cornish-Fisher expansion. In this manner the model risk of distributional risk may be assessed free of the Monte Carlo noise associated with resampling. Examples are given of equations for converting normal samples to Student t, and converting exponential to hyperbolic, variance gamma and normal. In the case of the normal distribution, the change of variables employed allows the sampling to take place to good accuracy based on a single rational approximation over a very wide range of the sample space. The avoidance of any branching statement is of use in optimal GPU computations as it avoids the effect of {\it warp divergence}, and we give examples of branch-free normal quantiles that offer performance improvements in a GPU environment, while retaining the best precision characteristics of well-known methods. We also offer models based on a low-probability of warp divergence. Comparisons of new and old forms are made on the Nvidia Quadro 4000, GTX 285 and 480, and Tesla C2050 GPUs. We argue that in single-precision mode, the change-of-variables approach offers performance competitive with the fastest existing scheme while substantially improving precision, and that in double-precision mode, this approach offers the most GPU-optimal Gaussian quantile yet, and without compromise on precision for Monte Carlo applications, working twice as fast as the CUDA 4 library function with increased precision.

Full PDF

QQuantile Mechanics II:Changes of Variables in Monte Carlo methodsandGPU-Optimized Normal Quantiles

William T. Shaw ∗ , Thomas Luu † and Nick Brickman ‡ October 23, 2018

Abstract

With ﬁnancial modelling requiring a better understanding of model risk, it is helpful to be able tovary assumptions about underlying probability distributions in an eﬃcient manner, preferably withoutthe noise induced by resampling distributions managed by Monte Carlo methods. This article presentsdiﬀerential equations and solution methods for the functions of the form Q ( x ) = F − ( G ( x )), where F and G are cumulative distribution functions. Such functions allow the direct recycling of Monte Carlo samplesfrom one distribution into samples from another. The method may be developed analytically for certainspecial cases, and illuminate the idea that it is a more precise form of the traditional Cornish-Fisherexpansion. In this manner the model risk of distributional risk may be assessed free of the Monte Carlonoise associated with resampling. The method may also be regarded as providing both analytical andnumerical bases for doing more precise Cornish-Fisher transformations. Examples are given of equationsfor converting normal samples to Student t, and converting exponential to hyperbolic, variance gammaand normal. In the case of the normal distribution, the change of variables employed allows the samplingto take place to good accuracy based on a single rational approximation over a very wide range of thesample space. The avoidance of any branching statement is of use in optimal GPU computations asit avoids the eﬀect of warp divergence , and we give examples of branch-free normal quantiles that oﬀerperformance improvements in a GPU environment, while retaining the best precision characteristics ofwell-known methods. We also oﬀer models based on a low-probability of warp divergence. Comparisonsof new and old forms are made on the Nvidia Quadro 4000, GTX 285 and 480, and Tesla C2050 GPUs.We argue that in single-precision mode, the change-of-variables approach oﬀers performance competitivewith the fastest existing scheme while substantially improving precision, and that in double-precisionmode, this approach oﬀers the most GPU-optimal Gaussian quantile yet, and without compromise onprecision for Monte Carlo applications, working twice as fast as the CUDA 4 library function withincreased precision. Keywords: Monte Carlo, Student, hyperbolic, variance gamma, computational ﬁnance, quantile mechanics,normal quantile, Gaussian quantile, GPU, Acklam, AS241, inverse error function, erﬁnv, inverse CDF, probit. ∗ Corresponding author: Departments of Mathematics and Computer Science, University College London, Gower Street,London WC1E 6B, England; E-mail: [email protected] † Department of Mathematics, University College London; E-mail: [email protected] ‡ Taylor Brickman Ltd; E-mail: [email protected] a r X i v : . [ q -f i n . C P ] D ec .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable The construction of Monte Carlo samples from a distribution is facilitated if one has a knowledge of thequantile function w ( u ) of a distribution. If F ( x ) is the cumulative distribution function, then the quantile w ( u ) is the solution of the equation F ( w ( u )) = u . (1)A knowledge of the function w ( u ) makes Monte Carlo simulation straightforward: given a random sample U from the uniform distribution, a sample from the target distribution characterized by a density anddistribution f ( x ) , F ( x ), is X = w ( U ) . (2)While it is commonplace to use the uniform distribution on the unit interval as the base distribution forsampling, there is in fact no need to do so . In his critique of copula theory [13], T. Mikosch stated Thereis no particular mathematical or practical reason for transforming the marginals to the uniform distributionon (0; 1) and proceeded to consider exponential and normal coordinates. On the other hand, samples fromthe uniform distribution are readily generated, and in the computer science domain it remains a topic ofcontinuing study to develop better uniform random number generators.For example, a great deal of intellectual eﬀort has been expended on highly eﬃcient sampling fromthe normal and other well-known distributions. Given such samples, can we leverage the work done tocreate samples from other distributions in an eﬃcient manner? This article will address this question in theaﬃrmative. In principle the answer is trivial: given a sample Z from a distribution with CDF G ( x ), we ﬁrstwork out G ( X ) which is uniform. Then we can apply the quantile function F − ( x ) associated with a targetdistribution with CDF F and form F − ( G ( x )) as a sample from that target distribution. In general F , G and their inverses can be rather awkward special functions (see e.g. [17]) , so a direct route to the object Q ( x ) = F − ( G ( x )) would be helpful.There are at least two ways of developing this idea. One route is to postulate interesting forms for thecomposite mapping. This has been explored by Shaw and Buckley [19] based on Gilchrist’s theory of quantiletransformations [8]. In this way we can ﬁnd skew and kurtotic variations of any base distribution, whileavoiding, in a controlled manner, the introduction of “negative density” problems that arise in traditionalGram-Charlier methods. The second route is to try to simplify the mapping given a choice of F and G . Sucha route can be found by the method of diﬀerential equations for quantile functions developed by Steinbrecherand Shaw [21]. In the next section we will give a brief review of that approach, and in Section 4 we shalluse this insight to build a new representation of the Student t quantile in terms of a series expansion of aGaussian. This neatly tidies up the well-known expansions already available for the Student distribution asthe degrees of freedom becomes large, and we obtain a more compact power series in the normal randomvariable.In view of the importance of Gaussian sampling, we will also explore how to make normal randomvariables by the change of variables approach, and explore the use of the exponential and even Studentdistributions as intermediaries. In this simple cases we can write down the objects F − ( G ( z )) very explicitlyand the issue becomes one of the eﬃcient approximation and computation of such objects. We will showthat this oﬀers a useful performance beneﬁt in a GPU environment, where branching algorithms may besubject to signiﬁcant performance penalties. Our change-of-variables approach will allow costly branching tobe avoided or reduced to a low probability and we will demonstrate the beneﬁts in the CUDA environmentfor programming NVIDIA GPUs. This part of the paper of necessity involves ideas from computer science asmuch as applied mathematics, but there are lessons in approximation theory to be learnt from this analysis,which reveals the relative merits of using diﬀerent types of error norm in the constructions in terms of theirimpact on the actual precision of the relevant function approximations.The plan of this work is as follows. In Section 2 we review the diﬀerential approach to quantile functionsand introduce a new diﬀerential equation that characterizes the mapping from samples of one probabilitydistribution to another. We give speciﬁc examples for recycling samples of exponential and Gaussian randomvariables. In Section 3 we focus on Gaussian backgrounds and review the transformation from Gaussianrandom variables to those from Student t distribution. We will show how this recasts traditional Cornish-Fisher expansions into power series solutions of non-linear ODEs, and give a brief introduction to how thismight be applied. This rather clear observation was ﬁrst made to me by Peter Jaeckel .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable If f ( x ) is the probability density function for a real random variable X , the ﬁrst order quantile ODE isobtained by diﬀerentiating Eqn. (1), to obtain: f ( w ( u )) dw ( u ) du = 1 , (3)where w ( u ) is the quantile function considered as a function of u , with 0 ≤ u ≤

1. Applying the productrule with a further diﬀerentiation we obtain: f ( w ( u )) d w ( u ) du + f (cid:48) ( w ( u )) (cid:18) dw ( u ) du (cid:19) = 0 . (4)This may be reorganized as d w ( u ) du = H ( w ( u )) (cid:18) dw ( u ) du (cid:19) , (5)where H ( w ) = − ddw log { f ( w ) } . (6)and the simple rational form of H ( w ) for many common distributions, particularly the Pearson family, allowsanalytical series solutions to be developed [21]. This last equation we refer to as the second order quantileequation. Now suppose that we make a change of independent variable in the second order quantile equation Eqn (8).We let v = q ( u ), and regard w as a function of v by writing w ( u ) = Q ( v ). Elementary application of thechain rule and some algebra gives us: d Q ( v ) dv + q (cid:48)(cid:48) ( u )[ q (cid:48) ( u )] dQ ( v ) dv = H ( Q ( v )) (cid:18) dQ ( v ) dv (cid:19) , (7)In general this is a rather awkward diﬀerential equation. However, when we regard q ( u ) as being itself aquantile function, we can make some simpliﬁcations. If q ( u ) is a quantile mapping, it satisﬁes an ODE ofthe form d q ( u ) du = ˆ H ( q ( u )) (cid:18) dq ( u ) du (cid:19) , (8)where ˆ H ( w ) = − ddw log { ˆ f ( w ) } . (9) .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable f is the probability density function associated with the quantile q ( u ). So we can simplify the ODE to d Q ( v ) dv + ˆ H ( q ( u )) dQ ( v ) dv = H ( Q ( v )) (cid:18) dQ ( v ) dv (cid:19) , (10)and bearing in mind that v = q ( u ) we arrive at the “Recycling Ordinary Diﬀerential Equation”: d Q ( v ) dv + ˆ H ( v ) dQ ( v ) dv = H ( Q ( v )) (cid:18) dQ ( v ) dv (cid:19) , (11)We now turn to two particularly interesting cases, rather inspired by Mikosch’s suggestions [13]. In this case we have the following obvious sequence of manipulations:ˆ f ( x ) = 1 √ π e − x / (12)log ˆ f ( x ) = − / π ) − x / ddx log ˆ f ( x ) = − x (14)ˆ H ( v ) = v (15)and we arrive at the Recycling ODE for a Gaussian background as d Q ( v ) dv + v dQ ( v ) dv = H ( Q ( v )) (cid:18) dQ ( v ) dv (cid:19) , (16)This is an interesting example to consider for target distributions along the entire real line. In this case we have the following obvious sequence of manipulations:ˆ f ( x ) = e − x , log ˆ f ( x ) = − x, ddx log ˆ f ( x ) = − , ˆ H ( v ) = 1 (17)and we arrive at the Recycling ODE for a exponential background as d Q ( v ) dv + dQ ( v ) dv = H ( Q ( v )) (cid:18) dQ ( v ) dv (cid:19) , (18)This is an interesting example to consider for target distributions along the positive real line. For distributionsthat are asymptotically exponential in both directions it can be used in two pieces. In a Gaussian background we work with the Recycling ODE in the form Q (cid:48)(cid:48) + v Q (cid:48) = H ( Q )( Q (cid:48) ) (19)where the explicit dependence on v is suppressed for brevity, and ’ denotes d/dv . The target distributionis encoded through the function H . Note that it is not required in any sense that the target distribution is“close” to, or asymptotic to a Gaussian. This is an exact relationship governing the function Q that is thecomposition of the Gaussian CDF followed by the quantile of the target distribution. But such a relationshipmust contain all information relevant to the creation of an expansion of one distribution in terms of another.In particular, we should be able to re-create known and new expansions of Cornish-Fisher type. GeneralizedCornish-Fisher expansions have been considered in the notable paper by Hill and Davis [9], but the step toconsidering the matter as the solution of a single diﬀerential equation is, so far as this author is aware, anew one. .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable This is an interesting case for several reasons:1. We can illustrate the method;2. We can recover a well known asymptotic series;3. We can develop that series to arbitrary numbers of terms;4. We can explore the limitations of the known series;5. We can develop an alternative numerical method and explore purely numerical options.The density and associated H -function for the Student case can be written down as f T n ( x ) = 1 √ nπ Γ[( n + 1) / n/ (cid:18) x n (cid:19) − n +12 , H T n ( Q ) = (cid:18) n (cid:19) Q Q /n (20)where n is the degrees of freedom (not necessarily an integer), and the Recycling ODE can be written in theform (cid:18) Q n (cid:19)(cid:18) Q (cid:48)(cid:48) + v Q (cid:48) (cid:19) = (cid:18) n (cid:19) Q ( Q (cid:48) ) (21)We note that if we let n → ∞ we obtain Q (cid:48)(cid:48) + v Q (cid:48) = Q ( Q (cid:48) ) (22)and this has the desired solution Q = v . More generally we can look at series solutions, but should bemindful of the fact that the term Q /n is present - this is a hint that the behaviour of series for Q (cid:28) √ n and Q (cid:29) √ n could be rather diﬀerent. Such considerations do not always apply if one is thinking in apurely asymptotic framework. For any ﬁnite n , no matter how large, there will always be values of Q suchthat the behaviour is far from Gaussian. This was alluded to in [17], where it was noted that the knownCornish-Fisher expansion always goes wrong in the tails at some point.We also need to consider boundary conditions. The derivative of any ordinary quantile function at apoint is the inverse of the PDF at the corresponding quantile. We ﬁrst work around the point u = 1 / v = 0 in the Gaussian coordinate. If z ( u ) and t ( u ) are the ordinary quantiles for theGaussian and Student distributions, then we have z (1 /

2) = 0 ,z (cid:48) (1 /

2) = √ πt (1 /

2) = 0 ,t (cid:48) (1 /

2) = √ nπ Γ[ n/ n + 1) /

2] (23)It follows that the centre conditions we wish to apply to the Recycling ODE are just: Q (0) = 0 ,Q (cid:48) (0) = γ ≡ (cid:114) n n/ n + 1) /

2] (24)where the latter expression γ arises as the ratio of the derivatives. We now develop a series solution about the centre, and we expect that it will be reasonable to treat thesolution as “close to Gaussian” if Q (cid:28) n . We assume, as both the normal and Student quantiles aresymmetric, that Q ( v ) ∼ ∞ (cid:88) k =0 c k v k +1 (25) .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable c = γ . We use the tilde notation to indicate that at this point we have no presumption as to whetherthe resulting series will be convergent for all v or form some kind of asymptotic series. We ﬁnd that c = ( n + 1) γ − nγ nc = (cid:0) n + 8 n + 1 (cid:1) γ + (cid:0) − n − n (cid:1) γ + 3 n γ n (26)Subsequent terms may be generated by iteration of the RODE, and in this case, after some algebra, we ﬁndthat (2 i + 3)(2 i + 2) c i +1 = − (2 i + 1) c i + i (cid:88) l =0 i − l (cid:88) m =0 a lm ( n ) c i − l − m c l c m − θ ( i ) n i − (cid:88) l =0 i − − l (cid:88) m =0 (2 m + 1) c i − − l − m c l c m , (27)where θ (0) = 0 , θ ( i ) = 1 if i ≥

1, and a lm ( n ) = (1 + 1 n )(2 l + 1)(2 m + 1) − n m (2 m + 1) (28) We now develop a series solution about the right tail Q → ∞ . We begin by assuming that Q (cid:29) n . TheRecycling ODE becomes Q ( Q (cid:48)(cid:48) + vQ (cid:48) ) = ( n + 1)( Q (cid:48) ) (29)Following some experimentation, we make the change of variables P ( v ) = 1 Q ( v ) n (30)and this reduces the ODE to P (cid:48)(cid:48) ( v ) + vP (cid:48) ( v ) = 0 (31)The solution of this satisfying the condition that P ( v ) → v → ∞ is P ( v ) ∝ erfc( v √ c , Q ( v ) ∼ c (cid:20)

12 erfc( v √ (cid:21) − /n (33)We see that the solution has emerged naturally as Q ( v ) ∼ c (cid:20) − Φ( v ) (cid:21) − /n (34)where Φ is the Gaussian CDF. The asymptotic diﬀerential equation is scale invariant so we have to determine c by other means. It is possible that it might be possible to determine it by a matching argument, but it issimpler to now appeal to other known properties of the Student distribution. In [17] the tail behaviour ofthe Student CDF was determined (see Eqns. (60-62) of [17]) and we can deduce that c = √ n (cid:20) n √ π Γ( n/ n + 1) / (cid:21) − /n (35)If we step back from these calculations it becomes clear what is happening. The Recycling ODE is startingto reconstruct a solution that combines the change of variable w = 1 − Φ( v ) with the asymptotic power seriesof the ordinary Student quantile. .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable Expansions of Cornish-Fisher type can be found in the statistics literature. One that is reasonably wellknown is the expansion of the Student random variable t in terms of the Gaussian random variable z , forlarger values of the degrees of freedom n . It is quoted, for example, as identity 26.7.5 of [3]. t = z + z + z n + 5 z + 16 z + 3 z n + 3 z + 19 z + 17 z − z n + 79 z + 776 z + 1482 z − z − z n + . . . (36)An equation of true Cornish-Fisher type (cf identity 26.2.49 of [3]) can be obtained by transforming (provided n >

2) to a variable s with unit variance : s = t (cid:112) − /n and re-expanding in inverse powers of n . ThatEqn. (36) is somehow incomplete is evident by the fact that z appears in every term, z in all but the ﬁrst,and so on. The matter is resolved nicely by ﬁrst observing that γ = 1 + 14 n + 132 (cid:18) n (cid:19) − (cid:18) n (cid:19) − (cid:0) n (cid:1) (cid:0) n (cid:1) O (cid:32)(cid:18) n (cid:19) (cid:33) , (37)which sums up all the z -terms. Similarly c = 14 n + 16 (cid:18) n (cid:19) + 17384 (cid:18) n (cid:19) − (cid:18) n (cid:19) − (cid:0) n (cid:1) O (cid:32)(cid:18) n (cid:19) (cid:33) (38)and so on. So the series solution of the diﬀerential equation constitutes a re-summation of the knownasymptotic series where the coeﬃcient of each power of z is computed exactly. This means that we can usethe series without assuming that n is large, as it is a series valid in some domain for z , the existence of whichdoes not depend on n being large. We now turn to the quality of the results. This can be assessed precisely by the use of an exact representationof the composite function F − N (Φ( z )), where Φ is the normal CDF and F n the Student CDF. The exactformula for the Student CDF for all real n is given in terms of inverse beta functions by Shaw [17], andthere are known simpler forms for n = 1 , ,

4. These are also given in [17] and are also now available on the

Wikipedia page on quantile functions [16]. The case n = 4 is an interesting case as it is known exactly, inthe boundary case where kurtosis is inﬁnite, and there is some evidence from work by Fergusson and Platen[7] that it is a good case for modelling daily world index log-returns. We shall therefore develop this insome detail. It turns out that working as far as c is a useful point. A detailed calculation shows that theprecision (i.e. relative error) of the central power series is then less than 2 × − on | z | <

4. Given that z is a standard Gaussian this is within the range of ± γ = 43 (cid:114) π ∼ . y = z ∗ z , t = z*(1.06384608107048714 +y*(0.0735313753642658509 +y*(0.00408737916150927847 +y*(0.000157376276663230562 +y*(4.31939824140363509e-6 +y*(9.56881464639227278e-8 +y*(2.09256881803614446e-9 +y*(3.87962938209093352e-11 +y*(2.72326084541915671e-13 +(2.90528930162373328e-15 +4.59490133995901375e-16*y)*y))))))))) .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable | z | > n = 4 it is suﬃcient to use just two termsof the known tail series. This gives us, in general, for the positive tail (the negative tail being treated bysymmetry) w = (1 − Φ( z )) n √ π Γ( n/ n + 1) / t = √ nw − /n (1 − n + 12( n + 2) w /n ) (40)and for the case n = 4: w = (1 − Φ( z )) 163 t = 2 w − / (1 − w / ) (41)The optimal crossover is then in fact at z = 3 . . × − overthe entire range of z The analysis for the Student t case, although rather specialized, also allows the appraisal of direct numer-ical schemes. The direct numerical solution of the RODE can be done using standard methods. Within

Mathematica version 6, the use of

NDSolve with high precision and accuracy goals, explicit Runge-Kuttaand sixth-order diﬀerences leads to an precision of better than 5 × − on the range | z | <

6, which is excel-lent. Of course, one must also consider sampling eﬃciency issues arising from such interpolated numericalschemes, but they can be made the basis of a further, e.g. rational approximation if speed is an issue. Sucha numerical scheme will be exploited in the examples considered below. There is clearly much more scopefor transforming between Student distributions of diﬀerent n and Gaussians, and later in this paper we willprovided an example of transforming a T to a Gaussian. Given that there closed-forms for the quantiles for T , T and T these might serve as useful intermediate distributions for other applications as well, but thiswill be explored elsewhere. In this section we consider the issues involved in building more accurate and/or faster quantiles for live useon computer simulations. These will be important for any distribution, but the particular focus here will beon ﬁnding new forms of quantile for the normal distribution. The construction of the normal quantile, alsoknow as “probit”, has a long and interesting history - see [21] and the references therein for details . Thisis important given its base case use in general modelling applications, and in particular given its massivelyrepeated use in the solution of stochastic diﬀerential equations. The issues we consider are1. The target precision;2. The domain over which the target is to be rigorously achieved;3. The behaviour outside the target domain;4. The choice of intermediate probability distribution to be used in the recyling scheme;5. The algorithm’s performance on typical CPU and GPU architectures.The last point applies to the particular approach developed here, but the ﬁrst four points are relevant toany approach. In the following sections we explore these issues and the associated choices. This gives somebasic parameters relevant to our general scheme of implementing “Quantile Mechanics”. In the past, diversechoices have been made for precision and domain, and we think it appropriate to better characterize thesein mathematical form. .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable Given an exact representation of a function Q ( u ) and an approximation of it Q a ( u ) we can consider variousnorms on their diﬀerences. We will work in the supremum norm for the relative precision over a domain D to be speciﬁed, and seek to minimize the relative error (cid:15) R = sup D (cid:12)(cid:12)(cid:12)(cid:12) Q a ( u ) Q ( u ) − (cid:12)(cid:12)(cid:12)(cid:12) . (42)What this seeks to do is to maximize the quality of the signiﬁcant ﬁgures of the result. We do have tobear in mind that such a goal may be interpreted in two ways. The ﬁrst approach, that will be taken here,is to set such a target for the theoretical precision of an approximation (typically polynomial or rational),given an abundance of computer precision. The second approach is to set the target based on the actualerrors in a given limited precision environment. While the second approach is clearly more pragmatic,the answers we will get will depend signiﬁcantly on implementation/hardware/compiler details. We willtherefore consider the ﬁrst approach, which allows us to develop the analysis mathematically, and which isin common use in modelling. So, for example, the ﬁrst form (i.e. the simple form without an application ofthe Newton-Raphson-Halley iteration) of the normal quantile developed by Acklam [1] sets (cid:15) R (cid:46) . × − (43)and our own computational analysis of Wichura’s AS241 [23] suggests that in that case (cid:15) R < . × − (44)when worked out in high precision arithmetic. The Acklam error bound is also attained in double-precisionarithmetic, while AS241 errors (in our implementation) peak at about 5 × − .In our own studies we have chosen the set the target theoretical precision based on IEEE levels . Giventhat IEEE single precision has a 24 bit mantissa, we will aim in single, or ﬂoat precision for (cid:15) Rf = 2 − = 2 . × − . (45)Since IEEE double precision has 53 bits of precision we will set (cid:15) Rd = 2 − = 5 . × − (46)for double precision. So these values are half the values often quoted as “machine epsilon” for the two cases.The matter is somewhat ethereal in view of diﬀerent possible rounding methods and compiler variations, andin some circumstances a ﬂoat target of 2 − = 1 . × − may be perfectly acceptable. The error targetmight also be expressed in a diﬀerent norm. So, for example, M. Giles works with an L on the relative errorand a theoretical target of about 10 − in his GPU Computing Gems article [10]. That work also includes avery useful discussion of realized errors. This is a somewhat subtle matter and massively diﬀerent choices are possible. The quantile mapping fordistributions along the entire real line is from the domain (0 ,

1) to the range that is the real line. In practicecomputation will take place on an interval U ( (cid:15) L , (cid:15) R ) = ( (cid:15) L , − (cid:15) R ) (47)and the question for this section is the choice of (cid:15) L , (cid:15) R . The diﬃculty is that there is some asymmetry in thematter, as is well discussed on P. Acklam’s web page [1]. It is clear that the value of (cid:15) R should be related tothe machine epsilon, as there is simply no possibility of getting closer to unity in ﬁnite precision arithmetic.For example, Acklam’s discussion considers the errors up to both the points 1 − (cid:15) Rd and 1 − (cid:15) Rd .The problems are more to do with the choices made for (cid:15) L . One might argue on symmetry groundsthat one should take (cid:15) L = (cid:15) R . Indeed, if the analysis is done for the inverse error function, rather thanthe quantile, this is entirely natural, and is the approach taken by Giles [10], with a symmetric range for We are grateful to M. Pont of NAG for advice on these points .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable erﬁnv(x) from ( − (cid:15) G , − (cid:15) G ) where (cid:15) G is about 5 . × − for ﬂoat and 1 . × − for double, basedon the largest required values for w = − log(1 − x ) of 16 and 36.While such symmetric choices are sensible for erﬁnv , it seems to us that they do not go far enough forthe case of the quantile function with the domain of the unit interval, with lower bound at zero. Acklam [1]takes the view that the quantile needs to have a controlled precision down to at least 2 . × − , whichis the smallest full precision number that can be given in double precision, and his results show controllederrors on a still larger range.If one is interested in representing a mathematical function for diverse applications to as much precisionas possible, we take the view that a domain on the lines of the extended form proposed by Acklam is entirelyappropriate. However, for Monte Carlo applications this is massively over-engineered, but we argue also thatthe Giles method does not quite go far enough. To make the case for a diﬀerent choice, we consider thetypical scope of a Monte Carlo simulation.Consider a set of N independent uniformly distributed random numbers, { U i } , i = 1 , . . . , N taken fromthe interval (0 , U min = min { U i } . (48)The distribution of U min is easily worked out using elementary probability: P ( U min < u ) = 1 − P ( U min ≥ u ) = 1 − P ( U i ≥ u, ∀ i ) = 1 − P ( U i ≥ u ) N (49)and given that each U i has a uniform distribution we see that P ( U min < u ) = 1 − (1 − u ) N (50)and it is then easily computed that E ( U min ) = 1 N + 1 , Var( U min ) = N ( N + 1) ( N + 2) , (51)and we see that the average smallest value is about 1 /N for large N , but the standard deviation is about thesame as the mean. Inverting the distribution function we see that the quantile function for this minimum isgiven by Q min ( v ) = 1 − (1 − v ) /N ∼ − N log(1 − u ) as N → ∞ (52)which is an exponential distribution. For large N and small U this quantile function is approximately Q min ( v ) ∼ uN . (53)In other words, we have about a 1% chance of encountering numbers lower than1100 ∗ N (54)in our simulation. Even a rather modest Monte Carlo simulation therefore has a signiﬁcant chance ofproducing numbers going well below the machine epsilon in ﬂoat mode, though not in double precision.One might argue that the inﬂuence of a few such small numbers on the overall computation of expectationsmight be small. The diﬃculty with such a view is that one can imagine situations where the computation isdominated by minima (lookbacks, options on the worst performer of a basket, systems coupled by a Claytoncopula with a lot of downside correlation, deeply out of the money Puts, and so on). Furthermore, the useof antithetic methods might result in low values being reﬂected to high values as well after the applicationof the target distribution’s quantile function.In view of this we argue that the largest allowable low-end truncation point should be no higher thanabout 10 − , meaning that 99% of random numbers would be mapped to the target distribution accuratelyfor samples up to N = 10 in size. Options should be provided to go to smaller values for special problemsrequiring very large samples. What this means is that in the left tail the machine epsilon bound will beignored in ﬂoat mode, and replaced by a number no higher than 10 − . Bearing in mind that our algorithmswill be symmetric in any case, we will take our target region as( ε, − ε ) (55) .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable ε < − for ﬂoat and ε < . × − for double . (56)In the case of double precision, it seems to us that the machine epsilon level is already small enough tomanage a super-large Monte Carlo simulation. We see no need to go down to the smallest full precision ﬂoat,about 10 − , let alone to the smallest possible double, for Monte Carlo applications, as such small numbersare very unlikely to occur. Having made the choices in the previous sub-section, we wish to insure that the values of random variablesgenerated if u < ε are not too poor in their error characteristics. That is, we require that the error growthbelow this level is reasonable. For example, if the relative error in a one in a billion sample point is perhaps10 − , it will not matter, but much larger relative errors cannot be accepted. Moving now to the particular case of the normal quantile, we have several options for intermediate distribu-tions of interest. The choices we will explore are suitable scalings of: • The one-sided exponential distribution; • The Student t distribution with two degrees of freedom; • An “almost-normal” distribution; • An “almost-chi-squared” distribution.The one-sided exponential distribution has PDF, CDF and Quantile given by, for x >

0, 0 < u < f E ( x ) = e − x , F E ( x ) = 1 − e − x , Q E ( u ) = − log(1 − u ) . (57)The Student t distribution with two degrees of freedom has PDF, CDF and Quantile given by, for all real x , 0 < u < f t ( x ) = 1(2 + x ) / , F t ( x ) = 12 + x √ x + 2 , Q t ( u ) = 2 u − (cid:112) u (1 − u ) . (58)The almost-normal distribution is literally one that is very close to a standard Gaussian but has a simpleclosed-form quantile. The density and distribution function are given by f AN ( x ) = e − x π | x | π (cid:113) − e − x π (59) F AN ( x ) = 12 + 12 sign( x ) (cid:113) − e − x π (60)and it is then an elementary exercise to compute the quantile function as Q AN ( u ) = sign( u − / ∗ (cid:114) − π u (1 − u )) . (61)The almost-chi-squared distribution is that of a random variable that is the square of an almost-normaldistribution. The density, distribution and quantile are given by f AC ( x ) = e − xπ π (cid:112) − e − xπ (62) F AC ( x ) = (cid:113) − e − xπ (63) This is a distribution one of us (WS) invented for an examination question, and it is left to the reader to compare itsdensity, distribution and quantile with the exact normal. .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable Q AC ( u ) = − π − u ) . (64)This distribution can also be simulated via the 2 − u → Q AN ( u ), that has the same image as Q AC ( u ), Q AN ( u ) = − π u (1 − u )) . (65)It is informative to look at the current list of approximate normal quantiles to see how prevalent thesechanges of variable are. • Our earlier branchless formulae [20] relied on the one-sided exponential distribution; • Giles’ method [10] is essentially based on transforming the almost-gamma in the central region and thealmost-normal in the tail; • Acklam’s non-iterated formula [1] is the asymptotic form of almost-normal in the tail, as is AS241 [23].The Moro method [14] uses a double-log in the tail region.

In this section we consider some of the details of computer science that impact the mathematical analysis inthe context of trying to be as eﬃcient as possible.

On a CPU there is normally very sophisticated management of algorithm branching, to the extent that(a) having a branch at all is managed eﬃciently and (b) a slow branch will not compromise the operationof a fast branch. On a GPU we have to confront the issue of warp divergence . A warp is a group of 32threads (in current generation NVIDIA GPUs) that evaluate concurrently. If there is an IF statement andone thread needs to execute a diﬀerent branch, then the current behaviour will be for all threads to executeboth branches. This is especially problematic in quantile evaluation where typically 2 or more regionsare considered for separate evaluation, with (as noted above) a more computationally expensive change ofvariables in the tail. There are two existing approaches: • create an essentially branchless algorithm where one formula applies for all relevant values; • lower the probability of a warp divergence to a much smaller value.The ﬁrst method is that taken by an earlier version of this study [20], and the second approach is that takenby Giles [10]. The essential question here is on the balance of eﬀort required to go essentially branchlessversus that where an unlikely branch remains but with a simpler formula in the middle region. In our ﬁnalmodel for double precision we will develop a third approach based on the idea of warp voting employing the all () function, in which one of two essentially branchless algorithms will be chosen depending on the warpas a whole. The presence in some architectures of a “fused multiply-add” (FMA) operation in a single hardware operationcreates a performance bias in support of algorithms that rely on repeated applications of operations such asare found in Horner’s scheme for fast polynomial evaluation, where, e.g. a general cubic would be evaluatedas p → d ∗ x + c , p → p ∗ x + b , p → p ∗ x + a . (66)Given that CUDA supports FMA we have found that there is no beneﬁt in performing any form of pre-processing to work out polynomials or rational approximations using other devices to reduce the number ofmultiplication operations. So, for example, the method of quadratic factorization (see e.g. section 7 of [15])has been found to produce no speed-up despite reducing the number of multiplications, and special methodsavailable for polynomials of speciﬁc degree (e.g. a trick for degree 6 described in exercise 6 for Section 7.2 of[15]), both produce no speed-up but also generate a loss of precision due to several subtractions. So will willwork with iterated linear maps on the basis of FMA support whenever a polynomial or rational function hasto be evaluated, though we should note that in non-FMA architectures that quadratic factorization may bea useful route for further optimization. .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable Given the comments made in Section 4.4 it is important to understand the costs of evaluating log and sqrt compared to working out polynomials of various size. In ﬂoat mode both log and sqrt are similarly expensive,so that for the main block of the algorithm for the central quantile, one application of one of these functionsis helpful, but applying both is ineﬃcient. For example, the main brake on Acklam’s formula is the slowcomputation of the tail, which as we have noted involves both a log and sqrt application.One of the surprises of our analysis is the beneﬁt of the eﬃciency in CUDA of the reciprocal square rootfunction rsqrt , especially in double precision. This means that the T distribution, suitable scaled, is anunexpectedly eﬃcient ﬁltering distribution. We will return to this later. In this section we will explore the precision characteristics of the existing range of commonly employedmethods, in both ﬂoat and double precision. For ﬂoat mode we will consider • Moro’s modiﬁcation of the Beasley-Springer method [14]; • Acklam’s non-iterated method [1]; • Our earlier constructions of an essentially branchless formula for GPUs[20]; • Giles’ construction of a highly-eﬃcient formula for GPUs [10];In each case we explore the theoretical precision of the method, using an implementation in Mathematica employing high-precision arithmetic. In this way we see the essential properties of the formula employed. Weconsider the precision both inside the region certiﬁed by the originators for precision, and, where appropriate,the behaviour outside. First, however, we explore the overall relative precision in the unit interval. This willshow the precision for the majority of samples, and is also indicative of the certiﬁed theoretical precision. In (cid:45) (cid:180) (cid:45) (cid:45) (cid:180) (cid:45) Moro Quantile Precision (cid:45) (cid:180) (cid:45) (cid:45) (cid:180) (cid:45) (cid:180) (cid:45) (cid:180) (cid:45) Acklam Quantile Precision (cid:45) (cid:180) (cid:45) (cid:45) (cid:180) (cid:45) (cid:180) (cid:45) (cid:180) (cid:45) ICNDfloat1 Quantile Precision (cid:45) (cid:180) (cid:45) (cid:45) (cid:180) (cid:45) (cid:45) (cid:180) (cid:45) (cid:180) (cid:45) (cid:180) (cid:45) (cid:180) (cid:45) ICNDfloat2 Quantile Precision (cid:45) (cid:180) (cid:45) (cid:45) (cid:180) (cid:45) (cid:180) (cid:45) (cid:180) (cid:45) Giles Quantile Precision

Figure 1: Theoretical relative precision of standard quantiles - main regioninterpreting Fig. 1, we should note that the Moro formula was certiﬁed and constructed based on absoluterather than relative error, and that Giles’ formula is based on a weighted L model rather than a supremumnorm. Next we consider the behaviour in the left tail region, as show in Fig. 2. In Fig. 2 we have plotted The notebooks employed are available on request from the corresponding author .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) Log_10 Quantile Error (cid:45)

Left Tail

ICND2ICND1GilesAcklamMoro

Figure 2: Theoretical relative precision of standard quantiles in left tailthe base-10 logarithm of the absolute value of the precision as a function of the base-10 log of the input u . We should recall that the Giles formula was not designed to perform well for points closer to zero thanmachine epsilon. The growth in the error before we get down to that level is, however, a concern. We haveshown our two early branchless approximations ICND1 and ICND2, where ICND1 [20] was designed to havesimilar precision to the Acklam method, and ICND2 was a quicker less precise option.These plots suggest that there is some room for improvement, at least as regards the criteria given inSections 4.1 and 4.2. We shall now proceed to consider the construction of quantiles optimized for speedsubject to a precision goal of (cid:15) Rf = 2 − = 2 . × − working down to below 10 − , based on the essentiallybranchless approach and an exponential ﬁltering distribution. Now we consider the construction of normal samples from exponential samples, and proceed to a detailedpractical implementation. We work on the right hand region and extend the mapping to the left region byodd symmetry. The recyling ordinary diﬀerential equation in the right hand region, v ≥ d Qdv + dQdv = Q (cid:18) dQdv (cid:19) (67)with the initial conditions Q (0) = 0, Q (cid:48) (0) = (cid:112) π/

2. This has the formal solution Q ( v ) = Φ − (1 − / e − v ) (68)where Φ is the normal CDF. To extract useful representations we proceed as follows. This equation mayﬁrst be solved by the method of series. However, the resulting solution turns out to be an asymptotic seriesbest used to a small number of terms in a neighbourhood of v = 0. The series solution is easily found to be, .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable Q ( v ) = (cid:114) π v − (cid:114) π v + (cid:0) √ π + π / (cid:1) v √ − (cid:0) √ π + 3 π / (cid:1) v √ (cid:0) √ π + 50 π / + 7 π / (cid:1) v √ − (cid:0) √ π + 180 π / + 105 π / (cid:1) v √ (cid:0) √ π + 1204 π / + 1960 π / + 127 π / (cid:1) v √ − (cid:0) √ π + 966 π / + 3675 π / + 889 π / (cid:1) v √ (cid:0) √ π + 24200 π / + 194628 π / + 117348 π / + 4369 π / (cid:1) v √ − (cid:0) √ π + 74640 π / + 1190700 π / + 1493520 π / + 196605 π / (cid:1) v √ O (cid:0) v (cid:1) (69)While this expression is interesting, it does not work far enough out to be of much practical use, so adiﬀerent approach is needed - if we wish to retain the use of the above expression we would need to patch inanother algorithm. One could consider solving the diﬀerential equations about several points. However, asalready noted, important point for modern computation is to try to avoid “IF” statements in the computerimplementation. Such branches do not make use of the best features of modern GPU systems, such as theNVIDIA Tesla system [22]. The standard rational approximations all have breaks as follows in the positivequantile region Z ≥ , . ≤ u < • Wichura’s AS241 [23]: two breaks, at u = 0 .

925 and u = 1 − e − . • Moro [14]: breaks at u = 0 . • Acklam Level 1[1]: breaks at u = 0 . • Giles’ quantile: breaks at u = 0 . u = 1 / . ≤ u <

1, and output both Z = Φ − ( u ) and − Z for simulation purposes, i.e. always work antithetically.So we focus on the real breaks as in the list above. This break arises in standard approaches due to the factthat the standard quantile Φ − ( u ) has rather a split personality - it is slowly varying in the central regionwhere u is between a half and about 0 .

9, and then diverges to inﬁnity as u → − . This is shown in Fig. 3.If we work in an exponential base the situation changes. The function Q ( v ) = Φ − (1 − / e − v ) is shown in Figure 3: The normal quantile in standard coordinatesFig. 4 for the region 0 ≤ v ≤

37. This function now has a much simpler quality and we can aim to builda single useful rational approximation. It is then a matter of picking a target range and precision for the .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable Figure 4: The normal quantile in exponential coordinatesdesired result. In Fig. 4 we have plotted the function in the range 0 ≤ v ≤

37, showing the single characterof the function, that will allow a single rational approximation to be made.For precision we shall require relative error below (cid:15) Rf = 2 − = 2 . × − , and will consider rationalapproximations designed to work on a target interval down to at least 10 − . This was explored using thehigh-precision arithmetic of Mathematica to work out the normal quantile deep into the tail, and the function MiniMaxApproximation to create the rational approximation. The function actually approximated was Q ( v ) v = 1 v Φ − (1 − / e − v ) (70)and the power series for Q was used in a small neighbourhood of the origin to allow MiniMaxApproximation to work preserving precision near the origin, where Q (0) = 0. The settings employed for the computationwere • Brake -> 20, 20 ; • WorkingPrecision -> 20 ; • MaxIterations -> 400 ; A rational approximation of degree (5 ,

6) was found with the desired precision. The relative error is shownin Fig. 5 and is less than 2 . × − on 0 ≤ v ≤

25, corresponding to u diﬀering from zero by 2 . × − .The resulting form for Q ( v ) is as follows (cid:45) (cid:180) (cid:45) (cid:45) (cid:180) (cid:45) (cid:45) (cid:180) (cid:45) (cid:180) (cid:45) (cid:180) (cid:45) (cid:180) (cid:45) Figure 5: Precision of (5 ,

6) exponential-normal quantile on [0 , Q ( v ) = v ∗ p ( v ) /q ( v ) (71) .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable p and q are polynomials of degree 5 and 6 respectively, with nested C-forms as follows, where weproduce the higher-precision output generated by Mathematica. The numerator p is and the denominator q is For completeness, an algorithm for normal samples based on this is (in the ﬁrst two steps we give in bracketsthe better form using a reﬂection and scaling to simplify the ﬁrst part and avoid precision reduction nearunity): • sample u in 1 / ≤ u < < u < • evaluate v = − log[2(1 − u )], (then, better, v = − log[ u ]); • evaluate Z = Q ( v ) with Q given by the rational approximation; • output the antithetic pair Z, − Z .If an exponential base is used we are essentially employing the last two steps. This is what should be doneideally to get maximum precision. A quantile for managing the entire range (0 ,

1) is given in CUDA kernelform in Appendix A. It is that model that is investigated in the next sections with regard to theoreticalprecision, realized precision and speed.

The error characteristics of the two new formula is shown in the Figures 6 and 7. Fig. 6 illustrates the (cid:45) (cid:180) (cid:45) (cid:45) (cid:180) (cid:45) (cid:180) (cid:45) (cid:180) (cid:45) (cid:180) (cid:45) (cid:72) (cid:76) Quantile Precision (cid:45) main region

Figure 6: Theoretical relative precision of new branchless quantile - main regionfact that we have satisﬁed the precision goal in each case for the main region. The behaviour in the deeptail is shown in Fig. 7. This illustrates the depth of precise tail penetration by the two algorithms. Ourview was that 10 − is appropriate for Monte Carlo simulations with about 10 samples. The Mathematica notebook used to generate such schemes can be obtained on request from WS, and it is not diﬃcult togenerate branchless schemes valid over still larger regions.In reality these theoretical precisions computed in full precision will not be reproduced when passedthrough CUDA, or C++ in ﬂoat mode. However, we see a signiﬁcant improvement in the number ofsigniﬁcant ﬁgures for numbers of order u = 10 − . A number of sample computations in CUDA revealed thatthese schemes were generating realized precision of order 3 × − in contrast to about 1 × − from theGiles formula, which has similar characteristics to the CUDA 4 library. .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) Log_10 Quantile Error (cid:45)

Left Tail (cid:72) (cid:76)

GilesAcklamMoro

Figure 7: Theoretical relative precision of new quantiles - tail region (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) Acklam Quantile Realized Log_10 Error (cid:45)

Left Tail (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) Giles Quantile Realized Log_10 Error (cid:45)

Left Tail (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:72) (cid:76) Quantile Realized Log_10 Error (cid:45)

Left Tail

Figure 8: Realized relative precision of various quantile in ﬂoat: left tail

To explore this reality we utilized the GPU-link capabilities built in to Mathematica version 8 and passed taildata to the GPU, ran the kernel forms of the Acklam, Giles and (5 ,

6) formulae and then passed the resultsback to Mathematica. The results are shown in Figure 8. The new formula preserves a precision better thanabout 3 . × − down to u = 10 − . We should also point out that none of these ﬂoat formulae can dealwith the right hand region eﬀectively. The value of machine epsilon signiﬁcantly limits the generation ofextreme positive values, which is an argument for moving to double precision. In our earlier study we compared Acklam’s quantile with the codes ICNDﬂoat1 and ICNDﬂoat2 and foundthat Acklam’s quantile remains superior on a CPU, but was improved on by the branchless schemes on aGPU. On balance the avoidance of any “special function” calls means that on a CPU the central simplescheme wins out, but that on a GPU the application of both square roots and logs in the tail algorithm slowsdown the code rather dramatically.Here we will re-analyse the GPU eﬃciency of the Acklam and Giles models compared to our new (5 , .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable t [ ms ] t [ ms ] t [ ms ] t [ ms ]GPU GTX285 Quadro 4000 GTX 480 C2050Acklam 197 . .

05 211 . . . .

56 84 . . . .

07 85 . . L error normthat does not provide control over a tail region that is small in that norm. Our understanding is that thealgorithm for erﬁnv in the CUDA 4 toolkit is based on the Giles model but uses some additional internalhardware tricks on log etc. to which we do not have access, so some further optimization of the (5 ,

6) modelis also likely to be possible - the results here are all based on openly available functions.In the full philosophy of this paper, one would of course use an exponential base for many diﬀerentcomputations and distributions and possibly pre-store a large set of exponential samples created by eﬃcientmethods. The overhead of converting to normal is then the evaluation of a simple rational function and theperformance beneﬁts are magniﬁed many times over those given in Table 1.

One can also consider working to double precision on a modern Nvidia GPU. The need for double precision issomewhat contentious in the Monte Carlo framework, but it seems to us that while some computations mightbe satisfactory in ﬂoat, one does need to check. While purely additive Monte Carlo summations are almostcertainly be OK, one can never be sure about any computation in ﬂoat until one has seen that repeatingin double gives consistent answers. Furthermore, the commonplace approach to working out mathematicalderivatives by diﬀerencing can generate a substantial precision loss, as it involves an essential subtraction.These are well-established concerns. To these we add the issues associated with generate regions in the righttail with u close to unity, for which the ﬂoat form of machine epsilon is a killer issue. In our own experimentsthe right tail no longer has a proper distribution of values but instead has a concentration of particularvalues of Φ − (1 − − ) = 5 . , Φ − (1 − − ) = 5 . − (1 − − ) = 8 . , Φ − (1 − − ) = 8 . reﬁned Acklam method, where the level one approximation is fed once through a Newton-Raphson-Halley method.How do we do a quality check on such high precision methods? We will use the Mathematica function

InverseErf as our benchmark. However, this will not be done blindly on the assumption that it is necessarily .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable \protect http://en.wikipedia.org/wiki/Probitand as a series has been coded up both in

Mathematica and quadruple-precision FORTRAN based on theAbsoft compiler. Based on these three implementations various cross-veriﬁcations have been carried out.For example, the quad-precision FORTRAN code that agrees with

Mathematica’s internal

InverseErf to aprecision of better than 10 − on the interval [ e, − e ], with e = 0 . Mathematica agrees with

InverseErf to much better than quad precision. Sowe have considerable conﬁdence that our benchmark is precise enough for any double-precision evaluation.In the context of double precision we will restrict attention to realized precision, and this has beenevaluated directly using Mathematica 8’s capability to embed CUDA kernels. We send a controlled sequenceof double-precision numbers in (0 ,

1) to a kernel of interest, send the results back and compare with our high-precision benchmark. First, in Fig. 9 we show the left-tail outcome for doing this for a CUDA implementationof AS241. This conﬁrms that double precision is realized consistently throughout the region of interest.AS241, in common with every scheme, suﬀers a small loss of realized precision in the neighbourhood of u = 0 .

5, where the quantile is zero, and this is shown by the small blip at the right. This is not a matterof concern especially as the absolute precision in a neighbourhood of u = 0 . − log ( approx/exact −

1) as a function of − log ( u ), and the vertical lineshows the range bound set by machine epsilon. In practice real Monte Carlo simulations are only likelyinvolve data well to the right of this line. The corresponding error plot for the iterated Acklam form is shown (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) AS241 Quantile Realized Log_10 Error (cid:45)

Left Tail

Figure 9: Precision of AS241 CUDA kernel in left tailin Fig. 10. There is, in this implementation, a small glitch in precision for values of u below 10 − , but theoverall levels are very good. (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) Acklam Iterated Quantile Realized Log_10 Error (cid:45)

Left Tail

Figure 10: Precision of iterated Acklam in left tail This might, of course, be a bug in our implementation in CUDA, but we have been unable to ﬁnd an error, and acorresponding implementation in high precision in Mathematica produces a diﬀerent manifestation of precision loss. .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable In order to produce a faster double-precision quantile for GPU work, we can consider either • aiming to be essentially branchless using a single formula over the entire range; • having a low warp divergence probability but with a faster formula.In our earlier study [20] we produced an essentially branchless formula based on an exponential transforma-tion, and Giles has also generated a double precision analogue in his approach based on a branching formulawith a low warp divergence probability.First the Giles formula is evaluated on the same basis as AS241 and the iterated Acklam model, and theresults are shown in Fig. 11. This, in our view, suﬀers from the same precision degradation as the ﬂoatmodel, and again we suspect that this is due to the use of an L norm rather than a supremum norm. (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) Giles DP Quantile Realized Log_10 Error (cid:45)

Left Tail

Figure 11: Precision of double precision Giles (CUDA 4 similar) model in left tailThe production of suitable approximations to compete with these methods follows the same path asconsidered for the ﬂoat case, except that now • the target for relative precision is now to be less than (cid:15) Rd = 5 . × − • the domain is ( (cid:15) Rd , − (cid:15) Rd ). (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) Branchless Quantile Realized Log_10 Error (cid:45)

Left Tail

Figure 12: Precision of branchless model in left tailIn order to achieve the desired relative precision on the given domain, we ﬁrst considered a branchlesssolution based again on the simple exponential distribution. With the change of variables v = − log(2 ∗ u ) onthe left region 0 < u < /

2, and v = − log(2 ∗ (1 − u )) on the right region 1 / < u < .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable ,

13) satisfying an relative theoretical error bound of lessthan 5 . × − on an extension of the target domain up to v = 42, corresponding to a quantile point of u = 2 . × − . This is essentially a small variation of our previous double precision model given in [20],also based on a (13 ,

13) model, adapted to target the bound (cid:15) Rd within the largest domain possible. Thekernel is shown in Appendix B. The realized precision of this model is shown in Fig. 12. An interesting feature of some GPUs is that they involved an unusually eﬃcient implementation of thereciprocal square root function rsqrt(x) ≡ √ x (74)and in double precision this is much faster than the logarithm function. This suggests we explore the sequencein which a uniform random variable is mapped to a suitable sample of the Student t distribution with two degrees of freedom, and then this is re-mapped by appropriate solution of the recycling ODE. Speciﬁcally,we will consider an intermediate random variable that is of the form V = T √ Q v ( u ) = ( u − / √ u − u ∗ u (76)and the mapping from V to a normal random variable is the composition of the CDF of V with probit:Ψ( v ) = Φ − (cid:18) (cid:20) v √ v (cid:21)(cid:19) (77)We have experimented with the rational approximation of Ψ and while it is diﬃcult to give an economicalbranchless solution, it serves as the basis for a very fast branching solution but with low warp divergence.Speciﬁcally, we have found a rational approximation of degree (14 ,

15) that satisﬁes the error bound for0 ≤ v ≤ .

5. Furthermore, Ψ(15 .

5) = 3 . u = 0 . T as an intermediatedistribution, and in the tail the computation is done using the already computed branchless solution. Theprecision characteristics are indistinguishable from the branchless case and are shown in Fig. 13.There is one ﬁnal point about the way we have implemented the branching for this case, as we do nothave warp divergence at all. The algorithm implements the central T ﬁlter if all the threads the warp shouldfollow the central path, otherwise all the threads fall back to our branchless model. This is accomplished inthe code by warp voting using the all () command. The comparisons on a modern GPU are very interesting. We have completed a comparison of seven algo-rithms on four GPUS. The four algorithms are: • Acklam’s one-pass method, but coded in double precision (DP); • The Acklam-Lea DP method; • AS241 (Burkardt code as of early 2009); • the CUDA 4 erﬁnv model; • Giles double precision erﬁnv model; .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) T2 Hybrid Quantile Realized Log_10 Error (cid:45)

Left Tail

Figure 13: Precision of hybrid model in left tail • Our latest pure breakless model; • Our T exponential hybrid.Our understanding is that the CUDA and Giles models are diﬀerent for double precision, though we havefound them to have similar precision characteristics. The four GPUs considered are • GTX 285; • Quadro 4000; • GTX 480; • Tesla C2050.The timings in ms for the test program are reported in Table 2 with compilation for compute capability 1.3in the case of the 285, and 2.0 in the case of the Quadro 4000, GTX 480 and Tesla C2050.Algorithm Timing t [ ms ] Timing t [ ms ] Timing t [ ms ] Timing t [ ms ]GPU: GTX 285 Quadro 4000 GTX 480 Tesla C2050Acklam non-iterated as double 2391 .

97 1278 .

25 949 .

46 611 . .

62 3133 .

62 2305 .

16 1488 . .

99 1675 .

05 1285 .

20 797 . .

24 1950 .

91 1441 .

11 930 . .

69 1016 .

50 811 .

51 486 . .

04 1178 .

17 889 .

35 562 . T (Appendix C) .

58 933 .

01 607 .

20 448 . Table 2: Double precision timings for various normal quantiles on GPUThe use of the T as an intermediate distribution results in a clear win for this method on performancegrounds, and given that it also provides clear double precision accuracy over the range of interest for MonteCarlo work, we argue that it is a good candidate for the preferred method on GPUs. We note that thischoice seems best reserved for double precision work - the advantage arises from the relatively fast speed ofthe inverse square root operation compared to the logarithm in double precision under CUDA. We do notsee this advantage in ﬂoat arithmetic. We also note that on GTX cards, it is possible to go slightly fasterstill using a hybrid with more branches and optimized T intermediates for a near tail, but these oﬀer nobeneﬁt on the C2050 Tesla. Finally our latest breakless method comes in 3rd behind the hybrid and Gilesapproaches, but has the merit of robustness with respect to any increase in the number of threads per warp.We also brieﬂy explored polynomial representations of the mapping in the central region in order to avoid acostly divide, but found it hard to get close to the target supremum error bound. An example computationwith degree 32 over the target range [0 , .

5] got us down to only O (10 − ). Requests for code implementing this idea can be made to the authors .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable In this section we move to other distributions of interest to ﬁnance. First we consider the hyperbolic distri-bution, and then the variance gamma. These will have in common a non-normal base distribution, and willillustrate the use of a 2-sided exponential base instead.

This was originally motivated by Bagnold’s classic study of sand [4], and was given a clear mathematicaldescription by Barndorﬀ-Nielsen [5], who also generalized it. The applications to ﬁnance have been exploredEberlein and Keller [6]. A direct treatment of the quantile function for the symmetric case has been givenby Xiong [24]. He we shall explore the conversion of samples from a suitable exponential distribution tosamples from the hyperbolic. Hyperbolic distributions can of course be sampled as random mixtures of anormal distribution. Our method facilitates the use of hyperbolic marginals coupled to an arbitrary copula,and and this example also illustrates how cleanly the choice of a suitable base simpliﬁes the computationsof the quantile - the exponential base regularizes the tail in an elegant way.The probability density function is known explicitly as f ( x, α, β, δ, µ ) = γ αδK ( δγ ) exp {− α (cid:112) δ + ( x − µ ) + β ( x − µ ) } (79)where γ = (cid:112) α − β , with | β | < α . In what follows we shall translate the origin so that µ = 0, with density f ( x, α, β, δ ) = γ αδK ( δγ ) exp {− α (cid:112) δ + x + βx } (80)The H -function for the target distribution is given by the negative of the logarithmic derivative: H ( x ) = − ddx log f ( x, α, β, δ ) = αx √ δ + x − β (81)and it is evident that for large x , H ( x ) ∼ sign( x ) α − β = ± α − β . (82)Bearing in mind that the exponential distribution is characterized by a constant H -function, we will use apair of exponential distributions for the base case. When we solve the resulting recycling equations, it willsimplify matters to set Q (0) = 0 for the boundary conditions, and in order to get the proportion of therandom variables that are positive and negative correct with this choice, we let p + = (cid:90) ∞ dx γ αδK ( δγ ) exp {− α (cid:112) δ + x + βx } p − = (cid:90) −∞ dx γ αδK ( δγ ) exp {− α (cid:112) δ + x + βx } (83)so clearly p + + p − = 1. We choose the base distribution to have the form f ( x ) = (cid:40) p + ( α − β ) e − ( α − β ) x if x > p − ( α + β ) e ( α + β ) x if x <

0, (84)The quantile function for sampling from f has the trivial form: v = Q ( u ) = (cid:40) α + β log( u/p − ) if u < p − , − α − β log((1 − u ) /p + ) if u > p − , (85)So samples from the base can be made easily. To convert them into samples from the hyperbolic we solve a left and right diﬀerential equation. The right problem is of the form d Qdv + ( α − β ) dQdv = (cid:18) αQ (cid:112) δ + Q − β (cid:19)(cid:18) dQdv (cid:19) (86) .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable v > Q (0) = 0 and dQdv | v =0 = Q (cid:48) ( p − ) Q (cid:48) ( p − ) = f (0 + ) f (0) = p + ( α − β )2 αδγ K ( δγ ) e αδ (87)The left problem is d Qdv − ( α + β ) dQdv = (cid:18) αQ (cid:112) δ + Q − β (cid:19)(cid:18) dQdv (cid:19) (88)on v < Q (0) = 0 and dQdv | v =0 = Q (cid:48) ( p − ) Q (cid:48) ( p − ) = f (0 − ) f (0) = p − ( α + β )2 αδγ K ( δγ ) e αδ (89)The solution to this diﬀerential system is readily visualized. If we use a sixth-order explicit RK method asbefore, with parameters α = 1 = δ, β = 0 for illustration, the result for the recycling mapping is show below,together with the identity map (diagonal line). (cid:45) (cid:45) (cid:45) (cid:45) Figure 14: Plot of the mapping for conversion of exponential to hyperbolic

The variance-gamma density was introduced by Madan and Seneta [12] as a model for share market returns.The density is given, for λ > , α > , | β | < α , by e βx | x | λ − (cid:0) α − β (cid:1) λ K λ − ( α | x | )(2 α ) λ − / √ π Γ( λ ) (90)In the region x > H -function is given by H ( x ) = − ddx log( f ) = αK λ − / ( αx ) K λ − / ( αx ) − β ∼ ( α − β ) + 1 − λx + O (cid:32)(cid:18) x (cid:19) (cid:33) (91)In the region x < H -function is given by H ( x ) = − ddx log( f ) = − αK λ − / ( − αx ) K λ − / ( − αx ) − β ∼ − ( α + β ) + 1 − λx + O (cid:32)(cid:18) x (cid:19) (cid:33) (92)These asymptotic relationships suggest that the VG model may be treated in a similar way to the hyperboliccase, as the asymptotics are closely related with a good match to the exponential base. This time theprobabilities p ± are given by .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable p + = (cid:0) α − β (cid:1) λ (2 α ) λ − / √ π Γ( λ ) (cid:90) ∞ dxe βx x λ − K λ − ( αx )= 2 λ − (cid:16) α + βα − β (cid:17) λ Γ (cid:0) λ + (cid:1) F (cid:16) λ, λ ; λ + 1; α + ββ − α (cid:17) √ π Γ( λ + 1) ,p − = (cid:0) α − β (cid:1) λ (2 α ) λ − / √ π Γ( λ ) (cid:90) ∞ dxe − βx x λ − K λ − ( αx )= 2 λ − (cid:16) α − βα + β (cid:17) λ Γ (cid:0) λ + (cid:1) F (cid:16) λ, λ ; λ + 1; β − αα + β (cid:17) √ π Γ( λ + 1) , (93)where we have used identity 6.621.3 from [3] to evaluate the integrals giving the probabilities that the VGrandom variables is positive or negative. It is easily checked that if β = 0 then p + = p − = 1 / λ . First, we note that if λ = 1 the VG model is trivial as it is identical to thebase, so that Q ( v ) ≡ v . If λ > f and H exist at v = 0,with H (0) = 0. The recycling ODE may be solved as before, though many steps may be needed near v = 0if λ remains close to and just above 1. When 0 < λ < H (0) isdivergent, and furthermore the density becomes singular in the range 0 < λ ≤ /

2. The density has a logdivergence when λ = 1 /

2, and otherwise diverges as x λ − . All of these issues may in principle be addressedby doing analytical estimates in a small neighbourhood of the origin and starting the numerical treatmentat a small distance from the origin - as noted several diﬀerent cases must be considered and full details willbe given elsewhere. This paper has developed two related threads of thought. First, risk simulations depend critically on having arealistic (fat-tailed) model of asset returns. The methods developed here allow traditional Gaussian samplesto be converted to other distributions via the application to the samples of the solution of a recyclingdiﬀerential equation. The recycling diﬀerential equation is the ODE for transforming samples from a density f to a density f , and is d Q ( v ) dv + H ( v ) dQ ( v ) dv = H ( Q ( v )) (cid:18) dQ ( v ) dv (cid:19) , where H i = − ddx log[ f i ( x )]We have given an explicit example for the Student t case, where a power series emerges coupled to atail model. Other more complicated distributions with an explicit density may be handled similarly ornumerically, and other base distributions may be treated. In particular we can use changes of variableto construct “essentially IF-less” algorithms for objects like the normal quantile. The eﬃciency of suchalgorithms in GPU computation is of interest, and the methods introduced here can be considered for othertarget distributions. In contrast to the normal case, where there are no parameters beyond the translationand scale, we must ﬁrst solve the RODE with the relevant parameters and then develop a suitable fastapproximation.These methods also simplify the use of a Gaussian or T-copula, since the two steps of mapping to theunit hypercube and back to the marginals may be folded together into one operation, where the solutionto the RODE is applied directly in one step. Of course, the methods developed here rely on the abilityto compute the logarithmic derivatives of the target and base densities. Where the target density is notknown explicitly, but its characteristic function is known, other methods must be used. Investigations of theresulting integro-diﬀerential equations will be reported elsewhere.The second thread of thought concerns the detailed practicalities of building good quantile implemen-tations for Monte Carlo use. We have set out the numerical and computer science issues that are relevant .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable ﬂoat and double mode. Inﬂoat mode our approach is competitive on speed with the fast formula developed by Giles but oﬀers, webelieve, a substantial improvement in precision. In double precision we have provided a new formula whichoutperforms all other methods known to us on GPU speed while preserving full double precision on a rangemore than enough for Monte Carlo simulations. Acknowledgments

TL thanks the Knowledge Transfer Network, EPSRC and the Numerical Algorithms Group Ltd for supportunder a KTN-EPSRC CASE award. WS wishes to thank I. Buckley, W. Gilchrist, P. J¨ackel, D. Scott, G.Steinbrecher , A. Munir and Y. Xiong for discussions on various aspects of quantile theory. Presentations byC. Albanese, M.Giles and G. Ziegler and the NAG team at a recent London workshop on GPU computing[2] stimulated the development of the essentially “IF-less” normal quantile. A. Munir provided assistancein producing further Windows binaries for the 1.3 CUDA architecture. Clariﬁcation of precision issues fromseveral members of the Numerical Algorithms Group were invaluable, and we are also grateful to NAG foraccess to their Tesla system. We are also grateful to Wolfram Research for their creation of the CUDAembedding tools in Mathematica 8 that allowed the precision studies here to be made so easily, and to TomWickham-Jones for clarifying some points on double-precision analysis in that environment.

References [1]

P. J. Acklam

An algorithm for computing the inverse normal cumulative distribution function, http://home.online.no/~pjacklam/notes/invnorm/ [2]

C. Albanese, G. Ziegler, D. Sayers, M. Giles , Presentations at Nov 2007 King’s College LondonWorkshop on GPU computing in ﬁnance, Workshop home page at .[3]

M. Abramowitz, I.A. Stegun , Handbook of Mathematical Functions, Dover, 1975.[4]

R.A. Bagnold , The Physics of Blown Sands and Desert Dunes. Methuen, London, 1941.[5]

O.E. Barndorff-Nielsen

Exponentially Decreasing Distributions for the Logarithm of a particle size.Proceedings of the Royal Society (London), Series A, 353, 401-419, 1977.[6]

E. Eberlein, E. and U. Keller , Hyperbolic Distribution in Finance.

Bernoulli , Vol. 1, No. 3 (Sept,1995), pp. 281-299, 1995.[7]

K. Fergusson, E. Platen , On the Distributional Characterization of daily Log-returns of a WorldStock Index,

Applied Mathematical Finance , 13 (1), 19-38, March 2006.[8] Warren Gilchrist,

Statistical modelling with quantile functions , CRC Press Inc, 2000.[9]

G.W. Hill, A.W. Davis , Generalized Asymptotic Expansions of Cornish-Fisher Type,

Annals ofMathematical Statistics , 39, 4, 1264-1273, 1968.[10]

M.B. Giles , Approximating the erﬁnv function. In GPU Computing Gems, Vol. 2. Morgan Kaufmann,Elsevier, 2011.[11]

M. Joshi

Graphical Asian options,

Wilmott

Journal, Vol 2(2), 97-107, April 2010. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1473563 [12]

D.B. Madan, E. Seneta , The variance gamma (V.G.) model for share market returns, Journal ofBusiness, 63, pp. 511 - 524, 1990.[13]

T. Mikosch , Copulas: tales and facts. Extremes, 9, pp 3-20, 2006.[14]

B. Moro , The full monte, RISK 8 (Feb): 57-58. .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable

A. Ralston, P. Rabinowitz , First Course in Numerical Analysis, McGraw-Hill, 1978.[16]

Wikipedia , entry on “Quantile function”, http://en.wikipedia.org/wiki/Quantile_function [17]

W. T. Shaw , Sampling Student’s T distribution - use of the inverse cumulative distribution function.

Journal of Computational Finance , Vol. 9, No. 4, 2006.[18]

W. T. Shaw , Reﬁnement of the Normal Quantile, Simple improvements to the Beasley-Springer-Moromethod of simulating the Normal Distribution, and a comparison with Acklam’s method and WichurasAS241: [19]

W. T. Shaw, I.R.C. Buckley , The alchemy of probability distributions: beyond Gram-Charlierexpansions, and a skew-kurtotic-normal distribution from a rank transmutation map. Presented at theFirst IMA Conference on Computational Finance, March 2007. http://arxiv.org/abs/0901.0434v1 [20]

W. T.Shaw, N. Brickman , http://arxiv.org/abs/0901.0638v4 [21] G. Steinbrecher and W.T. Shaw , Quantile Mechanics.

European Journal of Applied Mathematics ,19(2), pp 87-112, 2008.[22] Nvidia TESLA computing solutions, [23]

Wichura, M.J. , Algorithm AS 241: The Percentage Points of the Normal Distribution.

Applied Statis-tics , 37, 477-484, 1988.[24]

Y. Xiong

Sampling hyperbolic distribution by quantile function,

MSc thesis , King’s College London,September 2008.

Appendix A: The (5,6) rational CUDA kernel for ﬂoat code

Here is the code for the (5,6) rational model in a form suitable for CUDA. The extra signiﬁcant ﬁgures areignored by the compiler, but we include the full precision form generated in Mathematica for information.The code is written as a sequence of maps p → p ∗ a + b in order to make good use of FMA operations. __inline__ __device__ float ws_norminvf(float u){ float half_minus_u = 0.5f - u;float v, p, q;float x = copysignf(2.0f*u, half_minus_u);if (half_minus_u < 0.0f) x += 2.0f;v = -__logf(x);p = 1.1051591117060895699e-4f;p = p*v + 0.011900603295838260268f;p = p*v + 0.23753954196273241709f;p = p*v + 1.3348090307272045436f;p = p*v + 2.4101601285733391215f;p = p*v + 1.2533141012558299407f;q = 2.5996479253181457637e-6f;q = q*v + 0.0010579909915338770381f;q = q*v + 0.046292707412622896113f;q = q*v + 0.50950202270351517687f;q = q*v + 1.8481138350821456213f;q = q*v + 2.4230267574304831865f;q = q*v + 1.0f;return -__fdividef(p, q) * copysignf(v, half_minus_u);} .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable Appendix B: A double precision branchless quantile kernel

Here is an optimized double precision branchless algorithm in a form suitable for CUDA kernel use. The fulldata supplied by Mathematica’s high-precision rational function generator is included. extern "C" __global__ void ws_norminv_exp_42(double *in, double *out){ int i = threadIdx.x + blockIdx.x * blockDim.x;double u = in[i];double half_minus_u = 0.5 - u;double v, p, q;double x = copysign(2.0*u, half_minus_u);if (half_minus_u < 0.0) x += 2.0;v = -log(x);p = 1.64783242453158904095515084024e-14;p = p*v + 5.06687427282961778456165208105e-11;p = p*v + 2.10154247206828001641073444523e-8;p = p*v + 2.79486316248312621569098418063e-6;p = p*v + 0.000158143467460605125860139269297;p = p*v + 0.00438343320745866724879101963414;p = p*v + 0.0646753575778845943457494008377;p = p*v + 0.535690416737220756622791398354;p = p*v + 2.57714610175675729492631703269;p = p*v + 7.33285309828701618935546741859;p = p*v + 12.3353630302640508603664862349;p = p*v + 11.9187726041215161859997693572;p = p*v + 6.06634828333794870534194478115;p = p*v + 1.25331413731550018371372639809;q = 9.3774528584890379942301072137e-13;q = q*v + 8.67759442958410980713288964586e-10;q = q*v + 1.9465409869330334204439096215e-7;q = q*v + 0.0000166601689658474353532677312063;q = q*v + 0.00066147322306910897444136114895;q = q*v + 0.013581089497310892038923062896;q = q*v + 0.154424951968123464901887026825;q = q*v + 1.01815001279043960887846096372;q = q*v + 4.01114257592029176980269694161;q = q*v + 9.58786255809221297975776809938;q = q*v + 13.8641781886242409731295280702;q = q*v + 11.7514614079486467058484941458;q = q*v + 5.34024563572829223828055331064;q = q*v + 1.0;// swap the following comments if using non-Fermi cardout[i] = p * __drcp_rn(q) * copysign(v, -half_minus_u);//out[i] = p / q * copysign(v, -half_minus_u);} .T. Shaw, T. Luu & N. Brickman: Monte Carlo Changes of Variable Appendix C: A double precision optimized quantile kernel