PProceedings of the 2018 CERN–Accelerator–School course on
Numerical Methods for Analysis, Design and Modelling of Particle Accelerators, Thessaloniki, (Greece)
Monte Carlo Simulation Techniques
Ji Qiang
Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Abstract
Monte Carlo simulations are widely used in many areas including particle ac-celerators. In this lecture, after a short introduction and reviewing of somestatistical backgrounds, we will discuss methods such as direct inversion, re-jection method, and Markov chain Monte Carlo to sample a probability dis-tribution function, and methods for variance reduction to evaluate numericalintegrals using the Monte Carlo simulation. We will also briefly introduce thequasi-Monte Carlo sampling at the end of this lecture.
Keywords
Monte Carlo; particle simulation. Introduction
The Monte Carlo method is a (computational) method that relies on the use of random sampling andprobability statistics to obtain numerical results for solving deterministic or probabilistic problems. It isa method of solving various problems in computational mathematics by constructing for each problema random process with parameters equal to the required quantities of that problem. The unknowns aredetermined approximately by carrying out observations on the random process and by computing itsstatistical characteristics which are approximately equal to the required parameters [1].It is believed that the earliest documented use of random sampling to solve a mathematical problemis by mathematician Comte de Buffon of France in 1777 [2]. This problem is to find the probability of anintersection between a randomly thrown needle of length L and a group of parallel lines with separationwidth D. It turns out that the analytical solution of this probability is proportional to π and later suggestedby Laplace to evaluate the π using the random sampling. Lord Kelvin used random sampling to aid inevaluating the time integrals associated with the kinetic theory of gases and Enrico Fermi was amongthe first to apply random sampling methods to study neutron moderation in Rome. During World WarII, Fermi, Stan Frankel, Nicholas Metropolis, John von Neumann, Stan Ulam and others developedcomputer-oriented Monte Carlo methods at Los Alamos to study neutron transport through materialsunder the Manhattan project. It is said that the name of "Monte Carlo" which is also a casino center forgambling in Monaco, was coined by Metropolis because of the similarity of the randomness employedin the method and games of chance [3].The Monte Carlo simulation starts with a probability distribution function that characterizes theparameters of the physical or mathematical system. Then one draws random sampling of the distributionfunction to obtain a sample of the parameters. Next, one runs simulation using those parameters. Afterthat, one collects the simulation outputs and repeats the above process for a number of samplings ofthe parameters. Finally, one performs statistical analysis on the simulation outputs. The Monte-Carlosimulation can be conveniently summarized in the following steps:1. Define a domain of possible inputs and identify the statistical probability distribution of theseinputs.2. Generate possible inputs through random sampling from the probability distribution over the do-main.3. Perform simulation with these input parameters.Available online at https://cas.web.cern.ch/previous-schools a r X i v : . [ phy s i c s . c o m p - ph ] J un . Aggregate and analyze statistically the output results.The error of the output results from the Monte Carlo simulation is inversely proportional to the squareroot of the number of samples.The Monte Carlo method can be used to solve the problems that are stochastic (probabilistic) bynature such as particle collision and transport, or the problems that are deterministic by nature such asthe evaluation of integrals. It has been used in areas as diverse as natural science such as physics andchemistry, engineering such as control and optimization, economics such as market prediction, and manyothers [2–5]. Statistical background
In the Monte Carlo simulation, system parameters are treated as random variables that follow someprobability distributions. The random variable is a real number associated with a random event whoseoccurring chance is determined by an underlying probability distribution. A discrete random variablesuch as face of dice or type of reaction has a discrete probability distribution. A continuous randomvariable such as spatial location or time of occurrence has a continuous probability distribution. If x is arandom variable with probability density function p i δ ( x − x i ) for the discrete variable and f ( x ) for thecontinuous variable, then g ( x ) is also a random variable. The expectation of g ( x ) is defined as: E ( g ( x )) = < g ( x ) > = (cid:88) i p i g ( x i ); for the discrete random variable (1) E ( g ( x )) = < g ( x ) > = (cid:90) ∞−∞ g ( x ) f ( x ) dx ; for the continuous random variable (2)where p i is the probability of the discrete random variable x i , and f ( x ) is the probability density function(PDF) of the continuous variable x . The n th moment of random variable x is defined as the expectationof the n th power of x . The spread of the random variable is measured by the variance of x . The squareroot of the variance is also called standard deviation or standard error. The variance of any function ofthe random variable is defined as: var ( g ( x )) = E ( g ( x )) − E ( g ( x )) (3)The variance has the following properties:1. For a constant random variable C , var { C } = 0 .2. For a constant C and random variable x , var { Cx } = C var { x } .3. For independent random variables x and y , var { x + y } = var { x } + var { y } When x and y are not necessarily independent, the covariance can be used to measure the degree ofdependence of the two random variables x and y : cov { x, y } = < xy > − < x >< y > (4)The covariance equals zero when x and y are independent and cov { x, x } = var { x } (5)However,the zero covariance does not by itself guarantee independence of the random variables. Forexample, let x be a uniform random variable between − and , and let y = x , the covariance cov { x, y } = 0 . Another quantity to measure the dependence between two random variables is thecorrelation coefficient that is given by: ρ ( x, y ) = cov { x, y } var { x } var { y } (6)and − ≤ ρ ( x, y ) ≤ (7)2 Sampling of probability distribution function
The Monte Carlo simulation starts with the sampling of a given probability distribution function. In orderto sample an arbitrary probability distribution, one needs first to generate a uniformly distributed randomnumber. The other complex probability distribution can then be sampled based on this uniform randomnumber through appropriate operations.
On computer, instead of using a real random number, a pseudo-random number is used to sample auniform distribution between zero and one. A simple and widely used algorithm to generate a pseudo-random number is called Linear Congruential Generator (LCG). The sequence of numbers is given bythe following recurrence relation: x k +1 = mod ( a ∗ x k + c, M ) , k = 1 , , . . . (8)where x k +1 and x k are integers between 0 and M , mod is the module function, M is the modulus, and a and c are the positive multiplier integer and the increment integer respectively. This function has alargest period of M if M , a , and c are properly chosen, and all possible integers between and M − can be attained starting from an initial seed integer. Normally, M is chosen as power of minus one.This module function repeatedly brings the linear function y = ax + c back to the range between zeroand M . A uniformly distributed random number between and is given by: r = x k +1 /M (9)The typical choice for the M is − , a = 7 = 16807 , and c = 0 . This is therandom number generator that was used in function ran0 of the Numerical Recipe (NR) [6]. It shufflesthe integers from 1 to 2,147,483,646, and then repeats itself. However, there are serial correlationspresent in the above random number generator. An improved version, function ran1 of the NR, usesthe function ran0 as its random value, but shuffles the output to remove low-order serial correlations. Arandom number derived from the j th value in the sequence, is not output on the j th call, but rather on arandomized later call, e.g. j + 32 on average. When a very long sequence of random number is needed,one can combine two different sequences with different periods so as to obtain a new sequence whoseperiod is the least common multiple of the two periods. This is what implemented in the function ran2of the NR that has a period of . The direct inversion method is also called transformation method. The above section discussed how togenerate a uniformly distributed random number between zero and one. Given the sampling of such auniform probability density function, the sampling of the other probability distribution function can beachieved through appropriate transformation and inversion of that sampling. Given that x is a randomvariable with probability density function f ( x ) and y = y ( x ) , then the probability density function of y will be: g ( y ) = f ( x ) | dxdy | (10)which reflects the fact that all the values of x in dx map into values of y in dy . In the above equation, oneneeds to use the function x ( y ) = y − ( x ) to attain g ( y ) . Consider the linear transformation y = a + b x ,the probability density function g ( y ) will be: g ( y ) = f ( y − ab ) / | b | (11)3his suggests that in order to sample a Gaussian distribution with mean µ and standard deviation σ , onecan sample a Gaussian distribution with mean zero and standard deviation one and then transform thesampled variable x using y = µ + σx .The transformation Eq. 10 can be rewritten in the integral form: (cid:90) y −∞ g ( t ) dt = (cid:90) x −∞ f ( t ) dt (12)These integrals are called cumulative distribution functions (CDF) of the random variable y and x re-spectively. If g ( y ) and G ( y ) represent PDF and CDF of a random variable y , if a random number x isdistributed uniformly between zero and one with PDF f ( x ) = 1 , the above equation can be rewritten as: G ( y ) = x (13)Then for each uniformly distributed random variable x , there is a corresponding y that is distributedaccording to the probability density function g ( y ) . For example, consider sampling a probability distri-bution function: f ( r ) = r exp( − r ) , < r < ∞ (14)The cumulative distribution function of the above function is: F ( r ) = (cid:90) r t exp( − t ) dt = 1 − exp( − r ) = ξ (15)where ξ is the uniformly distributed random variable with constant probability density function. Solvingthis equation for r yields: r = (cid:112) − log (1 − ξ ) (16)Next, we would like to sample a Gaussian probability distribution function with zero mean and standarddeviation one: f ( x ) = 1 √ π exp( − x ) , −∞ < x < ∞ (17)We can construct a two-dimensional (2D) Gaussian probability density function: f ( x, y ) = 12 π exp( −
12 ( x + y )) (18)Change the above coordinates ( x, y ) into the cylindrical coordinates ( r, θ ) , the 2D probability distributionfunction becomes f ( r, θ ) = 12 π r exp( − r ) (19)This distribution can be sampled using the above example as: θ = 2 πξ (20) r = (cid:112) − − ξ ) (21)then in the Cartesian coordinate: x = r cos( θ ) (22) y = r sin( θ ) (23)The above equations can be rewritten as: x = (cid:112) − ξ cos(2 πξ ) (24) y = (cid:112) − ξ sin(2 πξ ) (25)4ere, the uniform random number ξ is used to replace the original uniform random number − ξ . Theabove sampling of a Gaussian distribution function is known as the Box-Muller method [7].For a complex probability distribution function whose CDF is not analytically available, one cannumerically calculate a discrete CDF as: F ( x n ) = (cid:90) x n f ( t ) dt = nN , n = 0 , , , . . . , N (26)For a uniformly sampled random number ξ from u (0 , , one can find n such that: nN < ξ < n + 1 N (27)The sampled value for x can be calculated by the following linear interpolation: x = x n + ( x n +1 − x n ) r (28) r = N ξ − n, < r < (29)For a discrete probability distribution function, f ( x ) = p i δ ( x − x i ) and (cid:80) i p i = 1 , i =1 , , . . . , N , one can generate a uniform random number ξ , and obtain the sampled random variable x = x k so that k − (cid:88) i =1 p i ≤ ξ < k (cid:88) i =1 p i (30)For a multi-dimensional probability distribution function, if the random variable in each dimensionis independent of each other, the sampling of multi-dimensional PDF can be done in each dimensionseparately. If the marginal and the conditional functions can be determined, sampling the multivariatedistribution will then involve sampling the sequence of univariate distributions. In many applications, the multi-dimensional probability distribution function can be very complicatedand the explicit analytical expression of the cumulative distribution function is not attainable. The abovedirect inversion of the CDF becomes impossible. In this case, the rejection method can be used as ageneral method to sample the probability distribution function. The rejection method is a compositionmethod that needs two samplings to sample a given distribution. Here, the first sampling will generate arandom point within the variable domain of the probability distribution function. The second samplingwill generate a uniform random number between zero and one. The probability of accepting the first sam-pling point depends on the normalized function value at the first sampling point. If the uniform randomnumber is less than or equal to the normalized function value, the first sampling point is accepted as thesampling point of the probability distribution function, otherwise, it is rejected. For a one-dimensionalPDF, the rejection method can be written as follows:– generate a uniform random number x = ξ between x min and x max .– generate another uniform random number ξ between 0 and 1.– if ξ ≤ f ( x ) f max : accept x .– otherwise; reject x .Here, f max is the maximum value of the PDF within the domain between x min and x max . A geometricview of the rejection method is shown in Fig. 1. Here, the rejection method can be viewed as to chooseuniformly the points enclosed by the curve f ( x ) inside the smallest rectangle that contains the curve.The ordinate of such a point is x = ξ ; the abscissa is f max ξ . Points lying above the curve are rejected;5 ig. 1: A geometric view of the rejection method. points below are accepted. Their ordinates x = x have the distribution f ( x ) . For example, consider thefollowing probability distribution function: f ( x ) = 11 + 2 x , < x < (31)This can be sampled using the following steps:1. x = ξ
2. if ξ > x , repeat from 1; else x = x .where ξ and ξ are two uniformly sampled random numbers between zero and one. Another exampleof using the rejection method is to sample a uniform density distribution within a unit circle. This can bedone as follows:1. x = ξ , and y = ξ
2. if x + y > , repeat from 1; else x = x and y = y .The above rejection method requires the information of the maximum value of the sampled prob-ability distribution function within the domain in order to calculate the normalized function value f ( x ) f max .The efficiency of the rejection method depends on the ratio of f ( x ) /f max . In many applications, thisratio can be small, e.g. the tail of a Gaussian distribution. This suggests that many trial solutions willbe rejected before attaining a sampled point. For some complex probability distribution function, themaximum of the function is not easily accessible. However, if one can find an easily sampled function g ( x ) so that M g ( x ) ≥ f ( x ) with constant M > for all x , a general rejection method can be writtenas [8]:– generate a random number x between x min and x max from the sampling of g ( x ) .– generate a uniform random number ξ between and .– if ξ ≤ f ( x ) / ( M g ( x )) : accept x .– otherwise; reject x .If one choose the g ( x ) = f max /M , a uniform distribution, the above general rejection method is reducedto the preceding rejection method. It is clear that the efficiency of the above rejection method dependson the ratio of f ( x ) /M g ( x ) . 6 .4 Markov chain Monte Carlo The efficiency of the rejection method can be improved by another general sampling method, Metropolismethod or in general also called Markov chain Monte Carlo (MCMC) method. The Markov chain MonteCarlo method is a general method to sample any probability distribution function regardless of its analyticcomplexity in any number of dimensions. It does not need to know either the maximum or the upperbound of the sampled probability distribution function. Moreover, it does not reject all samplings with f ( x ) /f max < ξ in the rejection method but with a probability of acceptance depending on the localfunction value. This makes it more efficient than the rejection method. Some disadvantages of theMCMC are that the sampling is correct only asymptotically and that successive samplings are correlated.To avoid these disadvantages, some initial samplings are thrown away (called burn-in phase) and theused samplings are separated by a number of steps.A sequence of random variables x i , i = 1 , , . . . forms a Markov chain if: P ( x i +1 = x | x , · · · , x i ) = P ( x i +1 = x | x i ) (32)That is, the probability distribution of x i +1 depends only on the previous step x i , and is independentof other steps ( x i − , . . . , x ) before. A Markov chain is said to be ergodic if it satisfies the followingconditions [9–11]:– Irreducible: Any state can be reached from any other state with nonzero probability.– Positive recurrent: For any state A, the expected number of steps required for the chain to returnto A is finite.– Aperiodic: For any state A, the number of steps required to return to A must not always be amultiple of some integer value.In other words, it means that all possible states of the system will be reached within some finite numberof steps. If there exists a distribution function f ( x ) such that f ( x i +1 ) P ( x i +1 | x i ) = f ( x i ) P ( x i | x i +1 ) for all i , the Markov chain is reversible and the f ( x ) is then the equilibrium distribution of the Markovchain. Provided that a Markov chain is ergodic it will converge to an equilibrium stationary distribution.This stationary distribution is determined entirely by the transition probabilities of the chain. The initialvalue of the chain is irrelevant in the long run. This suggests that the sampling based on the Markovchain would sample the probability distribution f ( x ) asymptotically. The Metropolis Markov chainMonte-Carlo algorithm to sample an arbitrary distribution can be summarized as:1. choose a proposal transition probability distribution p ( x ) and an initial random sampled value x .2. calculate a new trial value ¯ x = x i + τ using an update step τ sampled from p ( x ) .3. if f (¯ x ) > f ( x i ) accept x i +1 = ¯ x , otherwise accept x i +1 = ¯ x with a probability f (¯ x ) /f ( x i ) .4. continue step 2 until one has enough number of sampled values.5. discard some early values during the burn-in phase.Typically, the proposal distribution function can be assumed as a Gaussian function p ( x ) = N (0 , σ ) ora uniform distribution p ( x ) = U ( − v, v ) [12].The above symmetric proposal transition distribution might not be optimal. In order to speed upconvergence, a correction factor, the Hastings ratio, is applied to correct for the bias. The probabil-ity to accept the new trial sampling value is modified from the original min (1 , f (¯ x/f ( x i )) to includethe Hastings ratio min (1 , f (¯ x ) p (¯ x ; x i ) / ( f ( x i ) p ( x i ; ¯ x )) . If p (¯ x ; x i ) = p ( x i ; ¯ x ) , this is the Metropolisalgorithm.In practical application, the width of the proposal distribution (e.g. for a Gaussian update or fora uniform update) should be tuned during the burn-in phase to set the rejection fraction in the rightrange. The conventional acceptance probability is typically between and . One can use the7 ig. 2: Behaviour of
One of the most important applications of the Monte Carlo method is to calculate the integral. Given thefollowing integral: G = (cid:90) g ( x ) f ( x ) dx, f ( x ) ≥ and (cid:90) f ( x ) = 1 (33)one can sample the probability distribution function f ( x ) and form the arithmetic mean as: G N = 1 N (cid:88) i g ( x i ) (34)where N is the number of sampling points. The original integration can be written as: G = G N + error (35)with | error | (cid:39) σ √ N (36)where σ = (cid:90) g ( x ) f ( x ) dx − G (37)denotes the variance of the function g ( x ) . The error will decrease as / √ N independent of dimen-sionality of the integral. This is the key advantage of the Monte Carlo method over the direct numericalquadrature whose computational cost scales exponentially with the number of dimensions . In order toreduce the error in the calculation of the integral using the Monte Carlo method, for a given number ofsamplings, one needs to reduce the variance of the integrand or to improve on the scaling with respect tothe number of samplings. In the following, we will introduce several variance reduction methods and aquasi-Monte Carlo method to reduce the numerical error in the evaluation of the integral using the MonteCarlo method. 8 .1 Importance sampling for variance reduction Given the initial integral Eq. 33, we can rewrite the integral as: G = (cid:90) g ( x ) f ( x )¯ f ( x ) ¯ f ( x ) dx (38)where ¯ f ( x ) is a new probability density function. Using the sampling from this new probability distri-bution function, the numerical integral can be written as: G N = 1 N (cid:88) i g ( x i ) f ( x i )¯ f ( x i ) (39)The variance of the new integrand becomes: ¯ σ = (cid:90) g ( x ) f ( x )¯ f ( x ) dx − G (40)Ideally, the optimal ¯ f ( x ) should be chosen as g ( x ) f ( x ) /G [2]. In practice, a similar function to theintegrand g ( x ) f ( x ) can be used as ¯ f ( x ) to reduce the variance. For example, consider the followingintegral: G = (cid:90) cos( x ) dx (41)A straightforward Monte Carlo algorithm would be to sample a uniform probability density function f ( x ) between [0, 1], and then to calculate the mean quantity < cos( x ) > . The variance of this directMonte Carlo calculation is . . If we approximate the original function as: cos( x ) ≈ − x (42)and choose ¯ f ( x ) = (1 − x / as the importance sampling function, the new function will be: ¯ g ( x ) = cos( x )¯ f ( x ) = 106 cos( x )2 − x (43)The variance of the new function is . . This is about two orders of magnitude reduction of thevariance in comparison to the direct Monte-Carlo method. Besides the importance sampling method to reduce the variance, another way to reduce the variance is tomake use some function whose integral can be calculated analytically. Using the analytically integrablefunction h ( x ) , the original integral Eq. 33 can be rewritten as: G = (cid:90) ( g ( x ) − h ( x )) f ( x ) dx + (cid:90) h ( x ) f ( x ) dx (44)Here, we assume that the second integral can be obtained analytically. Using the Monte Carlo method tosample the probability distribution f ( x ) , the above integral can be approximated as: G (cid:39) N (cid:88) i ( g ( x i ) − h ( x i )) + (cid:90) h ( x ) f ( x ) dx (45)9f the variance of g ( x ) − h ( x ) , is much less than the variance of the original function g ( x ) , especially, if | g ( x ) − h ( x ) | is approximately constant for different values of h ( x ) , then the above correlated samplingwould be an efficient variance reduction method. For example, consider the following integral: G = (cid:90) sin( x ) dx (46)The variance of G using the direct Monte Carlo method following a uniform probability distributionfunction is . . If we choose h ( x ) = x , the variance of sin( x ) − x following the same uniformprobability distribution is . , which is more than an order of magnitude less than the variance ofthe original function. This method exploits the fact that the decrease in variance occurs when random variables are negativelycorrelated. Consider the following integral: G = (cid:90) g ( x ) dx (47)This integral can be rewritten as: G = (cid:90)
12 [ g ( x ) + g (1 − x )] dx (48)and G N = 1 N (cid:88) i [ g ( x i ) + g (1 − x i )] (49)If g ( x ) is a linear function of x , the variance of the above integration will be zero. For nearly linear func-tions, this method will substantially reduce the variance. For example, consider the following integral: G = (cid:90) e − x dx (50)The variance using the direct Monte Carlo method (assuming a uniform density function f ( x ) ) is . .Using the above method, the variance is reduced to . , another order of magnitude reduction of thevariance. As seen from the Eq. 36, in order to reduce the error in numerical integration using the Monte Carlomethod, besides reducing the variance of integral using the preceding methods, another way is to improveon the scaling with respect to the number of samplings. A quasi-Monte Carlo sampling is a methodthat uses a non-random sequence to sample the uniform distribution between zero and one. Samplingan arbitrary probability distribution can then be attained through the transformation of the sampling ofthe uniform distribution. A non-random sequence that has low discrepancy (a measure of deviation fromuniformity) can be used to simulate the uniform distribution. A popular non-random Halton/Hammersleysequence in multiple dimensions is defined as follows [13]: X = ( j − / /N, Φ ( j ) , Φ ( j ) , . . . , Φ r ( j ) , j = 1 , . . . , N (51) j = a + a r + . . . (52) Φ r ( j ) = a r − + a r − + . . . (53)10 ig. 3: Two dimensional uniform sampling from the Fortran pseudo-random number generator (left) and the Haltonsequence (right). where Φ r ( j ) is the radical inversion function in the base of a prime number r . For example, using basenumber , and j = 1 , , , one obtains the sequence: Φ (1) = 1 / , Φ (2) = 2 / , Φ (3) = 1 / , Φ (4) = 4 / . Figure 3 shows samplings of a two-dimensional uniform distribution from using therandom sampling and from the non-random Halton sequence. It is seen that the non-random samplingpopulates the two-dimensional square more uniformly than the random sampling. Fluctuation of thistype of sequence scales as /N whereas a random Monte Carlo sampling scales as / √ N . The error insome cases of numerical integration using the non-random sampling can reach /N [6]. Acknowledgements
This work was supported by the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
References [1] J. H. Halton, SIAM Review, Vol. 12, No. 1 (1970).[2] M. H. Kalos and P. A. Whitlock,
Monte Carlo Methods , 2nd ed. WILEY-VCH Verlag GmbH &Co., Weinheim, 2008.[3] https://en.wikipedia.org/wiki/Monte_Carlo_method .[4] M. E. J. Newman, G. T. Barkema,
Monte Carlo Methods in Statistical Physics , Oxford UniversityPress, Oxford, 2001.[5] P. Glasserman,
Monte Carlo Methods in Financial Engineering , (Springer Science+Business Me-dia, New York, 2003).[6] W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P. Flannery,
Numerical Recipes in FORTRAN ,2nd ed. Cambridge University Press, Cambridge, 1992.[7] G. E. P. Box and M. E. Muller, The Annals of Mathematical Statistics. 29 (2): 610-611, 1958.[8] https://en.wikipedia.org/wiki/Rejection_sampling .[9] https://en.wikipedia.org/wiki/Markov_chain .[10] P. Lam, http://patricklam.org/teaching/mcmc_print.pdf .[11] P. Breheny, https://web.as.uky.edu/statistics/users/pbreheny/701/S13/notes/2-28.pdf .[12] S. Paltani, .[13] S. M. Lavalle,