aa r X i v : . [ phy s i c s . c o m p - ph ] O c t Introduction to Randomness and Statistics excerpt from the book
Practical Guide to Computer Simulations
World Scientific 2009, ISBN 978-981-283-415-7see with permission by World Scientific Publishing Co. Pte. Ltd.
Alexander K. HartmannInstitute of PhysicsUniversity of OldenburgGermanyOctober 22, 2018 hapter 7
Randomness and Statistics
In this chapter, we are concerned with statistics in a very broad sense. Thisinvolves generation of (pseudo) random data, display/plotting of data and thestatistical analysis of simulation results.Frequently, a simulation involves the explicit generation of random numbers,for instance, as auxiliary quantity for stochastic simulations. In this case it isobvious that the simulation results are random as well. Although there aremany simulations which are explicitly not random, the resulting behavior ofthe simulated systems may appear also random, for example the motion ofinteracting gas atoms in a container. Hence, methods from statistical dataanalysis are necessary for almost all analysis of simulation results.This chapter starts (Sec. 7.1) by an introduction to randomness and statis-tics. In Sec. 7.2 the generation of pseudo random numbers according to somegiven probability distribution is explained. Basic analysis of data, i.e., the cal-culation of mean, variance, histograms and corresponding error bars, is coveredin Sec. 7.3. Next, in Sec. 7.4, it is shown how data can be represented graphi-cally using suitable plotting tools, gnuplot and xmgrace . Hypothesis testing andhow to measure or ensure independence of data is treated in Sec. 7.5. How tofit data to functions is explained in Sec. 7.6. In the concluding section, a specialtechnique is outlined which allows to cope with the limitations of simulationsdue to finite system sizes.Note that some examples are again presented using the C programminglanguage. Nevertheless, there exist very powerful freely available programs likeR [R], where many analysis (and plotting) tools are available as additionalpackages.
Here, a short introduction to concepts of probability and randomness is given.The presentation here should be concise concerning the subjects presented in thisbook. Nevertheless, more details, in particular proofs, examples and exercises,3
A.K. Hartmann: Practical Guide to Computer Simulations can be found in standard textbooks [Dekking et al (2005), Lefebvre (2006)].Here often a sloppy mathematical notation is used for brevity, e.g. instead ofwriting “a function g : X → Y, y = g ( x )”, we often write simply “a function g ( x )”.A random experiment is an experiment which is truly random (like radioac-tive decay or quantum mechanical processes) or at least unpredictable (liketossing a coin or predicting the position of a certain gas atom inside a containerwhich holds a hot dense gas). Definition
The sample space
Ω is a set of all possible outcomes of a randomexperiment.For the coin example, the sample space is Ω = { head, tail } . Note that asample space can be in principle infinite, like the possible x positions of an atomin a container. With infinite precision of measurement we have Ω ( x ) = [0 , L x ],where the container shall be a box with linear extents L x ( L y , L z in the otherdirections, see below).For a random experiment, one wants to know the probability that certainevents occur. Note that for the position of an atom in a box, the probability tofind the atom precisely at some x -coordinate x ∈ Ω ( x ) is zero if one assumes thatmeasurements result in real numbers with infinite precision. For this reason, oneconsiders probabilities P ( A ) of subsets A ⊂ Ω (in other words A ∈ Ω , 2 Ω beingthe power set which is the set of all subsets of Ω). Such a subset is calledan event . Therefore P ( A ) is the probability that the outcome of a randomexperiment is inside A , i.e. one of the elements of A . More formally: Definition A probability function P is a function P : 2 Ω −→ [0 ,
1] with P (Ω) = 1 (7.1)and for each finite or infinite sequence A , A , A , . . . of mutual disjoint events( A i ∩ A j = ∅ for i = j ) we have P ( A ∪ A ∪ A ∪ . . . ) = P ( A ) + P ( A ) + P ( A ) + . . . (7.2)For a fair coin, both sides would appear with the same probability, hence onehas P ( ∅ ) = 0, P ( { head } ) = 0 . P ( { tail } ) = 0 . P ( { head, tail } ) = 1. For thehot gas inside the container, we assume that no external forces act on the atoms.Then the atoms are distributed uniformly. Thus, when measuring the x positionof an atom, the probability to find it inside the region A = [ x, x + ∆ x ] ⊂ Ω ( x ) is P ( A ) = ∆ x/L x .The usual set operations applies to events. The intersection A ∩ B of twoevents is the event which contains elements that are both in A and B . Hence P ( A ∩ B ) is the probability that the outcome of an experiment is contained inboth events A and B . The complement A c of a set is the set of all elements ofΩ which are not in A . Since A c , A are disjoint and A ∪ A c = Ω, we get fromEq. (7.2): P ( A c ) = 1 − P ( A ) . (7.3) .1. INTRODUCTION TO PROBABILITY A, B ⊂ Ω: P ( A ∪ B ) = P ( A ) + P ( B ) − P ( A ∩ B ) (7.4) Proof P ( A ) = P ( A ∩ Ω)= P ( A ∩ ( B ∪ B c ))= P (( A ∩ B ) ∪ ( A ∩ B c )) (7 . = P ( A ∩ B ) + P ( A ∩ B c ). If we apply this for A ∪ B instead of A , we get P ( A ∪ B ) = P (( A ∪ B ) ∩ B ) + P (( A ∪ B ) ∩ B c )) = P ( B ) + P ( A ∩ B c ). Eliminating P ( A ∩ B c ) fromthese two equations gives the desired result. (cid:3) Note that Eqs. (7.2) and (7.3) are special cases of this equation.If a random experiment is repeated several times, the possible outcomes ofthe repeated experiment are tuples of outcomes of single experiments. Thus,if you throw the coin twice, the possible outcomes are (head,head), (head,tail),(tail,head), and (tail,tail). This means the sample space is a power of the single-experiment sample spaces. In general, it is also possible to combine differentrandom experiments into one. Hence, for the general case, if k experimentswith sample spaces Ω (1) , Ω (2) , . . . , Ω ( k ) are considered, the sample space of thecombined experiment is Ω = Ω (1) × Ω (2) × . . . × Ω ( k ) . For example, one candescribe the measurement of the position of the atom in the hot gas as a com-bination of the three independent random experiments of measuring the x , y ,and z coordinates, respectively.If we assume that the different experiments are performed independently ,then the total probability of an event for a combined random experiment isthe product of the single-experiment probabilities: P ( A (1) , A (2) , . . . , A ( k ) ) = P ( A (1) ) P ( A (2) ) . . . P ( A ( k ) ).For tossing the fair coin twice, the probability of the outcome (head,tail)is P ( { (head,head) } ) = P ( { head } ) P ( { head } ) = 0 . · . .
25. Similarly, forthe experiment where all three coordinates of an atom inside the containerare measured, one can write P ([ x, x + ∆ x ] × [ y, y + ∆ y ] × [ z, z + ∆ z ]) = P ([ x, x + ∆ x ]) P ([ y, y + ∆ y ]) P ([ z, z + ∆ z ]) = (∆ x/L x )(∆ y/L y )(∆ z/L z ) =∆ x ∆ y ∆ z/ ( L x L y L z ).Often one wants to calculate probabilities which are restricted to specialevents C among all events, hence relative or conditioned to C . For any otherevent A we have P ( C )= P (( A ∪ A c ) ∩ C )= P ( A ∩ C ) + P ( A c ∩ C ), which means P ( A ∩ C ) P ( C ) + P ( A c ∩ C ) P ( C ) = 1. Since P ( A ∩ C ) is the probability of an outcome in A and C and because P ( C ) is the probability of an outcome in C , the fraction P ( A ∩ C ) P ( C ) gives the probability of an outcome A and C relative to C , i.e. theprobability of event A given C , leading to the following Definition
The probability of A under the condition C is P ( A | C ) = P ( A ∩ C ) P ( C ) . (7.5)As we have seen, we have the natural normalization P ( A | C ) + P ( A c | C ) = 1.Rewriting Eq. (7.5) one obtains P ( A | C ) P ( C ) = P ( A ∩ C ). Therefore, thecalculation of P ( A ∩ C ) can be decomposed into two parts, which are sometimeseasier to obtain. By symmetry, we can also write P ( C | A ) P ( A ) = P ( A ∩ C ). A.K. Hartmann: Practical Guide to Computer Simulations
Combining this with Eq. (7.5), one obtains the famous
Bayes’ rule P ( C | A ) = P ( A | C ) P ( C ) P ( A ) . (7.6)This means one of the conditional probabilities P ( A | C ) and P ( C | A ) can beexpressed via the other, which is sometimes useful if P ( A ) and P ( C ) are known.Note that the denominator in the Bayes’ rule is sometimes written as P ( A ) = P ( A ∩ ( C ∪ C c ))= P ( A ∩ C ) + P ( A ∩ C c ) = P ( A | C ) P ( C ) + P ( A | C c ) P ( C c ).If an event A is independent of the condition C , its conditional probabilityshould be the same as the unconditional probability, i.e., P ( A | C ) = P ( A ). Using P ( A ∩ C ) = P ( A | C ) P ( C ) we get P ( A ∩ C ) = P ( A ) P ( C ), i.e., the probabilitiesof independent events have to be multiplied. This was used already above forrandom experiments, which are conducted as independent subexperiments.So far, the outcomes of the random experiments can be anything like thesides of coins, sides of a dice, colors of the eyes of randomly chosen people orstates of random systems. In mathematics, it is often easier to handle numbersinstead of arbitrary objects. For this reason one can represent the outcomes ofrandom experiments by numbers which are assigned via special functions: Definition
For a sample space Ω, a random variable is a function X : Ω −→ R . For example, one could use X (head)=1 and X (tail) = 0. Hence, if onerepeats the experiments k times independently, one would obtain the numberof heads by P ki =1 X ( ω ( i ) ), where ω ( i ) is the outcome of the i ’th experiment.If one is interested only in the values of the random variable, the connectionto the original sample space Ω is not important anymore. Consequently, onecan consider random variables X as devices, which output a random number x each time a random experiment is performed. Note that random variablesare usually denoted by upper-case letters, while the actual outcomes of randomexperiments are denoted by lower-case letters.Using the concept of random variables, one deals only with numbers asoutcomes of random experiments. This enables many tools from mathematicsto be applied. In particular, one can combine random variables and functions toobtain new random variables. This means, in the simplest case, the following:First, one performs a random experiment, yielding a random outcome x . Next,for a given function g , y = g ( x ) is calculated. Then, y is the final outcomeof the random experiment. This is called a transformation Y = g ( X ) of therandom variable X . More generally, one can also define a random variable Y bycombining several random variables X (1) , X (2) , . . . , X ( k ) via a function ˜ g suchthat Y = ˜ g (cid:16) X (1) , X (2) , . . . , X ( k ) (cid:17) . (7.7)In practice, one would perform random experiments for the random variables X (1) , X (2) , . . . , X ( k ) , resulting in outcomes x (1) , x (2) , . . . , x ( k ) . The finalnumber is obtained by calculating y = ˜ g ( x (1) , x (2) , . . . , x ( k ) ). A simple butthe most important case is the linear combination of random variables Y = α X (1) + α X (2) + . . . + α k X ( k ) , which will be used below. For all examples .1. INTRODUCTION TO PROBABILITY X (1) , X (2) , . . . , X ( k ) have the same prop-erties, which means that the same random experiment is repeated k times.Nevertheless, the most general description which allows for different randomvariables will be used here.The behavior of a random variable is fully described by the probabilities ofobtaining outcomes smaller or equal to a given parameter x : Definition
The distribution function of a random variable X is a function F X : R −→ [0 ,
1] defined via F X ( x ) = P ( X ≤ x ) (7.8)The index X is omitted if no confusion arises. Sometimes the distributionfunction is also named cumulative distribution function. One also says, the dis-tribution function defines a probability distribution . Stating a random variableor stating the distribution function are fully equivalent methods to describe arandom experiment.For the fair coin, we have, see left of Fig. 7.1 F ( x ) = x < . ≤ x < x ≥ . (7.9)For measuring the x position of an atom in the uniformly distributed gas weobtain, see right of Fig. 7.1 F ( x ) = x < x/L x ≤ x < L x x ≥ L x . (7.10) F(x)10.5 0 1 F(x)1 0 L x x x Figure 7.1: Distribution function of the random variable for a fair coin (left)and for the random x position of a gas atom inside a container of length L x .Since the outcomes of any random variable are finite, there are no possibleoutcomes X ≤ x in the limit x → −∞ . Also, all possible outcomes fulfill X ≤ x A.K. Hartmann: Practical Guide to Computer Simulations for x → ∞ . Consequently, one obtains for all random variables lim x →−∞ F ( x ) =0 and lim x → + ∞ F ( x ) = 1. Furthermore, from Def. 7.1, one obtains immediately: P ( x < X ≤ x ) = F X ( x ) − F X ( x ) (7.11)Therefore, one can calculate the probability to obtain a random number for anyarbitrary interval, hence also for unions of intervals.The distribution function, although it contains all information, is sometimesless convenient to handle, because it gives information about cumulative proba-bilities. It is more obvious to describe the outcomes of the random experimentsdirectly. For this purpose, we have to distinguish between discrete random vari-ables, where the number of possible outcomes is denumerable or even finite, and continuous random variables, where the possible outcomes are non-denumerable.The random variable describing the coin is discrete, while the position of anatom inside a container is continuous. We first concentrate on discrete random variables. Here, an alternative butequivalent description to the distribution function is to state the probability foreach possible outcome directly:
Definition
For a discrete random variable X , the probability mass function (pmf) p X : R → [0 ,
1] is given by p X ( x ) = P ( X = x ) . (7.12)Again, the index X is omitted if no confusion arises. Since a discrete randomvariable describes only a denumerable number of outcomes, the probability massfunction is zero almost everywhere. In the following, the outcomes x where p X ( x ) > x i . Since probabilities must sum up to one, seeEq. 7.1, one obtains P i p X (˜ x i ) = 1. Sometimes we also write p i = p X (˜ x i ).The distribution funcion F X ( x ) is obtained from the pmf via summing up allprobabilities of outcomes smaller or equal to x : F X ( x ) = X ˜ x i ≤ x p X (˜ x i ) (7.13)For example, the pmf of the random variable arising from the fair coin Eq.(7.9) is given by p (0) = 0 . p (1) = 0 . p ( x ) = 0 elsewhere). The general-ization to a possibly unfair coin, where the outcome “1” arises with probability p , leads to: Definition
The
Bernoulli distribution with parameter p (0 < p ≤
1) de-scribes a discrete random variable X with the following probability mass func-tion p X (1) = p, p X (0) = 1 − p . (7.14)Performing a Bernoulli experiment means that one throws a generalized coinand records either “0” or “1” depending on whether one gets head or tail. .1. INTRODUCTION TO PROBABILITY Definition • The expectation value is µ ≡ E[ X ] = X i ˜ x i P ( X = ˜ x i ) = X i ˜ x i p X (˜ x i ) (7.15) • The variance is σ ≡ Var[ X ] = E[( X − E[ X ]) ] = X i (˜ x i − E[ X ]) p X (˜ x i ) (7.16) • The standard deviation σ ≡ p Var[ X ] (7.17)The expectation value describes the “average” one would typically obtain if therandom experiment is repeated very often. The variance is a measure for thespread of the different outcomes of random variable. As example, the Bernoullidistribution exhibitsE[ X ] = 0 p (0) + 1 p (1) = p (7.18)Var[ X ] = (0 − p ) p (0) + (1 − p ) p (1)= p (1 − p ) + (1 − p ) p = p (1 − p ) (7.19)One can calculate expectation values of functions g ( x ) of random variables X via E[ g ( X )] = P i g (˜ x i ) p X (˜ x i ). For the calculation here, we only need that thecalculation of the expectation value is a linear operation. Hence, for numbers α , α and, in general, two random variables X , X one has E [ α X + α X ] = α E[ X ] + α E[ X ] . (7.20)In this way, realizing that E[ X ] is a number, one obtains: σ = Var( X ) = E[( X − E[ X ]) ] = E[ X ] − X E[ X ]] + E[E[ X ] ]= E[ X ] − E[ X ] = E[ X ] − µ (7.21) ⇔ E[ X ] = σ + µ (7.22)The variance is not linear , which can be seen when looking at a linear com-bination of two independent random variables X , X (implying E[ X X ] =0 A.K. Hartmann: Practical Guide to Computer Simulations E[ X ] E[ X ] ( ⋆ )) σ α X + α X = Var[ α X + α X ] (7 . = E[( α X + α X ) ] − E[ α X + α X ] . = E[ α X + 2 α α X X + α X ] − ( α E[ X ] + α E[ X ]) . , ( ⋆ ) = α E[ X ] + α E[ X ] − α E[ X ] + α E[ X ] . = α Var[ X ] + α Var[ X ] (7.23)The expectation values E [ X n ] are called the n ’th moments of the distribu-tion. This means that the expectation value is the first moment and the variancecan be calculated from the first and second moments.Next, we describe two more important distributions of discrete random vari-ables. First, if one repeats a Bernoulli experiment n times, one can measurehow often the result “1” was obtained. Formally, this can be written as a sum of n random variables X ( i ) which are Bernoulli distributed: X = P ni =1 X ( i ) withparameter p . This is a very simple example of a transformation of a random vari-able, see page 6. In particular, the transformation is linear. The probability toobtain x times the result “1” is calculated as follows: The probability to obtainexactly x times a “1” is p x , the other n − x experiments yield “0” which hap-pens with probability (1 − p ) n − x . Furthermore, there are (cid:0) nx (cid:1) = n ! / ( x !( n − x )!)different sequences with x times “1” and n − x times “0”. Hence, one obtains: Definition
The binomial distribution with parameters n ∈ N and p (0
1) describes a random variable X which has the pmf p X ( x ) = (cid:18) nx (cid:19) p x (1 − p ) n − k (0 ≤ x ≤ n ) (7.24)A common notation is X ∼ B ( n, p ).Note that the probability mass function is assumed to be zero for argumentvalues that are not stated. A sample plot of the distribution for parameters n = 10 and p = 0 . X ] = np (7.25)Var[ X ] = np (1 − p ) (7.26)(without proof here). The distribution function cannot be calculated analyti-cally in closed form.In the limit of a large number of experiments ( n → ∞ ), constrained such thatthe expectation value µ = np is kept fixed, the pmf of a Binomial distributionis well approximated by the pmf of the Poisson distribution , which is defined asfollows:
Definition
The
Poisson distribution with parameter µ > X with pmf p X ( x ) = µ x x ! e − µ (7.27) .1. INTRODUCTION TO PROBABILITY x p ( x ) x p ( x ) Figure 7.2: (Left) Probability mass function of the binomial distribution forparameters n = 10 and p = 0 .
4. (Right) Probability mass function of thegeometric distribution for parameter p = 0 . P i µ x x ! is the Taylorseries of e µ . The Poisson distribution exhibits E[ X ] = µ and Var[ X ] = µ . Again,a closed form for the distribution function is not known.Furthermore, one could repeat a Bernoulli experiment just until the first timea “1” is observed, without limit for the number of trials. If a “1” is observed forthe first time after exactly x times, then the first x − − p ) x − . At the x ’th experiment,the outcome “1” is observed which has the probability p . Therefore one obtains Definition
The geometric distribution with parameter p (0 < p ≤
1) de-scribes a random variable X which has the pmf p X ( x ) = (1 − p ) x − p ( x ∈ N ) (7.28)A sample plot of the pmf (up to x = 10) is shown in the right of Fig. 7.2. Thegeometric distribution has (without proof here) the expectation value E [ X ] =1 /p , the variance Var[ X ] = (1 − p ) /p and the following distribution function: F X ( x ) = ( x < − (1 − p ) m m ≤ x < m + 1 ( m ∈ N ) As stated above, random variables are called continuous if they describe randomexperiments where outcomes from a subset of the real numbers can be obtained.One may describe such random variables also using the distribution function,see Def. 7.1. For continuous random variables, an alternative description is2
A.K. Hartmann: Practical Guide to Computer Simulations possible, equivalent to the pmf for discrete random variables: The probabilitydensity function states the probability to obtain a certain number per unit:
Definition
For a continuous random variable X with a continuous distribu-tion function F X , the probability density function (pdf) p X : R → [0 ,
1] is givenby p X ( x ) = dF X ( x ) dx (7.29)Consequently, one obtains, using the definition of a derivative and using Eq.(7.11) F X ( x ) = Z x −∞ d ˜ x p X (˜ x ) (7.30) P ( x < X ≤ x ) = Z x x d ˜ x p X (˜ x ) (7.31)Below some examples for important continuous random variables are pre-sented. First, we extend the definitions Def. 7.1.2 of expectation value andvariance to the continuous case: Definition • The expectation value is E[ X ] = Z ∞−∞ dx x p X ( x ) (7.32) • The variance isVar[ X ] = E[( X − E[ X ]) ] = Z −∞∞ ( x − E[ X ]) p X ( x ) (7.33)Expectation value and variance have the same properties as for the discretecase, i.e., Eqs. (7.20), (7.21), and (7.23) hold as well. Also the definition of then’th moment of a continuous distribution is the same.Another quantity of interest is the median , which describes the central pointof the distribution. It is given by the point such that the cumulative probabilitiesleft and right of this point are both equal to 0.5: Definition
The median x med = Med[ X ] is defined via F ( x med ) = 0 . a, b ): Definition
The uniform distribution , with real-valued parameters a < b , describes a randomvariable X which has the pdf p X ( x ) = x < a b − a x ≤ x < b x ≥ .1. INTRODUCTION TO PROBABILITY X ∼ U ( a, b ). The distribution function simply rises linearly fromzero, starting at x = a , till it reaches 1 at x = b , see for example Eq. 7.10 forthe case a = 0 and b = L x . The uniform distribution exhibits the expectationvalue E[ X ] = ( a + b ) / X ] = ( b − a ) /
12. Note that viathe linear transformation g ( X ) = ( b − a ) ∗ X + a one obtains g ( X ) ∼ U ( a, b )if X ∼ U (0 , Definition
The
Gaussian distribution , also called normal distribution , withreal-valued parameters µ and σ >
0, describes a random variable X which hasthe pdf p X ( x ) = 1 √ πσ exp (cid:18) − ( x − µ ) σ (cid:19) (7.36)One writes X ∼ N ( µ, σ ). The Gaussian distribution has expectation valueE[ X ] = µ and variance Var[ X ] = σ . A sample plot of the distribution forparameters µ = 5 and σ = 3 is shown in the left of Fig. 7.3. The Gaussiandistribution for µ = 0 and σ = 1 is called standard normal distribution N (0 , X ∼ N (0 ,
1) by applying thetransformation g ( X ) = σX + µ . Note that the distribution function for theGaussian distribution cannot be calculated analytically. Thus, one uses usuallynumerical integration or tabulated values of N (0 , -5 0 5 10 15 x p ( x ) x p ( x ) Figure 7.3: (Left) Probability density function of the Gaussian distribution forparameters µ = 5 and σ = 3. (Right) Probability density function of theexponential distribution for parameter µ = 3.The central limit theorem describes how the Gaussian distribution arisesfrom a sum of random variables:4 A.K. Hartmann: Practical Guide to Computer Simulations
Theorem
Let X (1) , X (2) , . . . , X ( n ) be independent random variables, whichfollow all the same distribution exhibiting expectation value µ and variance σ .Then X = n X i =1 X ( i ) (7.37)is in the limit of large n approximately Gaussian distributed with mean nµ andvariance nσ , i.e. X ∼ N ( nµ, nσ ).Equivalently, the suitably normalized sum Z = n P ni =1 X ( i ) − µσ/ √ n (7.38)is approximately standard normal distributed Z ∼ N (0 , Brownian motion isdescribed by a Gaussian distribution.Another common probability distribution is the exponential distribution.
Definition
The exponential distribution , with real-valued parameter µ > X which has the pdf p X ( x ) = 1 µ exp ( − x/µ ) (7.39)A sample plot of the distribution for parameter µ = 3 is shown in the rightof Fig. 7.3. The exponential distribution has expectation value E[ X ] = µ andvariance Var[ X ] = µ . The distribution function can be obtained analyticallyand is given by F X ( x ) = 1 − exp ( − x/µ ) (7.40)The exponential distribution arises under circumstances where processeshappen with certain rates , i.e., with a constant probability per time unit. Veryoften, waiting queues or the decay of radioactive atoms are modeled by suchrandom variables. Then the time duration till the first event (or between twoevents if the experiment is repeated several times) follows Eq. (7.39).Next, we discuss a distribution, which has attracted recently [Newman (2003),Newman et al. (2006)] much attention in various disciplines like sociology, physicsand computer science. Its probability distribution is a power law: Definition
The power-law distribution , also called
Pareto distribution , withreal-valued parameters γ > κ >
0, describes a random variable X whichhas the pdf p X ( x ) = ( x < γκ ( x/κ ) − γ +1 x ≥ .1. INTRODUCTION TO PROBABILITY
15A discretized version of the power-law distribution appears for example inempirical social networks. The probability that a person has x “close friends”follows a power-law distribution. The same is observed for computer networksfor the probability that a computer is connected to x other computers. Thepower-law distribution has a finite expectation value only if γ >
1, i.e. if it fallsoff quickly enough. In that case one obtains E[ X ] = γκ/ ( γ − γ >
2: Var[ X ] = κ γ ( γ − ( γ − . The distributionfunction can be calculated analytically: F X ( x ) = 1 − ( x/κ ) − γ ( x ≥
1) (7.42) x p ( x ) x -40 -30 -20 -10 p ( x ) Figure 7.4: (Left) Probability density function of the power-law distributionfor parameters γ = 3 and κ = 1. (Right) Probability density function of theFisher-Tippett distribution for parameter λ = 3 with logarithmically scaled y -axis.In the context of extreme-value statistics, the Fisher-Tippett distribution(also called log-Weibull distribution) plays an important role. Definition
The
Fisher-Tippett distribution , with real-valued parameters λ > , x , describes a random variable X which has the pdf p X ( x ) = λe − λx e − e − λx (7.43)In the special case of λ = 1, the Fisher-Tippett distribution is also called Gumbel distribution. A sample Fisher-Tippett distribution is shown in the right partof Fig. 7.4. The function exhibits a maximum at x = 0. This can be shifted toany value µ by replacing x by x − µ . The expectation value is E[ X ] = ν/λ , where ν ≡ . . . . is the Euler-Mascheroni constant . The distribution exhibits avariance of Var[ X ] = π √ λ . Also, the distribution function is known analytically: F X ( x ) = e − e − λx (7.44)6 A.K. Hartmann: Practical Guide to Computer Simulations
Mathematically, one can obtain a Gumbel ( λ = 1) distributed randomvariable from n standard normal N (0 ,
1) distributed variables X ( i ) by tak-ing the maximum of them and performing the limit n → ∞ , i.e. X =lim n →∞ max (cid:8) X (1) , X (2) , . . . , X ( n ) (cid:9) . This is also true for some other “well-behaved” random variables like exponential distributed ones, if they are nor-malized such that they have zero mean and variance one. The Fisher-Tippettdistribution can be obtained from the Gumbel distribution via a linear trans-formation.For the estimation of confidence intervals (see Secs. 7.3.2 and 7.3.3) oneneeds the chi-squared distribution and the F distribution, which are presentednext for completeness. Definition
The chi-squared distribution , with ν > degrees of freedom de-scribes a random variable X which has the probability density function (usingthe Gamma function Γ( x ) = R ∞ t x − e − t dt ) p X ( x ) = 12 ν/ Γ( ν/ x ν − e − x ( x >
0) (7.45)and p X ( x ) = 0 for x ≤
0. Distribution function, mean and variance are notstated here. A chi-squared distributed random variable can be obtained froma sum of ν squared standard normal distributed random variables X i : X = P νi =1 X i . The chi-squared distribution is implemented in the
GNU scientificlibrary (see Sec. 6.3).
Definition
The
F distribution , with d , d > degrees of freedom describesa random variable X which has the pdf p X ( x ) = d d / d d / Γ( d / d / d / d / x d / − ( d x + d ) d / d / ( x >
0) (7.46)and p X ( x ) = 0 for x ≤
0. Distribution function, mean and variance arenot stated here. An F distributed random variable can be obtained from achi-squared distributed random variable Y with d degrees of freedom anda chi-squared distributed random variable Y with d degrees of freedom via X = Y /d Y /d . The F distribution is implemented in the GNU scientific library (see Sec. 6.3).Finally, note that also discrete random variables can be described usingprobability density functions if one applies the so-called delta function δ ( x − x ).For the purpose of computer simulations this is not necessary. Consequently,no further details are presented here. For many simulations in science, economy or social sciences, random numbersare necessary. Quite often the model itself exhibits random parameters whichremain fixed throughout the simulation; one speaks of quenched disorder . Afamous example in the field of condensed matter physics are spin glasses , which .2. GENERATING (PSEUDO) RANDOM NUMBERS inversion method , the rejection method and
Box-M¨uller method .More comprehensive information about these and similar techniques can befound in Refs. [Morgan (1984), Devroye (1986), Press et al. (1995)]. In this sec-tion it is assumed that you are familiar with the basic concepts of probabilitytheory and statistics, as presented in Sec. 7.1.
First, it should be pointed out that standard computers are deterministic ma-chines. Thus, it is completely impossible to generate true random numbersdirectly. One could, for example, include interaction with the user. It is, forexample, possible to measure the time interval between successive keystrokes,which is randomly distributed by nature. But the resulting time intervals de-pend heavily on the current user which means the statistical properties cannotbe controlled. On the other hand, there are external devices, which have atrue random physical process built in and which can be attached to a computer[Qantis, Westphal] or used via the internet [Hotbits]. Nevertheless, since thesenumbers are really random, they do not allow to perform stochastic simulationsin a controlled and reproducible way. This is important in a scientific context,because spectacular or unexpected results are often tried to be reproduced byother research groups. Also, some program bugs turn up only for certain ran-dom numbers. Hence, for debugging purposes it is important to be able to runexactly the same simulation again. Furthermore, for the true random numbers,either the speed of random number generation is limited if the true randomnumbers are cheap, or otherwise the generators are expensive.This is the reason why pseudo random numbers are usually taken. They aregenerated by deterministic rules. As basis serves a number generator function rand() for a uniform distribution. Each time rand() is called, a new (pseudo)random number is returned. (Now the “pseudo” is omitted for convenience)These random numbers should “look like” true random numbers and shouldhave many of the properties of them. One says they should be “good”. What“look like” and “good” means, has to be specified: One would like to havea random number generator such that each possible number has indeed the8
A.K. Hartmann: Practical Guide to Computer Simulations same probability of occurrence. Additionally, if two generated numbers r i , r k differ only slightly, the random numbers r i +1 , r k +1 returned by the respectivesubsequent calls should differ sustancially, hence consecutive numbers shouldhave a low correlation. There are many ways to specify a correlation, hencethere is no unique criterion. Below, the simplest one will be discussed.The simplest methods to generate pseudo random numbers are linear con-gruential generators . They generate a sequence x , x , . . . of integer numbersbetween 0 and m − x n +1 = ( ax n + c )mod m . (7.47)The initial value x is called seed . Here we show a simple C implementation lin_con() . It stores the current number in the local variable x which is declaredas static , such that it is remembered, even when the function is terminated(see Sec. 1.2). There are two arguments. The first GET SOURCE CODE
DIR: randomness
FILE(S): rng.c argument set_seed indicates whether one wantsto set a seed. If yes, the new seed should be passedas second argument, otherwise the value of thesecond argument is ignored. The function returnsthe seed if it is changed, or the new random number. Note that the constants a and c are defined inside the function, while the modulus M is implemented viaa macro RNG_MODULUS to make it visible outside lin_con() : int lin_con(int set_seed, int seed) { static int x = 1000; /* current random number */ const int a = 12351; /* multiplier */ const int c = 1; /* shift */ if(set_seed) /* new seed ? */ x = seed; else /* new random number ? */ x = (a*x+c) % RNG_MODULUS; return(x); } If you just want to obtain the next random number, you do not care aboutthe seed. Hence, we use for convenience rn_lin_con() to call lin_con() withthe first argument being 0: int rand_lin_con() { return(lin_con(0,0)); } .2. GENERATING (PSEUDO) RANDOM NUMBERS seed_lin_con() : void srand_lin_con(int seed) { lin_con(1, seed); } To generate random numbers r distributed in the interval [0 ,
1) one has todivide the current random number by the modulus m . It is desirable to obtainequally distributed outcomes in the interval, i.e. a uniform distribution. Ran-dom numbers generated from this distribution can be used as input to generaterandom numbers distributed according to other, basically arbitrary, distribu-tions. Below, you will see how random numbers obeying other distributions canbe generated. The following simple C function generates random numbers in[0 ,
1) using the macro
RNG_MODULUS defined above: double drand_lin_con() { return( (double) lin_con(0,0) / RNG_MODULUS); } One has to choose the parameters a, c, m in a way that “good” random num-bers are obtained, where “good” means “with less correlations”. Note that inthe past several results from simulations have been proven to be wrong becauseof the application of bad random number generators [Ferrenberg et al. (1992),Vattulainen et al. (1994)].
Example
To see what “bad generator” means, consider as an example theparameters a = 12351 , c = 1 , m = 2 and the seed value I = 1000. 10000random numbers are generated by dividing each of them by m . They are dis-tributed in the interval [0 , k -tuplesof k successive random numbers ( x i , x i +1 , . . . , x i + k − ). A good random num-ber generator, exhibiting no correlations, would fill up the k -dimensional spaceuniformly. Unfortunately, for linear congruential generators, instead the pointslie on ( k − at most of theorder m /k such planes. A bad generator has much fewer planes. This is thecase for the example studied above, see top part of Fig. 7.6The result for a = 123450 is even worse: only 15 different “random” numbersare generated (with seed 1000), then the iteration reaches a fixed point (notshown in a figure).If instead a = 12349 is chosen, the two-point correlations look like thatshown in the bottom half of Fig. 7.6. Obviously, the behavior is much moreirregular, but poor correlations may become visible for higher k -tuples.A generator which has passed several empirical tests is a = 7 = 16807, m = 2 − c = 0. When implementing this generator you have to be careful,0 A.K. Hartmann: Practical Guide to Computer Simulations x p ( x ) Figure 7.5: Distribution of random numbers in the interval [0 ,
1) obtainedfrom converting a histogram into a pdf, see Sec. 7.3.3. The random num-bers are generated using a linear congruential generator with the parameters a = 12351 , c = 1 , m = 2 .because during the calculation numbers are generated which do not fit into 32bit. A clever implementation is presented in Ref. [Press et al. (1995)]. Finally,it should be stressed that this generator, like all linear congruential generators,has the low-order bits much less random than the high-order bits. For thatreason, when you want to generate integer numbers in an interval [1,N], youshould use r = 1+(int) (N*x_n/m); instead of using the modulo operation as with r=1+(x n % N) .In standard C, there is a simple built-in random number generator called rand() (see corresponding documentation), which has a modulus m = 2 ,which is very poor. On most operating systems, also drand48() is available,which is based on m = 2 ( a =, c = 11) and needs also special arithmetics. It isalready sufficient for simulations which no not need many random numbers anddo not required highest statistical quality. In recent years, several high-standardrandom number generators have been developed. Several very good ones areincluded in the freely availabe GNU scientific library (see Sec. 6.3). Hence, youdo not have to implement them yourself.So far, it has been shown how random numbers can be generated which aredistributed uniformly in the interval [0 , p ( x ). In the next sections, several techniquessuitable for this task are presented. .2. GENERATING (PSEUDO) RANDOM NUMBERS x i x i + ( x i ) x i x i + ( x i ) Figure 7.6: Two point correlations x i +1 ( x i ) between successive random numbers x i , x i +1 . The top case is generated using a linear congruential generator with theparameters a = 12351 , c = 1 , m = 2 , the bottom case has instead a = 12349. In case of discrete distributions with finite number of possible outcomes, one cancreate a table of the possible outcomes together with their probabilities p X ( x i )( i = 1 , . . . , i max ), assuming that the x i are sorted in ascending order. To drawa number, one has to draw a random number u which is uniformly distributedin [0 ,
1) and take the entry j of the table such that the sum s j ≡ P ji =1 p X ( x i )of the probabilities is larger than u , but s j − ≡ P j − i =1 p X ( i ) < u . Note that onecan search the array quickly by bisection search : The array is iteratively divided2 A.K. Hartmann: Practical Guide to Computer Simulations it into two halves and each time continued in that half where the correspondingentry j is contained. In this way, generating a random number has a timecomplexity which grows only logarithmically with the number i max of possibleoutcomes. This pays off if the number of possible outcomes is very large.In exercise (1) you are asked to write a function to sample from the proba-bility distribution of a discrete variable, in particular for a Poisson distribution.In the following, we concentrate on techniques for generating continuousrandom variables. Given is a random number generator drand() which is assumed to generaterandom numbers U which are distributed uniformly in [0 , Z with probability density p Z ( z ). The correspondingdistribution function is F Z ( z ) ≡ P ( Z ≤ z ) ≡ Z z −∞ dz ′ p Z ( z ′ ) (7.48)The target is to find a function g ( u ), such that after the transformation Z = g ( U ) the outcomes of Z are distributed according to (7.48). It is assumedthat g can be inverted and is strongly monotonically increasing. Then oneobtains F Z ( z ) = P ( Z ≤ z ) = P ( g ( U ) ≤ z ) = P ( U ≤ g − ( z )) (7.49)Since the distribution function F U ( u ) = P ( U ≤ u ) for a uniformly distributedvariable is just F U ( u ) = u ( u ∈ [0 , F Z ( z ) = g − ( z ). Thus,one just has to choose g ( z ) = F − Z ( z ) for the transformation function in orderto obtain random numbers, which are distributed according to the probabilitydistribution F Z ( z ). Of course, this only works if F Z can be inverted. If thisis not possible, you may use the methods presented in the subsequent sections,or you could generate a table of the distribution function, which is in fact adiscretized approximation of the distribution function, and use the methods forgenerating discrete random numbers as shown in Sec. 7.2.2. This can be evenrefined by using a linearized approximation of the distribution function. Here,we do not go into further details, but present an example where the distributionfunction can be indeed inverted. Example
Let us consider the exponential distribution with parameter µ ,with distribution function F Z ( z ) = 1 − exp( − z/µ ), see page 14. Therefore, onecan obtain exponentially distributed random numbers Z by generating uniformdistributed random numbers u and choosing z = − µ ln(1 − u ). GET SOURCE CODE
DIR: random
FILE(S): expo.c
The following simple C function generatesa random number which is exponentially dis-tributed. The parameter µ of the distribution ispassed as argument. .2. GENERATING (PSEUDO) RANDOM NUMBERS z −4 −3 −2 −1 p ( z ) Figure 7.7: Histogram pdf (see page 35) of random numbers generated accordingto an exponential distribution ( µ = 1) compared with the probability densityfunction (straight line) in a logarithmic plot. double rand_expo(double mu) { double randnum; /* random number U(0,1) */ randnum = drand48(); return(-mu*log(1-randnum)); } Note that we use in line 4 the simple drand48() random number generator,which is included in the C standard library and works well for applications withmoderate statistical requirements. For more sophisticated generates, see e.g.the
GNU scientific library (see Sec. 6.3).In Fig. 7.7 a histogram pdf (see page 35) for 10 random numbers generatedin this way and the exponential probability function for µ = 1 are shown with alogarithmically scaled y -axis. Only for larger values are deviations visible. Theyare due to statistical fluctuations since p Z ( z ) is very small there. As mentioned above, the inversion method works only when the distributionfunction P can be inverted analytically. For distributions not fulfilling thiscondition, sometimes this problem can be overcome by drawing several randomnumbers and combining them in a clever way.The rejection method works for random variables where the pdf p ( x ) fitsinto a box [ x , x ) × [0 , y max ), i.e., p ( x ) = 0 for x [ x , x ] and p ( x ) ≤ y max .The basic idea of generating a random number distributed according to p ( x ) isto generate random pairs ( x, y ), which are distributed uniformly in [ x , x ] × A.K. Hartmann: Practical Guide to Computer Simulations x p ( x ) Figure 7.8: The rejection method: Points ( x, y ) are scattered uniformly over abounded rectangle. The probability that y ≤ p ( x ) is proportional to p ( x ).[0 , y max ] and accept only those numbers x where y ≤ p ( x ) holds, i.e., the pairswhich are located below p ( x ), see Fig. 7.8. Therefore, the probability that x isdrawn is proportional to p ( x ), as desired. GET SOURCE CODE
DIR: randomness
FILE(S): reject.c
The following C function realizes the rejectionmethod for an arbitrary pdf. It takes as argu-ments the boundaries of the box y_max , x0 and x1 as well as a pointer pdf to the function realiz-ing the pdf. For an explanation of function pointers, see Sec. 1.4. double reject(double y_max, double x0, double x1, double (* pdf)(double)) { int found; /* flag if valid number has been found */ double x,y; /* random points in [x0,x1]x[0,p_max] */ found = 0; while(!found) /* loop until number is generated */ { x = x0 + (x1-x0)*drand48(); /* uniformly on [x0,x1] */ y = y_max *drand48(); /* uniformly in [0,p_max] */ if(y <= pdf(x)) /* accept ? */ found = 1; } return(x); } In lines 9–10 the random point, which is uniformly distributed in the box, isgenerated. Lines 11–12 contain the check whether a point below the pdf curvehas been found. The search in the loop (lines 7–13) continues until a randomnumber has been accepted, which is returned in line 14. .2. GENERATING (PSEUDO) RANDOM NUMBERS Example
The rejection method is applied to a pdf, which has density 1 in[0 , .
5) and rises linearly from 0 to 4 in [1 , . double pdf(double x) { if( (x<0)|| ((x>=0.5)&&(x<1))|| (x>1.5) ) return(0.0); else if((x>=0)&&(x<0.5)) return(1.0); else return(4.0*(x-1)); } The resulting empirical histogram pdf is shown in Fig. 7.9. x p ( x ) Figure 7.9: Histogram pdf (see page 35) of 10 random numbers generated usingthe rejection method for an artificial pdf.The rejection method can always be applied if the probability density isboxed, but it has the drawback that more random numbers have to be generatedthan can be used: If A = ( x − x ) y max is the area of the box, one has on averageto generate 2 A auxiliary random numbers to obtain one random number of thedesired distribution. If this leads to a very poor efficiency, you can consider touse several boxes for different parts of the pdf. In case neither the distribution function can be inverted nor the probability fitsinto a box, special methods have to be applied. As an example, a method for6
A.K. Hartmann: Practical Guide to Computer Simulations generating random numbers distributed according to a Gaussian distribution isconsidered. Other methods and examples of how different techniques can becombined are collected in [Morgan (1984)].The probability density function for the Gaussian distribution with mean µ and variance σ is shown in Eq. (7.36), see also Fig. 7.10. It is, apart fromuniform distributions, the most common distribution occurring in simulations. −4 −2 0 2 4 x p G ( x ) Figure 7.10: Gaussian distribution with zero mean and unit width. The circlesrepresent a histogram pdf (see page 35) obtained from 10 numbers drawn withthe Box-M¨uller method.Here, the case of a standard Gaussian distribution ( µ = 0 , σ = 1) is con-sidered. If you want to realize the general case, you have to draw a standardGaussian distributed number z and then use σz + µ which is distributed asdesired.Since the Gaussian distribution extends over an infinite interval and becausethe distribution function cannot be inverted, the methods from above are notapplicable. The simplest technique to generate random numbers distributedaccording to a Gaussian distribution makes use of the central limit theorem7.1.2. It tells us that any sum of K independently distributed random variables U i (with mean µ and variance v ) will converge to a Gaussian distribution withmean Kµ and variance Kv . If again U i is taken to be uniformly distributed in[0 ,
1) (which has mean µ = 0 . v = 1 / K = 12and the random variable Z = P Ki =1 U i − Box-M¨uller method is exact. You need tworandom variables U , U uniformly distributed in [0 ,
1) to generate two indepen-dent Gaussian variables N , N . This can be achieved by generating u , u from .3. BASIC DATA ANALYSIS U , U and assigning n = p − − u ) cos(2 πu ) n = p − − u ) sin(2 πu )A proof that n and n are indeed distributed according to (7.36) can be founde.g. in [Press et al. (1995), Morgan (1984)], where also other methods for gen-erating Gaussian random numbers, some even more efficient, are explained. Amethod which is based on the simulation of particles in a box is explained in[Fernandez and Criado (1999)]. In Fig. 7.10 a histogram pdf of 10 randomnumbers drawn with the Box-M¨uller method is shown. Note that you can findan implementation of the Box-M¨uller method in the solution of Exercise (3). The starting point is a sample of n measured points { x , x , . . . , x n − } ofsome quantity, as obtained from a simulation. Examples are the density of agas, the transition time between two conformations of a molecule, or the priceof a stock. We assume that formally all measurements can be described byrandom variables X i representing the same random variable X and that allmeasurements are statistically independent of each other (treating statisticaldependencies is treated in Sec. 7.5). Usually, one does not know the underlyingprobability distribution F ( x ), having density p ( x ), which describes X . Thus, one wants to obtain information about X by looking at the sample { x , x , . . . , x n − } . In principle, one does this by considering estimators h = h ( x , x , . . . , x n − ). Since the measured points are obtained from random vari-ables, H = h ( X , X , . . . , X n − ) is a random variable itself. Estimators areoften used to estimate parameters θ of random variables, e.g. moments of dis-tributions. The most fundamental estimators are: • The mean x ≡ n n − X i =0 x i (7.50) • The sample variance s ≡ n n − X i =0 ( x i − x ) (7.51)The sample standard deviation is s ≡ √ s .8 A.K. Hartmann: Practical Guide to Computer Simulations
GET SOURCE CODE
DIR: randomness
FILE(S): mean.c
As example, next a simple C function is shown,which calculates the mean of n data points. Thefunction obtains the number n of data points andan array containing the data as arguments. Itreturns the average: double mean(int n, double *x) { double sum = 0.0; /* sum of values */ int i; /* counter */ for(i=0; i 2, the variance does not exist. Numerically, when calculating s according Eq. (7.51), one observes that it will not converge to a finite valuewhen increasing the sample size n . Instead one will observe occasionally jumpsto higher and higher values. One says the estimator is not robust . To get stillan impression of the spread of the data points, one can instead calculate the average deviation D ≡ n n − X i =0 | x i − x | (7.55)In general, an estimator is the less robust, the higher the involved momentsare. Even the sample mean may not be robust, for instance for a power-lawdistribution with γ ≤ 1. In this case one can use the sample median , which is thevalue x m such that x i ≤ x m for half the sample points, i.e. x m is the ( n +1) / The sample median is clearlyan estimator of the median (see Def. 7.1.2). It is more robust, because it is lessinfluenced by the sample points in the tail. The simplest way to calculate themedian is to sort all sample points in ascending order and take the sample pointat the ( n/ O ( n log n ).Nevertheless, there is an algorithm [Press et al. (1995), Cormen et al. (2001)]which calculates the median even in linear running time O ( n ). In the previous section, we have studied estimators for parameters of a ran-dom variable X using a sample obtained from a series of independent random Sometimes the sample variance is defined as S ⋆ = n − P n − i =0 ( x i − x ) to make it anunbiased estimator of the variance. If n is even, one can take the average between the n/ n + 1) / A.K. Hartmann: Practical Guide to Computer Simulations experiments. This is a so-called point estimator , because just one number isestimated.Since each estimator is itself a random variable, each estimated value will beusually off the true value θ . Consequently, one wants to obtain an impression ofhow far off the estimate might be from the real value θ . This can be obtainedfor instance from: Definition The mean squared error of a point estimator H = h ( X , X , . . . , X n − ) for a parameter θ isMSE( H ) ≡ E[( H − θ ) ] = E[( H − E[ H ] + E[ H ] − θ ) ]= E[( H − E[ H ]) ] + E[2( H − E[ H ])(E[ H ] − θ )] + E[(E[ H ] − θ ) ]= E[( H − E[ H ]) ] + 2 (E[ H ] − E[ H ]) | {z } =0 (E[ H ] − θ ) + (E[ H ] − θ ) = Var[ H ] + (E[ H ] − θ ) (7.56)If an estimator is unbiased, i.e., if E[ H ] = θ , the mean squared error isgiven by the variance of the estimator. Hence, if for independent samples (eachconsisting of n sample points) the estimated values are close to each other, theestimate is quite accurate. Unfortunately, usually only one sample is available(how to circumvent this problem rather ingeniously, see Sec. 7.3.4). Also themean squared error does not immediately provide a probabilistic interpretationof how far the estimate is away from the true value θ .Nevertheless, one can obtain an estimate of the error in a probabilistic sense.Here we want to calculate a so-called confidence interval also sometimes named error bar . Definition For a parameter θ describing a random variable, two estimators l α = l α ( x , x , . . . , x n − ) and u = u α ( x , x , . . . , x n − ) which are obtained froma sample { x , x , . . . , x n − } provide a confidence interval if, for given confidencelevel − α ∈ (0 , 1) we have P ( l α < θ < u α ) = 1 − α (7.57)The value α ∈ (0 , 1) is called conversely significance level . This means, the truebut unknown value θ is contained in the interval ( l, u ), which is itself a randomvariable as well, with probability 1 − α . Typical values of the confidence levelare 0.68, 0.95 and 0.99 ( α = 0 . 32, 0 . 05, 0 . 01, respectively), providing increasingconfidence. The more one wants to be sure that the interval really contains thetrue parameter, i.e. the smaller the value of α , the larger the confidence intervalwill be.Next, it is quickly outlined how one arrives at the confidence interval for themean, for details please consult the specialized literature. First we recall thataccording to its definition the mean is a sum of independent random variables.For computer simulations, one can assume that usually (see below for a coun-terexample) a sufficiently large number of experiments is performed. Therefore, This is different for many empirical experiments, for example, when testing new treat- .3. BASIC DATA ANALYSIS X should exhibit (approximately)a pdf f X which is Gaussian with an expectation value µ and some variance σ X = σ /n . This means, the probability α that the sample means fall outside an interval I = [ µ − zσ X , µ + zσ X ] can be easily obtained from the standardnormal distribution. This situation is shown in the Fig. 7.11. Note that theinterval is symmetric about the mean µ and that its width is stated in multiples z = z ( α ) of the standard deviation σ X . The relation between significance level α and half interval width z is just R z − z dx f X ( x ) = 1 − α . Hence, the weight ofthe standard normal distribution outside the interval [ − z, z ] is α . This relationcan be obtained from any table of the standard Gaussian distribution or fromthe function gsl_cdf_gaussian_P() of the GNU scientific library (see Sec. 6.3).Usually, one considers integer values z = 1 , , α = 0 . 32, 0 . 05, and 0 . I (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) p (x) x µ− σ z µ+ σ z X XX Figure 7.11: Probability density function of the sample mean X for large enoughsample sizes n where the distribution becomes Gaussian. The true expectationvalue is denoted by µ and σ X is the variance of the sample mean. The probabilitythat a random number drawn from this distribution falls outside the symmetricinterval [ µ − zσ X , µ + zσ X ] is α .contains the unknown expectation value µ and the unknown variance σ X . First,one can rewrite 1 − α = P ( µ − zσ X ≤ X ≤ µ + zσ X )= P ( − zσ X ≤ X − µ ≤ zσ X )= P ( − X − zσ X ≤ − µ ≤ − Xzσ X )= P ( X − zσ X ≤ µ ≤ X + zσ X ) . This now states the probability that the true value, which is estimated by thesample mean x , lies within an interval which is symmetric about the estimate x . ments in medical sciences, where often only a very restricted number of experiments can beperformed. In this case, one has to consider special distributions, like the Student distribution . A.K. Hartmann: Practical Guide to Computer Simulations Note that the width 2 zσ X is basically given by σ X = q Var[ X ]. This explainswhy the mean squared error MSE( H ) = Var[ H ], as presented in the beginningof this section, is a good measure for the statistical error made by the estimator.This will be used in Sec. 7.3.4.To finish, we estimate the true variance σ using nn − s , hence we get σ X = σ √ n ≈ S √ n − . To summarize we get: P (cid:18) X − z S √ n − ≤ µ ≤ X + z S √ n − (cid:19) ≈ − α (7.58)Note that this confidence interval, with l α = x − z ( α ) S/ √ n − u α = x + z ( α ) S/ √ n − 1, is symmetric about x , which is not necessarily the case forother confidence intervals. Very often in scientific publications, to state theestimate for µ including the confidence interval, one gives the range where thetrue mean is located in 68% of all cases ( z = 1) i.e. x ± S √ n − , this is called the standard Gaussian error bar or one σ error bar . Thus, the sample variance andthe sample size determine the error bar/ confidence interval.For the variance, the situation is more complicated, because it is not simplya sum of statistically independent sample points { x , x , . . . , x n − } . Withoutgoing into the details, here only the result from the corresponding statistics lit-erature [Dekking et al (2005), Lefebvre (2006)] is cited: The confidence intervalwhere with probability 1 − α the true variance is located is given by [ σ l , σ u ]where σ l = ns χ (1 − α/ , n − σ u = ns χ ( α/ , n − . (7.59)Here, χ ( β, ν ) is the inverse of the cumulative chi-squared distribution with ν degrees of freedom. It states the value where F ( χ , ν ) = β , see page 16. Thischi-squared function is implemented in the GNU scientific library (see Sec. 6.3)in the function gsl_cdf_chisq_Pinv() .Note that as one alternative, you could regard y i ≡ ( x i − x ) approximatelyas independent data points and use the above standard error estimate describedfor the mean of the sample { y i } . Also, one can use the bootstrap method asexplained below (Sec. 7.3.4), which allows to calculate confidence intervals forarbitrary estimators. .3. BASIC DATA ANALYSIS Sometimes, you do not only want to estimate moments of an underlying distri-bution, but you want to get an impression of the full distribution. In this caseyou can use histograms . Definition A histogram is given by a set of disjoint intervals B k = [ l k , u k ) , (7.60)which are called bins and a counter h k for each bin. For a given sample of n measured points { x , x , . . . , x n − } , bin h k contains the number of samplepoints x i which are contained in B k . Example For the sample { x i } = { . , . , . , . , . , . , . , . , . , . , . , . } the bins [0 , . , [0 . , . 0) [1 . , . , [1 . , . , [2 . , . 5) [2 . , . , are used, resulting in h = 0 , h = 3 , h = 5 , h = 3 , h = 1 , h = 0which is depicted in Fig. 7.12. i h i Figure 7.12: Histogram for the data shown in Ex. 7.3.3.In principle, the bins can be chosen arbitrarily. You should take care thatthe union of all intervals covers all (possible or actual) sample points. Here, it is4 A.K. Hartmann: Practical Guide to Computer Simulations assumed that the bins are properly chosen. Note also that the width b k = u k − l k of each bin can be different. Nevertheless, often bins with uniform width areused. Furthermore, for many applications, for instance, when assigning differentweights to different sample points , it is useful to consider the counters as real-valued variables. A simple (fixed-bin width) C implementation of histograms isdescribed in Sec. 3.2. The GNU scientific library (see Sec. 6.3) contains datastructures and functions which implement histograms allowing for variable binwidth.Formally, for a given random variable X , the count h k in bin k can be seen asa result of a random experiment for the binomial random variable H k ∼ B ( n, p k )with parameters n and p k , where p k = P ( X ∈ B k ) is the probability that arandom experiment for X results in a value which is contained in bin B k . Thismeans that confidence intervals for a histogram bin can be obtained in principlefrom a binomial distribution. Nevertheless, for each sample the true value fora value p k is unknown and can only be estimated by q k ≡ h k /n . Hence, thetrue binomial distribution is unknown. On the other hand, a binomial randomvariable is a sum of n Bernoulli random variables with parameter p k . Thus, theestimator q k is nothing else than a sample mean for a Bernoulli random variable.If the number of sample points n is “large” (see below), from the central limittheorem 7.1.2 and as discussed in Sec. 7.3.2, the distribution of the sample mean(being binomial in fact) is approximately Gaussian. Therefore, one can use thestandard confidence interval Eq. (7.58), in this case P (cid:18) q k − z S √ n − ≤ p k ≤ q k + z S √ n − (cid:19) ≈ − α (7.61)Here, according Eq. (7.19), the Bernoulli random variable exhibits a samplevariance s = q k (1 − q k ) = ( h k /n )(1 − h k /n ). Again, z = z ( α ) denotes thehalf width of an interval [ − z, z ] such that the weight of the standard normaldistribution outside the interval equals α . Hence, the estimate with standarderror bar ( z = 1) is q k ± p q k (1 − q k ) / ( n − B k . This can happen easily in the regions where p k is smallerthan 1 /n but non-zero, i.e. in regions of the histogram which are used to samplethe tails of a probability density function. In this case the estimated fractioncan easily be q k = 0 resulting also in a zero-width confidence interval, which iscertainly wrong. This means, the number of samples n needed to have a reliableconfidence interval for a bin B k depends on the number of bin entries. A ruleof thumb from the statistics literature is that nq k (1 − q k ) > q i,l , q i,u ] for q k hasto be obtained from the binomial distribution and it is quite complicated, sinceit uses the F distribution (see Def. 7.1.2 on page 16) This occurs for some advanced simulation techniques. .3. BASIC DATA ANALYSIS q i,l = h k h k + ( n − h k + 1) F q i,u = ( h k + 1) F ( h k + 1) F + n − h k , (7.62)where F = F (1 − α/ 2; 2 n − h k + 2 , h k ) F = F (1 − α/ 2; 2 h k + 2 , n − h k )The value F ( β ; r , r ) states the x value such that the distribution functionfor the F distribution with number of degrees r and r reaches the value β .This inverse distribution function is implemented in the GNU scientific library (see Sec. 6.3). If you always use these confidence intervals, which are usually not symmetric about q k , then you cannot go wrong. Nevertheless, for mostapplications the standard Gaussian error bars are fine.Finally, in case you want to use a histogram to represent a sample from acontinuous random variable, you can easily interpret a histogram as a samplefor a probability density function, which can be represented as a set of points { (˜ x k , p (˜ x k )) } . This is called the histogram pdf or the sample pdf . For sim-plicity, it is assumed that the interval mid points of the intervals are used as x -coordinate. For the normalization, we have to divide by the total number ofcounts, as for q k = h k /n and to divide by the bin width. This ensures that theintegral of the sample pdf, approximated by a sum, gives just unity. Therefore,we get ˜ x k ≡ ( l k + u k ) / p (˜ x k ) ≡ h k / ( nb k ) . (7.63)The confidence interval, whatever type you choose, has to be normalized in thesame way. A function which prints a histogram as pdf, with simple Gaussianerror bars, is shown in Sec. 3.2.For discrete random variables, the histogram can be used to estimate thepmf. . In this case the choice of the bins, in particular the bin widths, iseasy, since usually all possible outcomes of the random experiments are known.For a histogram pdf, which is used to describe approximately a continuousrandom variable, the choice of the bin width is important. Basically, you haveto adjust the width manually, such that the sample data is respresented “best”.Thus, the bin width should not be too small nor too large. Sometimes a non-uniform bin width is the best choice. In this case no general advice can be given,except that the bin width should be large where few data points have beensampled. This means that each bin should contain roughly the same numberof sample points. Several different rules of thumb exist for uniform bin widths.For example b = 3 . Sn − / [Scott (1979)], which comes from minimizing themean integrated squared difference between a Gaussian pdf and a sample drawn For discrete random variables, the q k values are already suitably normalized A.K. Hartmann: Practical Guide to Computer Simulations from this Gaussian distribution. Hence, the larger the variance S of the sample,the larger the bin width, while increasing the number of sample points enablesthe bin width to be reduced.In any case, you should be aware that the histogram pdf can be only anapproximation of the real pdf, due to the finite number of data points and dueto the underlying discrete nature resulting from the bins. The latter problem hasbeen addressed in recent years by so-called kernel estimators [Dekking et al (2005)].Here, each sample point x i is represented by a so-called kernel function . A kernelfunction k ( x ) is a peaked function, formally exhibiting the following properties: • It has a maximum at 0. • It falls off to zero over some some distance h . • Its integral R k ( x ) dx is normalized to one.Often used kernel functions are, e.g., a triangle, a cut upside-down parabola ora Gaussian function. Each sample point x i is represented such that a kernelfunction is shifted having the maximum at x i . The estimator ˆ p ( x ) for the pdfis the suitably normalized sum (factor 1 /n ) of all these kernel functions, one foreach sample point: ˆ p ( x ) = 1 n X i k ( x − x i ) (7.64)The advantages of these kernel estimators are that they result usually in asmooth function ˆ p and that for a value ˆ p ( x ) also sample points more distantfrom x may contribute, with decreasing weight for increasing distance. Themost important parameter is the width h , because too small a value of h willresult in many distinguishable peaks, one for each sample point, while too largevalue a of h leads to a loss of important details. This is of similar importanceas the choice of the bin width for histograms. The choice of the kernel function(e.g. a triangle, an upside-down parabola or a Gaussian function) seems to beless important. As pointed out, an estimator for some parameter θ , given by a function h ( x , x , . . . , x n − ), is in fact a random variable H = h ( X , X , . . . , X n − ).Consequently, to get an impression of how much an estimate differs from thetrue value of the parameter, one needs in principle to know the distributionof the estimator, e.g. via the pdf p H or the distribution function F H . In theprevious chapter, the distribution was known for few estimators, in particularif the sample size n is large. For instance, the distribution of the sample meanconverges to a Gaussian distribution, irrespectively of the distribution function F X describing the sample points { x i } .For the case of a general estimator H , in particular if F X is not known,one may not know anything about the distribution of H . In this case one canapproximate F X by the sample distribution function: .3. BASIC DATA ANALYSIS Definition For a sample { x , x , . . . , x n − } , the sample distribution function (also called empirical distribution function ) is F ˆ X ( x ) ≡ number of sample points x i smaller than or equal to xn (7.65)Note that this distribution function describes in fact a discrete random vari-able (called ˆ X here), but is usually (but not always) used to approximate acontinuous distribution function.The bootstrap principle is to use F ˆ X instead of F X . The name of this principlewas made popular by B. Efron [Efron (1979), Efron and Tibshirani (1994)] andcomes from the fairy tale of Baron M¨unchhausen, who dragged himself out of aswamp by pulling on the strap of his boot. Since the distribution function F X is replaced by the empirical sample distribution function, the approach is alsocalled empirical bootstrap , for a variant called parametric bootstrap see below.Now, having F ˆ X one could in principle calculate the distribution function F ˆ H for the random variable ˆ H = h ( ˆ X , ˆ X , . . . , ˆ X n − ) exactly, which then is anapproximation of F H . Usually, this is to cumbersome and one uses a secondapproximation: One draws so-called bootstrap samples { ˆ x , ˆ x , . . . , ˆ x n − } fromthe random variable ˆ X . This is called resampling . This can be done quitesimply by n times selecting ( with replacement ) one of the data points of theoriginal sample { x i } , each one with the same probability 1 /n . This meansthat some sample points from { x i } may appear several times in { ˆ x i } , somemay not appear at all. Now, one can calculate the estimator value h ∗ = h (ˆ x , ˆ x , . . . , ˆ x n − ) for each bootstrap sample. This is repeated K times fordifferent bootstrap samples resulting in K values h ∗ k ( k = 1 , . . . , K ) of theestimator. The sample distribution function F H ∗ of this sample { h ∗ k } is the finalresult, which is an approximation of the desired distribution function F H . Notethat the second approximation, replacing F ˆ H by F H ∗ can be made arbitrarilyaccurate by making K as large as desired, which is computationally cheap.You may ask: Does this work at all, i.e., is F H ∗ a good approximationof F H ? For the general case, there is no answer. But for some cases thereare mathematical proofs. For example for the mean H = X the distributionfunction F X ∗ in fact converges to F X . Here, only the subtlety arises that onehas to consider in fact the normalized distributions of X − µ ( µ = E[ X ]) andˆ X − x ( x = P n − i =0 x i /n ). Thus, the random variables are just shifted by constantvalues. For other cases, like for estimating the median or the variance, one hasto normalize in a different way, i.e., by subtracting the (empirical) median or bydividing by the (empirical) variance. Nevertheless, for the characteristics of F H we are interested in, in particular in the variance, see below, normalizations likeshifting and stretching are not relevant, hence they are ignored in the following.Note that indeed some estimators exist, like the maximum of a distribution, forwhich one can prove conversely that F H ∗ does not converge to F H , even after In the European version, he dragged himself out by pulling his hair. The probability for a sample point not to be selected is (1 − /n ) n = exp( n log(1 − /n )) → exp( n ( − /n )) = exp( − ≈ . 367 for n → ∞ . A.K. Hartmann: Practical Guide to Computer Simulations some normalization. On the other hand, for the purpose of getting a not toobad estimate of the error bar, for example, bootstrapping is a very convenientand suitable approach which has received high acceptance during recent years.Now one can use F H ∗ to calculate any desired quantity. Most important isthe case of a confidence interval [ h l , h u ] such that the total probability outsidethe interval is α , for given significance level α , i.e. F H ∗ ( h u ) − F H ∗ ( h l ) = 1 − α . Inparticular, one can distribute the weight α equally below and above the interval,which allows to determine h l , h u F H ∗ ( h u ) = F H ∗ ( h l ) = α/ . (7.66)Similar to the confidence intervals presented in Sec. 7.3.2, [ h l , h u ] also representsa confidence interval for the unknown parameter θ which is to be estimated fromthe estimator (if it is unbiased). Note that [ h l , h u ] can be non-symmetric aboutthe actual estimate h ( x , x , . . . , x n − ). This will happen if the distribution F H ∗ is skewed.For simplicity, as we have seen in Sec. 7.3.2, one can use the variance Var[ H ]to describe the statistical uncertainty of the estimator. As mentioned on page32, this corresponds basically to a α = 0 . 32 uncertainty. GET SOURCE CODE DIR: randomness FILE(S): bootstrap.cbootstrap test.c The following C function calculates Var[ H ∗ ],as approximation of the unknown Var[ H ]. Onehas to pass as arguments the number n of samplepoints, an array containing the sample points, thenumber K of bootstrap iterations, and a pointerto the function f which represents the estimator. f has to take two arguments:the number of sample points and an array containing a sample. For an expla-nation of function pointers, see Sec. 1.4. The function bootstrap_variance() returns Var[ H ∗ ]. double bootstrap_variance(int n, double *x, int n_resample, double (*f)(int, double *)) { double *xb; /* bootstrap sample */ double *h; /* results from resampling */ int sample, i; /* loop counters */ int k; /* sample point id */ double var; /* result to be returned */ h = (double *) malloc(n_resample * sizeof(double)); xb = (double *) malloc(n * sizeof(double)); for(sample=0; sample 32) error bar.For calculating properties of the sample mean, the bootstrap approach worksfine, but in this case one could also be satisfied with the standard Gaussian con-fidence interval. The bootstrap approach is more interesting for non-standardestimators. One prominent example from the field of statistical physics is theso-called Binder cumulant [Binder (1981)], which is given by: b ( x , x , . . . , x n − ) = 0 . − x [ x ] ! (7.67) GET SOURCE CODE DIR: randomness FILE(S): binder L8.datbinder L10.datbinder L16.datbinder L30.dat where . . . is again the sample mean, for exam-ple x = P n − i =0 x i . The Binder cumulant is oftenused to determine phase transitions via simula-tions, where only systems consisting of a finitenumber of particles can be studied. For exam-ple, consider a ferromagnetic system held at sometemperature T . At low temperature, below the Curie temperature T c , the system will exhibit a macroscopic magnetization m .On the other hand, for temperatures above T c , m will on average converge tozero when increasing the system size. This transition is fuzzy, if the systemsizes are small. Nevertheless, when evaluating the Binder cumulant for differentsets of sample points { m ( T, L ) i } which are obtained at several temperatures T and for different system sizes L , the b L ( T ) curves for different L will all cross[Landau and Binder (2000)] (almost) at T c , which allows for a very precise de-termination of T c . A sample result for a two-dimensional (i.e. layered) modelferromagnet exhibiting L × L particles is shown in Fig. 7.13. The Binder cumu-lant has been useful for the simulation of many other systems like disorderedmaterials, gases, optimization problems, liquids, and graphs describing socialsystems.0 A.K. Hartmann: Practical Guide to Computer Simulations T b L ( T ) L=8L=10L=16L=30 Figure 7.13: Plot of Binder cumulant of two-dimensional model ferromagnet asfunction of temperature T (dimensionless units). Each system consists of L × L particles. The curves for different system sizes L cross very close to the phasetransition temperature T c = 2 . 269 (known from analytical calculations of thismodel). The error bars shown can be obtained using a bootstrap approach.A confidence interval for the Binder cumulant is very difficult (or evenimpossible) to obtain using standard error analysis. Using bootstrapping, itis straightforward. You can use simply the function bootstrap_variance() shown above while providing as argument a function which evaluates the Bindercumulant for a given set of data points.So far, it was assumed that the empirical distribution function F ˆ X was usedto determine an approximation of F H . Alternatively, one can use some addi-tional knwoledge which might be available. Or one can make additional assump-tions, via using a distribution function F λ which is parametrized by a vectorof parameters λ . For an exponential distribution, the vector would just con-sist of one parameter, the expectation value, while for a Gaussian distribution, λ would consist of the expectation value and the variance. In principle, arbi-trary complex distributions with many parameters are possible. To make F λ a“good” approximation of F X , one has to adjust the parameters such that thedistribution function represents the sample { x i } “best”, resulting in a vectorˆ λ of parameters. Methods and tools to perform this fitting of parameters arepresented in Sec. 7.6.2. Using F ˆ λ one can proceed as above: Either one cal-culates F ˆ H exactly based on F ˆ λ , which is most of the time too cumbersome.Instead, usually one performs simulations where one takes repeatedly samples { ˆ x , ˆ x , . . . , ˆ x n − } from simulating F ˆ λ and calculates each time the estimator h ∗ = h (ˆ x , ˆ x , . . . , ˆ x n − ). This results, as in the case of the empirical boot- .3. BASIC DATA ANALYSIS F H ∗ which is furtheranalyzed. This approach, where F λ is used instead of F ˆ X , is called parametricbootstrap .Note that the bootstrap approach does not require that the sam-ple points are statistically independent of each other. For in-stance, the sample could be generated using a Markov chain MonteCarlo simulation [Newman and Barkema (1999), Landau and Binder (2000),Robert and Casella (2004), Liu (2008)], where each data point x i +1 is calcu-lated using some random process, but also depends on the previous data point x i . More details on how to quantify correlations are given in Sec. 7.5. Never-theless, if the sample follows the distribution F X , everything is fine when usingbootstrapping and for example a confidence interval will not depend on the frac-tion of “independent” data points. One can see this easily by assuming that youreplace each data point in the original sample { x i } by ten copies, hence makingthe sample ten times larger without adding any information. This will not affectany of the following bootstrap calculations, since the size of the sample does notenter explicitly. The fact that bootstrapping is not susceptible to correlationsbetween data points is in contrast to the classical calculation of confidence in-tervals explained in Sec. 7.3.2, where independence of data is assumed and thenumber of independent data points enters formulas like Eq. (7.58). Hence, whencalculating the error bar according to Eq. (7.58) using the ten-copy sample, itwill be incorrectly smaller by a factor √ 10, since no additional information isavailable compared to the original sample.It should be mentioned that bootstrapping is only one of several resamplingtechniques. Another well known approach is the jackknife approach , whereone does not sample randomly using F ˆ X or a fitted F λ . Instead the sample { x i } is divided into B blocks of equal size n b = n/B (assuming that n is amultiple of B ). Note that choosing B = n is possible and not uncommon.Next, a number B of so-called jackknife samples b = 1 , . . . , B are formed fromthe original sample { x i } by omitting exactly the sample points from the b ’thblock and including all other points of the original sample. Therefore, each ofthese jackknife samples consists of n − n b sample points. For each jackknifesample, again the estimator is calculated, resulting in a sample { h ( j ) k } of size B . Note that the sample distribution function F ( j ) of this sample is not anapproximation of the estimator distribution function F H ! Nevertheless, it isuseful. For instance, the variance Var[ H ] can be estimated from ( B − S h , where S h is the sample variance of { h ( j ) k } . No proof of this is presented here. It is justnoted that when increasing the number B of blocks, i.e., making the differentjackknife samples more alike, because fewer points are excluded, the sample ofestimators values { h ( j ) k } will fluctuate less. Consequently, this dependence onthe number of blocks is exactly compensated via the factor ( B − A.K. Hartmann: Practical Guide to Computer Simulations can be found in [Efron and Tibshirani (1994)].Finally, you should be aware that there are cases where resampling ap-proaches clearly fail. The most obvious example is the calculation of confidenceintervals for histograms, see Sec. 7.3.3. A bin which exhibits no sample points,for example, where the probability is very small, will never get a sample pointduring resampling either. Hence, the error bar will be of zero width. This isin contrast to the confidence interval shown in Eq. 7.62, where also bins withzero entries exhibit a finite-size confidence interval. Consequently, you have tothink carefully before deciding which approach you will use to determine thereliability of your results. So far, you have learned many methods for analyzing data. Since you do notjust want to look at tables filled with numbers, you should visualize the datain viewgraphs. Those viewgraphs which contain the essential results of yourwork can be used in presentations or publications. To analyze and plot data,several commercial and non-commercial programs are available. Here, two freeprograms are discussed, gnuplot , and xmgrace . Gnuplot is small, fast, allowstwo- and three-dimensional curves to be generated and transformed, as well asarbitrary functions to be fitted to the data (see Sec. 7.6.2). On the other hand xmgrace is more flexible and produces better output. It is recommended to use gnuplot for viewing and fitting data online, while xmgrace is to be preferred forproducing figures to be shown in presentations or publications. gnuplot The program gnuplot is invoked by entering gnuplot in a shell, or from a menuof the graphical user interface of your operating system. For a complete manualsee [Texinfo].As always, our examples refer to a UNIX window system like X11, but theprogram is available for almost all operating systems. After startup, in thewindow of your shell or the window which pops up for gnuplot the prompt(e.g. gnuplot > ) appears and the user can enter commands in textual form,results are shown in additional windows or are written into files. For a generalintroduction you can type just help .Before giving an example, it should be pointed out that gnuplot scripts canbe generated by simply writing the commands into a file, e.g. command.gp , andcalling gnuplot command.gp . GET SOURCE CODE DIR: randomness FILE(S): sg e0 L.dat The typical case is that you have available adata file of x − y data or with x − y − dy data(where dy is the error bar of the y data points).Your file might look like this, where the “energy” e of a system is stored as a function of the “system size” L . The filename It is the ground-state energy of a three-dimensional ± J spin glass , a protypical system .4. DATA PLOTTING sg e0 L.dat . The first column contains the L values, the second the energyvalues and the third the standard error of the energy. Please note that linesstarting with “ ” are comment lines which are ignored on reading: To plot the data enter gnuplot> plot "sg_e0_L.dat" with yerrorbars which can be abbreviated as p "sg e0 L.dat" w e . Please do not forget thequotation marks around the file name. Next, a window pops up, showing theresult, see Fig. 7.14.Figure 7.14: Gnuplot window showing the result of a plot command.For the plot command many options and styles are available, e.g. withlines produces lines instead of symbols. It is not explained here how to set in statistical physics. These spin glasses model the magnetic behavior of alloys like iron-gold. A.K. Hartmann: Practical Guide to Computer Simulations line styles or symbol sizes and colors, because this is usually not necessary for aquick look at the data. For “nice” plots used for presentations, we recommend xmgrace , see next section. Anyway, help plot will tell you all you have to knowabout the plot command.Among the important options of the plot command is that one can specifyranges. This can be done by specifying the range directly after the command,e.g. gnuplot> plot [7:20] "sg_e0_L.dat" with yerrorbars will only show the data for x ∈ [7 , x range can bespecified like in plot [7:20][-1.79:-1.77] "sg_e0_L.dat" with yerrorbars If you just want to set the y range, you have to specify [ ] for the x -range.You can also fix the ranges via the set xrange and the set yrange commands,such that you do not have to give them each time with the plot command, see help set xrange or help unset xrange for unsetting a range. Gnuplot knows a lot of built-in functions like sin( x ), log( x ), powers, roots,Bessel functions, error function, and many more. For a complete list type help functions . These function can be also plotted. Furthermore, using thesefunctions and standard arithmetic expressions, you can also define your ownfunctions, e.g. you can define a function ft(x) for the Fischer-Tippett pdf (seeEq. (7.43)) for parameter λ (called lambda here) and show the function via gnuplot> ft(x)=lambda*exp(-lambda*x)*exp(-exp(-lambda*x))gnuplot> lambda=1.0gnuplot> plot ft(x) You can also include arithmetic expressions in the plot command. To plot ashifted and scaled Fischer-Tippett pdf you can type: gnuplot> plot [0:20] 0.5*ft(0.5*(x-5)) The Fischer-Tippett pdf has a tail which drops off exponentially. This canbe better seen by a logarithmic scaling of the y axis. gnuplot> set logscale ygnuplot> plot [0:20] 0.5*ft(0.5*(x-5)) will produce the plot shown in Fig. 7.15.Furthermore, it is also possible to plot several functions in one plot, via sepa-rating them via commas, e.g. to compare a Fischer-Tippett pdf to the standardGaussian pdf, here the predefined constant pi is used: gnuplot> plot ft(x), exp(-x*x/2)/sqrt(2*pi) The error function is erf( x ) = (2 / √ π ) R x dx ′ exp( − x ′ ). .4. DATA PLOTTING Gnuplot window showing the result of plotting a shifted andrescaled Fischer-Tippett pdf with logarithmically scaled y -axis.It is possible to read files with multi columns via the using data modifier ,e.g. gnuplot> plot "test.dat" using 1:4:5 w e displays the fourth column as a function of the first, with error bars given bythe 5th column. The elements behind the using are called entries. Withinthe using data modifier you can also perform transformations and calculations.Each entry, where some calculations should be performed have to be embracedin ( ) brackets. Inside the brackets you can refer to the different columns ofthe input file via $1 for the first column, $2 for the second, etc. You can generatearbitrary expressions inside the brackets, i.e. use data from different columns(also combine several columns in one entry), operators, variables, predefined andself-defined functions and so on. For example, in Sec. 7.6.2, you will see thatthe data from the sg_e0_L.dat file follows approximately a power law behavior e ( L ) = e ∞ + aL b with e ∞ ≈ − . a ≈ . 54 and b ≈ − . 8. To visualize this,we want to show e ( L ) − e ∞ as a function of L b . This is accomplished via: gnuplot> einf=-1.788gnuplot> b=-2.8gnuplot> plot "sg_e0_L.dat" u ($1**b):($2-einf) Now the gnuplot window will show the data as a straight line (not shown, butsee Fig. 7.23).6 A.K. Hartmann: Practical Guide to Computer Simulations So far, all output has appeared on the screen. It is possible to redirect theoutput, for example, to an encapsulated postscript file (by setting set terminalpostscript and redirecting the output set output "test.eps" ). When younow enter a plot command, the corresponding postscript file will be generated.Note that not only several functions but also several data files or a mixtureof both can be combined into one figure. To remember what a plot exported tofiles means, you can set axis labels of the figure by typing e.g. set xlabel "L" ,which becomes active when the next plot command is executed. Also you canuse set title or place arbitrary labels via set label . Use the help commandto find out more.Also three-dimensional plotting (in fact a projection into two dimensions) ispossible using the splot command (enter help splot to obtain more informa-tion). Here, as example, we plot a two-dimensional Gaussian distribution: gnuplot> x0=3.0gnuplot> y0=-1.0gnuplot> sx=1.0gnuplot> sy=5.0gnuplot> gauss2d(x,y)=exp(-(x-x0)**2/(2*sx)-(y-y0)**2/(2*sy))\> /sqrt(4*pi**2*sx**2*sy**2)gnuplot> set xlabel "x"gnuplot> set ylabel "y"gnuplot> splot [x0-2:x0+2][y0-4:y0+4] gauss2d(x,y) with pointsgnuplot> Note that the long line containing the definition of the (two-argument) function gauss2d() is split up into two lines using a backslash at the end of the first line.Furthermore, some of the variables are used inside the interval specifications atthe beginning of the splot command. Clearly, you also can plot data files withthree-dimensional data. The resulting plot appearing in the output window isshown in Fig. 7.16. You can drag the mouse inside the window showing theplot, which will alter the view.Finally, to stop the execution of gnuplot , enter the command exit . Theseexamples should give you already a good impression of what can be done with gnuplot . More can be found in the documentation or the online help. How to fitfunctions to data using gnuplot is explained in Sec. 7.6.2. It is also possible tomake, with some effort, publication-suitable plots, but it is simpler to achievethis with xmgrace , which is presented in the following section. xmgrace The xmgrace (X Motiv GRaphing, Advanced Computation and Exploration ofdata) program is much more powerful than gnuplot and produces nicer output,commands are issued by clicking on menus and buttons and it offers WYSI-WYG. The xmgrace program offers almost every feature you can imagine fortwo-dimensional data plots, including multiple plots (insets), fits, fast Fouriertransform, interpolation. The look of the plots may be altered in any kind of .4. DATA PLOTTING Gnuplot window showing the result of plotting a two-dimensionalfunction using splot .way you can imagine like choosing fonts, sizes, colors, symbols, styles for lines,bar charts etc. Also, you can create manifold types of labels / legends andit is possible to add elements like texts, labels, lines or other geometrical ob-jects in the plot. The plots can be exported to various format, in particularencapsulated postscript ( .eps ) Advanced users also can program it or use it forreal-time visualization of simulations. On the other hand, its handling is a littlebit slower compared to gnuplot and the program has the tendency to fill yourscreen with windows. For full information, please consult the online help, themanual or the program’s web page [xmgrace].Here, just the main steps to produce a simple but nice plot are shown andsome further directions are mentioned. You will be given here the most impor-tant steps to create a similar plot to the first example, shown for the gnuplot program, but ready for publication. First you have to start the program bytyping xmgrace into a shell (or to start it from some window/operating systemmenu). Then you choose the Data menu , next the Import sub menu and finallythe ASCII.. sub sub menu. Then a “ Grace:Read Set ” window will pop up (seeFig. 7.17) and you can choose the data file to be loaded (here sg_e0_L.dat ) [A] ,the type of the input file ( Single Set ) [B] , the format of the data ( XYDY ) [C] .This means you have three columns, and the third one is an error bar for thesecond. Then you can hit on the OK button [E] . The data will be loaded and The underlined character appears also in the menu name and refers to the key one has tohit together with Alt button, if one wants to open the menu via key strokes. A.K. Hartmann: Practical Guide to Computer Simulations [E][B] [F][C][A][D] Figure 7.17: The Grace:Read Set window of the xmgrace program. Amongothers, you can select a file [A] , choose the type of the input file [B] , choose theformat of the data [C] , what axes should be rescaled automatically on input [D] .You can actually load the data by hitting on the OK button [E] and closing thewindow by hitting on the Cancel button [F] .shown in the main window (see Fig. 7.18). The axis ranges have been adjustedto the data, because the “ Autoscale on read ” is set by default to “ XY ” [D] . Youcan quickly change the part of the data shown by the buttons (magnifier, AS , Z , z , ← , → , ↓ , ↑ ) on the left of the main window just below the Draw button.Note that another important input file type is “ Block data ” where the filesconsist of many columns of which you only want to show some. When you hitth OK button [E] , another window ( Grace:Edit block data ) will pop up, whereyou have to select the columns which you actually want to display. For the dataformat (also when loading block data), some other important choices are XY (no error bars) and XYDYDY (full confidence interval, maybe non-symmetric).Finally, you can close the file selection window, by hitting on the Cancel button[F] . The OK and Cancel buttons are common to all xmgrace windows and willnot be mentioned explicitly in the following.In the form the loaded data is shown by default, it is not suitable for publi- .4. DATA PLOTTING xmgrace window after the data set has been loaded (withauto scale).cation purposes, because the symbols, lines and fonts are usually too small/ toothin. To adjust many details of your graph, you should go to the Plot menu.First, you choose the Plot appearance... sub menu. A corresponding windowwill pop up. Here, you should just unselect the “ Fill ” toggle box (upper rightcorner), because otherwise the bounding box included in the .eps file will notmatch the plot and your figure will overwrite other parts of e.g. your manuscript.The fact that your plot has no background now becomes visible through the ap-pearance of some small dots in the main xmgrace window, but this does notdisrupt the output when exporting to .eps .Next, you choose the Set appearance... sub menu from the Plot menu. Thecorresponding window will pop up, see Fig. 7.19. You can pop this window alsoby double-clicking inside the graph. This window allows to change the actualdisplay style of the data. You have to select the data set or sets [A] to whichthe changes will be be applied to when hitting the Apply button at the lowerleft of the window. Note that the list of sets in this box will contain severalsets if you have imported more than one data set. Each of them can have (andusually should) its own style. The box where the list of sets appears is also usedto administrate the sets. If you hit the right mouse button, while the mousepointer is inside this box, a menu will pop up, where you can for instance copyor delete sets, hide or unhide them, or rearrange them.0 A.K. Hartmann: Practical Guide to Computer Simulations [D][F][C][B][A] [E] Figure 7.19: The Grace:Set Appearance window of the xmgrace program. Firstyou have to select the set or sets which should be addressed by the changes [A] . Due to the large amount of adjustable parameters, the window is organizedinto different tabs. The most import one is “ Main ” [B] , which is shown here.Among others, you can select a symbol type [C] (below: symbol size, symbolcolor), choose the width of the lines [D] (also: line type, style) and the color [E] .Furthermore, the label for this data appearing in the legends can be states [F] .The options in this window are arranged within different tabs, the mostimportant is the “ Main ” tab [B] . Here you can choose whether you want toshow symbols for your data points and which type [C] , also the symbol sizesand colors. If you want to show lines as well ( Line properties area at the right),you can choose the style like “ straight ” and others, but also “ none ” is no linesshould be displayed. The style can be full, dotted, dashed, and different dotted-dashed styles. For presentations and publications it is important that lines arewell visible, in this example a line width of 2 is chosen [D] and a black color [E] . For presentations you can distinguish different data sets also by differentcolors, but for publications in scientific journals you should keep in mind thatthe figures are usually printed in black and white, hence light colors are not .4. DATA PLOTTING Each data set can have a legend (see below how to activate it). Here, thelegend string can be stated. You can enter it directly, with the help of someformatting commands which are characters preceded by a backslash \ . Themost important ones are • \\ prints a backslash. • \0 selects the Roman font, which is also the default font. A font is activeuntil a new one is chosen. • \1 selects the italic font, used in equations. • \x selects a symbol font, which contains e.g. Greek characters. For ex-ample \xabchqL will generate αβχηθ Λ, just to mention some importantsymbols. • \s generates a subscript, while \N switches back to normal. For example \xb\s2\N\1(x) generates β ( x ). • \S generates a superscript, for instance \1A\S3x\N-5 generates A x − • The font size can be changed with \+ and \- . • With \o and \O one can start and stop overlining, respectively, for instance \1A\oBC\OD generates ABCD . Underlining can be controlled via \u and \U .By default, error bars are shown (toggle box lower right corner). At leastyou should increase the line width for the symbols ( Symbols tab) and increasethe base and rise line widths for error bars ( Error bars tab).You should know that, when you are creating another plot, you do not haveto redo all these and other adjustments of styles. Once you have found yourstandard, you can save it using the Save Parameters... sub menu from the Plot menu. You can conversely load a parameter set via the Load Parameters... submenu of the same menu.Next, you can adjust the properties of the axes, by choosing the Set appear-ance... sub menu from the Plot menu or by double-clicking on an axis. Thecorresponding window will pop up, see Fig. 7.20. You have to select the axiswhere the current changes apply to [A] . For the x axis you should set the rangein the fields Start [B] and Stop [C] , here to the values 1 and 15. Below thesetwo fields you find the important Scale field, where you can choose linear scaling(default), logarithmic or reciprocal, to mention the important ones.The most important adjustments you can perform within the Main tab [D] .Here you enter the label shown below the axis in the Label string field [E] . Theformat of the string is the same as for the data set legends. Here you enter just Acting as referee reading scientific papers submitted to journals, I experienced many timesthat I could not recognize or distinguish some data because they were obviously printed in alight color, or with a thin line width, or with tiny symbols . . . . A.K. Hartmann: Practical Guide to Computer Simulations [B] [G][C][A][D][F][E] Figure 7.20: The Grace: Axes window of the xmgrace program. First you haveto select the axis which should be addressed by the changes [A] . Among others,you can change the range in the Start [B] and Stop [C] fields. Here the Main tab [D] is shown. You can enter an axis label in the Label string field [E] andselect the spacing of the major and minor ticks [F,G] \1L , which will show as L . The major spacing of the major (with labels) andminor ticks can be chosen in the corresponding fields [F,G] . Below there is a Format field , where you can choose how the tick labels are printed. Among themany formats, the most common are General (1 , , , . . . ), Exponential ( , , ,. . . ), and Power , which is useful for logarithmic scaled axes(10 , , , . . . ). For the tick labels, you can also choose a Precision . This andother fields of this tab you can leave at their standard values here. Nevertheless,you should also adjust the Char size of the axis labels (tab Axis label & bar ) andof the tick labels (tab Tick labels ). For publications, character sizes above 150%are usually well readable. Note that in the Axis label & bar tab, there is afield Axis transform where you can enter formulas to transform the axis more .4. DATA PLOTTING Special tab is useful, where you can enter all major and minorticks individually.To finish the design of the axes, you can perform similar changes to the y axis, with Start field − . Stop field − . Label string field \1E\s0\N(L) andthe same character sizes as for the x axis for axis labels and tick labels in thecorresponding tabs. Note that the axis label will be printed vertically. If you donot like this, you can choose the Perpendicular to axis orientation in the Layout field of the Axis label & bar tab.Now you have already a nice graph. To show you some more of the capa-bilities of xmgrace , we refine it a bit. Next, you generate an inset, i.e. a smallsubgraph inside the main graph. This is quite common in scientific publica-tions. For this purpose, you select the underlineEdit menu and there the Arrangegraph... sub menu. The corresponding window appears. We want to have justone inset, i.e. in total 2 graphs. For this purpose, you select in the Matrix regionof the window the Cols: field to 1 and the Rows: field to 2. Then you hit on the Accept button which applies the changes and closes the window. You now havetwo graphs, one containing the already loaded data, the other one being empty.These two graphs are currently shown next to each other, one at the top andone at the bottom.To make the second graph an inset of the first, you choose the Graph appear-ance... sub menu from the Plot menu. At the top a list of the available graphs isshown [A] . Here you select the first graph G0 . You need only the Main tab [B] ,other tabs are for changing styles of titles, frames and legends. We recommendto choose Width Frame tab. In the Main tab, you can choose the Type ofgraph [C] , e.g. XY graph , which we use here (default), Polar graph or Pie chart .You only have to change the Viewport coordinates [D] here. These coordinatesare relative coordinates, i.e. the standard full viewport including axes, labelsand titles is [0 , × [0 , G0 , you choose Xmin and Ymin Xmax and Ymax Displaylegend [E] , where you can control whether a legend is displayed. If you want tohave a legend, you can control its position in the Leg. box tab. Now the differentgraphs overlap. This does not bother you, because next you select graph G1 inthe list at the top of the window. We want to have the inset in the free area ofthe plot, in the upper right region. Thus, you enter the viewport coordinates Xmin Ymin Xmax Ymax Read to graph G1 in the Grace: Readsets window. In Sec. 7.6.2, you will see that the data follows approximatelya power law behavior e ( L ) = e ∞ + aL b with e ∞ ≈ − . a ≈ . 54 and b ≈ − . 8. To visualize this, we want to show e ( L ) − e ∞ as a function of L b . Hence, we want to transform the data. You choose from the Data menu the Transformations sub menu and there the Evaluate expression sub sub menu. Notethat here you can also find many other transformations, e.g. Fourier transform,4 A.K. Hartmann: Practical Guide to Computer Simulations [B][A][C][D][E] Figure 7.21: The Grace: Graph Appearance window of the xmgrace program. Atthe top one can select to which graph changes should apply [A] . The windowis divided into different tabs [B] , here the Main tab is shown. The Type of thegraph can be selected [C] , also Title and Subtitle (empty here). The extensionsof the graph can be selected in the Viewport area [D] . This allows to make onegraph an inset of another. Using the Display legend toggle [E] the legend can beswitched on and off.interpolation and curve fitting. Please consult the manual for details. In thiscase, the evaluateExpression window pops up, see Fig. 7.22 (if you did not closethe windows you have used before, your screen will be already pretty populated).A transformation always takes the data points from one source set, applies aformula to all data points (or to a subset o points) and stores the result in a destination set. These sets can be selected at the top of the window in the Source [A] and Destination [B] fields for graph and set separately. Note that thedata in the destination set is overwritten. If you want to write the transformeddata to a new set, you can first copy an existing set (click on the right mousebutton in the Destination Set window and choose Duplicate ). In our case, wewant to replace the data, hence you select for source and destination the dataset from graph G1 . The transformation is entered below [C] , here you first enter y=y+1.788 to shift the data. The you hit the Apply button at the bottom. Nextyou change the transformation to x=x^(-2.8) and hit the Apply button again. .4. DATA PLOTTING [C][A] [B] Figure 7.22: The evaluateExpression window of the xmgrace program. At thetop you can select Source [A] and Destination [B] sets of the transformation. Theactual transformation is entered at the bottom [C] .When you now select the second graph by clicking into it, and hit the AS (autoscale) button on the left of the main window, you will see that the data pointsfollow a nice straight line in the inset, which confirms the behavior of the data.Again you should select symbols, line stiles, and axis labels for the inset.Usually smaller font sizes are used here. Note that all operations always apply tothe current graph, which can be selected for example by clicking near the cornersof the boundary boxes of the graph (which does not always work, depending onwhich other windows are open) or by double clicking on the corresponding graphin the graph list in the Grace: Graph Appearance window. The final main windowis shown in Fig. 7.23. Note that the left axis label is not fully visible. This isno problem when exporting the file as encapsulated postscript; everything willbe shown. But if you do not like it, you can adjust the Xmin value of graph G0 .Finally, if you choose the menu Window and the sub menu Drawing objects a window will pop up which enable many graphical elements like texts, lines,boxes and ellipses (again with a variety of choices for colors, styles, sizes etc.)tobe added/changed and deleted in your plot. We do not go into details here.Now you should save your plot using the File menu and the Save as... sub menu, e.g. with filename sg_e0_L.agr , where .agr is the typicalpostfix of xmgrace source files. When GET SOURCE CODE DIR: randomness FILE(S): sg e0 L.dat A.K. Hartmann: Practical Guide to Computer Simulations Figure 7.23: The main xmgrace window after all adjustments have been made.you want to create another plot with similar layout later, it is convenient tostart from this saved file by copying it to a new file and subsequently usingagain xmgrace to modify the new file.To export your file as encapsulated postscript, suitable for including it intopresentations or publications (see Sec. 8.3), you have to choose the File menuand the Print setup... sub menu. In the window, which pops up, you can selectthe Device EPS . The file name will automatically switch to sg_e0_L.eps (thisseems not to work always, in particular if you edit several files, one after theother, please check the file names always). Having hit on the Accept button,you can select the File menu and the Print sub menu, which will generate thedesired output file. Now you have a solid base for viewing and plotting, hence we can continuewith advanced analysis techniques. You can experiment with plotting using xmgrace in exercise (5). Using the tool epstopdf you can convert the postcript file also to a pdf file. With othertools like convert or gimp you can convert to many other styles. .5. HYPOTHESIS TESTING AND (IN-)DEPENDENCE OF DATA In the previous section, you have learned how to visualized data, mainly dataresulting from the basic analysis methods presented in Sec. 7.3. In this sec-tion, we proceed with more elaborate analysis methods. One important way toanalyze data of simulations is to test hypotheses concerning the results. Thehypothesis to be tested is usually called null hypothesis H . Examples for nullhypotheses are:(A) In a traffic system, opening a new track will decrease the mean value of thetravel time t A → B for a connection A → B below a target threshold t target .(B) Within an acquaintance network, a change of the rules describing howpeople meet will change the distribution of the number of people eachperson knows.(C) The distribution of ground-states energies in disordered magnets followsa Fisher-Tippett distribution.(D) Within a model of an ecological system, the population size of foxes isdependent on the population size of beetles.(E) For a protein dissolved in water at room temperature, adding a certainsalt to the water changes the structure of the protein.One now can model these situations and use simulations to address the abovequestions. The aim is to find methods which tell us whetheror not, dependingon the results of the simulations, we should accept a null hypothesis. Thereis no general approach. The way we can test H depends on the formulationof the null hypothesis. In any case, our result will again be based on a set ofmeasurements, such as a sample of independent data points { x , x , . . . , x n − } ,formally obtained by sampling from random variables { X , X , . . . , X n − } (hereagain, all described by the same distribution function F X ). To get a solid sta-tistical interpretation, we use a test statistics , which is a function of the sample t = t ( x , x , . . . , x n − ). Its distribution describes a corresponding random vari-able T . This means, you can use any estimator (see page 27), which is alsoa function of the sample, as test statistics. Nevertheless, there are many teststatistics, which usually are not used as estimators.To get an idea of what a test statistics t may look like, we discuss now teststatistics for the above list of examples. For (A), one can use obviously thesample mean. This has to be compared to the threshold value. This will beperformed within a statistical interpretation, enabling a null hypothesis to beaccepted or rejected, see below. For (B) one needs to compare the distribu-tions of the number of acquaintances before and after the change, respectively.Comparing two distributions can be done in many ways. One can just comparesome moments, or define a distance between them based on the difference in8 A.K. Hartmann: Practical Guide to Computer Simulations area between the distribution function, just to mention two possibilities. Fordiscrete random variables, the mean-squared difference is particularly suitable,leading to the so-called chi-squared test, see Sec. 7.5.1. For the example (C),the task is similar to (B), only that the empirical results are compared to agiven distribution and that the corresponding random variables are continuous.Here, a method based on the maximum distance between two distribution func-tions is used widely, called Kolmogorov-Smirnov (KS) test (see Sec. 7.5.2). Totest hypothesis (D), which means to check for statistical independence, one canrecord a two-dimensional histogram of the population size of foxes and beetles.This is compared with the distribution where both populations are assumed tobe independent, i.e. with the product of the two single-population distributionfunctions. Here, a variant of the chi-squared test is applied, see Sec. 7.5.3. In thecase (E), the sample is not a set of just one-dimensional numbers, instead thesimulation results are conformations of proteins given by 3 N − dimensional vec-tors of the positions r i ( i = 1 , . . . , N ) of N particles. Here, one could introducea method to compare two protein conformations { r Ai } , { r Bi } in the followingway: First, one “moves” the second protein towards the first one such that thepositions of the center of masses agree. Second, one considers the axes throughthe center of masses and through the first atoms, respectively. One rotates thesecond protein around its center of mass such that these axes become parallel.Third, the second protein is rotated around the above axis such that the dis-tances between the last atoms of the two proteins are minimized. Finally, forthese normalized positions { r B⋆i } , one calculates the squared difference of allpairs of atom positions d = P i ( r Ai − r B⋆i ) which serves as test function. Fora statistical analysis, the distribution of d for one thermally fluctuating proteincan be determined via a simulation and then compared to the average valueobserved when changing the conditions. We do not go into further details here.The general idea to test a null hypothesis using a test statistics in a statisticalmeaningful way is as follows:1. You have to know, at least to an approximate level, the probability dis-tribution function F T of the test statistics under the assumption that thenull hypothesis is true . This is the main step and will be covered in detailbelow.2. You select a certain significance level α . Then you calculate an inter-val [ a l , a u ] such that the cumulative probability of T outside the intervalequals to α , for instance by distributing the weight equally outside theinterval via F ( a l ) = α/ F ( a u ) = 1 − α/ 2. Sometimes one-sided intervalsare more suitable, e.g. [ ∞ , a u ] with F ( a u ) = 1 − α , see below concerningexample (A).3. You calculate the actual value t of the test statistics from your simulation.If t ∈ [ a l , a u ] then you accept the hypothesis, otherwise you reject it.Correspondingly, the interval [ a l , a u ] is called acceptance interval .Since this is a probabilistic interpretation, there is a small probability α thatyou do not accept the null hypothesis, although it is true. This is called a type .5. HYPOTHESIS TESTING AND (IN-)DEPENDENCE OF DATA I error (also called false negative ), but this error is under control, because α isknown.On the other hand, it is important to realize that in general the fact that thevalue of the test statistics falls inside the acceptance interval does not prove thatthe null hypothesis is true! A different hypothesis H could indeed hold, justyour test statistics is not able to discriminate between the two hypotheses. Or,with a small probability β , you might obtain some value for the test statisticswhich is unlikely for H , but likely for H . Accepting the null hypothesis, al-though it is not true, is called a type II error (also called false positive ). Usually,H is not known, hence β cannot be calculated explicitly. The different casesand the corresponding possibilities are summarized in Fig. 7.24. To conclude:If you want to prove a hypothesis H (up to some confidence level 1 − α ), it isbetter to use the opposite of H as null hypothesis, if this is possible. H is true H is trueH reject H acceptdecisiontest reality correct decision type II errorcorrect decisiontype I error α 1−ββ Figure 7.24: The null hypothesis H might be true, or the alternative (usuallyunknown) hypothesis H . The test of the null hypothesis might result in anacceptance or in a rejection. This leads to the four possible scenarios whichappear with the stated probabilities.Indeed, in general the null hypothesis must be suitably formulated, such thatit can be tested, i.e. such that the distribution function F T describing T canbe obtained, at least in principle. For example (A), since the test statistics T is a sample mean, it is safe to assume a Gaussian distribution for T : One canperform enough simulations rather easily, such that the central limit theoremapplies. We use as null hypothesis the opposite of the formulated hypothesis(A). Nevertheless, it is impossible to calculate an acceptance interval for theGaussian distribution based on the assumption that the mean is larger thana given value. Here, one can change the null hypothesis, such that instead anexpectation value equal to t target is assumed. Hence, the null hypothesis assumesthat the test statistics has a Gaussian distribution with expectation value t target .The variance of T is unknown, but one can use, as for the calculation of errorbars, the sample variance s divided by n − 1. Now one calculates on this basisan interval [ a l , ∞ ] with F T ( a l ) = α . Therefore, one rejects the null hypothesisif t < a l , which happens with probability α . On the other hand, if the trueexpectation value is even larger than t target , then the probability of finding a0 A.K. Hartmann: Practical Guide to Computer Simulations mean with t < a l becomes even smaller than α , i.e. less likely. Hence, thehypothesis (A) can be accepted or rejected on the basis of a fixed expectationvalue.For a general hypothesis test, to evaluate the distribution of the test statis-tics T , one can perform a Monte Carlo simulation . This means one drawsrepeatedly samples of size n according to a distribution F X determined by thenull hypothesis. Each time one calculates the test statistics t and records ahistogram of these values (or a sample distribution function F ˆ T ) which is anapproximation of F T . In this way, the corresponding statistical information canbe obtained. To save computing time, in most cases no Monte Carlo simulationsare performed, but some knowledge is used to calculate or approximate F T .In the following sections, the cases corresponding to examples (B), (C), (D)are discussed in detail. This means, it is explained how one can test for equalityof discrete distributions via the chi-squared test and for equality of continuousdistributions via the KS test. Finally, some methods for testing concerning (in-)dependence of data and for quantifying the degree of dependence are stated. The chi-squared test is a method to compare histograms and discrete probabilitydistributions. The test works also for discretized (also called binned ) continuousprobability distributions, where the probabilities are obtained by integratingthe pdf over the different bins. The test comes in two variants: • Either you want to compare the histogram { h k } for bins B k (see Sec. 7.3.3)describing the sample { x , x , . . . , x n − } to a given discrete or discretizedprobability mass function with probabilities { p k } = P ( x ∈ B k ). The nullhypothesis H is: “ the sample follows a distribution given by { p k } ”. Note that the probabilities are fixed and independent of the data sample.If the probabilities are parametrized and the parameter is determined bythe sample (e.g. by the mean of the data) such that the probabilities fit thedata best, related methods as described in Sec. 7.6.2 have to be applied. • Alternatively, you want to compare two histograms { h k } , { ˆ h k } obtainedfrom two different samples { x , x , . . . , x n − } and { ˆ x , ˆ x , . . . , ˆ x n − } defined for the same bins B k . The null hypothesis H is: “ the two samplesfollow the same distribution” . In case the test is used to compare intrinsically discrete data, the intervals B k can conveniently be chosen such that each possible outcome corresponds toone interval. Note that due to the binning process, the test can be applied tohigh-dimensional data as well, where the sample is a set of vectors. Also non-numerical data can be binned. In these cases each bin represents either a subset Note that here we assume that the two samples have the same size, which is usually easyto achieve in simulations. A different case occurs when also the number of sample points is arandom variable, hence a difference in the number of sample points makes the acceptance ofH less likely, see [Press et al. (1995)]. .5. HYPOTHESIS TESTING AND (IN-)DEPENDENCE OF DATA ih , np i i Figure 7.25: Chi-squared statistics: A histogram (solid line) is compared toa discrete probability distribution (dashed line). For each bin, the sum of thesquared differences of the bin counter h k to the expected number of counts np k iscalculated (dotted vertical lines), see Eq. (7.68). In this case, the differences arequite notable, thus the probability that the histogram was obtained via randomexperiments from a random variable described by the probabilities { p k } (nullhypothesis) will be quite small.We start with the first case, where a sample histogram is compared to aprobability distribution, corresponding to example (C) on page 57. The teststatistics, called χ , is defined as: χ = X k ′ ( h k − np k ) np k (7.68)with np k being the expected number of sample points in bin B k . The prime atthe sum symbol indicates that bins with h k = np k = 0 are omitted. The numberof contributing bins is denoted by K ′ . If the pmf p k is nonzero for an infinitenumber of bins, the sum is truncated for terms np k ≪ 1. This means that thenumber of contributing bins will be always finite. Note that bins exhibiting h k > p k = 0 are not omitted. This results in an infinite value of χ , whichis reasonable, because for data with h k > p k = 0, the data cannot bedescribed by the probabilities p k .The chi-squared distribution with ν = K ′ − − n is equal to the totalnumber of expected data points P k n k p k = n P k p k = n , hence the K ′ differentsummands are not statistically independent. The probability density of the chi-squared distribution is given in Eq. (7.45). To perform the actual test, it is2 A.K. Hartmann: Practical Guide to Computer Simulations recommended to use the implementation in the GNU scientific library (GSL)(see Sec. 6.3).Next, a C function chi2_hd() is shown whichcalculates the cumulative probability ( p-value )that a value of χ or larger is obtained, giventhe null hypothesis that the GET SOURCE CODE DIR: randomness FILE(S): chi2.c sample was generated using the probabilities p k . Arguments of chi2_hd() arethe number of bins, and two arrays h[] and p[] containing the histogram h k and the probabilities p k , respectively: double chi2_hd(int n_bins, int *h, double *p) { int n; /* total number of sample points */ double chi2; /* chi^2 value */ int K_prime; /* number of contributing bins */ int i; /* counter */ n = 0; for(i=0; i F (x) X F (x) X xd max , Figure 7.26: Kolmogorov-Smirnov test: A sample distribution function (solidline) is compared to a given probability distribution function (dashed line). Thesample statistics d max is the maximum difference between the two functions.The p-value, i.e. the probability of a value of d max as measured ( d measuredmax )or worse, given the null hypothesis that the sample is drawn from F X , is ap-proximately given by (see [Press et al. (1995)] and references therein): P ( d max ≥ d measuredmax ) = Q KS (cid:0) [ √ n + 0 . 12 + 0 . / √ n ] d measuredmax (cid:1) (7.71)This approximation is already quite good for n ≥ 8. Here, the following auxiliaryprobability function is used: Q KS ( λ ) = 2 ∞ X i =1 ( − i +1 e − i λ (7.72) GET SOURCE CODE DIR: randomness FILE(S): ks.c with Q KS (0) = 1 and Q KS ( ∞ ) = 0. This functioncan be implemented most easily by a direct sum-mation [Press et al. (1995)]. The function Q_ks() receives the value of λ as argument and returns Q KS ( λ ): double Q_ks(double lambda) { const double eps1 = 0.0001; /* relative margin for stop */ const double eps2 = 1e-10; /* relative margin for stop */ int i; /* loop counter */ double sum; /* final value */ double factor; /* constant factor in exponent */ .5. HYPOTHESIS TESTING AND (IN-)DEPENDENCE OF DATA double sign; /* of summand */ double term, last_term; /* summands, last summand */ sum = 0.0; last_term = 0.0; sign = 1.0; /* initialize */ factor = -2.0*lambda*lambda; for(i=1; i<100; i++) /* sum up */ { term = sign*exp(factor*i*i); sum += term; if( (fabs(term) <= eps1*fabs(last_term)) || (fabs(term) <= eps2*sum)) return(2*sum); sign =- sign; last_term = term; } return(1.0); /* in case of no convergence */ } The summation (lines 13–22) is performed for at most 100 iterations. If thecurrent term is small compared to the previous one or very small compared tothe sum obtained so far, the summation is stopped (line 17–18). If this doesnot happen within 100 iterations, the sum has not converged (which means λ is very small) and Q (0) = 1 is returned.This leads to the following C implementation for the KS test. The function ks() expects as arguments the number of sample points n , the sample x[] anda pointer F to the distribution function: double ks(int n, double *x, double (*F)(double)) { double d, d_max; /* (maximum) distance */ int i; /* loop counter */ double F_X; /* empirical distribution function */ qsort(x, n, sizeof(double), compare_double); F_X = 0; d_max = 0.0; for(i=0; i A.K. Hartmann: Practical Guide to Computer Simulations of the sample distribution function, because at each sample data point, in theorder of occurrence, the value of F ˆ X is increased by 1 /n . When obtaining themaximum distance (lines 10–19), one has to compare F ˆ X to the distributionfunction F X just before (lines 12–14) and after (lines 15–18) the jumps. Notethat this implementation works also for samples, where some data points occurmultiple times.For the actual test, one calculates the p-value for the given sample using ks() . If the p-value exceeds the indented significance level α , the null hy-pothesis is accepted, i.e. the data is compatible with the distribution with highprobability. Usually quite small significances are used, e.g. α = 0 . 05. Thismeans that even substantial values of d max are accepted. Thus, one rejects thenull hypothesis only, as usual, in case the probability for an error of type I isquite small.It is also possible to compare two samples of sizes n , n via the KS test. Thetest statistics for the two sample distribution functions is again the maximumdistance. The probability to find a value of d max as obtained or worse, giventhe null hypothesis that the samples are drawn from the same distribution, isas above in Eq. (7.71), only one has to replace n by the “effective” samplesize n eff = n n / ( n + n ), for details see [Press et al. (1995)] and referencestherein. It is straightforward to implement this test when using the C function ks() shown above as template. Here, we consider samples, which consist of pairs ( x i , y i ) ( i = 0 , , . . . , n − 1) ofdata points. Generalizations to higher-dimensional data is straightforward. Thequestion is, whether the y i values depend on the x i values (or vice versa). In thiscase, one also says that they are statistically related . If yes, this means that if weknow one of the two values, we can predict the other one with higher accuracy.The formal definition of statistical (in-) dependence was given in Sec. 7.1. Anexample of statistical dependence occurs in weather simulations: The amountof snowfall is statistically related to the temperature: If it is too warm or toocold, it will not snow. This also shows, that the dependence of two variables itnot necessarily monotonous. In case one is interested in monotonous and evenlinear dependence, one usually says that the variables are correlated , see below. GET SOURCE CODE DIR: randomness FILE(S): points0A.datpoints0B.datpoints1A.datpoints1B.dat It is important to realize that we have to dis-tinguish between statistical significance of a sta-tistical dependence and the strength of the depen-dence. Say that our test tells us that the x val-ues are statistically related with high probability.This usually just means that we have a large sam-ple. On the other hand, the strength of the sta-tistical dependence can be still small. It could be, for eaxample, that a givenvalue for x will influence the probability distribution for y only slightly. Onethe other hand, the strength can be large, which means, for example, knowing x almost determines y . But if we have only few samples points, we cannot be .5. HYPOTHESIS TESTING AND (IN-)DEPENDENCE OF DATA -4 -2 0 2 4 x i -4-2024 y i κ =0, n=50 -4 -2 0 2 4 x i -4-2024 y i κ =0, n=5000 -4 -2 0 2 4 x i -4-2024 y i κ =1, n=50 -4 -2 0 2 4 x i -4-2024 y i κ =1, n=5000 Figure 7.27: Scatter plots for n data points ( x i , y i ) where the x i numbers aregenerated from a standard Gaussian distribution (expectation value 0, variance1), while each y i number is drawn from a Gaussian distribution with expectationvalue κx i (variance 1).very sure whether the data points are related or not. Nevertheless, there is someconnection: the larger the strength, the easier it is to show that the dependenceis significant. For illustration consider a sample where the x i numbers are gen-erated from a standard Gaussian distribution (expectation value 0, variance 1),while each y i number is drawn from a Gaussian distribution with expectationvalue κx i (variance 1). Hence, if κ = 0, the data points are independent.Scatter plots, where each sample point ( x i , y i ) is shown as dot in the x − y plane This is an example, where the random variables Y i which described the sample are notidentical. A.K. Hartmann: Practical Guide to Computer Simulations are exposed in Fig. 7.27. Four possibilities are presented, k = 0 / n = 50 / linear correlationcoefficient is given, which states the strength of linear correlation. Finally, it isdiscussed how one can quantify the dependence within a sample, for examplebetween sample points x i , x i + τ .To test statistical dependence for a sample { ( x , y ) , ( x , y ) , . . . , ( x n − , y n − ) } ,one considers usually the null hypothesis: H = “The x sample points and the y sample points are independent.” To test H one puts the pairs of sample pointsinto two-dimensional histograms { h kl } . The counter h kl receives a count, if fordata point ( x i , y i ) we have x i ∈ B ( x ) k and y i ∈ B ( y ) l , for suitably determinedbins { B ( x ) k } and { B ( y ) l } . Let k x and k y the number of bins in x and y direction,respectively. Next, one calculates single-value (or one-dimensional) histograms { ˆ h ( x ) k } and { ˆ h ( y ) l } defined by ˆ h ( x ) k = X l h kl ˆ h ( y ) l = X k h kl (7.73)These one-dimensional histograms describe how many counts in a certain binarise for one variable, regardless of the value of the other variable. It is assumedthat all entries of these histograms are not empty. If not, the bins should beadjusted accordingly. Note that n = P k ˆ h ( x ) k = P l ˆ h ( y ) l = P kl h kl holds.Relative frequencies, which are estimates of probabilities, are obtained bynormalizing with n , i.e. ˆ h ( x ) k /n and ˆ h ( y ) l /n . If the two variables x i , y i are in-dependent, then the relative frequency to obtain a pair of values ( x, y ) in bins { B ( x ) k } and { B ( y ) l } should be the product of the single-value relative frequencies.Consequently, by multiplying with n one obtains the corresponding expectednumber n kl of counts, under the assumption that H holds: n kl = n ˆ h ( x ) k n ˆ h ( y ) l n = ˆ h ( x ) k ˆ h ( y ) l n (7.74)These expected numbers are compared to the actual numbers in the two-dimensional histogram { h kl } via the χ test statistics, comparable to Eq. (7.68): χ = X kl ( h kl − n kl ) n kl (7.75)The statistical interpretation of χ is again provided by the chi-squared dis-tribution. The number of degrees of freedom is determined by the number ofbins ( k x k y ) in the two-dimensional histogram minus the number of constraints.The constraints are given by Eq. (7.73), except that the total number of counts .5. HYPOTHESIS TESTING AND (IN-)DEPENDENCE OF DATA n is contained twice, resulting in k x + k y − 1. Consequently, the numberof degrees of freedom is ν = k x k y − k x − k y + 1 . (7.76)Therefore, under the assumption that the x and y sample points are indepen-dent, p = 1 − F ( χ , ν ) gives the probability (p-value) of observing a test statisticsof χ or larger. F is here the distribution function of the chi-square distribution,see Eq. (7.45). This p-value has to be compared to the significance level α . If p < α , the null hypothesis is rejected. GET SOURCE CODE DIR: randomness FILE(S): chi2indep.c The following C function implements the chi-squared independence test chi2_indep() . It re-ceives the number of bins in x and y directionas arguments, as well as a two-dimensional array,which carries the histogram: double chi2_indep(int n_x, int n_y, int **h) { int n; /* total number of sample points */ double chi2; /* chi^2 value */ int k_x, k_y; /* number of contributing bins */ int k, l; /* counters */ int *hx, *hy; /* one-dimensional histograms */ hx = (int *) malloc(n_x*sizeof(int)); /* allocate */ hy = (int *) malloc(n_y*sizeof(int)); n = 0; /* calculate total number of sample_points */ for(k=0; k 05) for the case κ = 1 , n = 50, which is actuallycorrelated. On the other hand, if the number of samples is large enough, thereis no doubt.Once it is established that a sample contains dependent data, one can tryto measure the strength of dependence. A standard way is to use the linearcorrelation coefficient (also called Pearson’s r ) given by r ≡ P i ( x i − x )( y i − y ) pP i ( x i − x ) pP i ( y i − y ) . (7.77)This coefficient assumes, as indicated by the name, that a linear correlationexists within the data. The implementation using a C function is straight for-ward, see exercise (7). For the data shown in Fig. 7.27, the following correlationcoefficients are obtained: r ( κ = 0 , n = 50) = 0 . r ( κ = 0 , n = 5000) = 0 . r ( κ = 1 , n = 50) = 0 . r ( κ = 1 , n = 5000) = 0 . r reflects whether or not the data .5. HYPOTHESIS TESTING AND (IN-)DEPENDENCE OF DATA r . Hence, to test for significance,it is better to use the hypothesis test based on the χ test statistics.Finally, note that a different type of correlation may arise: So far it wasalways assumed that the different sample points x i , x j (or sample vectors)are statistically independent of each other. Nevertheless, it could be thecase, for instance, that the sample is generated using a Markov chain MonteCarlo simulation [Newman and Barkema (1999), Landau and Binder (2000),Robert and Casella (2004), Liu (2008)], where each data point x i +1 is calcu-lated using some random process, but also depends on the previous data point x i , hence i is a kind of artificial sample time of the simulation. This dependencedecreases with growing time distance between sample points. One way to seehow quickly this dependence decreases is to use a variation of the correlationcoefficient Eq. (7.77), i.e. a correlation function :˜ C ( τ ) = 1 n − τ n − − τ X i =0 x i x i + τ − n − τ n − − τ X i =0 x i ! × n − τ n − − τ X i =0 x i + τ ! (7.78)The term n − τ P n − − τi =0 x i × n − τ P n − − τi =0 x i + τ will converge to x for n → ∞ if it can be assumed that the distribution of the sample points is stationary,i.e. does not depend on the sample time. Therefore, ˜ C ( τ ) is approximately n − τ P n − − τi =0 ( x i − x )( x i + τ − x ), comparable to the nominator of the linear cor-relation coefficient Eq. (7.77). Usually one normalizes the correlation functionby ˜ C (0), which is just the sample variance in the stationary case, see Eq. (7.51): C ( τ ) = ˜ C ( τ ) /C (0) . (7.79)Consequently, for any data, for example obtained from a Markov chainMonte Carlo simulation, C (0) = 1 will always hold, Then C ( τ ) decreases withincreasing difference τ , see for example Fig. 7.28. Very often the functional formis similar to an exponential ∼ exp( − τ /τ c ). In theory, C ( τ ) should converge tozero for τ → ∞ , but due to the finite size of the sample, usually strong fluctu-ations appear for τ approaching n . A typical time τ c which measures how fastthe dependence of the sample points decreases is given by C ( τ c ) = 1 /e , whichis consistent with the above expression, if the correlation function decreasesexponentially. At twice this distance, the correlation is already substantiallydecreases (to 1 /e ). Consequently, if you want to obtain error bars for sam-ples obtained from dependent data, you could include for instance only points x , x τ c , x τ c , x τ c , . . . in a sample, or just use n/ (2 τ c ) instead of n in any cal-culation of error bars. Although these error bars are different from those if thesample was really independent, it gives a fairly good impression of the statisticalerror.2 A.K. Hartmann: Practical Guide to Computer Simulations τ C ( τ ) Figure 7.28: Correlation function C ( τ ) for a simulation of a ferromagnetic sys-tem, x i being the magnetization at time step i . (For experts: Ising system ofsize 16 × 16 spins simulated with single-spin flip Metropolis Monte Carlo at a(reduced) temperature T = 2 . 269 close to the phase transition temperature,where correlation times τ c are large).Alternatively, to obtain a typical time τ c without calculating a correlationfunction, you can also use the blocking method [Flyvbjerg (1998)]. Within thisapproach, you iteratively merge neighboring data points via x ( z +1) i = ( x ( z )2 i + x ( z )2 i +1 ) / n ( z +1) = n ( z ) / z = 0 corresponds to the originalsample). You calculate the standard error bar σ ( z ) / √ n ( z ) − z c , the data is (almost) independentand the true error bar is given by the level value. Then τ c = 2 z c is a typicaltime of independence of the data points.If you are really just interested in error bars, i.e. you do not need to know thevalue of τ c , you could also use the bootstrap approach which is not susceptibleto dependence of data, see Sec. 7.3.4. In Sec. 7.3, different methods are presented of how to estimate parameters whichcan be obtained directly and simply from the given sample { x , x , . . . , x n − } .In this section, a general method is considered which enables estimators to beobtained for arbitrary parameters of probability distributions. The method isbased on the maximum-likelihood principle , which is exposed in Sec. 7.6.1. Thisprinciple can be extended to the modeling of data , where often a sample of .6. GENERAL ESTIMATORS { ( x , y , σ ) , ( x , y , σ ) , . . . , ( x n − , y n − , σ n − ) } is given. Typicallythe x i data points represent some control parameter, which can be chosen inthe simulation, such as the temperature of a gas. It is assumed that all x i values are different. Consequently, the simulation has been carried out at n different values of the control parameter. The y i data points are averages ofmeasurements (e.g. the density of the gas) obtained in the simulations for thefixed value x i of the control parameter. The σ i values are the corresponding errorbars. Modeling the data means that one wants to determine a relationship y = y ( x ). Usually some assumptions or knowledge about the relationship areavailable, which means one has available one parametrized test function y θ ( x ).Consequently, the set of parameters has to be adjusted θ such that the function y θ ( x ) fits the sample “best”. This is called data fitting and will be explainedin Sec. 7.6.2. This approach can also be used to compare several fitted testfunctions to determined which represents the most suitable model. Here, we consider the following task: For a given sample { x , x , . . . , x n − } and a probability distribution represented by a pmf p θ ( x ) or a pdf f θ ( x ), wewant to determine the parameters θ = ( θ , . . . , θ n p ) such that the pmf or pdfrepresents the data “best”. This is written in parentheses, because there is notunique definition what “best” means, or even a mathematical way to derive asuitable criterion. If one assumes no prior knowledge about the parameters, onecan use the following principle: Definition The maximum-likelihood principle states that the parameters θ should be chosen such that the likelihood of the data set, given the parameters,is maximal.In case of a discrete random variable, if it can be assumed that the differentdata points are independent, the likelihood of the data is just given by theproduct of the single data point probabilities. This defines the likelihood function L ( θ ) ≡ p θ ( x ) p θ ( x ) . . . p θ ( x n − ) = n − Y i =0 p θ ( x i ) (7.80)For the continuous case, the probability to obtain during a random exper-iment exactly a certain sample is zero. Nevertheless, for a small uncertaintyparameter ǫ , the probability to obtain a value in the interval [˜ x − ǫ, ˜ x + ǫ ] is P (˜ x − ǫ ≤ X < ˜ x + ǫ ) = R ˜ x + ǫ ˜ x − ǫ f θ ( x ) dx ≈ f θ (˜ x )2 ǫ . Since 2 ǫ enters just as afactor, it is not relevant to determining the maximum. Consequently, for thecontinuous case, one considers the following likelihood function L ( θ ) ≡ f θ ( x ) f θ ( x ) . . . f θ ( x n − ) = n − Y i =0 f θ ( x i ) (7.81) Sometimes also the x i data points are measured quantities which are also characterizedby error bars. The generalization of the methods to this case is straightforward. A.K. Hartmann: Practical Guide to Computer Simulations To find the maximum of a likelihood function L ( θ ) analytically, one has tocalculate the first derivatives with respect to all parameters, respectively, andrequires them to be zero. Since calculating the derivative of a product involvesthe application of the product rule, it is usually more convenient to consider the log-likelihood function l ( θ ) ≡ log L ( θ ) . (7.82)This turns the product of single-data-points pmfs or pdfs into a sum, wherethe derivatives are easier to obtain. Furthermore, since the logarithm is amonotonous function, the maximum of the likelihood function is the same asthe maximum of the log-likelihood function. Hence, the parameters which suit“best” are determined within the maximum-likelihood approach by the set ofequations ∂l ( θ ) ∂θ k ! = 0 ( k = 1 , . . . , n p ) (7.83)Note that the fact that the first derivatives are zero only assures that an extremalpoint is obtained. Furthermore, these equations often have several solutions.Therefore, one has to check explicitly which solutions are indeed maxima, andwhich is the largest one. Note that maximum-likelihood estimators, since theyare functions of the samples, are also random variables ML θ k ,n ( X , . . . , X n − ).As a toy example, we consider the exponential distribution with the pdfgiven by Eq. (7.39). It has one parameter µ . The log-likelihood function for asample { x , x , . . . , x n − } is in this case l ( µ ) = log n − Y i =0 f µ ( x i )= n − X i =0 log (cid:26) µ exp (cid:18) − x i µ (cid:19)(cid:27) = n − X i =0 (cid:18) log (cid:26) µ (cid:27) − x i µ (cid:19) = n log (cid:26) µ (cid:27) − nµ x Taking the derivative with respect to µ we obtain:0 ! = ∂L ( θ ) ∂µ = n − µ µ − − nµ x = − nµ ( µ − x )This implies µ = x . It is easy to verify that this corresponds to a maximum.Since the expectation value for the exponential distribution is just E[ X ] = µ ,this is compatible with the result from Sec. 7.3, where it was shown that thesample mean is an unbiased estimator of the expectation value.If one applies the maximum-likelihood principle to a Gaussian distribu-tion with parameters µ and σ , one obtains (not shown here, see for example .6. GENERAL ESTIMATORS x (for µ ) and the sample variance s (for σ ), respectively. This means (see Eq.(7.54)) that the maximum-likelihood estimator for σ is biased. Fortunately,we know that the bias disappears asymptotically for n → ∞ . Indeed, it can beshown, under rather mild conditions on the underlying distributions, that allmaximum-likelihood estimators ML θ k ,n ( X , . . . , X n − ) for a parameter θ k areasymptotically unbiased, i.e. lim n →∞ E[ML θ k ,n ] = θ k (7.84)In contrast to the exponential and Gaussian cases, for many applicationsthe maximum-likelihood parameter is not directly related to a standard sampleestimator. Furthermore, ML θ k ,n can often even not be determined analytically.In this case, one has to optimize the log-likelihood function numerically, for ex-ample, using the corresponding methods from the GNU scientific library (GSL)(see Sec. 6.3). GET SOURCE CODE DIR: randomness FILE(S): max likely.c As example, we consider Fisher-Tippett distri-bution, see Eq. (7.43), shifted to exhibit the max-imum at x instead of at 0. Hence, we have twoparameters λ and x to adjust. The function tobe optimized (the target function ), i.e. the log-likelihood function here, must beof a special format when using the minimization functions of the GSL. This firstargument of the target function contains the pdf parameters to be adjusted, i.e.the main argument vector of the target function. This argument must be ofthe type gsl_vector , which is a GSL type for vectors. One needs to include A.K. Hartmann: Practical Guide to Computer Simulations double ll_ft(const gsl_vector *par, void *param) { double lambda, x0; /* parameters of pdf */ sample_t *sample; /* sample */ double sum; /* sum of log-likelihood contributions */ int i; /* loop counter */ lambda = gsl_vector_get(par, 0); /* get data */ x0 = gsl_vector_get(par, 1); sample = (sample_t *) param; sum = sample->n*log(lambda); /* calculate log likelihood */ for(i=0; i The set-up of the minimizer object comes in two steps, first al-location using gsl_multimin_fminimizer_alloc() , then initialization via gsl_multimin_fminimizer_set() while passing the target function, the start-ing point par and the (initial) simplex size. The sample.x[] array has to befilled with the actual sample (not shown here).The minimization loop looks as follows: do /* perform minimization */{ iter++;status = gsl_multimin_fminimizer_iterate(s); /* one step */if(status) /* error ? */break;size = gsl_multimin_fminimizer_size(s); /* converged ? */status = gsl_multimin_test_size(size, 1e-4);}while( (status == GSL_CONTINUE) && (iter<100) ); The main work is done in gsl_multimin_fminimizer_iterate() . Thenit is checked whether an error has occurred. Next, the size of the simplex iscalculated and finally tested whether the size falls below some limit, 10 − here. The simplex is spanned by par and the n vectors given by par plus (0 , . . . , , simplex size[i] , 0 , . . . , 0) for i = 1 , . . . , n . A.K. Hartmann: Practical Guide to Computer Simulations The actual estimate of the parameters can be obtained via gsl_vector_get(s->x, 0) and gsl_vector_get(s->x, 1) . Note thatfinally all allocated memory should be freed: gsl_vector_free(par); /* free everything */gsl_vector_free(simplex_size);gsl_multimin_fminimizer_free(s);free(sample.x); As an example, n = 10000 data points were generated according to a Fisher-Tippett distribution with parameters λ = 3 . x = 2 . 0. With the above startingparameters, the minimization converged to the values ˆ λ = 2 . 995 and ˆ x = 2 . In the previous section, the parameters of a probability distribution are chosensuch that the distribution describes the data best. Here, we consider a more gen-eral case, called modeling of data . As explained above, here a sample of triplets { ( x , y , σ ) , ( x , y , σ ) , . . . , ( x n − , y n − , σ n − ) } is given. Typically, the y i aremeasured values obtained from a simulation with some control parameter (e.g.the temperature) fixed at different values x i ; σ i is the corresponding error barof y i . Here, one wants to determine parameters θ = ( θ , . . . , θ n p ) such that thegiven parametrized function y θ ( x ) fits the data “best”, one says one wants to fit the function to the data. Similar to the case of fitting a pmf or a pdf, thereis no general principle of what “best” means.Let us assume that the y i are random variables, i.e. comparing different sim-ulations. Thus, the measured values are scattered around their “true” values y θ ( x i ). This scattering can be described approximately by a Gaussian distribu-tion with mean y θ ( x i ) and variance σ i : q θ ( y i ) ∼ exp (cid:18) − ( y i − y θ ( x i )) σ i (cid:19) . (7.85)This assumption is often valid, e.g. when each sample point y i is itself a samplemean obtained from a simulation performed at control parameter value x i , and σ i is the corresponding error bar. The log-likelihood function for the full datasample is l ( θ ) = log n − Y i =0 q θ ( y i ) ∼ − n − X i =0 (cid:18) y i − y θ ( x i ) σ i (cid:19) .6. GENERAL ESTIMATORS l ( θ ) is equivalent to minimizing − l ( θ ), hence one minimizes the mean-squared difference χ θ = n − X i =0 (cid:18) y i − y θ ( x i ) σ i (cid:19) (7.86)This means the parameters θ are determined such that function y θ ( x ) follows thedata points { ( x , y ) , . . . ( x n − , y n − ) } as close as possible, where the deviationsare measured in terms of the error bars σ i . Hence, data points with smaller errorbar enter with more weight . The full procedure is called least-squares fitting .The minimized mean-squared difference is a random variable. Note that thedifferent terms are not statistically independent, since they are related by the n p parameters ˆ θ which are determined via minimizing χ θ . As a consequence,the distribution of χ θ is approximately given by chi-squared distribution (seeEq. (7.45) for the pdf) with n − n p degrees of freedom. This distribution canbe used to evaluate the statistical significance of a least-squares fit, see below.In case, one wants to model the underlying distribution function for a sampleas in Sec. 7.6.1, say for a continuous distribution, it is possible in principle to usethe least-squares approach as well. In this case one would fit the parametrizedpdf to a histogram pdf, which has also the above mentioned sample format { ( x i , y i , σ i ) } . Nevertheless, although the least-squares principle is derived usingthe maximum-likelihood principle, usually different parameters are obtained ifone fits a pdf to a histogram pdf compared to obtaining these parameters froma direct maximum-likelihood approach. Often [Bauke (2007)], the maximum-likelihood method gives more accurate results. Therefore, one should use aleast-squares fit mainly for a fit of a non-pmf/non-pdf function to a data set.Fortunately, to actually perform least-squares fitting, you do not have towrite your own fitting functions, because there are very good fitting implemen-tations readily available. Both programs presented in Sec. 7.4, gnuplot and xmgrace , offer fitting to arbitrary functions. It is advisable to use gnuplot , sinceit offers higher flexibility for that purpose and gives you more information usefulto estimate the quality of a fit.As an example, let us suppose that you want to fit an algebraic function ofthe form f ( L ) = e ∞ + aL b to the data set of the file sg e0 L.dat shown on page42. First, you have to define the function and supply some rough (non-zero)estimations for the unknown parameters. Note that the exponential operatoris denoted by ** and the standard argument for a function definition is x , butthis depends only on your choice: gnuplot> f(x)=e+a*x**bgnuplot> e=-1.8gnuplot> a=1gnuplot> b=-1 The actual fit is performed via the fit command. The program uses the non-linear least-squares Levenberg-Marquardt algorithm [Press et al. (1995)], which0 A.K. Hartmann: Practical Guide to Computer Simulations allows a fit data to almost all arbitrary functions. To issue the command, youhave to state the fit function, the data set and the parameters which are to beadjusted. First, we consider the case where just two columns of the data areused or available (in this case, gnuplot assumes σ i = 1). For our example youenter: gnuplot> fit f(x) "sg_e0_L.dat" via e,a,b Then gnuplot writes log information to the output describing the fittingprocess. After the fit has converged it prints for the given example: After 17 iterations the fit converged.final sum of squares of residuals : 7.55104e-06rel. change during last iteration : -2.42172e-09degrees of freedom (ndf) : 5rms of residuals (stdfit) = sqrt(WSSR/ndf) : 0.00122891variance of residuals (reduced chisquare) = WSSR/ndf : 1.51021e-06Final set of parameters Asymptotic Standard Error======================= ==========================e = -1.78786 +/- 0.0008548 (0.04781%)a = 2.5425 +/- 0.2282 (8.976%)b = -2.80103 +/- 0.08265 (2.951%)correlation matrix of the fit parameters:e a be 1.000a 0.708 1.000b -0.766 -0.991 1.000 The most interesting lines are those where the results ˆ θ for your parametersalong with the standard error bar are printed. Additionally, the quality of thefit can be estimated by the information provided in the three lines beginningwith “ degree of freedom ”. The first of these lines states the number of degreesof freedom, which is just n − n p . The mean-squared difference χ θ is denoted as WSSR in the gnuplot output. A measure of quality of the fit is the probability Q that the value of the mean-squared difference is equal or larger compared tothe value from the current fit, given the assumption that the data points aredistributed as in Eq. (7.85) [Press et al. (1995)]. The larger the value of Q , thebetter is the quality of the fit. As mentioned above, Q can be evaluated from achi-squared distribution with n − n p degrees of freedom. To calculate Q usingthe gnuplot output you can use the little program Q.c These “error bars” are calculated in a way which is in fact correct only when fitting linearfunctions; hence, they have to be taken with care. .6. GENERAL ESTIMATORS int main(int argc, char *argv[]) { double WSSRndf; int ndf; if(argc != 3) { printf("USAGE %s Figure 7.29: Gnuplot window showing the result of a fit command along withthe input data. gnuplot> fit [5:12] f(x) "sg_e0_L.dat" using 1:2:3 via e,a,b More information on how to use the fit command, such as fitting higher-dimensional data, can be obtained when using the gnuplot online help via en-tering help fit . .6. GENERAL ESTIMATORS Exercises ( solutions: see CD enclosed with book)1. Sampling from discrete distribution Design, implement and test a function, whichreturns a random number which is distributedaccording to some discrete distribution functionstored in an array F , as describe in Sec. 7.2.2. SOLUTION SOURCE CODE DIR: randomness FILE(S): poisson.c The function prototype reads as follows: /******************** rand_discrete() *****************//** Returns natural random number distributed **//** according a discrete distribution given by the **//** distribution function in array ’F’ **//** Uses search in array to generate number **//** PARAMETERS: (*)= return-paramter **//** n: number of entries in array **//** F: array with distribution function **//** RETURNS: **//** random number **//******************************************************/int rand_discrete(int n, double *F) For simplicity, you can use the drand48() function from the standard C libraryto generate random numbers distributed according to U (0 , F for a Poisson distribution with parameter µ , see Eq. (7.27)for the probability mass function. The function should determine automaticallyhow many entries of F are needed, depending on the paramater µ . The functionprototype reads as follows: /********************* init_poisson() *****************//** Generates array with distribution function **//** for Poisson distribution with mean mu: **//** p(k)=mu^k*exp(-mu)/x! **//** The size of the array is automatically adjusted. **//** PARAMETERS: (*)= return-paramter **//** (*) n_p: p. to number of entries in table **//** mu: parameter of distribution **//** RETURNS: **//** pointer to array with distribution function **//******************************************************/double *init_poisson(int *n_p, double mu) Hints: To determine the array sizes, you can first loop over the probabilities andtake the first value k_0 where p ( k_0 ) = 0 within the precision of the numerics.This value of k_0 serves as array size. Alternatively, you start with some sizeand extend the array if needed by doubling its size. For testing purposes, youcan generate many numbers, calculate the mean and compare it with µ . Al-ternatively, you could record a histogram (see Chap. 3) and compare with Eq.(7.27). A.K. Hartmann: Practical Guide to Computer Simulations Inversion Method for Fisher-Tippett distribution Design, implement and test a function, whichreturns a random number which is distributedaccording to the Fisher-Tippett distributionEq. (7.43) with parameter λ . Use the inversionmethod. SOLUTION SOURCE CODE DIR: randomness FILE(S): fischer tippett.c The function prototype reads as follows: /******************** rand_fisher_tippett() ***********//** Returns random number which is distributed **//** according the Fisher-Tippett distribution **//** PARAMETERS: (*)= return-paramter **//** lambda: parameter of distribution **//** RETURNS: **//** random number **//******************************************************/double rand_fisher_tippett(double lambda) Remarks: For simplicity, you can use the drand48() function from the standardC library to generate random numbers distributed according to U (0 , ∼ . /λ .3. Variance of data sample Design, implement and test a function, whichcalculates the variance s of a sample of datapoints. Use directly Eq. (7.51), i.e. do not usean equivalent form of Eq. (7.21), since this formis more susceptible to rounding errors. SOLUTION SOURCE CODE DIR: randomness FILE(S): variance.c The function prototype reads as follows: /********************** variance() ********************//** Calculates the variance of n data points **//** PARAMETERS: (*)= return-paramter **//** n: number of data points **//** x: array with data **//** RETURNS: **//** variance **//******************************************************/double variance(int n, double *x) Remark: The so-called corrected double-pass algorithm [Chan et al. (1983)]aims at further reducing the rounding error. It is based on the equation s = 1 n n − X i =0 ( x − x ) − n n − X i =0 ( x i − x ) ! . The second would be zero for exact arithmetic and accounts for rounding errosoccurring in the second term. It becomes important in particular if the expec-tation value is large. Perform experiments for generating Gaussian distributednumber with σ = 1 and µ = 10 , without and with the correction. .6. GENERAL ESTIMATORS Bootstrap Design, implement and test a function, whichuses bootstrapping to calculate the confidenceinterval at significance level α given in Eq.(7.66). SOLUTION SOURCE CODE DIR: randomness FILE(S): bootstrap ci.c The function prototype reads as follows: /***************** bootstrap_ci() *********************//** Calculates a confidence interval by ’n_resample’ **//** times resampling the given sample points **//** and each time evaluation the estimator ’f’ **//** PARAMETERS: (*)= return-paramter **//** n: number of data points **//** x: array with data **//** n_resample: number of bootstrap iterations **//** alpha: confidence level **//** f: function (pointer) = estimator **//** (*) low: (p. to) lower boundary of conf. int.**//** (*) high: (p. to) upper boundary of conf. int.**//** RETURNS: **//** (nothing) **//******************************************************/void bootstrap_ci(int n, double *x, int n_resample,double alpha, double (*f)(int, double *),double *low, double *high) Hints: Use the function bootstrap_variance() as example. To get the entriesat the positions defined via Eq. (7.66), you can sort the bootstrap sample firstusing qsort() , see Sec. 6.1.You can test your function by using the provided main file bootstrap_test.c ,the auxiliary files mean.c and variance.c and by compiling with cc -o bt bootstrap_test.c bootstrap_ci.c mean.c -lm -DSOLUTION . Notethat the macro definition -DSOLUTION makes the main() function to call bootstrap_ci() instead of bootstrap_variance() .5. Plotting data Plot the data file FTpdf.dat using xmgrace .The file contains a histogram pdf generated forthe Fisher-Tippett distribution. The file formatis 1st column: bin number, 2nd: bin midpoint,3rd: pdf value, 4th: error bar. Use SOLUTION SOURCE CODE DIR: randomness FILE(S): FTplot.agr the “block data” format to read the files (columns 2,3,4). Create a plot withinset. The main plot should show the histogram pdf with error bars and logarith-mically scaled y axis, the inset should show the data with linear axes. Describethe plot using a text label placed in the plot. Choose label sizes, line width andother styles suitably. Store the result as .agr file and export it to a postscript(eps) file.The result should look similar to: A.K. Hartmann: Practical Guide to Computer Simulations x -5 -4 -3 -2 -1 P ( x ) Fisher-Tippett distribution Chi-squared test Design, implement and test a function, whichcalculates the χ test statistics for two his-tograms { h i } , { ˆ h i } according Eq. (7.69). Thefunction should return the p-value, i.e. the SOLUTION SOURCE CODE DIR: randomness FILE(S): chi2hh.c cumulative probability (“p-value”) that a value of χ or larger is obtained underthe assumption that the two histograms were obtained by sampling from thesame (discrete) random variable.The function prototype reads as follows: /********************* chi2_hh() ***********************//** For chi^2 test: comparison of two histograms **//** to probabilities: Probability to **//** obtain the corresponding chi2 value or worse. **//** It is assumed that the total number of data points**//*+ in the two histograms is equal ! **//** **//** Parameters: (*) = return parameter **//** n_bins: number of bins **//** h: array of histogram values **//** h2: 2nd array of histogram values **//** **//** Returns: **//** p-value **//*******************************************************/double chi2_hh(int n_bins, int *h, int *h2) Hints: Use the functio chi2_hd() as example. Include a test, which verifies thatthe total number of counts in the two histograms agree.To test the function: Generate two histograms according to a binomial distri-bution with parameters n = par_n = 10 and p = 0.5 or p = par_p . Perform a .6. GENERAL ESTIMATORS loop for different values of par_p and calculate the p-value each time using the gsl_cdf_chisq_Q() function of the GNU scientific library (GSL) (see Sec. 6.3).7. Linear correlation coefficient Design, implement and test a function, whichcalculates the linear correlation coefficient r tomeasure the strength of a correlation for a sam-ple { ( x , y ) , ( x , y ) , . . . , ( x n − , y n − ) } . SOLUTION SOURCE CODE DIR: randomness FILE(S): lcc.c The function prototype reads as follows: /**************************** lcc() ********************//** Calculates the linear correlation coefficient **//** **//** Parameters: (*) = return parameter **//** n: number of data points **//** x: first element of sample set **//** y: second element of sample set **//** **//** Returns: **//** r **//*******************************************************/double lcc(int n, double *x, double *y) Remark: Write a main() function which generates a sample in the following way:the x i numbers are generated from a standard Gaussian distribution N (0 , y i number is drawn from a Gaussian distribution with expectationvalue κx i (variance 1). Study the result for different values of κ and n .8. Least-squares fitting Copy the program from exercise (2) to a newprogram and change it such that numbers fora shifted Fisher-Tippett with parameters λ andpeak position x are generated. The numbersshould be stored in a histogram and a histogrampdf should be written to the standard output. SOLUTION SOURCE CODE DIR: randomness FILE(S): fitFT.gpfisher tippett2.c • Choose the histogram parameters (range, bin range) such that the his-tograms match the generated data well. • Run the program to generate n = 10 numbers for parameters x = 2 . λ = 3 . 0. Pipe the histogram pdf to a file (e.g. using > ft.dat at theend of the call). • Plot the result using gnuplot . • Define the pdf for the Fisher-Tippett distribution in gnuplot and fit thefunction to the data with x and λ as adjustable parameters. Choose asuitable range for the fit. • Plot the data together with the fitted function. • How does the result compare to the maximum-likelihood fit presented inSec. 7.6.1? • Does the fit (in particular for λ ) get better if you increase the number ofsample points to 10 ? A.K. Hartmann: Practical Guide to Computer Simulations Hints: The shift is implemented by just adding x to the generated randomnumber. Use either the histograms from Chap. 3, or implement a “poor-manshistogram” via an array hist ( see also in the main() function of the reject.c program partly presented in Sec. 7.2.4). ibliography [Abramson and Yung (1986)] Abramson, B. and Yung, M. (1986). Construc-tion through decomposition: a divide-and-conquer algorithm for the N-queensproblem, CM ’86: Proceedings of 1986 ACM Fall joint computer conference ,pp. 620–628. (IEEE Computer Society Press, Los Alamitos)[Abramson and Yung (1989)] Abramson, B. and Yung, M. (1989). Divide andconquer under global constraints: A solution to the N-queens problem, Jour-nal of Parallel and Distributed Computing , pp. 649–662.[Aho et al. (1974)] Aho, A. V., Hopcroft, J. E., and Ullman, J. D. (1974). The Design and Analysis of Computer Algorithms , (Addison-Wesley, Reading(MA)).[Albert and Barab´asi (2002)] Albert, R. and Barab´asi, A.-L. (2002). Statisticalmechanics of complex networks , Rev. Mod. Phys. , pp. 47–97.[Allen and Tildesley (1989)] Allen, M. P., and Tildesley, D. J. (1989). ComputerSimulation of Liquids , (Oxford University Press, Oxford).[Alta Vista] Alta Vista , search engine, see .[APS] APS, American Physical Society , journals see http://publish.aps.org/ .[arXiv] arXiv , preprint server, see http://arxiv.org/ .[Bauke (2007)] Bauke, H. (2007). Parameter estimation for power-law distribu-tions by maximum likelihood methods, Eur. Phys. J. B , pp. 167–173.[Beamer] Beamer class, a L A TEX package; written by Till Tantau, see http://latex-beamer.sourceforge.net/ .[Becker (2007)] Becker, P. (2007). The C++ Standard Library Extensions ,(Addison-Wesley Longman, Amsterdam).[Binder (1981)] Binder, K. (1981). Finite size scaling analysis of ising modelblock distribution functions, Z. Phys. B , pp. 119–140.890 BIBLIOGRAPHY [Binder and Heermann (1988)] Binder, K. and Heermann, D. W. (1988). MonteCarlo Simulations in Statistical Physics , (Springer, Heidelberg).[Bolobas (1998)] Bolobas, B. (1998). Modern Graph Theory , (Springer, NewYork).[Boost] boost collection of libraries; available, including documentation, at .[Cardy (1996)] Cardy, J. (1996). Scaling and Renormalization in StatisticalPhysics , (Cambridge University Press, Cambridge).[Chan et al. (1983)] Chan, T. F., Golub, G.H., and LeVeque, R. J. (1983). Algo-rithm for Computing the Sample Variance: Analysis and Recommendations, Amer. Statist. , pp. 242–247.[Claiborne (1990)] Claiborne, J. D. (1990). Mathematical Preliminaries forComputer Networking , (Wiley, New York).[Comp. Sci. Eng. (2008)] Computational Provenance, special issue of Comput-ing in Science & Engineering (3), pp. 3–52.[Cormen et al. (2001)] Cormen, T. H., Clifford, S., Leiserson, C. E., and Rivest,R. L. (2001). Introduction to Algorithms , (MIT Press).[CTAN] Comprehensive TeX Archive Network : .[Dekking et al (2005)] Dekking, F. M., Kraaikamp, C., Lopuha¨a, H. P., andMeester, L. E. (2005). A Modern Introduction to Probability and Statistics ,(Springer, London).[Devroye (1986)] Devroye, L. (1986). Non-Uniform Random Variate Genera-tion , (Springer, London).[Dhar (2001)] Dhar, A. (2001). Heat Conduction in a One-dimensional Gasof Elastically Colliding Particles of Unequal Masses, Phys. Rev. Lett. ,pp. 3554–3557.[Marsaglia] “Diehard” test provided by George Marsaglia, see source code at .[Efron (1979)] Efron, B. (1979). Bootstrap methods: another look at the jack-nife, Ann. Statist. , pp. 1–26.[Efron and Tibshirani (1994)] Efron, B. and Tibshirani, R. J. (1994). An Intro-duction to the Bootstrap , (Chapman & Hall/CRC, Boca Raton).[Fernandez and Criado (1999)] Fernandez, J. F. and Criado, C. (1999). Algo-rithm for normal random numbers, Phys. Rev. E , pp. 3361–3365. IBLIOGRAPHY Phys. Rev. Lett. , pp. 3382–3384.[Flyvbjerg (1998)] Flyvbjerg, H. (1998). Error Estimates on Averages of Cor-related Data, in: Kert´esz, J. and Kondor, I. (Eds.), Advances in ComputerSimulation , (Springer, Heidelberg), pp. 88–103.[Galassi et al. (2006)] Galassi M. et al (2006). GNU Scientific Li-brary Reference Manual , (Network Theory Ltd, Bristol), see also .[Ghezzi et al. (1991)] Ghezzi C., Jazayeri, M. and Mandrioli, D. (1991). Fun-damentals of Software Engineering , (Prentice Hall, London).[Google] Google search engine, see .[GraphViz] GraphViz graph drawing package, see .[Grassberger et al. (2002)] Grassberger, P., Nadler, W. and Yang, L. (2002).Heat conduction and entropy production in a one-dimensional hard-particlegas, Phys. Rev. Lett. , 180601, pp. 1–4.[Haile (1992)] Haile, J. M. (1992). Molecular Dynamics Simulations: Elemen-tary Methods , (Wiley, New York).[Hartmann (1999)] Hartmann, A. K. (1999). Ground-state behavior of the 3d ± J random-bond Ising model, Phys. Rev. B , pp. 3617–3623.[Hartmann and Rieger (2001)] Hartmann, A. K. and Rieger, H. (2001). Opti-mization Algorithms in Physics , (Wiley-VCH, Weinheim).[Heck (1996)] Heck, A. (1996). Introduction to Maple , (Springer, New York).[Hotbits] HotBits webpage: here you can order files with ran-dom numbers which are generated from radioactive decay, see .[Hucht (2003)] Hucht, A. (2003). The program fsscale , see .[INSPEC] INSPEC , literature data base, see .[JAVA] JAVA programming language, see .[Johnsonbaugh and Kalin (1994)] Johnsonbaugh, R. and Kalin, M. (1994). Ob-ject Oriented Programming in C++ , (Macmillan, London).[Josuttis (1999)] Josuttis, N. M. (1999). The Standard C++ Library , (Addison-Wesley, Boston).2 BIBLIOGRAPHY [Karlsson (2005)] Karlsson, B. (2005). Beyond the C++ Standard Library. AnIntroduction to Boost , (Addison-Wesley Longman, Amsterdam).[Kernighan and Pike (1999)] Kernighan, B. W. and Pike, R. (1999). The Prac-tice of Programming , (Addisin-Wesley, Boston).[Kernighan and Ritchie (1988)] Kernighan, B. W. and Ritchie, D. M. (1988). The C Programming Language , (Prentice Hall, London).[Lamport and Bibby (1994)] Lamport, L. and Bibby, D. (1994). LaTeX : ADocumentation Preparation System User’s Guide and Reference Manual ,(Addison Wesley, Reading (MA)).[Landau and Binder (2000)] Landau, D.P. and Binder, K. (2000). A Guide toMonte Carlo Simulations in Statistical Physics , (Cambridge University Press,Cambridge (UK)).[Lefebvre (2006)] Lefebvre L. (2006). Applied Probability and Statistics ,(Springer, New York).[Lewis and Papadimitriou (1981)] Lewis, H. R. and Papadimitriou, C. H.(1981). Elements of the Theory of Computation , (Prentice Hall, London).[Liu (2008)] Lui, J. S. (2008). Monte Carlo Strategies in Scientific Computing ,(Springer, Heidelberg).[Loukides and Oram (1996)] Loukides, M. and Oram, A. (1996). Programming with GNU Software , (O’Reilly, London); see also .[L¨uscher (1994)] L¨uscher, M. (1994). A portable high-quality random numbergenerator for lattice field-theory simulations, Comput. Phys. Commun. ,pp. 100–110.[Lyx] Lyx , document processor based on L A TEX, see .[Matsumoto and Nishimura (1998)] Matsumoto, M. and Nishimura, T. (1998).Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator, ACM Transactions on Modeling and ComputerSimulation , pp. 3–30.[Mehlhorn and N¨aher (1999)] Mehlhorn, K. and N¨aher, St. (1999). The LEDAPlatform of Combinatorial and Geometric Computing (Cambridge UniversityPress, Cambridge); see also .[Meyers (2005)] Meyers, S. (2005). Effective C++: 55 Specific Ways to ImproveYour Programs and Designs , (Addison-Wesley, Reading (MA)).[Morgan (1984)] Morgan, B. J. T. (1987). Elements of Simulation , (CambridgeUniversity Press, Cambridge). IBLIOGRAPHY Monte Carlo Methods in Statistical Physics , (Clarendon Press, Oxford).[Newman (2003)] Newman, M. E. J. (2003) The Structure and Function ofComplex Networks , SIAM Review , pp. 167–256.[Newman et al. (2006)] Newman, M. E. J., Barabasi, A.-L., and Watts, D.(2006). The Structure and Dynamics of Networks , ( Princeton UniversityPress, Princeton).[Philipps (1987)] Phillips, J. (1987) . The Nag Library: A Beginner’s Guide (Oxford University Press, Oxford); see also .[Oram and Talbott (1991)] Oram, A. and Talbott, S. (1991). Managing ProjectsWith Make , (O’Reilly, London).[PhysNet] PhysNet , the Physics Departments and Documents Network, see http://physnet.uni-oldenburg.de/PhysNet/physnet.html .[Press et al. (1995)] Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flan-nery, B. P. (1995). Numerical Recipes in C , (Cambridge University Press,Cambridge).[Povray] Povray , Persistence of Vision raytracer, see .[Qantis] Quantis : A hardware true random number generator. It is based onthe quantum mechanical process of photon scattering. It can be connectedto a computer via USB port or PCI slot. More information can be found at .[R] R is a software package for statistical computing, freely avalaible at .[Rapaport (1995)] Rapaport, D. C. (1995). The Art of Molecular DynamicsSimulations , (Cambridge University Press, Cambridge).[Robert and Casella (2004)] Robert, C. P. and Casella, G. (2004). Monte CarloStatistical Methods , (Springer, Berlin)[Robinson and Torrens (1974)] Robinson, M. T. and Torrens, I. M. (1974).Computer simulation of atomic displacement cascades in solids in the binarycollision approximation, Phys. Rev. B , pp. 5008–5024.[Romeo] Romeo , a database for publisher copyright policies and self-archiving,see .[Rumbaugh et al. (1991)] Rumbaugh, J., Blaha, M., Premerlani, W. , Eddy, F.,and Lorensen, W. (1991). Object-Oriented Modeling and Design , (PrenticeHall, London).4 BIBLIOGRAPHY [Scott (1979)] Scott, D. W. (1979). On optimal and data-based histograms, Biometrica , pp. 605–610.[Sedgewick (1990)] Sedgewick, R., (1990). Algorithms in C , (Addison-Wesley,Reading (MA)).[Skansholm (1997)] Skansholm, J. (1997). C++ from the Beginning , (Addison-Wesley, Reading (MA)).[Sommerville (1989)] Sommerville, I. ( 1989). Software Engineering , (Addison-Wesley, Reading (MA)).[Sosic and Gu (1991)] Sosic, R. and Gu, J. (2001). Fast Search Algorithms forthe N-Queens Problem, IEEE Trans. on Systems, Man, and Cybernetics ,pp. 1572–1576.[STL] Standard Template Library , see .[Stroustrup (2000)] Stroustrup, B. (2000). The C++ Programming Language ,(Addison-Wesley Longman, Amsterdam).[Sutter (1999)] Sutter, H. (1999). Exceptional C++: 47 Engineering Puzzles,Programming Problems, and Solutions , (Addison-Wesley Longman, Amster-dam).[SVN] Subversion version control system, see http://subversion.tigris.org/ .[Swamy Thulasiraman (1991)] Swamy, M. N. S. and Thulasiraman, K. (1991). Graphs, Networks and Algorithms , (Wiley, New York).[Texinfo] Texinfo system, see . For some tools there is a texinfo file . To read it, call the editor ’emacs’ and type < crtl > +’h’ and then’i’ to start the texinfo mode.[TUG] TUG, TEXUser Group, see .[valgrind] Valgrind memory checker; more information, including a user manual,can be obtained from .[Vattulainen et al. (1994)] Vattulainen, I., Ala-Nissila, T. and Kankaala, K.(1994). Physica Test for Random Numbers in Simulations, Phys. Rev. Lett. , pp. 2513–2516.[Web of Science] Web of Science, see [Westphal] Westphal Electronic ( )sells divices which produce true random numbers based on thermal noise inZ diodes. They can be connected to a computer via USB port or bluetooth. IBLIOGRAPHY Wikipedia is a free online encyclopedia, currently containing morethan 2.5 million articles, see .[Wilson (2007)] Wilson, M. (2007). Extended STL , (Addison-Wesley Longman,Amsterdam).[xmgrace] Xmgrace (X Motiv GRaphing, Advanced Computation and Explo-ration of data), see http://http://plasma-gate.weizmann.ac.il/Grace/ .[Yahoo] Yahoo search engine, see .[Ziff (1998)] Ziff, R. M. (1998). Four-tap shift-register-sequence random-numbergenerators, Computers in Physics , pp. 385–392.[zlib] zlib compression library, see