[PDF] A Multi-Threaded Version of MCFM

Abstract

We report on our findings modifying MCFM using OpenMP to implement multi-threading. By using OpenMP, the modified MCFM will execute on any processor, automatically adjusting to the number of available threads. We modified the integration routine VEGAS to distribute the event evaluation over the threads, while combining all events at the end of every iteration to optimize the numerical integration. Special care has been taken that the results of the Monte Carlo integration are independent of the number of threads used, to facilitate the validation of the OpenMP version of MCFM.

Full PDF

FFermilab-PUB-15-043-T

A Multi-Threaded Version of MCFM

John M. Campbell a,1 , R. Keith Ellis b,1 , Walter T. Giele c,1 Fermilab, PO Box 500, Batavia, IL 60510, USAMarch 23, 2015

Abstract

We report on our ﬁndings modifying MCFMusing OpenMP to implement multi-threading. By usingOpenMP, the modiﬁed MCFM will execute on any pro-cessor, automatically adjusting to the number of avail-able threads.We modiﬁed the integration routine VEGAS to dis-tribute the event evaluation over the threads, whilecombining all events at the end of every iteration tooptimize the numerical integration.Special care has been taken that the results of theMonte Carlo integration are independent of the num-ber of threads used, to facilitate the validation of theOpenMP version of MCFM.

An important aspect of Monte Carlo programs is evalu-ation speed and ease of use. A faster overall evaluationspeed not only means that more complicated processescan be evaluated, but it also allows for more experi-mentation as results are returned in a shorter time.Computer processors are increasing their computa-tional power by including more and more computingcores. It is therefore essential for Monte Carlo eventgenerators to explore the possibility of a parallel imple-mentation of the code by taking advantage of the mul-tiple threads to reduce the evaluation time for a givennumber of events. By properly implementing the use ofmulti-threading, the Monte Carlo evaluation speed willscale with the number of cores; this process will con-tinue as more and more cores become available in thefuture. Monte Carlo event generators are well suited to a e-mail: [email protected] b e-mail: [email protected] c e-mail: [email protected] take advantage of multi-core processors. Parallelizationis straightforward as each generated event is evaluatedindependently, while the results of these evaluations areall combined to optimize the numerical integration.The reason processors increase the number of coresinstead of the processor frequency is the limitation de-riving from the growth of the power consumption of thechip. The power consumption in a chip is given by theequation P = CV f (1)where P is power, C is the capacitance being switchedper clock cycle, V is voltage, and f is the processor fre-quency (cycles per second). As the clock speed increasesthe power (and hence heat) grows linearly. By havingtwo circuits in parallel, we can double the capacitanceand halve the clock speed. The voltage determines therate at which the capacitance charges and discharges,so that a slower clock speed can run with lower volt-ages. At half the clock speed, we can approximatelyhalve the voltage, leading to a saving in power withouta compromise in performance. The use of many cores inthis fashion may allow the growth of computing powerto continue following Moore’s law in the future. It istherefore imperative that software evolve to take ad-vantage of these developments.Currently the Intel Xeon-Phi coprocessor with 240processor threads and General Purpose Graphics Pro-cessing Units (GPGPU’s) with up to 2,880 gpu coresare the most extreme implementation of this approachto increasing the computational power. The Xeon Phi isthe ﬁrst generation of the Intel MIC (Many IntegratedCores) hardware. With an improved version of this co-processor planned for release in the summer of 2015,further speed-ups can be expected. a r X i v : . [ phy s i c s . c o m p - ph ] M a r We will explore using this co-processor and moreconventional processors using OpenMP. Speciﬁcally, wewill test our OpenMP version of MCFM on an In-tel Core I7-4770 (4 hardware threads), a dual IntelXeon X5650 (2x6 hardware threads), a quadruple AMD6128 HE Opteron (4x8 hardware threads) and the In-tel Xeon-Phi 5110P (240 hardware threads). Note thatthe Intel Core i7 comes with 8 hyperthreads, 2 soft-ware threads per core. The core can execute only oneof the threads and quickly switch to the other thread ifthe current thread is waiting. As we will see this is oflimited beneﬁt for our application.The OpenMP standard [1] is a good choice for im-plementing parallel programming. It is native to boththe Intel and GNU compilers and can be invoked by in-cluding the ‘openmp’-ﬂag during compilation. No spe-cial libraries or other software need to be installed. TheOpenMP compiler directives are simply implemented ascomment statements in either FORTRAN or C/C++code. This has the advantage that the code can becompiled without the ‘openmp’-ﬂag. In this case theOpenMP directives are interpreted as comments by thecompiler. Furthermore, we can implement the paral-lelism with only minor alterations to the original codeby just adding these compiler directives.The further layout of our paper is as follows. Insection 2 we discuss some details and considerationsfor implementing OpenMP into the FORTRAN codeof MCFM [2,3] (similar considerations will hold forC/C++ code). The numerical performance of the par-allel code is explored in section 3 using several diﬀerentprocessors. Finally, in section 4 we sum up our conclu-sions and review further possible developments for theOpenMP MCFM program. MCFM-7.0 which runs under the OpenMP protocol asdescribed in this paper can be downloaded from themcfm.fnal.gov website. ‘OpenMP (Open Multi-Processing) is an API that supportsmulti-platform shared memory multiprocessing programmingin C, C++, and Fortran, on most processor architectures andoperating systems, including Solaris, AIX, HP-UX, Linux,Mac OS X, and Windows platforms. It consists of a set ofcompiler directives, library routines, and environment vari-ables that inﬂuence run-time behavior. OpenMP is managedby the nonproﬁt technology consortium OpenMP Architec-ture Review Board (or OpenMP ARB), jointly deﬁned by agroup of major computer hardware and software vendors, in-cluding AMD, IBM, Intel, Cray, HP, Fujitsu, Nvidia, NEC,Red Hat, Texas Instruments, Oracle Corporation, and more.’,from Wikipedia. uate the cross sections. This greatly helps to validatethat the implementation of the parallel code is correct.Almost all the work to be done is to make sure vari-ables are correctly assigned. In a parallel program wehave to decide whether a variable is global (i.e. poten-tially shared by threads) or local to the thread (i.e. eachthread has its own version of the variable).The most labor-intensive part is the treatment ofdata structures. The following rules will lead to a suc-cessful parallelization. For all the code running in par-allel one has to implement the following steps: – All variables in DATA statements in the parallel re-gion have to be included in SAVE statements ensur-ing they are declared for each thread. If not done,the variables are not necessarily initialized. – All variables in SAVE statements in the parallel re-gion must be made ‘thread private’ in the respectivefunctions and subroutines. – All common blocks whose variables are deﬁned orchanged in the parallel region have to be declared‘thread private’ each time the common block is de-clared. – All common blocks whose variables are deﬁned orchanged outside the parallel region in addition tobeing changed in the parallel region need to be de-clared ‘thread private’. To ensure the values arecopied to each thread at the start of the parallelregion a COPYIN directive including the commonblock has to be issued.Note that, where necessary, variables and common blocksare made ‘thread private’ by adding the THREADPRI-VATE directive to the function or subroutine [1].The MCFM code was originally written in FOR-TRAN 77, but parts of the code now require a FOR-TRAN 90 compiler. In view of the special treatmentrequired for data statements, indicated above, it is ben-eﬁcial to eliminate data statements wherever they arenot needed. FORTRAN 90 allows parameter arrays, soit is useful to replace the FORTRAN 77 legacy dataarrays by parameter arrays wherever possible.To ensure that the same events are generated, inde-pendent of the number of threads used, we have to en-sure VEGAS generates the same sequence of groups ofpseudo-random numbers used to generate the momentain an event. To do this we use the CRITICAL directiveforcing the pseudo-random number generator to runserially, when assigning the groups of pseudo-randomnumbers to a thread. When looking at all threads com-bined, the same groups of random numbers will be gen-erated, and consequently the same set of events. Theorder in which the groups of random numbers are ac-cessed by the threads is not identical and varies from run to run (i.e. which thread reaches the critical regionﬁrst) but in the end the same events are always gener-ated. A named CRITICAL directive provides a way ofdistinguishing CRITICAL regions in diﬀerent parts ofthe program. When a thread arrives at a CRITICALdirective, it waits until no other thread is executing acritical region with the same name.The ATOMIC construct, which applies only to thespeciﬁc assignment statement that follows it, can be aneﬃcient alternative to a CRITICAL region. The state-ment following an ATOMIC directive is executed by allthreads, but only one thread at a time can execute thestatement.This is still not suﬃcient to reach identical resultsfor the cross section. The reason for this is numericalrounding diﬀerences due to the fact that the resultingweights are added in diﬀerent orders. Using Kahan sum-mation [6] will ameliorate rounding error, leading notonly to identical cross section results but also to moreaccurate results.We checked that all processes in MCFM produceidentical results independent of the number of threadsand in agreement with the non-parallel version of MCFM(version 6.8). to execute all processes in MCFM the stack size shouldbe set to 16,000 or higher using the environmental vari-able OMP STACKSIZE (though for most processes inMCFM a much smaller stacksize suﬃces).3.2 Results threads used1 10 r un t i m e ( s e c ond s ) PP-> H(->bb)+ 2 jets @ LO

Processor:Intel Psi 5110PAMD 6128 HEIntel Xeon X5650Intel Core i7-4770

PP-> H(->bb)+ 2 jets @ LO threads used1 10 r un t i m e ( s e c ond s ) PP-> H(->bb)+ 2 jets @ NLO

Processor:Intel Psi 5110PAMD 6128 HEIntel Xeon X5650Intel Core i7-4770

PP-> H(->bb)+ 2 jets @ NLO

Fig. 1

The evaluation time of

P P → H ( → b ¯ b ) + 2 jets us-ing 4x1,000+10x10,000 Vegas events (in seconds) versus thenumber of threads. The top graphs is at LO and the bottomgraph at NLO. To benchmark the performance of the parallel ver-sion of MCFM we use four diﬀerent types of computerhardware. This will test the code on a variety of hard-ware conﬁgurations with diﬀering clock frequency, num-ber of threads, cache size etc.The ﬁrst conﬁguration is a standard desktop withan Intel Core i7-4770. This processor has 4 cores, each

Intel Core I7-4770Thr. Time (sec) Acc. Eﬀ.min avg max avg (%)1 1.67 1.69 1.70 1.00 100.002 0.83 0.83 0.83 2.02 101.213 0.57 0.57 0.58 2.94 97.884 0.44 0.44 0.44 3.80 95.125 0.40 0.40 0.40 4.18 83.506 0.37 0.37 0.37 4.55 75.787 0.34 0.34 0.34 4.92 70.268 0.32 0.32 0.32 5.25 65.65

Table 1

Dual Intel Xeon X5650Thr. Time (sec) Acc. Eﬀ.min avg max avg (%)1 2.88 2.89 2.89 1.00 100.002 1.49 1.49 1.50 1.94 96.763 0.99 1.00 1.00 2.90 96.604 0.75 0.75 0.75 3.85 96.136 0.50 0.50 0.51 5.72 95.308 0.38 0.38 0.38 7.57 94.5910 0.31 0.31 0.31 9.37 93.6612 0.26 0.26 0.26 11.16 92.96

Table 2

Quadruple AMD 6128 HE OpteronThr. Time (sec) Acc. Eﬀ.min avg max avg (%)1 3.79 3.80 3.80 1.00 100.002 2.00 2.02 2.05 1.88 94.063 1.36 1.37 1.38 2.77 92.424 1.03 1.04 1.05 3.66 91.528 0.54 0.54 0.54 7.00 87.4412 0.38 0.38 0.38 9.98 83.1316 0.33 0.33 0.33 11.44 71.5232 0.83 0.84 0.86 4.50 14.06

Table 3

Intel Xeon Phi 5110PThr. Time (sec) Acc. Eﬀ.min avg max avg (%)1 23.09 23.12 23.15 1.00 100.002 12.10 12.12 12.14 1.91 95.393 8.14 8.22 8.53 2.81 93.784 6.16 6.21 6.38 3.72 93.1116 1.66 1.67 1.68 13.86 86.6132 1.39 1.39 1.40 16.61 51.8964 1.41 1.41 1.41 16.39 25.61128 1.44 1.44 1.45 16.02 12.52240 1.52 1.52 1.53 15.19 6.33

Table 4

The LO evaluation of

P P → H ( → b ¯ b ) + 2 jets using4x1,000+10x10,000 Vegas events for the 4 diﬀerent hardwareconﬁgurations.Intel Core I7-4770Thr. Time (sec) Acc. Eﬀ.min avg max avg (%)1 238.83 238.95 239.07 1.00 100.002 120.16 120.45 120.73 1.98 99.193 81.99 82.03 82.07 2.91 97.104 63.01 63.02 63.02 3.79 94.805 58.67 58.69 58.71 4.07 81.436 54.84 54.85 54.86 4.36 72.617 51.52 51.53 51.54 4.64 66.248 48.62 48.63 48.64 4.91 61.42 Table 5

Dual Intel Xeon X5650Thr. Time (sec) Acc. Eﬀ.min avg max avg (%)1 496.43 496.43 496.44 1.00 100.002 249.73 249.83 249.94 1.99 99.353 166.20 166.41 166.62 2.98 99.444 124.58 124.58 124.59 3.98 99.626 83.01 83.04 83.06 5.98 99.648 62.24 62.26 62.29 7.97 99.6610 49.79 49.80 49.80 9.97 99.6912 41.46 41.46 41.46 11.97 99.78

Table 6

Quadruple AMD 6128 HE OpteronThr. Time (sec) Acc. Eﬀ.min avg max avg (%)1 806.86 806.98 807.10 1.00 100.002 404.00 404.08 404.17 2.00 99.853 269.26 269.37 269.48 3.00 99.864 201.96 201.99 202.02 4.00 99.888 101.03 101.05 101.07 7.99 99.8312 67.41 67.41 67.41 11.97 99.7616 50.56 50.56 50.56 15.96 99.7532 25.34 25.36 25.37 31.82 99.45

Table 7

Intel Xeon Phi 5110PThr. Time (sec) Acc. Eﬀ.min avg max avg (%)1 3784.45 3784.45 3784.45 1.00 100.002 1906.73 1906.73 1906.73 1.98 99.243 1282.26 1282.26 1282.26 2.95 98.384 958.59 958.59 958.59 3.95 98.7016 242.66 242.66 242.66 15.60 97.4732 121.25 121.25 121.25 31.21 97.5464 62.29 62.29 62.29 60.76 94.93128 41.22 41.22 41.22 91.81 71.73240 31.82 31.82 31.82 118.94 49.56

Table 8

The NLO evaluation of

P P → H ( → b ¯ b ) + 2 jets using4x1,000+10x10,000 Vegas events for the 4 diﬀerent hardwareconﬁgurations. with 2 hyperthreads. The second conﬁguration containstwo Intel Xeon X5650 processors, each with 6 cores fora total of 12 cores. The third conﬁguration containsfour AMD 6128 HE Opteron processors, each with 8cores for a total of 32 cores. The ﬁnal conﬁgurationis an Intel Phi 5110P coprocessor card connected to aPCI slot. This coprocessor has 60 cores, each with 4hardware threads for a total of 240 threads.While we have validated all processes in this ver-sion of MCFM, we pick one process in particular tostudy the speedups gained by using multiple threads.The process we choose is P P → H ( → b ¯ b )+2 jets whichdescribes the production of a Higgs boson in associ-ation with two jets through an eﬀective gluon-gluon-Higgs vertex. The Higgs boson subsequently undergoesa two-body decay to two b -quarks. Thus the process canhave as many as 4 (5) jets in LO (NLO), two of whichcan come from the Higgs decay. In lowest order a pro-cess with n particles in the ﬁnal state requires 3 n − (GeV) jj m0 50 100 150 200 250 3000510152025303540 Fig. 2

The di-jet diﬀerential cross section for

P P → H ( → b ¯ b ) + 2 jets at NLO using 1 hour of running time on theIntel Core I7-4770 using a single thread and on the quadrupleAMD 6128 HE Opteron using all 32 threads. The peak at m jj = 125 GeV when the two jets come from the decay ofthe Higgs boson is visible. (GeV) jj m0 50 100 150 200 250 300051015202530 LONLO

Fig. 3

The di-jet diﬀerential cross section for

P P → H ( → b ¯ b ) + 2 jets using 4x1,500,000+10x15,000,000 events. At LOwe use the Dual Intel Xeon X5650 with 12 threads (about12 minutes of runtime) and at NLO we use the quadrupleAMD 6128 HE Opteron (about 22 hours of runtime ) with 32threads. The peak when the two jets come from the decay ofthe Higgs boson is clearly visible. To see the impact of the faster running we show inFigure 2 the results for the di-jet mass invariant massdistribution. We compare the fastest single thread con-ﬁguration (the Core-I7) and the fastest multi-threadconﬁguration (quad AMD) using approximately 1 hourof runtime for each. We see that the single thread run isinsuﬃcient for any useful exploratory runs. In contrastone hour of running on the multi-threaded system givesa good result. Finally, in Figure 3 we make the di-jetdistribution using about 24 hours of runtime which ismore than suﬃcient to produce a stable ﬁnal result.

To conclude we see that the threaded version of MCFMaccelerates well on diﬀerent architectures. The com-putationally bound NLO processes scale well with thenumber of threads and the evaluation speeds are signif-icantly improved. In particular, the performance of theXeon-Phi coprocessor is impressive. A new coprocessoris to be released in the summer of 2015, promising evenfaster evaluation times. Moreover, this new version willalso be available in a socketed version, removing thePCI-bus and hopefully alleviating the bandwith-boundissues of LO. This will make the Xeon-Phi coprocessora very attractive option for Monte Carlo generators inthe near future.As we have shown, we have successfully implementeda parallel version of MCFM. It instantly reduces the ex-ecution time dependent on the hardware conﬁgurationof the system (i.e. number of cores, cache conﬁgura-tion, memory bandwidth, clock frequency etc) withoutany intervention of the user of MCFM. For the comput-ing intensive next-to-leading order processes we obtainvery good accelerations on all processors. In particular,utilizing the Xeon-Phi coprocessor with 240 hardwarecores yields an acceleration of order 100 over runningon a single thread.The new Xeon-Phi processor, to be released in sum-mer 2015, will overcome most of the bandwith limita-tion to which the compute-light leading order processesare subject. Moreover the new processor will be sub-stantially more powerful, giving us accelerations wellover a factor of 100. Now that we have improved thespeed of MCFM, we can implement more complicatedprocesses in the event generator and still get accept-able evaluation times. Possibilities could include addingmore jets to current processes in MCFM or proceedingto next-to-next-to leading order processes.

Acknowledgments

The numerical work on the Intel Xeon-Phi processorwas performed using the Fermilab MIC developmentcluster funded by the DOE Oﬃce of Science and op-erated by the Fermilab scientiﬁc computing HPC de-partment. We acknowledge useful discussions with DonHolmgren and James Simone. This research is supportedby the US DOE under contract DE-AC02-07CH11359.

References

1. B. Chapman, G. Jost, R. van der Pas,

Using OpenMP (MIT press, 2007)2. J.M. Campbell, R.K. Ellis, Nucl.Phys.Proc.Suppl. , 10 (2010). DOI 10.1016/j.nuclphysbps.2010.08.0113. J.M. Campbell, R.K. Ellis, C. Williams, JHEP , 018(2011). DOI 10.1007/JHEP07(2011)0184. G.P. Lepage, J.Comput.Phys. , 192 (1978). DOI 10.1016/0021-9991(78)90004-95. W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flan-nery, Numerical Recipes in FORTRAN: The Art of Sci-entiﬁc Computing (Cambridge University Press, 1992)6. W. Kahan, Communications of the ACM8(1)