Calculation of Longitudinal Collective Instabilities with mbtrack-cuda
Haisheng Xu, Uldis Locans, Andreas Adelmann, Lukas Stingelin
CCalculation of Longitudinal Collective Instabilities with mbtrack-cuda
Haisheng Xu , Uldis Locans , Andreas Adelmann , Lukas Stingelin ,Paul Scherrer Institut, CH-5232 Villigen PSI, Switzerland Abstract
Macro-particle tracking is a prominent method to study the collective beam instabilities in accel-erators. However, the heavy computation load often limits the capability of the tracking codes.One widely used macro-particle tracking code to simulate collective instabilities in storage rings is mbtrack . The Message Passing Interface (MPI) is already implemented in the original mbtrack toaccelerate the simulations. However, many CPU threads are requested in mbtrack for the analysis ofthe coupled-bunch instabilities. Therefore, computer clusters or desktops with many CPU cores areneeded. Since these are not always available, we employ as alternative a Graphics Processing Unit(GPU) with CUDA programming interface to run such simulations in a stand-alone workstation.All the heavy computations have been moved to the GPU. The benchmarks confirm that mbtrack-cuda can be used to analyze coupled bunch instabilities up to at least 484 bunches. Compared to mbtrack on an 8-core CPU, 36-core CPU and a cluster, mbtrack-cuda is faster for simulations ofup to 3 bunches. For 363 bunches, mbtrack-cuda needs about six times the execution time of thecluster and twice of the 36-core CPU. The multi-bunch instability analysis shows that the lengthof the ion-cleaning gap has no big influence, at least at filling to ⁄ .
1. Introduction
Synchrotron light sources have been a powerful tool for condensed matter physics, materialscience, biology and medicine since about 1968 [1]. The demand for higher brightness from the usersof synchrotron light sources has pushed for improved performance of the accelerators, for instance,to reduce the emittance of electron beams further. In recent years, there have been remarkableimprovements on the design of ultra-low emittance storage rings. Thanks to the applications ofthe concepts of Multi-Bend Achromat (MBA) [2], Longitudinal Gradient Bends (LGBs) [3], andAnti-Bends (ABs) [4], an emittance of the order of 100 pm (or even lower) has been achieved inmany new designs of storage rings and measured at MAX IV [5] during early commissioning. Allthe above mentioned concepts are employed in the design of the storage ring for the Swiss LightSource Upgrade (SLS-2) [6].Since the lattices of the ultra-low emittance rings need strong focusing provided by the strongquadrupole magnets, high-field sextupole magnets are needed for the correction of chromatic andgeometric aberration. Vacuum chambers with small cross sections are therefore considered in the [email protected], the author is presently in the Institute of High Energy Physics, CAS, Beijing, China [email protected] [email protected] [email protected]
Preprint submitted to Elsevier September 5, 2018 a r X i v : . [ phy s i c s . acc - ph ] S e p ltra-low emittance storage rings, causing high impedance. Furthermore, ultra-low emittance stor-age rings are more sensitive to the impedance induced instabilities.The mbtrack code [7] is a multi-bunch macro-particle tracking code which can be used to studyboth single-bunch and coupled-bunch instabilities in electron storage rings. It has been used indifferent synchrotron light sources, such as, SOLEIL [8], MAX IV [7, 9], etc. The Message PassingInterface (MPI) is implemented in mbtrack for acceleration of the multi-bunch tracking. For an n-bunch simulation using mbtrack , where n is an integer, (n+1) MPI processes are needed. Therefore,a large scale computing cluster is usually preferred to perform the multi-bunch simulations in thestorage rings of the synchrotron light sources. For instance, 391 processes are required for theanalysis of the nominal operation mode of Swiss Light Source (SLS) [10, 11], including 390 bunchesin the bunch train.In order to reduce the demands for large scale clusters for multi-bunch simulations, we developeda GPU version of mbtrack ( mbtrack-cuda ) [12, 13], in which the computations are offloaded toa Graphics Processing Unit (GPU). Taking advantage of a state-of-the-art GPU, which has amassively parallel architecture including thousands of cores, one can parallelize the tracking of allthe macro-particles in an efficient manner. An NVIDIA graphics card was chosen as the hardwareto develop the mbtrack-cuda . CUDA [14], which is a parallel computing platform and programmingmodel invented by NVIDIA, is used for programming. The resulting mbtrack-cuda manages to carryout the multi-bunch simulations in a stand-alone workstation equipped with an NVIDIA Tesla K40cGPU.Recently, the advantages of GPU computing attracts interest in the accelerator physics com-munity for the analysis of collective instabilities. For example, the GPU implementation for the elegant code [15, 16] is also undergoing and the Inovesa code [17], which is a Vlasov-Fokker-Planksolver, also benefits from modern GPU computing. The development of mbtrack-cuda provides apowerful alternative, especially for the users of mbtrack .In this paper, we present the development of the mbtrack-cuda code and the studies of longitu-dinal collective instabilities for SLS-2 carried out by this code. The rest of this paper is organizedas follows. In Section 2, the detailed information of mbtrack-cuda code is presented. The bench-mark of the code is presented in Section 3. The simulation study of the longitudinal coupled-bunchinstability for SLS-2 by mbtrack-cuda , are presented in Section 4. The conclusions are in Section 5.
2. GPU acceleration of mbtrack2.1. The introduction of mbtrack-cuda development
The mbtrack-cuda code [12] is an expansion of the original version. The main architecture ofthe code is kept identical to the original version. Similar to mbtrack , mbtrack-cuda allows the usersto choose the effects included in the simulations.The main difference between the two versions is on which device the operations of the macro-particles would be carried out. In the original mbtrack , apparently, all the operations are carriedout by CPU. However, in the mbtrack-cuda code, the coordinates of the macro-particles are allstored on the GPU. The operations on the macro-particles are performed also on the GPU, whichmeans the CPU is used only to control the flow of the simulation and to write the output data. Dueto the different architectures of CPU cluster and GPU, the parallelization models used in the twoversions are different, as shown in Figure 1. mbtrack parallelizes the simulations over bunches witheach process handling one bunch, while the macro-particles in one bunch are tracked in series. On2he other hand, the mbtrack-cuda parallelizes the simulations over macro-particles in one bunch.Different bunches are tracked in series. CPU 0: Manager Task
Initialize simulation.Handle communication between worker processes ...CPU 1: bunch 1 CPU N: bunch NCPU 2: bunch 2 (a)
CPU 0: Manager Task
Initialize simulation.Schedule kernel execution on the GPU
GPU: N bunches (b)Figure 1: Parallelization in the original mbtrack and the mbtrack-cuda : (a) mbtrack : each bunch is assigned to oneCPU core, (b) mbtrack-cuda : all the bunches are on the GPU. The simulations are parallelized over macro-particlesin one bunch.
To create mbtrack-cuda , CUDA kernels have been written to perform all the transformationsthat are implemented in the original mbtrack . In addition, the statistics calculations are alsoperformed on the GPU to avoid transferring the macro-particles’ coordinates to the CPU side. Theflow diagram of mbtrack-cuda is shown in Figure 2. The flow diagram shows the tasks executed onthe host and the kernels launched on the GPU side. The transformations performed by mbtrack and the implementation of CUDA kernels for mbtrack-cuda are described in detail in the rest of thesection.
HostGPU
Initsimulation Createbunches Transferbunches Starttracking Logpotentials Updatestatistics Logstatistics EndtrackingReceivebunches Selffieldtransf. Opticstransf. Calculatestatistics LongrangeRW int.
Figure 2: Flow diagram of the mbtrack-cuda code.
Every particle in mbtrack is represented by a 6 dimensional vector:( x, x, y, y (cid:48) , τ, δ )where x and y are the horizontal and vertical positions, while x (cid:48) and y (cid:48) are transverse momentaat longitudinal position s. Parameter δ = ∆ E/E describes the energy deviation relative to the3eference particle, and the longitudinal coordinate τ is the arrival time with respect to the referenceparticle [7].At each turn this transformation computes the energy deviation for each particle: δ i +1 = δ i + (cid:15) i − U rad E (1)where (cid:15) i is the relative energy gain in the RF cavities, U rad is the average energy loss per turndue to the synchrotron radiation (SR), and E is the reference energy. After energy deviation iscomputed, the longitudinal coordinate is updated as follows: τ i +1 = τ i + δ i T α c (2)where α c is the momentum compaction factor.In transverse planes the particles in the beam perform betatron oscillations. These oscillationsare described using Twiss parameters α x,y , β x,y , γ x,y , which describe the beam shape, size andorientation, and a phase advance per turn Ψ xy [7]. For particles with non-zero energy deviationsthe phase advance can be calculated by:Ψ xy = Ψ x y (1 + ξ xy δ ) (3)where ξ xy is the chromaticity. Given the presence of the horizontal dispersion D , the transformationin transverse planes are expressed by the transfer matrices: xx (cid:48) δ = cos Ψ x + α sin Ψ x β x sin Ψ x D − γ x sin Ψ x cos Ψ x − α sin Ψ x D (cid:48) xx (cid:48) δ (cid:18) yy (cid:48) (cid:19) = (cid:18) cos Ψ y + α sin Ψ y β y sin Ψ y − γ y sin Ψ y cos Ψ y − α sin Ψ y (cid:19) (cid:18) yy (cid:48) (cid:19) (4)Changes in the beam energy are also caused by quantum excitation and radiation dampingshown in the equations below.˜ δ i +1 = δ i +1 (1 − D E ) + σ E (cid:112) D e δ rand ˜ x i +1 = x i +1 + σ x (cid:112) D x x rand ˜ x (cid:48) i +1 = x (cid:48) i +1 δ i +1 δ i +1 + (cid:15) i +1 + σ x (cid:48) (cid:112) D x x (cid:48) rand (5)where coefficients D and σ correspond to the synchrotron radiation damping time and bunchenergy spread, while δ rand , x rand and x (cid:48) rand are random numbers from normal distribution withunit standard deviation [7].The application launches one CUDA kernel that performs these transformations for every bunch.Since calculations for each particle are independent, one thread per particle is created inside thekernel. The random numbers needed for the calculations of the radiation damping and quantumexcitation are generated using the NVIDIA cuRAND library. Shared memory is used to hold thedata for additional harmonic cavities, since they are shared by all the threads, and shared memoryusage allows to improve the load time from global memory.4 .3. Treatment of the bunch-wake interactions The simulations in mbtrack can include resistive-wall effects, arbitrary number of resonators,and purely resistive and inductive components, all contributing to the total wake [7]. The macro-particles in each bunch are grouped into cells (bins) depending on their longitudinal position. Thewake functions are calculated corresponding to ensemble of resonators. Each turn, the excitation ofthis wake on each bin is calculated and the resulting kick is given to every particle in the bin [18].The wake function is calculated for each resonator and summed to form the total wake. Addi-tionally resistive-wall effects are also added to the same wake. In the longitudinal and transverseplanes the transformation is expressed as change in particle energy (horizontal and vertical planesare treated identically): ∆ δ j = q j V ( τ j ) E ∆ x (cid:48) j = q j E j − (cid:88) k =0 q k D p ( τ k ) W ⊥ ( τ j − τ k ) (6)where V ( τ j ) is the wake voltage induced by this bunch at bin τ j , W ⊥ is the transverse wake function, q j,k is the total charge in the bins j and k , and D p ( τ k ) is the dipole-moment of the bunch at position τ k . Detailed information about how V ( τ j ) and W ⊥ are calculated can be found in [7].The transformation is carried out step-by-step. Each step begins by assigning each macro-particle to a bin (mesh cell). In the same step, the calculations of different macro-particles areindependent. Therefore, this process can be parallelized over the number of macro-particles in thebunch. A kernel is launched to put all the macro-particles to their own bins using the strategy thateach thread handles one macro-particle. The bin number is saved in a temporary memory reservedin the beginning of the calculations and is reused between bunches.After each macro-particle has been assigned to a bin, another kernel is launched to count thenumber of macro-particles in each bin and the dipole-moments D p for horizontal and vertical planes.This kernel is also launched with one thread per macro-particle. However, since there are a numberof macro-particles per bin, the atomic operations are needed to sum up all the number of macro-particles in the bins.After the information of the macro-particles per bin is known, a kernel is launched to find thebins holding the minimum and maximum number of macro-particles. Since this operation requirescommunication between threads, it’s not suitable for parallelization. Therefore, this operation isperformed serially on the GPU. One block with multiple threads are launched on the GPU, themultiple threads parallelize the loading of the data from the slow GPU memory to the faster sharedmemory. And then, one thread searches for the first and last bins that contains macro-particles.Since we do not need to loop over all the cells, this serialization of the kernel will not cause abottleneck for the simulation.Once the wake potentials are constructed, a kernel is called to carry out the bunch-wake inter-actions. This kernel parallelizes the simulation over the macro-particles in the bunch, and launchesone thread for each macro-particle. Shared memory is used to store the data that is frequentlyreused by the kernel, which can minimize the loads from global memory.After the wake potentials have been applied to the macro-particles, the long-range resistive-walleffects are calculated and applied to the transverse planes as described in [7]. Since long-rangeresistive-wall effects require statistics information about previous bunches, these calculations are5aunched only when wake potential effects are calculated for all the bunches and the statistics havebeen updated. The statistics of the bunches are calculated turn-by-turn in order to save the temporary informa-tion of the bunches during the simulations. To calculate the statistics, such as the mean values andthe RMS values of the six dimensional coordinates of each bunch, is very time consuming mainlybecause of the relatively large number of macro-particles and the communication time among dif-ferent threads. Furthermore, since the calculations of the statistics will be carried out on the GPU,the calculated statistics data need to be transferred to the CPU in order to log the statistics to fileat every turn, which is also computationally intensive. To calculate the average values, the Thrustlibraries reduce function is used to calculate the sums of position and momentum in each dimension.After reduce is performed, data will be sent to the CPU. To compute the standard deviation, theThrusts transform reduce function is used.Since each macro-particle in the bunch in mbtrack is represented as a structure of 6 variables,custom operators are defined to perform reduction and transformations correctly on the arrays ofparticles. The statistics are calculated for each bunch as well as averaged for all the bunches in thesimulations. Only the calculations for the individual bunches are performed on the GPU while therest are carried out on the CPU.
3. Benchmarks of the mbtrack-cuda
Before using mbtrack-cuda in our studies, it was essential to carry out systematic benchmarksof the code, in which, the tracking module is the first thing to be tested. In the first test, the samegroup of macro-particles are tracked for 50000 turns (about seven longitudinal damping time), whileturning off the impedance induced collective effects in both codes. The lattice ’dc12c’, which wasan option of the SLS-2 storage ring, is used in the simulations. The main parameters of the ’dc12c’lattice are listed in Table 1. The bunch energy spread is monitored, as shown in Figure 3.
Table 1: the Main parameters of the ’dc12c’ lattice
Parameters ValuesCircumference C ring E U α c − . × − Betatron Tune ν x \ ν y \ ξ x \ ξ y -66.591 \ -40.445Transverse Emittance (cid:15) x \ (cid:15) y \ V RF ν s × − To eliminate the influence of the different random number generators in the two different plat-forms, we first turned off the synchrotron radiation effects in the tracking. The bunch energy spreadin the last 5000 turns are shown in the Figure 3 (a) as an example. We can find from this figure6 a) (b)Figure 3: (a) Energy spread vs. turns without synchrotron radiation effects; (b) Energy spread vs. turns withsynchrotron radiation effects. When turning on the synchrotron radiation effects, four different seeds are used toinitialize the random number generator in mbtrack . that the results produced from both codes agree perfectly with each other, which demonstrates thereliability of the tracking module in mbtrack-cuda .However, when the synchrotron radiation effects are turned on, a small discrepancy between thetwo codes can be observed, as shown in the Figure 3 (b). We believe that this small discrepancy isbecause of the random numbers when considering synchrotron radiation. We then vary the seedsof the random number generator in mbtrack . the results corresponding to four different seeds arealso shown in Figure 3 (b). As we can see, the discrepancy between the two codes is comparablewith the cases when changing the seeds of the random number generator in mbtrack , meaning thatthe discrepancy between the two codes is dominated by the random number generators. We believethat the above mentioned results demonstrate the good agreement between the two codes.After testing the tracking module of mbtrack-cuda , we benchmark the beam-impedance inter-actions. The longitudinal short-range resistive-wall (RW) wake field is used in the test. Here, weassume that the copper vacuum chamber with 10 mm inner radius is along the whole ring. Neithergeometric wake nor RW wake of any other component is included. The macro-particles are trackedfor 50000 turns as well in this test. The ’equilibrium’ bunch length and energy spread can be cal-culated by averaging the tracking data in the last 5000 turns. By varying the single-bunch current,the plots in Figure 4 are obtained.Figures 4 (a) and (b) show how the ’equilibrium’ bunch length and energy spread vary whenincreasing the single-bunch current, respectively. The ’equilibrium’ bunch length, shown in Figure 4(a), first reduces when the single-bunch current goes up from zero, and then, increases with therising bunch current. The bunch shortening effect at low current is due to the negative momentumcompaction factor of the used lattice and an inductively dominated wake. The ’equilibrium’ energyspread keeps almost constant below about 15 mA and keeps growing above this current, which isthe microwave instability threshold. Furthermore, both Figure 4 (a) and (b) indicate clearly thedesired good agreement between the two codes.After the validity test of the mbtrack-cuda code, we carried out performance tests. The pre-liminary performance tests were performed in a system equipped with 2x Intel E5-2609 v2 CPUs(2x4 CPU cores, 2x4 threads maximum) and an NVIDIA Tesla K40C graphics card (2880 CUDA7 a) (b)Figure 4: Comparison of mbtrack and mbtrack-cuda . (a) ’Equilibrium’ bunch lengths vs. single-bunch currents; (b)’Equilibrium’ energy spread vs. single-bunch currents. cores). The two example lattices integrated in mbtrack source code, which are SOLEIL lattice andMAX IV 3 GeV ring lattice, are used in our preliminary performance tests, respectively. In thetests, each bunch, consisting of 100,000 macro-particles, were tracked for 10,000 turns.In the test using SOLEIL lattice, the basic optics transformations and long range RW effectsare implemented. The tracking was performed for longitudinal and horizontal planes. Since therandom number generator will be called every turn if the synchrotron radiation is included in thesimulation, we manually turn the synchrotron radiation effects on and off to find the influence ofgenerating random numbers on the simulation time in both codes. The results in Table 2 show thefull execution time of the simulations including the input and output operations. Table 2: Comparison of the computing time using SOLEIL lattice.Bunches No SR effects With SR effects mbtrack mbtrack-cuda mbtrack mbtrack-cuda
In the test using the MAX IV 3 GeV ring lattice, one passive third-harmonic cavity and onebroad-band resonator are included. The transformations are performed only in the longitudinalplane. The results are shown in Table 3.One can find from the validation tests that the implementation of synchrotron radiation resultsin remarkable increase of the computation time when using mbtrack , which is due to the call of therandom number generator every turn. However, the SR effect has little influence on the computationtime by mbtrack-cuda . This fact is because the well optimized random number generator on GPUplatform was used in mbtrack-cuda . We benefit from the highly parallelized architecture of GPUhere in the generation of random numbers.Furthermore, we can find from Table 2 and Table 3 that for this computer, no results can8 able 3: Comparison of the computing time using MAX IV 3 GeV ring lattice.Bunches No SR effects With SR effects mbtrack mbtrack-cuda mbtrack mbtrack-cuda be generated by mbtrack when the number of bunches is significantly more than the number ofCPU cores (8 CPU cores). Meanwhile, the mbtrack-cuda still manages to run the multi-bunchsimulations, even the full-ring multi-bunch simulations, in reasonable time. This fact shows thesignificance of our development.We also carried out the tests of computation time using the above mentioned SLS-2 lattice’dc12c’. In these simulations, each bunch consisted of 100,000 macro-particles and was tracked for20,000 turns. Both synchrotron radiation damping and quantum excitation effects were enabled.Here, we carried out the test in three different systems. The first system was the same stand-aloneworkstation mentioned above, which has 2x Intel E5-2609 v2 processors (2x4 CPU cores). However,the above mentioned NVIDIA graphics card was moved to the second system, which was a multi-core high-performance workstation equipped with 2x Intel E5-2697 v4 processors (2x18 CPU cores,hyper-threading enabled). The third system was a cluster equipped with 32 Intel XEON GOLD6140 processors (18 CPU cores, hyper-threading disabled). We managed to fully parallelize thesimulations in the third system. The computing time of the different tests is shown in Table 4.
Table 4: Benchmarks of mbtrack and mbtrack-cuda on CPU (8-core), CPU (36-core), cluster (576-core CPU), andGPU (2880 CUDA cores).Number of Bunches Computing Time8-core CPU 36-core CPU cluster GPU1 810 s 549 s 232 s 93 s2 814 s 558 s 236 s 157 s3 832 s 562 s 240 s 221 s10 3596 s 577 s 266 s 546 s20 - 757 s 312 s 1006 s121 - 2883 s 922 s 5860 s363 - 10931 s 3165 s 18754 s
The Table 4 shows significant difference of the computation time when running mbtrack in theabove mentioned CPUs. This phenomenon is mainly because of the different performance of theCPU cores and the different number of CPU cores. It’s interesting to point out that the full-ringmulti-bunch simulation can be carried out in a multi-core (36 cores with hyper-threading enabled)stand-alone workstation. However, the computation time will be remarkably longer than in thecluster mainly because of the heavy overloading of the CPU cores and the lower performance ofeach core.The present version of mbtrack-cuda shows higher speed when the number of bunches is fewer,e.g., single-bunch simulations. However, when running the simulations with more and more bunches,the mbtrack-cuda becomes slower. For instance, it’s about 6 times slower to carry out 363 bunchessimulation by the mbtrack-cuda code in the above mentioned workstation than carrying out the9ame simulation in the above mentioned cluster by mbtrack . The significant degradation of theperformance as the number of bunches increases, is mainly because the present mbtrack-cuda codehas to calculate the different bunches in series. However, it shows the potential to accelerate thesimulations in the future.
4. Simulations of the longitudinal coupled-bunch instability for SLS-2 by mbtrack-cuda
In this section, we present our implementation of the developed mbtrack-cuda code in thestudy of the longitudinal coupled-bunch instability in SLS-2 (using ’dc12c’ lattice). The 500 MHzELETTRA-type RF cavities, used in SLS [19], are considered to be reused in SLS-2. We there-fore use the same Higher Order Modes (HOMs) parameters of the SLS cavity in the followingsimulations.ELETTRA type cavities utilize the temperature of cooling water and the plunger tuner as thetwo main parameters to tune the resonant frequencies of all the modes. To avoid the longitudinalcoupled-bunch instability, the temperature of the cooling water should be adjusted to the values inthe ’stable windows’ [20, 21]. Using the L5 mode (resonant frequency 1606.862 MHz at 45 ◦ C [22])in the cavity µ , shown by Eq. (7) [23], have been carried out under theassumption of uniform filling pattern. The result is shown by the red curve in Figure 5 for the caseof SLS-2 without harmonic cavity.1 τ µ = ηe N b M ω r β E T ω s · (cid:104) R e Z (cid:107) ( qM ω + µω + ω s ) − R e Z (cid:107) ( q (cid:48) M ω − µω − ω s ) (cid:105) (7)The above mentioned analytical method uses an assumption of uniform filling of all the buckets,which is usually not the case in the real operation of the storage rings in synchrotron light sources.As mentioned above, we propose that 390 identical bunches are filled continuously in the SLS-2storage ring. The goal of the simulation is to see, whether the stable temperature windows shiftbecause of the nonuniform filling. We therefore simulate the influence of the L5 mode using theuniform filling pattern and the ⁄ filling pattern, respectively. To compare with the analyticalestimations, we plot the growth rates of the bunches under different conditions in Figure 5. Thegreen curve with the ’cross’ markers shows the simulation results at the uniform filling pattern.Meanwhile, the information of the first bunch, the middle bunch, and the last bunch in the bunchtrain of the continuous ⁄ filling pattern are all shown in the same figure.The analytic method is able to provide a satisfactory good prediction of the ’stable window’.By comparing the simulation results under the assumption of uniform filling pattern and the ⁄ filling, we could find that the resulting ’stable window’ is almost the same. This study providesmore confidence to the analytic estimations of the longitudinal coupled-bunch instability.
5. Conclusions
We present the development of the mbtrack-cuda code, using the GPU computing technology, inthis paper. The heaviest computations are carried out by GPU in this code. mbtrack-cuda allowsto run multi-bunch simulations in one stand-alone workstation with a scientific graphics card in an10 igure 5: The comparison of the growth rates of the longitudinal coupled-bunch instability driven by the longitudinalHigher-Order Mode L5 of the Cavity ⁄ filling pattern, respectively. In the ⁄ filling pattern, 363 continuous bucketsout of 484 buckets are filled identically. To make a fair comparison, the total current of 400 mA is used in both theanalytic estimation and the simulations. acceptable time. Therefore, this code reduces the requirement of the large scale clusters, which isusually expensive in construction and operation.In the present version of mbtrack-cuda , the performance still needs improvements since it’sstill slower than running mbtrack in not only a big cluster, but also a multi-core high performancestand-alone workstation. However, it shows advantage in the single-bunch simulations, which clearlydemonstrate the great potential of improvement of mbtrack-cuda .The mbtrack-cuda code has been implemented in the study of the longitudinal coupled-bunchinstabilities of the SLS-2 storage ring. The longitudinal HOM L5 of the ELETTRA-type 500 MHzcavity is studied both analytically and by simulation. By changing the cooling water temperatureof the cavity, we simulated both the uniform filling pattern and the ⁄ filling pattern at the totalcurrent 400 mA, respectively. The simulation results agree well with the analytical estimation.The limiting factor of the mbtrack-cuda version is the lack of multi GPU implementation. Since mbtrack-cuda shows significant performance improvement for a single bunch simulations splittingthe bunches among multiple GPUs would allow multi node clusters to take full advantage of GPUresources. The multi GPU implementation would also decrease the memory used by a single cardthus allowing to run bigger simulations with more macro-particles per bunch.11ith the advances in GPU technologies the number of CUDA cores keeps increasing and sodoes the memory speed which should lead to even better performance. Since the number of macroparticles per bunch is usually very high, mbtrack-cuda will be able to take advantage if increasingcore counts. Adapting to the new VOLTA architecture would certainly decrease the time to solution.
6. Acknowledgements
The authors would like to thank Dr. Ryutaro Nagaoka and Dr. Francis Cullinan for theirkind support and discussions of mbtrack and Dr. Paolo Craievich for providing the information ofthe HOMs of SLS cavities. The authors would also like to thank Carl Beard for the kind Englishcorrections.The research leading to these results has received funding from the European Community’sSeventh Framework Programme (FP7/2007-2013) under grant agreement n.o290605 (PSI-FELLOW/ COFUND). The author Haisheng Xu would like to thank the PSI-FELLOW program for thesupport.The 576-core cluster used in the code benchmark is operated by the IHEP computing center.The authors would like to thank the staffs in the IHEP computing center for their kind support.
References [1] E. M. Rowe and Frederick E. Mills. Tantalus. 1. a dedicated storage ring synchrotron radiationsource.
Part. Accel. , 4:211–227, 1973.[2] Dieter Einfeld and Mark Plesko. Design of a diffraction-limited light source.
Proc.SPIE ,2013:2013 – 2013 – 12, 1993.[3] A. Streun and A. Wrulich. Compact low emittance light sources based on longitudinal gradientbending magnets.
Nuclear Instruments and Methods in Physics Research Section A: Accelera-tors, Spectrometers, Detectors and Associated Equipment , 770(Supplement C):98 – 112, 2015.[4] A. Streun. The anti-bend cell for ultralow emittance storage ring lattices.
Nuclear Instru-ments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors andAssociated Equipment , 737(Supplement C):148 – 154, 2014.[5] Simon Leemann, ˚Ake Andersson, and Magnus Sj¨ostr¨om. First Optics and Beam DynamicsStudies on the MAX IV 3 GeV Storage Ring. In
Proceedings, 8th International Particle Accel-erator Conference (IPAC 2017): Copenhagen, Denmark, May 14-19, 2017 , page WEPAB075,2017.[6] A. Streun, M. Aiba, M. B¨oge, C. Calzolaio, M. Ehrlichman, A. M¨uller, ´A. Sa´a Hern´andez, andH. Xu. Proposed upgrade of the sls storage ring.
Proceedings of IPAC2016, Busan, Korea ,(WEPOW038):2922 – 2924, 2016.[7] Galina Skripka, Ryutaro Nagaoka, Marit Klein, Francis Cullinan, and Pedro F. Tavares. Simul-taneous computation of intrabunch and interbunch collective beam motions in storage rings.
Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers,Detectors and Associated Equipment , 806:221 – 230, 2016.128] R. Nagaoka, R. Bartolini, and J. Rowland. Studies of collective effects in soleil and diamondusing the multiparticle tracking codes sbtrack and mbtrack.
Proceedings of PAC09, Vancouver,BC, Canada , (FR5RFP046):4637 – 4639, 2009.[9] M. Klein, R. Nagaoka, G. Skripka, P. F. Tavares, and E. J. Wall´en. Study of collectivebeam instabilities for the max iv 3 gev ring.
Proceedings of IPAC2013, Shanghai, China ,(TUPWA005):1730 – 1732, 2013.[10] Natalia Milas and Lukas Stingelin. Impact of Filling Patterns in Bunch Length and Lifetimeat the SLS.
Conf. Proc. , C100523:THPE084, 2010.[11] M. E. Busse-Grawitz, P. Marchand, and W. Tron. RF system for the SLS booster and storagering. In
Proceedings, 1999 Particle Accelerator Conference (PAC’99): New York, New York,March 29-April 2, 1999 , pages 986–987, 1999.[12] U. Locans. Future processor hardware architectures for the benefit of precise particle acceler-ator modeling.
Doctoral Thesis, University of Latvia, Riga , 2017.[13] U. Locans, H. S. Xu, A. Adelmann, and L. Stingelin. A gpu variant of mbtrack and itsapplication in sls-2.
Proceedings of IPAC17, Copenhagen, Denmark , (THPAB051):3827 – 3829,2017.[14] https://developer.nvidia.com/cuda-zone.[15] K. Amyx, J. Balasalle, J. King, I. V. Pogorelov, M. Borland, and R. Soliday. Beam dynamicssimulations with a gpu-accelerated version of elegant.
Proceedings of IPAC2013, Shanghai,China , (MOPWO067):1040 – 1042, 2013.[16] I. V. Pogorelov, J. R. King, K. M. Amyx, M. Borland, and R. Soliday. Current status of thegpu-accelerated elegant.
Proceedings of IPAC2015, Richmond, VA, USA , (MOPMA035):623 –625, 2015.[17] Patrik Sch¨onfeldt, Miriam Brosi, Markus Schwarz, Johannes L. Steinmann, and Anke-SusanneM¨uller. Parallelized vlasov-fokker-planck solver for desktop personal computers.
Phys. Rev.Accel. Beams , 20:030704, Mar 2017.[18] Jack Borthwick, Francis Cullinan, Ryutaro Nagaoka, and Galina Skripka. mbtrack : Multi-bunch tracking code. 2015.[19] M. E. Busse-Grawitz, P. Marchand, and W. Tron. Rf system for the sls booster and storagering.
Proceedings of the 1999 Particle Accelerator Conference, New York , (MOP137):986 –988, 1999.[20] M. Svandrlik, C. J. Bocchetta, A. Fabris, F. Iazzourene, E. Karantzoulis, R. Nagaoka, C. Pa-sotti, L. Tosi, R. P. Walker, and A. Wrulich. The cure of multibunch instabilities in ELETTRA.
Conf. Proc. , C950501:2762–2764, 1996.[21] Michele Svandrlik, Alessandro Fabris, and Cristina Pasotti. Improvements in Curing CoupledBunch Instabilities at ELETTRA by Mode Shifting after the Installation of the AdjustableHigher Order Mode Frequency Shifter (HOMFS).
Conf. Proc. , C970512:1735, 1997.1322] Priviate communication with Dr. Paolo Craievich, Paul Scherrer Institut.[23] K. Y. Ng. Physics of intensity dependent beam instabilities.