[PDF] Calculation of Longitudinal Collective Instabilities with mbtrack-cuda

Abstract

Macro-particle tracking is a prominent method to study the collective beam instabilities in accelerators. However, the heavy computation load often limits the capability of the tracking codes. One widely used macro-particle tracking code to simulate collective instabilities in storage rings is mbtrack. The Message Passing Interface (MPI) is already implemented in the original mbtrack to accelerate the simulations. However, many CPU threads are requested in mbtrack for the analysis of the coupled-bunch instabilities. Therefore, computer clusters or desktops with many CPU cores are needed. Since these are not always available, we employ as alternative a Graphics Processing Unit (GPU) with CUDA programming interface to run such simulations in a stand-alone workstation. All the heavy computations have been moved to the GPU. The benchmarks confirm that mbtrack-cuda can be used to analyze coupled bunch instabilities up to at least 484 bunches. Compared to mbtrack on an 8-core CPU, 36-core CPU and a cluster, mbtrack-cuda is faster for simulations of up to 3 bunches. For 363 bunches, mbtrack-cuda needs about six times the execution time of the cluster and twice of the 36-core CPU. The multi-bunch instability analysis shows that the length of the ion-cleaning gap has no big influence, at least at filling to 3/4.

Full PDF

CCalculation of Longitudinal Collective Instabilities with mbtrack-cuda

Haisheng Xu , Uldis Locans , Andreas Adelmann , Lukas Stingelin ,Paul Scherrer Institut, CH-5232 Villigen PSI, Switzerland Abstract

Macro-particle tracking is a prominent method to study the collective beam instabilities in accel-erators. However, the heavy computation load often limits the capability of the tracking codes.One widely used macro-particle tracking code to simulate collective instabilities in storage rings is mbtrack . The Message Passing Interface (MPI) is already implemented in the original mbtrack toaccelerate the simulations. However, many CPU threads are requested in mbtrack for the analysis ofthe coupled-bunch instabilities. Therefore, computer clusters or desktops with many CPU cores areneeded. Since these are not always available, we employ as alternative a Graphics Processing Unit(GPU) with CUDA programming interface to run such simulations in a stand-alone workstation.All the heavy computations have been moved to the GPU. The benchmarks conﬁrm that mbtrack-cuda can be used to analyze coupled bunch instabilities up to at least 484 bunches. Compared to mbtrack on an 8-core CPU, 36-core CPU and a cluster, mbtrack-cuda is faster for simulations ofup to 3 bunches. For 363 bunches, mbtrack-cuda needs about six times the execution time of thecluster and twice of the 36-core CPU. The multi-bunch instability analysis shows that the lengthof the ion-cleaning gap has no big inﬂuence, at least at ﬁlling to ⁄ .

1. Introduction

Synchrotron light sources have been a powerful tool for condensed matter physics, materialscience, biology and medicine since about 1968 [1]. The demand for higher brightness from the usersof synchrotron light sources has pushed for improved performance of the accelerators, for instance,to reduce the emittance of electron beams further. In recent years, there have been remarkableimprovements on the design of ultra-low emittance storage rings. Thanks to the applications ofthe concepts of Multi-Bend Achromat (MBA) [2], Longitudinal Gradient Bends (LGBs) [3], andAnti-Bends (ABs) [4], an emittance of the order of 100 pm (or even lower) has been achieved inmany new designs of storage rings and measured at MAX IV [5] during early commissioning. Allthe above mentioned concepts are employed in the design of the storage ring for the Swiss LightSource Upgrade (SLS-2) [6].Since the lattices of the ultra-low emittance rings need strong focusing provided by the strongquadrupole magnets, high-ﬁeld sextupole magnets are needed for the correction of chromatic andgeometric aberration. Vacuum chambers with small cross sections are therefore considered in the [email protected], the author is presently in the Institute of High Energy Physics, CAS, Beijing, China [email protected] [email protected] [email protected]

Preprint submitted to Elsevier September 5, 2018 a r X i v : . [ phy s i c s . acc - ph ] S e p ltra-low emittance storage rings, causing high impedance. Furthermore, ultra-low emittance stor-age rings are more sensitive to the impedance induced instabilities.The mbtrack code [7] is a multi-bunch macro-particle tracking code which can be used to studyboth single-bunch and coupled-bunch instabilities in electron storage rings. It has been used indiﬀerent synchrotron light sources, such as, SOLEIL [8], MAX IV [7, 9], etc. The Message PassingInterface (MPI) is implemented in mbtrack for acceleration of the multi-bunch tracking. For an n-bunch simulation using mbtrack , where n is an integer, (n+1) MPI processes are needed. Therefore,a large scale computing cluster is usually preferred to perform the multi-bunch simulations in thestorage rings of the synchrotron light sources. For instance, 391 processes are required for theanalysis of the nominal operation mode of Swiss Light Source (SLS) [10, 11], including 390 bunchesin the bunch train.In order to reduce the demands for large scale clusters for multi-bunch simulations, we developeda GPU version of mbtrack ( mbtrack-cuda ) [12, 13], in which the computations are oﬄoaded toa Graphics Processing Unit (GPU). Taking advantage of a state-of-the-art GPU, which has amassively parallel architecture including thousands of cores, one can parallelize the tracking of allthe macro-particles in an eﬃcient manner. An NVIDIA graphics card was chosen as the hardwareto develop the mbtrack-cuda . CUDA [14], which is a parallel computing platform and programmingmodel invented by NVIDIA, is used for programming. The resulting mbtrack-cuda manages to carryout the multi-bunch simulations in a stand-alone workstation equipped with an NVIDIA Tesla K40cGPU.Recently, the advantages of GPU computing attracts interest in the accelerator physics com-munity for the analysis of collective instabilities. For example, the GPU implementation for the elegant code [15, 16] is also undergoing and the Inovesa code [17], which is a Vlasov-Fokker-Planksolver, also beneﬁts from modern GPU computing. The development of mbtrack-cuda provides apowerful alternative, especially for the users of mbtrack .In this paper, we present the development of the mbtrack-cuda code and the studies of longitu-dinal collective instabilities for SLS-2 carried out by this code. The rest of this paper is organizedas follows. In Section 2, the detailed information of mbtrack-cuda code is presented. The bench-mark of the code is presented in Section 3. The simulation study of the longitudinal coupled-bunchinstability for SLS-2 by mbtrack-cuda , are presented in Section 4. The conclusions are in Section 5.

2. GPU acceleration of mbtrack2.1. The introduction of mbtrack-cuda development

The mbtrack-cuda code [12] is an expansion of the original version. The main architecture ofthe code is kept identical to the original version. Similar to mbtrack , mbtrack-cuda allows the usersto choose the eﬀects included in the simulations.The main diﬀerence between the two versions is on which device the operations of the macro-particles would be carried out. In the original mbtrack , apparently, all the operations are carriedout by CPU. However, in the mbtrack-cuda code, the coordinates of the macro-particles are allstored on the GPU. The operations on the macro-particles are performed also on the GPU, whichmeans the CPU is used only to control the ﬂow of the simulation and to write the output data. Dueto the diﬀerent architectures of CPU cluster and GPU, the parallelization models used in the twoversions are diﬀerent, as shown in Figure 1. mbtrack parallelizes the simulations over bunches witheach process handling one bunch, while the macro-particles in one bunch are tracked in series. On2he other hand, the mbtrack-cuda parallelizes the simulations over macro-particles in one bunch.Diﬀerent bunches are tracked in series. CPU 0: Manager Task

Initialize simulation.Handle communication between worker processes ...CPU 1: bunch 1 CPU N: bunch NCPU 2: bunch 2 (a)

CPU 0: Manager Task

Initialize simulation.Schedule kernel execution on the GPU

GPU: N bunches (b)Figure 1: Parallelization in the original mbtrack and the mbtrack-cuda : (a) mbtrack : each bunch is assigned to oneCPU core, (b) mbtrack-cuda : all the bunches are on the GPU. The simulations are parallelized over macro-particlesin one bunch.

To create mbtrack-cuda , CUDA kernels have been written to perform all the transformationsthat are implemented in the original mbtrack . In addition, the statistics calculations are alsoperformed on the GPU to avoid transferring the macro-particles’ coordinates to the CPU side. Theﬂow diagram of mbtrack-cuda is shown in Figure 2. The ﬂow diagram shows the tasks executed onthe host and the kernels launched on the GPU side. The transformations performed by mbtrack and the implementation of CUDA kernels for mbtrack-cuda are described in detail in the rest of thesection.

HostGPU

Initsimulation Createbunches Transferbunches Starttracking Logpotentials Updatestatistics Logstatistics EndtrackingReceivebunches Selﬃeldtransf. Opticstransf. Calculatestatistics LongrangeRW int.

Figure 2: Flow diagram of the mbtrack-cuda code.

Every particle in mbtrack is represented by a 6 dimensional vector:( x, x, y, y (cid:48) , τ, δ )where x and y are the horizontal and vertical positions, while x (cid:48) and y (cid:48) are transverse momentaat longitudinal position s. Parameter δ = ∆ E/E describes the energy deviation relative to the3eference particle, and the longitudinal coordinate τ is the arrival time with respect to the referenceparticle [7].At each turn this transformation computes the energy deviation for each particle: δ i +1 = δ i + (cid:15) i − U rad E (1)where (cid:15) i is the relative energy gain in the RF cavities, U rad is the average energy loss per turndue to the synchrotron radiation (SR), and E is the reference energy. After energy deviation iscomputed, the longitudinal coordinate is updated as follows: τ i +1 = τ i + δ i T α c (2)where α c is the momentum compaction factor.In transverse planes the particles in the beam perform betatron oscillations. These oscillationsare described using Twiss parameters α x,y , β x,y , γ x,y , which describe the beam shape, size andorientation, and a phase advance per turn Ψ xy [7]. For particles with non-zero energy deviationsthe phase advance can be calculated by:Ψ xy = Ψ x y (1 + ξ xy δ ) (3)where ξ xy is the chromaticity. Given the presence of the horizontal dispersion D , the transformationin transverse planes are expressed by the transfer matrices:  xx (cid:48) δ  =  cos Ψ x + α sin Ψ x β x sin Ψ x D − γ x sin Ψ x cos Ψ x − α sin Ψ x D (cid:48)   xx (cid:48) δ (cid:18) yy (cid:48) (cid:19) = (cid:18) cos Ψ y + α sin Ψ y β y sin Ψ y − γ y sin Ψ y cos Ψ y − α sin Ψ y (cid:19) (cid:18) yy (cid:48) (cid:19) (4)Changes in the beam energy are also caused by quantum excitation and radiation dampingshown in the equations below.˜ δ i +1 = δ i +1 (1 − D E ) + σ E (cid:112) D e δ rand ˜ x i +1 = x i +1 + σ x (cid:112) D x x rand ˜ x (cid:48) i +1 = x (cid:48) i +1 δ i +1 δ i +1 + (cid:15) i +1 + σ x (cid:48) (cid:112) D x x (cid:48) rand (5)where coeﬃcients D and σ correspond to the synchrotron radiation damping time and bunchenergy spread, while δ rand , x rand and x (cid:48) rand are random numbers from normal distribution withunit standard deviation [7].The application launches one CUDA kernel that performs these transformations for every bunch.Since calculations for each particle are independent, one thread per particle is created inside thekernel. The random numbers needed for the calculations of the radiation damping and quantumexcitation are generated using the NVIDIA cuRAND library. Shared memory is used to hold thedata for additional harmonic cavities, since they are shared by all the threads, and shared memoryusage allows to improve the load time from global memory.4 .3. Treatment of the bunch-wake interactions The simulations in mbtrack can include resistive-wall eﬀects, arbitrary number of resonators,and purely resistive and inductive components, all contributing to the total wake [7]. The macro-particles in each bunch are grouped into cells (bins) depending on their longitudinal position. Thewake functions are calculated corresponding to ensemble of resonators. Each turn, the excitation ofthis wake on each bin is calculated and the resulting kick is given to every particle in the bin [18].The wake function is calculated for each resonator and summed to form the total wake. Addi-tionally resistive-wall eﬀects are also added to the same wake. In the longitudinal and transverseplanes the transformation is expressed as change in particle energy (horizontal and vertical planesare treated identically): ∆ δ j = q j V ( τ j ) E ∆ x (cid:48) j = q j E j − (cid:88) k =0 q k D p ( τ k ) W ⊥ ( τ j − τ k ) (6)where V ( τ j ) is the wake voltage induced by this bunch at bin τ j , W ⊥ is the transverse wake function, q j,k is the total charge in the bins j and k , and D p ( τ k ) is the dipole-moment of the bunch at position τ k . Detailed information about how V ( τ j ) and W ⊥ are calculated can be found in [7].The transformation is carried out step-by-step. Each step begins by assigning each macro-particle to a bin (mesh cell). In the same step, the calculations of diﬀerent macro-particles areindependent. Therefore, this process can be parallelized over the number of macro-particles in thebunch. A kernel is launched to put all the macro-particles to their own bins using the strategy thateach thread handles one macro-particle. The bin number is saved in a temporary memory reservedin the beginning of the calculations and is reused between bunches.After each macro-particle has been assigned to a bin, another kernel is launched to count thenumber of macro-particles in each bin and the dipole-moments D p for horizontal and vertical planes.This kernel is also launched with one thread per macro-particle. However, since there are a numberof macro-particles per bin, the atomic operations are needed to sum up all the number of macro-particles in the bins.After the information of the macro-particles per bin is known, a kernel is launched to ﬁnd thebins holding the minimum and maximum number of macro-particles. Since this operation requirescommunication between threads, it’s not suitable for parallelization. Therefore, this operation isperformed serially on the GPU. One block with multiple threads are launched on the GPU, themultiple threads parallelize the loading of the data from the slow GPU memory to the faster sharedmemory. And then, one thread searches for the ﬁrst and last bins that contains macro-particles.Since we do not need to loop over all the cells, this serialization of the kernel will not cause abottleneck for the simulation.Once the wake potentials are constructed, a kernel is called to carry out the bunch-wake inter-actions. This kernel parallelizes the simulation over the macro-particles in the bunch, and launchesone thread for each macro-particle. Shared memory is used to store the data that is frequentlyreused by the kernel, which can minimize the loads from global memory.After the wake potentials have been applied to the macro-particles, the long-range resistive-walleﬀects are calculated and applied to the transverse planes as described in [7]. Since long-rangeresistive-wall eﬀects require statistics information about previous bunches, these calculations are5aunched only when wake potential eﬀects are calculated for all the bunches and the statistics havebeen updated. The statistics of the bunches are calculated turn-by-turn in order to save the temporary informa-tion of the bunches during the simulations. To calculate the statistics, such as the mean values andthe RMS values of the six dimensional coordinates of each bunch, is very time consuming mainlybecause of the relatively large number of macro-particles and the communication time among dif-ferent threads. Furthermore, since the calculations of the statistics will be carried out on the GPU,the calculated statistics data need to be transferred to the CPU in order to log the statistics to ﬁleat every turn, which is also computationally intensive. To calculate the average values, the Thrustlibraries reduce function is used to calculate the sums of position and momentum in each dimension.After reduce is performed, data will be sent to the CPU. To compute the standard deviation, theThrusts transform reduce function is used.Since each macro-particle in the bunch in mbtrack is represented as a structure of 6 variables,custom operators are deﬁned to perform reduction and transformations correctly on the arrays ofparticles. The statistics are calculated for each bunch as well as averaged for all the bunches in thesimulations. Only the calculations for the individual bunches are performed on the GPU while therest are carried out on the CPU.

3. Benchmarks of the mbtrack-cuda

Before using mbtrack-cuda in our studies, it was essential to carry out systematic benchmarksof the code, in which, the tracking module is the ﬁrst thing to be tested. In the ﬁrst test, the samegroup of macro-particles are tracked for 50000 turns (about seven longitudinal damping time), whileturning oﬀ the impedance induced collective eﬀects in both codes. The lattice ’dc12c’, which wasan option of the SLS-2 storage ring, is used in the simulations. The main parameters of the ’dc12c’lattice are listed in Table 1. The bunch energy spread is monitored, as shown in Figure 3.

Table 1: the Main parameters of the ’dc12c’ lattice

Parameters ValuesCircumference C ring E U α c − . × − Betatron Tune ν x \ ν y \ ξ x \ ξ y -66.591 \ -40.445Transverse Emittance (cid:15) x \ (cid:15) y \ V RF ν s × − To eliminate the inﬂuence of the diﬀerent random number generators in the two diﬀerent plat-forms, we ﬁrst turned oﬀ the synchrotron radiation eﬀects in the tracking. The bunch energy spreadin the last 5000 turns are shown in the Figure 3 (a) as an example. We can ﬁnd from this ﬁgure6 a) (b)Figure 3: (a) Energy spread vs. turns without synchrotron radiation eﬀects; (b) Energy spread vs. turns withsynchrotron radiation eﬀects. When turning on the synchrotron radiation eﬀects, four diﬀerent seeds are used toinitialize the random number generator in mbtrack . that the results produced from both codes agree perfectly with each other, which demonstrates thereliability of the tracking module in mbtrack-cuda .However, when the synchrotron radiation eﬀects are turned on, a small discrepancy between thetwo codes can be observed, as shown in the Figure 3 (b). We believe that this small discrepancy isbecause of the random numbers when considering synchrotron radiation. We then vary the seedsof the random number generator in mbtrack . the results corresponding to four diﬀerent seeds arealso shown in Figure 3 (b). As we can see, the discrepancy between the two codes is comparablewith the cases when changing the seeds of the random number generator in mbtrack , meaning thatthe discrepancy between the two codes is dominated by the random number generators. We believethat the above mentioned results demonstrate the good agreement between the two codes.After testing the tracking module of mbtrack-cuda , we benchmark the beam-impedance inter-actions. The longitudinal short-range resistive-wall (RW) wake ﬁeld is used in the test. Here, weassume that the copper vacuum chamber with 10 mm inner radius is along the whole ring. Neithergeometric wake nor RW wake of any other component is included. The macro-particles are trackedfor 50000 turns as well in this test. The ’equilibrium’ bunch length and energy spread can be cal-culated by averaging the tracking data in the last 5000 turns. By varying the single-bunch current,the plots in Figure 4 are obtained.Figures 4 (a) and (b) show how the ’equilibrium’ bunch length and energy spread vary whenincreasing the single-bunch current, respectively. The ’equilibrium’ bunch length, shown in Figure 4(a), ﬁrst reduces when the single-bunch current goes up from zero, and then, increases with therising bunch current. The bunch shortening eﬀect at low current is due to the negative momentumcompaction factor of the used lattice and an inductively dominated wake. The ’equilibrium’ energyspread keeps almost constant below about 15 mA and keeps growing above this current, which isthe microwave instability threshold. Furthermore, both Figure 4 (a) and (b) indicate clearly thedesired good agreement between the two codes.After the validity test of the mbtrack-cuda code, we carried out performance tests. The pre-liminary performance tests were performed in a system equipped with 2x Intel E5-2609 v2 CPUs(2x4 CPU cores, 2x4 threads maximum) and an NVIDIA Tesla K40C graphics card (2880 CUDA7 a) (b)Figure 4: Comparison of mbtrack and mbtrack-cuda . (a) ’Equilibrium’ bunch lengths vs. single-bunch currents; (b)’Equilibrium’ energy spread vs. single-bunch currents. cores). The two example lattices integrated in mbtrack source code, which are SOLEIL lattice andMAX IV 3 GeV ring lattice, are used in our preliminary performance tests, respectively. In thetests, each bunch, consisting of 100,000 macro-particles, were tracked for 10,000 turns.In the test using SOLEIL lattice, the basic optics transformations and long range RW eﬀectsare implemented. The tracking was performed for longitudinal and horizontal planes. Since therandom number generator will be called every turn if the synchrotron radiation is included in thesimulation, we manually turn the synchrotron radiation eﬀects on and oﬀ to ﬁnd the inﬂuence ofgenerating random numbers on the simulation time in both codes. The results in Table 2 show thefull execution time of the simulations including the input and output operations. Table 2: Comparison of the computing time using SOLEIL lattice.Bunches No SR eﬀects With SR eﬀects mbtrack mbtrack-cuda mbtrack mbtrack-cuda

In the test using the MAX IV 3 GeV ring lattice, one passive third-harmonic cavity and onebroad-band resonator are included. The transformations are performed only in the longitudinalplane. The results are shown in Table 3.One can ﬁnd from the validation tests that the implementation of synchrotron radiation resultsin remarkable increase of the computation time when using mbtrack , which is due to the call of therandom number generator every turn. However, the SR eﬀect has little inﬂuence on the computationtime by mbtrack-cuda . This fact is because the well optimized random number generator on GPUplatform was used in mbtrack-cuda . We beneﬁt from the highly parallelized architecture of GPUhere in the generation of random numbers.Furthermore, we can ﬁnd from Table 2 and Table 3 that for this computer, no results can8 able 3: Comparison of the computing time using MAX IV 3 GeV ring lattice.Bunches No SR eﬀects With SR eﬀects mbtrack mbtrack-cuda mbtrack mbtrack-cuda be generated by mbtrack when the number of bunches is signiﬁcantly more than the number ofCPU cores (8 CPU cores). Meanwhile, the mbtrack-cuda still manages to run the multi-bunchsimulations, even the full-ring multi-bunch simulations, in reasonable time. This fact shows thesigniﬁcance of our development.We also carried out the tests of computation time using the above mentioned SLS-2 lattice’dc12c’. In these simulations, each bunch consisted of 100,000 macro-particles and was tracked for20,000 turns. Both synchrotron radiation damping and quantum excitation eﬀects were enabled.Here, we carried out the test in three diﬀerent systems. The ﬁrst system was the same stand-aloneworkstation mentioned above, which has 2x Intel E5-2609 v2 processors (2x4 CPU cores). However,the above mentioned NVIDIA graphics card was moved to the second system, which was a multi-core high-performance workstation equipped with 2x Intel E5-2697 v4 processors (2x18 CPU cores,hyper-threading enabled). The third system was a cluster equipped with 32 Intel XEON GOLD6140 processors (18 CPU cores, hyper-threading disabled). We managed to fully parallelize thesimulations in the third system. The computing time of the diﬀerent tests is shown in Table 4.

Table 4: Benchmarks of mbtrack and mbtrack-cuda on CPU (8-core), CPU (36-core), cluster (576-core CPU), andGPU (2880 CUDA cores).Number of Bunches Computing Time8-core CPU 36-core CPU cluster GPU1 810 s 549 s 232 s 93 s2 814 s 558 s 236 s 157 s3 832 s 562 s 240 s 221 s10 3596 s 577 s 266 s 546 s20 - 757 s 312 s 1006 s121 - 2883 s 922 s 5860 s363 - 10931 s 3165 s 18754 s

The Table 4 shows signiﬁcant diﬀerence of the computation time when running mbtrack in theabove mentioned CPUs. This phenomenon is mainly because of the diﬀerent performance of theCPU cores and the diﬀerent number of CPU cores. It’s interesting to point out that the full-ringmulti-bunch simulation can be carried out in a multi-core (36 cores with hyper-threading enabled)stand-alone workstation. However, the computation time will be remarkably longer than in thecluster mainly because of the heavy overloading of the CPU cores and the lower performance ofeach core.The present version of mbtrack-cuda shows higher speed when the number of bunches is fewer,e.g., single-bunch simulations. However, when running the simulations with more and more bunches,the mbtrack-cuda becomes slower. For instance, it’s about 6 times slower to carry out 363 bunchessimulation by the mbtrack-cuda code in the above mentioned workstation than carrying out the9ame simulation in the above mentioned cluster by mbtrack . The signiﬁcant degradation of theperformance as the number of bunches increases, is mainly because the present mbtrack-cuda codehas to calculate the diﬀerent bunches in series. However, it shows the potential to accelerate thesimulations in the future.

4. Simulations of the longitudinal coupled-bunch instability for SLS-2 by mbtrack-cuda

In this section, we present our implementation of the developed mbtrack-cuda code in thestudy of the longitudinal coupled-bunch instability in SLS-2 (using ’dc12c’ lattice). The 500 MHzELETTRA-type RF cavities, used in SLS [19], are considered to be reused in SLS-2. We there-fore use the same Higher Order Modes (HOMs) parameters of the SLS cavity in the followingsimulations.ELETTRA type cavities utilize the temperature of cooling water and the plunger tuner as thetwo main parameters to tune the resonant frequencies of all the modes. To avoid the longitudinalcoupled-bunch instability, the temperature of the cooling water should be adjusted to the values inthe ’stable windows’ [20, 21]. Using the L5 mode (resonant frequency 1606.862 MHz at 45 ◦ C [22])in the cavity µ , shown by Eq. (7) [23], have been carried out under theassumption of uniform ﬁlling pattern. The result is shown by the red curve in Figure 5 for the caseof SLS-2 without harmonic cavity.1 τ µ = ηe N b M ω r β E T ω s · (cid:104) R e Z (cid:107) ( qM ω + µω + ω s ) − R e Z (cid:107) ( q (cid:48) M ω − µω − ω s ) (cid:105) (7)The above mentioned analytical method uses an assumption of uniform ﬁlling of all the buckets,which is usually not the case in the real operation of the storage rings in synchrotron light sources.As mentioned above, we propose that 390 identical bunches are ﬁlled continuously in the SLS-2storage ring. The goal of the simulation is to see, whether the stable temperature windows shiftbecause of the nonuniform ﬁlling. We therefore simulate the inﬂuence of the L5 mode using theuniform ﬁlling pattern and the ⁄ ﬁlling pattern, respectively. To compare with the analyticalestimations, we plot the growth rates of the bunches under diﬀerent conditions in Figure 5. Thegreen curve with the ’cross’ markers shows the simulation results at the uniform ﬁlling pattern.Meanwhile, the information of the ﬁrst bunch, the middle bunch, and the last bunch in the bunchtrain of the continuous ⁄ ﬁlling pattern are all shown in the same ﬁgure.The analytic method is able to provide a satisfactory good prediction of the ’stable window’.By comparing the simulation results under the assumption of uniform ﬁlling pattern and the ⁄ ﬁlling, we could ﬁnd that the resulting ’stable window’ is almost the same. This study providesmore conﬁdence to the analytic estimations of the longitudinal coupled-bunch instability.

5. Conclusions

We present the development of the mbtrack-cuda code, using the GPU computing technology, inthis paper. The heaviest computations are carried out by GPU in this code. mbtrack-cuda allowsto run multi-bunch simulations in one stand-alone workstation with a scientiﬁc graphics card in an10 igure 5: The comparison of the growth rates of the longitudinal coupled-bunch instability driven by the longitudinalHigher-Order Mode L5 of the Cavity ⁄ ﬁlling pattern, respectively. In the ⁄ ﬁlling pattern, 363 continuous bucketsout of 484 buckets are ﬁlled identically. To make a fair comparison, the total current of 400 mA is used in both theanalytic estimation and the simulations. acceptable time. Therefore, this code reduces the requirement of the large scale clusters, which isusually expensive in construction and operation.In the present version of mbtrack-cuda , the performance still needs improvements since it’sstill slower than running mbtrack in not only a big cluster, but also a multi-core high performancestand-alone workstation. However, it shows advantage in the single-bunch simulations, which clearlydemonstrate the great potential of improvement of mbtrack-cuda .The mbtrack-cuda code has been implemented in the study of the longitudinal coupled-bunchinstabilities of the SLS-2 storage ring. The longitudinal HOM L5 of the ELETTRA-type 500 MHzcavity is studied both analytically and by simulation. By changing the cooling water temperatureof the cavity, we simulated both the uniform ﬁlling pattern and the ⁄ ﬁlling pattern at the totalcurrent 400 mA, respectively. The simulation results agree well with the analytical estimation.The limiting factor of the mbtrack-cuda version is the lack of multi GPU implementation. Since mbtrack-cuda shows signiﬁcant performance improvement for a single bunch simulations splittingthe bunches among multiple GPUs would allow multi node clusters to take full advantage of GPUresources. The multi GPU implementation would also decrease the memory used by a single cardthus allowing to run bigger simulations with more macro-particles per bunch.11ith the advances in GPU technologies the number of CUDA cores keeps increasing and sodoes the memory speed which should lead to even better performance. Since the number of macroparticles per bunch is usually very high, mbtrack-cuda will be able to take advantage if increasingcore counts. Adapting to the new VOLTA architecture would certainly decrease the time to solution.

6. Acknowledgements

The authors would like to thank Dr. Ryutaro Nagaoka and Dr. Francis Cullinan for theirkind support and discussions of mbtrack and Dr. Paolo Craievich for providing the information ofthe HOMs of SLS cavities. The authors would also like to thank Carl Beard for the kind Englishcorrections.The research leading to these results has received funding from the European Community’sSeventh Framework Programme (FP7/2007-2013) under grant agreement n.o290605 (PSI-FELLOW/ COFUND). The author Haisheng Xu would like to thank the PSI-FELLOW program for thesupport.The 576-core cluster used in the code benchmark is operated by the IHEP computing center.The authors would like to thank the staﬀs in the IHEP computing center for their kind support.

References [1] E. M. Rowe and Frederick E. Mills. Tantalus. 1. a dedicated storage ring synchrotron radiationsource.

Part. Accel. , 4:211–227, 1973.[2] Dieter Einfeld and Mark Plesko. Design of a diﬀraction-limited light source.

Proc.SPIE ,2013:2013 – 2013 – 12, 1993.[3] A. Streun and A. Wrulich. Compact low emittance light sources based on longitudinal gradientbending magnets.

Nuclear Instruments and Methods in Physics Research Section A: Accelera-tors, Spectrometers, Detectors and Associated Equipment , 770(Supplement C):98 – 112, 2015.[4] A. Streun. The anti-bend cell for ultralow emittance storage ring lattices.

Nuclear Instru-ments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors andAssociated Equipment , 737(Supplement C):148 – 154, 2014.[5] Simon Leemann, ˚Ake Andersson, and Magnus Sj¨ostr¨om. First Optics and Beam DynamicsStudies on the MAX IV 3 GeV Storage Ring. In

Proceedings, 8th International Particle Accel-erator Conference (IPAC 2017): Copenhagen, Denmark, May 14-19, 2017 , page WEPAB075,2017.[6] A. Streun, M. Aiba, M. B¨oge, C. Calzolaio, M. Ehrlichman, A. M¨uller, ´A. Sa´a Hern´andez, andH. Xu. Proposed upgrade of the sls storage ring.

Proceedings of IPAC2016, Busan, Korea ,(WEPOW038):2922 – 2924, 2016.[7] Galina Skripka, Ryutaro Nagaoka, Marit Klein, Francis Cullinan, and Pedro F. Tavares. Simul-taneous computation of intrabunch and interbunch collective beam motions in storage rings.

Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers,Detectors and Associated Equipment , 806:221 – 230, 2016.128] R. Nagaoka, R. Bartolini, and J. Rowland. Studies of collective eﬀects in soleil and diamondusing the multiparticle tracking codes sbtrack and mbtrack.

Proceedings of PAC09, Vancouver,BC, Canada , (FR5RFP046):4637 – 4639, 2009.[9] M. Klein, R. Nagaoka, G. Skripka, P. F. Tavares, and E. J. Wall´en. Study of collectivebeam instabilities for the max iv 3 gev ring.

Proceedings of IPAC2013, Shanghai, China ,(TUPWA005):1730 – 1732, 2013.[10] Natalia Milas and Lukas Stingelin. Impact of Filling Patterns in Bunch Length and Lifetimeat the SLS.

Conf. Proc. , C100523:THPE084, 2010.[11] M. E. Busse-Grawitz, P. Marchand, and W. Tron. RF system for the SLS booster and storagering. In

Proceedings, 1999 Particle Accelerator Conference (PAC’99): New York, New York,March 29-April 2, 1999 , pages 986–987, 1999.[12] U. Locans. Future processor hardware architectures for the beneﬁt of precise particle acceler-ator modeling.

Doctoral Thesis, University of Latvia, Riga , 2017.[13] U. Locans, H. S. Xu, A. Adelmann, and L. Stingelin. A gpu variant of mbtrack and itsapplication in sls-2.

Proceedings of IPAC17, Copenhagen, Denmark , (THPAB051):3827 – 3829,2017.[14] https://developer.nvidia.com/cuda-zone.[15] K. Amyx, J. Balasalle, J. King, I. V. Pogorelov, M. Borland, and R. Soliday. Beam dynamicssimulations with a gpu-accelerated version of elegant.

Proceedings of IPAC2013, Shanghai,China , (MOPWO067):1040 – 1042, 2013.[16] I. V. Pogorelov, J. R. King, K. M. Amyx, M. Borland, and R. Soliday. Current status of thegpu-accelerated elegant.

Proceedings of IPAC2015, Richmond, VA, USA , (MOPMA035):623 –625, 2015.[17] Patrik Sch¨onfeldt, Miriam Brosi, Markus Schwarz, Johannes L. Steinmann, and Anke-SusanneM¨uller. Parallelized vlasov-fokker-planck solver for desktop personal computers.

Phys. Rev.Accel. Beams , 20:030704, Mar 2017.[18] Jack Borthwick, Francis Cullinan, Ryutaro Nagaoka, and Galina Skripka. mbtrack : Multi-bunch tracking code. 2015.[19] M. E. Busse-Grawitz, P. Marchand, and W. Tron. Rf system for the sls booster and storagering.

Proceedings of the 1999 Particle Accelerator Conference, New York , (MOP137):986 –988, 1999.[20] M. Svandrlik, C. J. Bocchetta, A. Fabris, F. Iazzourene, E. Karantzoulis, R. Nagaoka, C. Pa-sotti, L. Tosi, R. P. Walker, and A. Wrulich. The cure of multibunch instabilities in ELETTRA.

Conf. Proc. , C950501:2762–2764, 1996.[21] Michele Svandrlik, Alessandro Fabris, and Cristina Pasotti. Improvements in Curing CoupledBunch Instabilities at ELETTRA by Mode Shifting after the Installation of the AdjustableHigher Order Mode Frequency Shifter (HOMFS).

Conf. Proc. , C970512:1735, 1997.1322] Priviate communication with Dr. Paolo Craievich, Paul Scherrer Institut.[23] K. Y. Ng. Physics of intensity dependent beam instabilities.