Surrogate Modeling of the CLIC Final-Focus System using Artificial Neural Networks
SSurrogate Modeling of the CLIC Final-FocusSystem using Artificial Neural Networks
J. ¨Ogren , C. Gohil , and D. Schulte European Organization for Nuclear Research (CERN), Geneva,Switzerland John Adams Institute, University of Oxford, Oxford, UnitedKingdomSeptember 15, 2020
Abstract
Artificial neural networks can be used for creating surrogate models thatcan replace computationally expensive simulations. In this paper, a sur-rogate model was created for a subset of the Compact Linear Collider(CLIC) final-focus system. By training on simulation data, we createda model that maps sextupole offsets to luminosity and beam sizes, thusreplacing computationally intensive tracking and beam-beam simulations.This model was then used for optimizing the parameters of a random walkprocedure for sextupole alignment.
The field of machine learning has grown immensely in the past decade and is nowfinding its way into a wide range of fields. Particle accelerators are controllablephysical systems with rich complexities and offer interesting applications thatcan profit from different machine learning methods. Examples of applicationsin particle accelerators are anomaly detection in superconducting magnets [1],automatic control of collimators [2], electron bunch profile prediction [3, 4] andion source prediction [5]. Different advanced techniques have also been used foroptimzation and automatic control in free-electron lasers [6,7], longitudinal phasespace control [8], dynamic aperture optimization [9] and optics correction [10,11].A more comprehensive list of machine learning applications in particle accelera-tors can be found in [12]. 1 a r X i v : . [ phy s i c s . acc - ph ] S e p he field of particle accelerators rely to a large extent, especially in design andoptimization, on high-fidelity simulations and there is an abundance of simulationtools. In many cases simulations are computationally expensive, which makes op-timization time consuming and online modeling problematic. A surrogate modelis an approximative model that can mimic the system and is particularly usefulfor applications that require extensive simulations. The surrogate model, whichis fast to evaluate, can be used for optimization or as an online model. In thispaper we make use of artificial neural networks and supervised deep learning tocreate a surrogate model for a part of the Compact Linear Collider (CLIC) withcomputationally expensive simulations: the final-focus system. CLIC is a proposed linear electron–positron collider [13–15] at CERN. To producehigh luminosities, linear colliders rely on ultra-small beam sizes at the interactionpoint. This requires emittance preservation along the whole machine and putstight tolerances on alignment and other imperfections. Being able to reach nom-inal performance under imperfect conditions is crucial for the reliability of sucha machine. Nominal parameters for the 380 GeV energy stage are summarizedin Table 1.The final-focus system constitutes the final 780 m of each beamline, see Fig-ure 1. Two strong quadrupole magnets at the end of the system de-magnify thebeam in both transverse directions. Since particles of different energies are bentdifferently in a magnetic field, a beam with an energy spread will not be focusedto a single point. This effect is called chromaticity and without compensationthe beam size at the interaction point will be greatly inflated with diminishedluminosity as a result. The chromaticity correction involves nonlinear magnetsand a delicate compensation of unwanted aberrations. Complex systems such asthis require fine-tuning and are challenging to design.The current version of the CLIC final-focus system has the final quadrupoledoublet mounted outside the detector [16, 17] with L ∗ = 6 m—distance betweenthe final quadrupole and point of collision. The system consists of 20 quadrupolemagnets, 6 sextupole magnets and 2 octupole magnets. Bending magnets gen-erate dispersion and in combination with the sextupole magnets, an energy-dependent focusing is generated that compensates the chromaticity generatedin the final doublet. The final-focus system follows a local chromaticity correc-tion scheme [18] where two sextupoles are placed at the final doublet and thenadditional sextupoles are placed upstream with optics designed in such a waythat unwanted geometric aberrations cancel. Figure 1 shows the beta functions( β x,y ) and the dispersion profile ( η x ) of the final-focus system.2able 1: 380 GeV CLIC beam parameters. Parameter unit value
Norm. emittance (end of linac) γ(cid:15) x /γ(cid:15) y [nm] 900 / 20Norm. emittance (IP) γ(cid:15) x /γ(cid:15) y [nm] 950 / 30Beta function (IP) β ∗ x /β ∗ y [mm] 8.2 / 0.1Target IP beam size σ ∗ x /σ ∗ y [nm] 149 / 2.9Bunch length σ z [ µ m] 70rms energy spread δ p [%] 0.35Bunch population N e [10 ] 5.2Number of bunches n b f rep [Hz] 50Luminosity L [10 cm − s − ] 1.5Peak luminosity L p [10 cm − s − ] 0.9 β x , y [ k m ] η x [ m ] s [m] β x β y η x Figure 1: Optical functions for the 380 GeV CLIC final-focus system with L ∗ = 6 m. Six sextupoles are placed in dispersive sections in a local chromaticitycorrection scheme. 3 .1 Tuning The final-focus system is sensitive to imperfections and the challenge is to achievesmall beam sizes, and high luminosity, for a system with imperfections. Successfuloperation of a linear collider requires effective and fast tuning. Tuning studieshave been done extensively for all parts of CLIC and for the final-focus systemin particular. A recent report [19] considered single-beam tuning of the 380 GeVCLIC final-focus system including static imperfections such as magnet offsets,roll errors and strength errors. Single beam means that only one side, i.e. oneof the two beamlines, of the final-focus system is tracked and the beam at theinteraction point is mirrored for the beam–beam simulation.In that study, the tuning procedure consisted of beam-based alignment thatcorrected the linear system, followed by sextupole alignment and tuning withsextupole and octupole knobs. Due to the nonlinearities of the system togetherwith synchrotron radiation, small imperfections can have dramatic impact onbeam size at the interaction point and consequently a substantial decrease inluminosity. The purpose of the tuning simulations is to find robust algorithms forachieving luminosity under imperfect conditions and thus ensuring the reliabilityof the collider. In all the simulations we use
PLACET [20] for tracking the beamthrough the final-focus system lattice and
GUINEA-PIG [21] for a full beam–beamsimulation.
Let us consider a single imperfection: transverse misalignment of sextupoles. Ex-perience has shown that this imperfection has a large impact on luminosity. Toquantify the impact we run simulations on a perfect machine and add transversesextupole offsets assuming a normal distribution. For different rms values, wetrack 20 cases and run a beam–beam simulation to compute the luminosity. Sim-ilar to Ref. [19] we consider the single-beam case. Figure 2 shows the averageluminosity decreasing as the rms value of the transverse sextupole offset increases.Since transverse sextupole misalignments have a large impact, it is essential tohave a quick and robust method for tuning the sextupoles. In the next section webuild a surrogate model for the impact of sextupole misalignments on luminosityand beam size at the interaction point.
We only consider transverse sextupole offsets and aim to develop a model thatoutputs: luminosity, peak luminosity, horizontal beam size and vertical beamsize. The peak luminosity is the luminosity of collisions within 1% of the nominalenergy. CLIC specifications require a minimum of 60% of collisions occur at thenominal energy, see Table 1. 4
10 20 30 40 50
Sextupole offset [ m] L u m i n o s i t y [ / ] Figure 2: The average luminosity for the perfect machine with sextupole offsetsof varying rms values. As the sextupole misalignments grow, luminosity quicklydeteriorates. The shaded area show the maximum and minimum values.An analytical approach is not feasible due to the level of complexity of thesystem. The nonlinear fields of the sextupole magnets together with effects ofsynchrotron radiation make it difficult to compute the resulting beam distributionat the interaction point. Furthermore, the disruption parameter in the beam–beam interaction is high, which means that the beams have considerable effect oneach other. In an electron–positron collider the transverse electromagnetic fieldsof one beam focuses the opposing beam (‘pinch effect’). In such circumstancesanalytical estimates of luminosity are not reliable.Artificial neural networks are large connections of small units (artificial neu-rons) and by training on data the models can learn to perform different tasks suchas classification, clustering or regression. In our case we make use of regressionand train a model to map input to known outputs. In the context of machinelearning, this is known as supervised learning. We use the machine learning sys-tem TensorFlow [22] interfaced via the Python library Keras [23] for training themodels. The goal of a model is to make accurate predictions based on unseendata. Such models are said to generalize well. Large data sets are often neededto train artificial neural networks successfully.
Assuming normal distributions with rms values of 5, 10 and 20 µ m, we addedrandom offsets to the sextupoles. For each case we tracked 10 macroparticlesthrough the final-focus system, ran a beam-beam simulation and saved sextupolespositions, beam sizes at the interaction point and the output from the beam-beamsimulation: luminosity and peak luminosity. The data was generated in parallelusing the CERN batch system and in a few days we had accumulated a data setof 450,000 cases. Figure 3 shows histograms of the accumulated data.Since the numeric values of the inputs and outputs have very different ranges5 Luminosity [cm s ] F r e q u e n c y Peak Luminosity [cm s ] F r e q u e n c y
200 400 600 800 1000 1200
Horizontal beam size [nm] F r e q u e n c y
200 400 600 800 1000 1200
Vertical beam size [nm] F r e q u e n c y Figure 3: Histograms of the data set showing the four outputs: luminosity (topleft), peak luminosity (top right), horizontal beam size (bottom left) and verticalbeam size (bottom right).we must transform data before training our models. We standardize the data set,i.e. apply transformations X standard = ( X − µ ) /σ , so that each column in ourdata set has zero mean and unit variance. The mean µ and standard deviation σ are computed using the training data set. To the columns containing output data(luminosity, peak luminosity, horizontal and vertical beam sizes) we also apply apower transformation to make distributions closer to Gaussian, see Fig. 4. Stan-dardization and power transformation are common practices to make the trainingof artificial neural networks numerically stable. Both these transformations areavailable in the Python library Scikit-learn [24]. When testing the model, theinverse transformations are applied to the output predictions. There are no clear guidelines of how to set up an artificial neural network andhow to select a model architecture since it depends on the problem. In general, ifthe model has a complexity that is too low it will not be able to mimic nonlinearbehavior and if the model has too much complexity it can suffer from overfitting,where the model does not generalize well.6
Luminosity F r e q u e n c y Peak Luminosity F r e q u e n c y Horizontal beam size F r e q u e n c y Vertical beam size F r e q u e n c y Figure 4: Histograms of the data set after power transformation and standard-ization: luminosity (top left), peak luminosity (top right), horizontal beam size(bottom left) and vertical beam size (bottom right). The data is also standardizedto have zero mean and unit variance. 7 x S1 y S2 x S6 y LL p σ x σ y Figure 5: A schematic of an artificial neural network model that maps the hori-zontal and vertical positions of the six sextupole magnets in the CLIC final-focussystem to the resulting luminosity, peak luminosity and horizontal and verti-cal beam sizes. This model consists of two fully connected hidden layers in afeedforward configuration.First we consider a simple case were our model consists of two hidden layersof twelve neurons each. We use a model with fully connected layers in a feed-forward configuration (every neuron in one layer connects to every neuron in thesubsequent layer). Figure 5 shows a schematic of the model. The 12-dimensionalinput vector can be written as X = ( x , . . . , x ) T = ( S x , S y , . . . , S x , S y ) T (1)where S x denotes the horizontal position of the first sextupole and so on. The4-dimensional output can be written Y = ( y , . . . , y ) T = ( L, L p , σ x , σ y ) T (2)where L denotes the luminosity, L p the peak luminosity and σ x ( σ y ) the horizontal(vertical) beam size at IP. The input to a neuron is the sum of weighted inputsand a bias, and during the training of the model, weights and biases of all neuronsare adjusted to minimize a loss function—in our case the mean squared error. Following common practice, we split the complete data set and reserve 20% fortesting. The remaining 80% is split into a training set (90%) and a validation set(10%). During training the model trains on the training set and evaluates on thevalidation set. The purpose of validation is to monitor overfitting. After training8able 2: Model evaluation on testing data. Mean absolute percentage error forthe four model outputs for three different model architectures.2 layers 5 layers 8 layersLuminosity [%] 8.87 4.01 2.47Peak luminosity [%] 8.42 3.77 2.29Horizontal beam size [%] 2.06 0.913 1.02Vertical beam size [%] 4.74 1.68 1.04is done, i.e. when there is either no improvements or overfitting starts, the modelperformance is evaluated on the testing data—data never previously seen by themodel.To assess the model performance we compare the predictions to the knownoutputs in the test data set. As a figure of merit we use the mean absolute per-centage error (MAPE). From Figure 3 it is clear there are a few outliers with lowluminosity. This results in large errors which can inflate the MAPE. Therefore,we make a cut in our testing data and only consider cases with luminosities above5 × cm − s − .Figure 6 shows histogram of the performance of the three different models. Wetested a different number of hidden layers (2-9) and the plots show three cases:two, five and eight layers. It is clear that a model with two layers does not havesufficient complexity and gives rather large relative errors. There is a substantialimprovement when we increase the number of layers from two to five and eight.Table 2 shows the mean and standard deviation of the absolute percentage error.Eight layers gave the lowest MAPE and there was no improvement from increasingthe number of layers beyond eight. We also tried other network architectures suchas increasing the number of nodes per layer. In the end it did not seem possibleto lower the MAPE further. The full data set used in previous section consisted of 450,000 samples with360,000 samples used for training and 90,000 samples for testing. To investigatethe influence of number of data points, we train the model with eight layers ontraining sets of different sizes. Subsets of [1000, 5000, 50000, 100000, ..., 300000]are drawn randomly from the full training set. Each case is trained ten times andtested on the same full testing set used in previous section. Figure 7 shows themean and standard deviation of the MAPE for the models trained on differentnumber of data points. It is clear that more data is better but with diminishingreturns. Already at 50,000 data points the model performance approaches thefinal result. For a model valid for luminosities above 5 × cm − s − , 50,000data points seems sufficient. On the other hand, in order to increase the range9 uminosity Peak luminosityHorizontal beam size Vertical beam size Figure 6: Histograms of the absolute percentage error for the four different out-puts: luminosity (top left), peak luminosity (top right), horizontal beam size(bottom left) and vertical beam size (bottom right). Three different models withtwo, five and eight layers are compared.10
Training data M A P E [ % ] LuminosityPeak LuminosityHorizontal beam sizeVertical beam size
Figure 7: Mean absolute percentage error (MAPE) for models trained on differentnumber of training samples. Each case is trained ten times and the markers showthe mean values and the shaded areas the standard deviation. It is clear thatmore data is better but with diminishing returns.of the model more data and more data over a wider range is likely required.
The final model consisted of eight fully connected layers in a feedforward config-uration, c.f. Figure 5. Each layer had 12 neurons that used the rectified linearunit as an activation function. The model was trained on the full training set of360,000 data points using the Adam optimizer [22]—an adaptive learning methodimplemented in TensorFlow—and mean squared error as loss function. We used abatch size of 50 and trained for 400 epochs. The results from testing are presentedin Table 2.Surely there is potential room for improvement. For instance, the hyperpa-rameter optimization was not done exhaustively and other model architecturesmight be better. However, finding the optimum model is not within the scopeof this paper. Instead, we want to show that the model was good enough to beuseful. To assess the final model’s usability we compared it to full simulations.
To verify the model performance we compare it to the full simulation. A nicefeature with the tracking code
PLACET is it can easily interface Python. Thismade it possible to seamlessly load and interface the trained machine learningmodel in our simulations. At each iteration we ran the full beam–beam simulationand at the same time checked the prediction of the machine learning model forcomparison. 11igure 8: Example of a random walk optimization on the full simulation.Figure 8 shows a random walk tuning of the transverse sextupole position.Random sextupole offsets were added to a perfect machine and using a ran-dom walk the luminosity was optimized within 350 iterations. At each itera-tion the beam distribution was tracked and the luminosity was computed witha full beam–beam simulation. At each step we also checked the prediction ofthe machine learning model (plotted in solid and dotted lines). Over the rangeof almost two orders of magnitude there was excellent agreement between themachine learning model and the full simulation. The MAPE in this case wasbelow 4% for luminosities and around 1% for beam sizes. In this case the ran-dom walk method always used the luminosity from the full simulation and themachine learning model was only evaluated parasitically. The converse was alsotested, where the output of the machine learning model guided the random walkoptimization and the full simulation was done parasitically for comparison. Bothcases showed similar results.
Running a single case on the full simulation, i.e. tracking 10 macroparticles andrunning the full beam–beam simulation, on a normal desktop computer takes a12 Subset N u m b e r o f i t e r a t i o n s Gain = 0.5Gain = 1.0Gain = 1.5Gain = 2.0Gain = 2.5Gain = 3.0
Figure 9: Grid search of gain and subset used in a random walk procedure. Thepoints show the average values and shaded area the standard deviation. For eachsetting a 100 random cases were tested 100 times.few minutes. Making a single prediction with the machine learning model takes afew milliseconds. Such a speed-up in performance allows for systematic studies ofoptimization otherwise not feasible. As an example, we use our machine learningmodel to optimize a simple random walk procedure.For the sextupoles, there are twelve degrees of freedom: six sextupoles andtwo transverse directions. We devise a random walk procedure in the followingway. Out of the twelve degrees of freedom a subset of the degrees of freedom isselected randomly. In this subset a direction is selected at random, i.e. we specifya random linear combination of the degrees of freedom in the subset. Four pointsalong the selected direction, g [ − , − . , . ,
1] with gain g , are checked. If thehighest luminosity of the four points exceeds the luminosity from the previousiteration, that points is selected. Otherwise the procedure moves to the nextiteration and selects a new subset and direction. The whole procedure is iterateduntil a certain luminosity is reached. There are two parameters to be optimized:gain and size of the subset.We performed a grid search of gains [0 . , . , . , . , . , .
0] and subsets 1-12 on the eight layer artificial neural network model. For each setting we tested100 different initial conditions with random sextupole offsets. For each initialcondition we ran the random walk procedure 100 times and recorded the requirednumber of iterations to reach the target luminosity. Figure 9 shows the averagenumber of iterations needed for the different parameter settings. For a smallersubset a higher gain decreases the number of iterations whereas for a larger subseta smaller gain is preferable. The optimum seems to be a small subset of 2 or 3and a higher gain.To verify the optimum parameters from the grid search, we tested the randomwalk procedure using a full simulation. Figure 10 shows three different cases:subsets 2, 6 and 12. For each case, three different gains were tested. The resultsare in good agreement with the grid search using the machine learning model.13sing a higher gain is better when the subset is small whereas a high gain anda large subset lead to large fluctuations and requires more iterations to reachtarget.This random walk procedure was rather simple and could surely be optimizedfurther. In a real machine there are other constraints to consider as well andin most cases a slow smoothly varying optimization might be preferred over aquicker procedure with large fluctuations. The point of this exercise was to showthe usability of the surrogate model and that results from optimizations on thesurrogate model translated to the full simulation.
In this paper we presented a surrogate model that maps sextupole offsets toresulting luminosities and beam sizes in the CLIC final-focus system. We trainedan artificial neural network on a large set of simulation data. The model wasverified by comparing its predictions to the full simulation with a mean absolutepercentage error of a few percent. We used the model to optimize two parametersof a random walk procedure and compared the results to the full simulation, withgood agreement.We considered only a small subset of the system (transverse sextupole offsets)and the single-beam case. A more complete surrogate model could be realized byadding more imperfections to the system. Naturally, as the degrees of freedomincrease, training the model becomes more challenging and may require a largerdata set but the principle remains the same. Nonetheless, as we showed with ourexample of optimizing the random walk procedure: even a model with limitedscope can already be very useful. In our case it helped improving the sextupolealignment procedure and reduced the overall tuning time.14igure 10: Random walk procedure tested on the full simulation. Three differentgains were tested using subset = 2 (top), subset = 6 (middled) and subset = 12(bottom). A higher gain is better when the subset is small and a smaller gainwhen subset is large. 15 eferences [1] M. Wielgosz, A. Skoczen, and M. Mertik, “Using LSTM recurrent neuralnetworks formonitoring the LHC superconducting magnets,” Nucl. Instrum.Meth.A867, 40–50 (2017).[2] G. Azzopardi, B. Salvachua, G. Valentino, S. Redaelli and A. Muscat, “Op-erational results on the fully automatic LHC collimator alignment,” Phys.Rev. Accel. Beams 22, 093001 (2019).[3] C. Emma, A. Edelen, M. J. Hogan, B. O’Shea, G. White, and V. Yakimenko,“Machine learning-based longitudinal phase space prediction of particle ac-celerators,” Phys. Rev. Accel. Beams 21, 112802 (2018).[4] A. Scheinker and S. Gessner. ”Adaptive method for electron bunch profileprediction.” Physical Review Special Topics-Accelerators and Beams 18.10,102801 (2015).[5] Y. B. Kong, M. G. Hur, E. J. Lee, J. H. Park, Y. D. Park, and S. D. Yang.“Predictive ion source control using artificial neural network for RFT-30cyclotron.” Nuclear Instruments and Methods in Physics Research SectionA: Accelerators, Spectrometers, Detectors and Associated Equipment 806(2016): 55-60.[6] N. Bruchon, G. F enu, G. Gaio, M. Lonza, F. A. Pellegrino, and L. Saule.“Free-electron laser spectrum evaluation and automatic optimization.” Nu-clear Instruments and Methods in Physics Research Section A: Accelerators,Spectrometers, Detectors and Associated Equipment 871 20-29, (2017).[7] A. Scheinker, D. Bohler, S. Tomin, R. Kammering, I. Zagorodnov, H.Schlarb, M. Scholz, B. Beutner and W. Decking, “Model-independent tuningfor maximizing free electron laser pulse energy,” Phys. Rev. Accel. Beams22, 082802 (2019).[8] A. Scheinker, A. Edelen, D. Bohler, C. Emma and A. Lutman. “Demon-stration of model-independent control of the longitudinal phase space ofelectron beams in the Linac-coherent light source with Femtosecond resolu-tion.” Physical review letters, 121(4), 044801 (2018).[9] Y. Li, W. Cheng, L. H. Yu, and R. Rainer. “Genetic algorithm enhancedby machine learning in dynamic aperture optimization.” Physical ReviewAccelerators and Beams, 21(5), 054601 (2018).[10] Y. Li, R. Rainer, and W. Cheng. “Bayesian approach for linear optics cor-rection.” Physical Review Accelerators and Beams 22.1, 012804 (2019).1611] E. Fol, J. M. C. Portugal, G. Franchetti, and R. Tom´as. “Optics cor-rections using Machine Learning in the LHC.” In Proceedings of the2019 International Particle Accelerator Conference, THPRB077, (2019).doi:10.18429/JACoW-IPAC2019-THPRB077[12] A. Edelen et al., “Opportunities in Machine Learning for Particle Accelera-tors,” arXiv [physics.acc-ph], arXiv:1811.03172, 2018.[13] “A Multi-TeV linear collider based on CLIC technology: CLIC ConceptualDesign Report,” edited by M. Aicheler, P. Burrows, M. Draper, T. Garvey,P. Lebrun, K. Peach, N. Phinney, H. Schmickler, D. Schulte and N. Toge,CERN-2012-007.[14] “Updated baseline for a staged Compact Linear Collider,” edited by P.N.Burrows, P. Lebrun, L. Linssen, D. Schulte, E. Sicking, S. Stapnes, M.A.Thomson, CERN–2016–004 (CERN, Geneva, 2016).[15] “The Compact Linear Collider (CLIC) – Project Implementation Plan,”edited by M. Aicheler, P.N. Burrows, N. Catalan, R. Corsini, M. Draper,J. Osborne, D. Schulte, S. Stapnes, M.J. Stuart, CERN-2018-010-M.[16] CLICdp collaboration, “The post-CDR CLIC detector model”, CLICdp-Note-2017-001 (2017).[17] F. Plassard, A. Latina, E. Marin, and R. Tom´as and P. Bambade,“Quadrupole-free detector optics design for the Compact Linear Colliderfinal focus system at 3 TeV”,
Phys. Rev. Accel. Beams , vol. , p. 011002(2018).[18] P. Raimondi and A. Seryi, “Novel Final Focus Design for Future LinearColliders”, Phys. Rev. Lett. , 3779 (2001).[19] J. ¨Ogren, A. Latina, R. Tom´as and D. Schulte, “Tuning of the CLIC 380 GeVFinal-Focus System with Static Imperfections”, CERN, Geneva, Switzer-land, Tech. Rep. CERN-ACC- 2018-0055, Dec. 2018.[20] The tracking code placet, h ttps://clicsw.web.cern.ch/clicsw/.[21] D. Schulte, “Study of Electromagnetic and Hadronic Background in theInteraction Region of the TESLA Collider”, Ph.D. thesis, DESY-TESLA-97-08, TESLA-97-08, Germany, 1996.[22] Mart´ın Abadi et al., ”TensorFlow: Large-scale machine learning on hetero-geneous systems,” 2015. Software available from tensorflow.org.[23] F. Chollet and others, Keras, 2015. https://keras.iohttps://keras.io