[PDF] Enhanced Preconditioner for JOREK MHD Solver

Abstract

The JOREK extended magneto-hydrodynamic (MHD) code is a widely used simulation code for studying the non-linear dynamics of large-scale instabilities in divertor tokamak plasmas. Due to the large scale-separation intrinsic to these phenomena both in space and time, the computational costs for simulations in realistic geometry and with realistic parameters can be very high, motivating the investment of considerable effort for optimization. In this article, a set of developments regarding the JOREK solver and preconditioner is described, which lead to overall significant benefits for large production simulations. This comprises in particular enhanced convergence in highly non-linear scenarios and a general reduction of memory consumption and computational costs. The developments include faster construction of preconditioner matrices, a domain decomposition of preconditioning matrices for solver libraries that can handle distributed matrices, interfaces for additional solver libraries, an option to use matrix compression methods, and the implementation of a complex solver interface for the preconditioner. The most significant development presented consists in a generalization of the physics based preconditioner to "mode groups", which allows to account for the dominant interactions between toroidal Fourier modes in highly non-linear simulations. At the cost of a moderate increase of memory consumption, the technique can strongly enhance convergence in suitable cases allowing to use significantly larger time steps. For all developments, benchmarks based on typical simulation cases demonstrate the resulting improvements.

Full PDF

aa r X i v : . [ phy s i c s . c o m p - ph ] J a n New developments regarding the JOREKsolver and physics based preconditioner

I Holod ID , M Hoelzl ID , P S Verma , GTA Huijsmans , R Nies ID , andJOREK Team Max Planck Computing and Data Facility (MPCDF) Max Planck Institute for Plasma Physics, Boltzmannstr. 2, 85748 Garching b.M., Germany CEA, IRFM, 13108 Saint-Paul-Lez-Durance, France Eindhoven University of Technology, P.O. Box 513, 5600 MB Eindhoven, TheNetherlands Department of Astrophysical Sciences, Princeton University, Princeton, NJ,08543, USA Princeton Plasma Physics Laboratory, Princeton, NJ, 08540, USA Refer to the author list of [M Hoelzl, G T A Huijsmans, S J P Pamela, MBecoulet, E Nardon, F J Artola, B Nkonga et al, Nuclear Fusion (submitted)]

Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 . . . . . . . . . . . . . . . . . . . . . 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 . . . . . . . . . . . . . . . 10 . . . . . . . . . . . . . . . . . . . . . . . . . . 12 . . . . . . . . . . 12 . . . . . . . 17 . 19 Abstract

The JOREK extended magneto-hydrodynamic (MHD) code is a widely used simulationcode for studying the non-linear dynamics of large-scale instabilities in divertor tokamakplasmas. Due to the large scale-separation intrinsic to these phenomena both in spacePage 1nd time, the computational costs for simulations in realistic geometry and with realisticparameters can be very high, motivating the investment of considerable eﬀort for opti-mization. The code is usually run with a fully implicit time integration allowing to uselarge time steps independent of a CFL criterion. This is particularly important due to thefast time scales associated with MHD waves and fast parallel heat transport. For solvingthe resulting large sparse-matrix system iteratively in each time step, a physics-basedpreconditioner building on the assumption of a weak coupling between the toroidal har-monics is applied. The solution for each harmonic matrix is determined independentlyin this preconditioner using a direct sparse matrix library. In this article, a set of de-velopments regarding the JOREK solver and preconditioner is described, which lead tooverall signiﬁcant beneﬁts for large production simulations. This comprises in particularenhanced convergence in highly non-linear scenarios and a general reduction of memoryconsumption and computational costs. The developments include faster construction ofpreconditioner matrices, a domain decomposition of preconditioning matrices for solverlibraries that can handle distributed matrices, interfaces for additional solver libraries,an option to use matrix compression methods, and the implementation of a complexsolver interface for the preconditioner. The most signiﬁcant development presented con-sists in a generalization of the physics based preconditioner to “mode groups”, whichallows to account for the dominant interactions between toroidal Fourier modes in highlynon-linear simulations. At the cost of a moderate increase of memory consumption, thetechnique can strongly enhance convergence in suitable cases allowing to use signiﬁcantlylarger time steps. For all developments, benchmarks based on typical simulation casesdemonstrate the resulting improvements.

Non-linear extended MHD allows to describe a huge variety of diﬀerent large-scale insta-bilities in magnetically conﬁned fusion plasmas accurately and with lower computationalcosts than (gyro)kinetic approaches. Numerically, however, there are signiﬁcant chal-lenges to overcome: large spatial and temporal scale separations need to be bridged.Highly anisotropic heat transport, fast plasma ﬂows and a large scale separation in timerender it virtually impossible to simulate realistic parameters without an implicit timeintegrator.The following Sections 1.1 to 1.4 provide background on the JOREK code and thesparse matrix solvers, explain the numerical stages executed during a time step, anddescribe the setup used for the numerical tests. Section 1.5 contains the outline of themain part of the article, which describes the developments implemented before they aretested via a series of benchmarks.

JOREK [1, 2] is a state-of-the-art code for the simulation of large-scale instabilities intokamaks (magnetic conﬁnement fusion devices). It solves extended reduced [3] or full [4]MHD equations using a 2D iso-parametric G continuous ﬁnite element discretizationPage 2ombined with a toroidal Fourier decomposition [5]. Main applications are simulationsaddressing the dynamics of major disruptions including vertical displacement events andrunaway electrons as well as edge localized modes. The goal is to advance the physicsunderstanding for these potentially very harmful phenomena and to develop reliablecontrol schemes applicable in present and future fusion devices.A free boundary and resistive wall extension exists [6] and a framework for kineticeﬀects is available [7]. The time integration is fully implicit such that a very large –and, due to the properties of the equations, badly conditioned – sparse matrix systemneeds to be solved in every time step. The present article describes improvements of thecomputational eﬃciency of the code by a variety of diﬀerent approaches enhancing thesolver performance. We describe the key properties of the JOREK solver and the sparsematrix system in the following Subsections 1.1.1 and 1.1.2. Due to the fully implicit time integration used in JOREK, a large sparse matrix systemconnecting all degrees of freedom needs to be solved in every time step (see Section 1.1.2for details). Two main options are available for that: when the problem size is small, inparticular for axisymmetric cases, a direct sparse matrix solver is applied to the completesystem (see Section 1.2 for details on direct sparse matrix solvers). Second option isapplied for the large problem sizes typically encountered in real-life applications: thesystem is solved iteratively with a restarted GMRES algorithm. Since the system isbadly conditioned, a physics-based preconditioner needs to be applied: The matrix P containing only the diagonal blocks of the matrix corresponding to the toroidal Fouriermodes is used for left preconditioning, while all oﬀ-diagonal blocks are neglected. Aschematic representation for a small case with the toroidal modes n = (0 , , ,

3) isshown in Figure 1. For the preconditioner, linear systems

P y = c need to be solvedin each GMRES iteration as well as for the initial guess. Due to the block diagonalstructure of P , the system can be solved independently and in parallel for each toroidalmode with (direct) sparse matrix libraries, which greatly reduces the overall memoryconsumption and computational costs compared to directly solving the complete system.The preconditioning matrix P can usually be reused for many time steps thus saving onthe expensive LU-factorization. In the linear phase of the physical system’s evolution,when the mode amplitudes are small and mode coupling is negligible, the preconditionermatrix P is an almost exact approximation of the full matrix A such that the iterativesolver usually converges in a single iteration. However, when the problem becomes non-linear such that the coupling between toroidal modes cannot be neglected any more,the preconditioner becomes far less eﬃcient, and consequently the GMRES convergencedeteriorates. This poor approximation of A by the preconditioning matrix P is one of themain challenges for the eﬃciency of the JOREK solver. Besides that, the large memoryconsumption of the LU decomposed preconditioner matrix blocks and the limited parallelscalability of the direct sparse matrix solvers are additional challenges to be faced. Alsothe construction of the preconditioner matrix blocks – presently done by extracting themfrom the global matrix via expensive MPI communication – constitutes a computationalPage 3 = n = c o s n = s i n n = c o s n = s i n n = c o s n = s i n n=0n=1cosn=1sinn=2cosn=2sinn=3cosn=3sin n = n = c o s n = s i n n = c o s n = s i n n = c o s n = s i n n=0n=1cosn=1sinn=2cosn=2sinn=3cosn=3sin Figure 1: The matrix structure is shown for a simple example case with the toroidalmodes n = (0 , , , left ﬁgure shows the structure of the completematrix A , where axisymmetric n = 0 has half as many degrees of freedom asthe other modes. The right ﬁgure shows the preconditioning matrix P likeit is normally used in JOREK. Only the diagonal parts are retained such thateach mode is assumed decoupled from the others in the preconditioner and thesparse diagonal blocks can be solved independently from each other.bottleneck.In the present work, we address these challenges, leading to a signiﬁcant step for-ward in code eﬃciency: the construction of the preconditioner matrices is performeddirectly, avoiding communication overhead (Section 2.1), new solver interfaces are builtto exploit advanced features, including complex representation of sparse matrices (Sec-tion 2.2). Finally, a generalization of the preconditioner is developed, which allows toapproximate the matrix A more accurately in highly non-linear scenarios (Section 2.3).The performance of the new developments is evaluated in Section 3. Here we describe the sparse matrix properties for a typical large problem size of 30thousand grid nodes, 8 physical variables, 4 degrees of freedom per grid node, and 41toroidal harmonics – this is counting cosine and sine components separately such thatthis corresponds to toroidal modes n = (0 . . . Global matrix

The total dimension of the matrix corresponding to this problem sizeis about 40 million. All node variables are typically coupled with each other, and,additionally, with eight neighboring grid nodes . This leads to about 12 thousand non-zero entries in each matrix row and about 500 billion non-zero entries in the wholematrix, which requires about 4 TB of main memory for storing the double precisionﬂoating point numbers. Additional memory is needed for storing the matrix indices,depending on the sparse matrix format used. For the coordinate format using 2 integer At special points like the grid axis and the X-point, this may be diﬀerent. Depending on the boundaryconditions, the connectivity can also be diﬀerent there.

Page 4ndices of 4-byte each, the total memory requirement for the matrix storage is about 8TB. In case of compressed sparse row (CSR) format, the amount of memory needed tostore the location of the entries is reduced by nearly 50%, so totally a bit more than 6TB would be required. In JOREK, the matrix is constructed in coordinate format andthe preconditioner matrices are stored in CSR format.As shown above, due to the locality of the 2D Bezier basis functions, the global matrixis sparse with only one out of 3000 matrix entries diﬀerent from zero in each row or,respectively, column. Since the matrix in our example does not ﬁt into the memoryof one typical compute node (100-200 GB), we use domain decomposition to constructmatrix in a distributed way. If in this case we use 84 compute nodes with 4 MPI tasksper node (336 MPI tasks in total), each MPI task is responsible for constructing thematrix contributions by about 90 grid elements. This way, roughly 25 GB of storage areneeded per MPI task for the distributed global matrix, i.e., about 100 GB per computenode. The matrix construction on each MPI task is OpenMP parallelized. If we assume8 OpenMP threads per MPI task, each thread is responsible for creating the matrixcontributions of about 10 grid elements (called elementary matrices). The elementarymatrices are dense and have a dimension of about 5000 (4 nodes times 4 degrees offreedom times 8 variables times 41 Fourier harmonics) requiring about 200 MB of mainmemory each. The matrix is constructed using fast Fourier transform. When a singletoroidal Fourier mode is used (in test simulation), direct integration is applied instead.

Preconditioner matrix

For the considered example, each block of the preconditionermatrix P has a dimension of about 2 million, and about 600 non-zero entries per matrixrow, such that each of the block-matrices has about 1 billion non-zero entries consumingabout 12 GB of main memory in CSR format. Each of these blocks corresponds to aparticular toroidal mode n , and need to be LU decomposed in the preconditioning, seeSection 1.1.1. All blocks together require about 250 GB of memory. The block-matrixcorresponding to n = 0 is smaller by a factor of two in dimension and by a factor of four inthe number of non-zero entries (since n = 0 has only axisymmetric component). DuringLU-factorization (see Section 1.1.1), the memory consumption increases signiﬁcantly dueto the ﬁll-in. A linear system of equations

B x = b is solved in JOREK using a solver library, wherematrix B corresponds either to the full system A for small cases or to the diagonal block P i of the preconditioning matrix P .The matrix B is sparse as shown in the previous section. Despite its large size dueto the numerous degrees of freedom, it consists mostly of zero entries, since interactionsare localized in space by the local basis functions. Naively using full matrix factorisationalgorithms on a sparse system would prohibitively increase the number of operationsand the required memory, and therefore special techniques must be employed. We re-mind brieﬂy here of the main aspects of sparse matrix solving and matrix compressionPage 5echniques, which can further reduce the computational cost and memory requirements.Readers interested in a more detailed treatment are referred to [8] and references therein.One of the main problems occurring during LU factorization of a sparse matrix isﬁll-in, i.e., the number of entries in the LU factors can greatly exceed that of the originalmatrix B . During the analysis phase, it is therefore attempted to determine the optimalorder in which the entries in L and U matrices should be computed such as to min-imise ﬁll-in. The sparse matrix is represented as a graph where each vertex representsan interaction, i.e., a connection between nodes i and j is equivalent to B i,j = 0. Thefactorisation is then reﬂected by an elimination tree traversing this graph. Standardtools to determine the ordering include METIS [9] and SCOTCH [10], which use nesteddissection algorithms, where the graph is recursively partitioned until the size is suﬃ-ciently reduced to allow for local ordering methods. During the factorisation phase, theelimination tree is traversed, starting from the tree leafs up to its root. Upon eliminationof a node, all of its parents up the tree must be updated, which can be performed ei-ther immediately (right-looking), only when the parent node is factorised (left-looking)or in a combination of the two where contributions from multiple nodes are groupedtogether (multifrontal). The sparse matrix solvers used in JOREK are right-lookingPaStiX [11, 12] or multifrontal MUMPS [13, 14] and STRUMPACK [15] (the interfacefor the latter was developed as part of this work). Finally, during the solution phase,the vector x is obtained by solving the systems L y = b and U x = y through back andforward-substitution.Matrix compression techniques often oﬀer further reductions in memory requirementsand/or computational cost for sparse matrix systems. In particular, low-rank matricesmay be approximated as B ≈ U · V T , where U and V are narrow rectangular matrices,such that U and V possess fewer entries than B and allow for faster matrix operations.Although the original matrix B will generally not be low-rank, oﬀ-diagonal blocks maywell be, such that a subdivision of the matrix into blocks is ﬁrst determined. It may behierarchical in nature, as used e.g. in the Hierarchically Semi-Separable (HSS) format [16,17] which further uses binary trees, or ﬂat, as used e.g. in Block-Low-Rank compression(BLR) [18]. The admissibility criterion determining whether a given block should becompressed or not depends on the speciﬁc solver at hand. Now that the JOREK solver, the sparse matrix system, and sparse matrix libraries havebeen introduced, we identify the major numerical stages performed during a single timestep. The stages are abbreviated by one letter in the following to simplify notation inﬁgures and tables later on. Note, that the present article concentrates on the phasesafter the global matrix construction. Here, F and I are usually most critical for theoverall performance. • G: Global matrix construction.

Done in every time step. Can cost a signiﬁcantfraction of the overall run time depending on the case. Not a topic of the presentwork, but addressed in Ref. [19] and more thoroughly in future work. Page 6

P: Extraction of preconditioner matrices.

So far done by extracting fromthe global matrix via communication; an alternative option for direct constructionwas implemented for the present work as shown in Section 2.1. This phase alsoincludes time for conversion of preconditioning matrix blocks into CSR format andconversion into complex matrices (where applicable). The last part is typically notnegligible, but also not dominating computational costs. • A: Analysis of the matrix structure.

Only done in the very ﬁrst time stepof a simulation, since the structure is not changing during a simulation, but onlythe values of the matrix entries. This phase is therefore usually not critical for theoverall performance. • F: LU-factorization of the preconditioner matrices.

Done only in time stepswhere the preconditioner needs to be updated. Often critical for the overall per-formance. The relative costs compared to phase I depend on the application andtime step used, but also on the user deﬁned threshold for updating the precondi-tioning matrix – the user can inﬂuence the frequency of the preconditioner updatesto some degree via the choice of the time step and via a threshold. • I: Iterative solution of the global system including solver calls for the pre-conditioner matrix blocks. Done in every time step. Often critical for the overallperformance. The relative costs compared to the preconditioner updates of phase F depend on case and settings (see explanation above). The quantities listed in the following are used in this article to describe the benchmarksetups and results. • N N : Total number of compute nodes used in the simulation • N M : Total number of MPI tasks used in the simulation • M T : Total memory consumption for the whole simulation (sum for all MPI tasks) • t W : Elapsed wall clock time • t N ≡ t W · N N : Computational cost in node-hoursAll benchmarks performed in this work are carried out on the Marconi supercomputeroperated by CINECA. The partition Marconi-Fusion available to the European fusionresearch community comprises of 2396 compute nodes, each equipped with two 24-coreIntel Xeon 8160 CPUs (“Skylake” generation) at a base clock speed of 2.10 GHz and192 GB DDR4 RAM. The interlink between the compute nodes is provided by IntelOmniPath (100 Gbit/s). We use Intel v.2018.4 compilers with compatible intelmpi and mkl libraries. Page 7 .5 Outline The rest of the article is organized as follows. Section 2 describes the developments per-formed as part of the present work. Subsection 2.1 brieﬂy shows an alternative option forconstructing the preconditioning matrices, subsection 2.2 explains new solver interfaceswhich were implemented as well as the adaptation to complex preconditioning matri-ces and subsection 2.3 describes a generalization of the preconditioner, which improvesconvergence in highly non-linear conﬁgurations. In Section 3, previously existing andnew solver options are compared based on realistic simulation cases to investigate per-formance improvements obtained in the present work. Finally, Section 4 brieﬂy providessummary, conclusions, and an outlook to future developments.

The global matrix A glob is constructed in JOREK in a domain decomposed manner,where each MPI task is responsible for a certain fraction of the computational grid(Section 1.1.2). Originally, the individual mode blocks needed for the preconditionerare extracted from the global matrix via expensive MPI all-to-all communication. Thedrawback of this approach is a strong increase of communication time for a large problemsize and an unfavorable scaling with the number of MPI tasks. As part of this work,an option was implemented to calculate the mode blocks directly instead of extractingthem from the global matrix to avoid the communication overhead. Here, a group of MPItasks is responsible for each toroidal mode distributing the work through the domaindecomposition. This new approach oﬀers a better parallel scalability than the commu-nication based one (see Section 3.1). The main drawback is that the preconditionermatrix construction needs to be done by direct integration instead of fast Fourier trans-form (which is used for the global matrix construction), thus some cases exist, wheredirect construction provides no gain. Further optimization of the direct preconditionermatrix construction is foreseen for the future. Section 3.1 contains a benchmark for thisdevelopment. Prior to this work JOREK had interfaces to the sparse matrix solvers MUMPS, PaStiXand WSMP. These libraries are used to solve block-matrix systems in the preconditioner,or optionally to the whole matrix system in case of small problem sizes. In order toreduce memory consumption and execution time we developed complex interface toPaStiX solver and added the new STRUMPACK solver, as described in the followingSections 2.2.1–2.2.2.The new version 6.x of PaStiX has been released recently providing major revision tothe library. Corresponding JOREK interface has been developed [20] allowing usage ofnew features such as Block-Low-Rank compression. Page 8 .2.1 Complex matrix interface to PaStiX 5.x library

PaStiX can be compiled for real or for complex matrices. An interface has been imple-mented allowing to use the preconditioner block-matrices in the complex format, whichreduces the memory consumption considerably. The transformation into complex formatis only possible when the matrix has the appropriate symmetry structure. A structureof the system of equations according to (cid:18) a − bb a (cid:19) · (cid:18) xy (cid:19) = (cid:18) cd (cid:19) (1)can be replaced by (cid:0) a + ib (cid:1) · (cid:0) x + iy (cid:1) = (cid:0) c + id (cid:1) (2)Due to some interactions between the n = 0 toroidal mode and the n > (cid:18) a − bc d (cid:19) → (cid:18) ( a + d ) / − ( b + c ) / b + c ) / a + d ) / (cid:19) (3)This symmetrized matrix can now be replaced by a complex form, analogously to (2).As will be shown later, the approximation introduced by this symmetrization aﬀects theeﬃciency of the preconditioner only very mildly (small impact on the number of GM-RES iterations) such that the overall beneﬁt of using complex matrices prevails. Directconstruction in complex format instead of conversion could enhance the performancefurther, but is left for future work. Besides switching to complex input, nothing elsewas changed in the PaStiX 5 interface. This implies that also the support for “multipledegrees of freedom”, which this solver library oﬀers, is still usable. This feature allowsJOREK to specify the size of small dense blocks in the sparse structure, which share thesame connectivity and thus can be handled in an identical way in the analysis phase ofthe solver. A complex matrix interface for STRUMPACK will be implemented soon aswell. Tests of the performance of the real matrix interface are presented in Section 3.3. The STRUctured Matrix PACKage – STRUMPACK v.5.0.0 [15] was identiﬁed as apromising additional solver library to compare to existing solvers in JOREK. Key fea-tures motivating our choice are the capability of handling distributed matrix input, theﬂexible MPI/OpenMP parallelization including GPU compatibility and the various op-tions (such as BLR and HSS) for compressing the dense frontal matrices. For additionalﬂexibility, as a ﬁrst step, we developed our own Fortran-C++ interface for STRUMPACKoﬀering routines for initializing, passing the matrix, reordering, factorizing, solving, andﬁnalizing. Sparse matrix is passed to STRUMPACK in row-distributed fashion usingthe compressed sparse row (CSR) format. The conversion from the coordinate formatused in JOREK for the construction into the CSR format has to additionally orderthe column indices and remove duplicates resulting from the domain decomposition toPage 9 = n = c o s n = s i n n = c o s n = s i n n = c o s n = s i n n = n = c o s n = s i n n = c o s n = s i n n = c o s n = s i n n = n = c o s n = s i n n = c o s n = s i n n = c o s n = s i n + Figure 2: Setup for mode groups denoted as

Overlapping I . Schematic illustration oflinear combination of precondition matrices which combine modes (0,1), (1,2),(2,3), (0), (3) into diagonal blocks. The resulting coverage of elements fromthe original matrix is shown in the right, with lighter shade representing afactor 1 / In this Section, we describe the most signiﬁcant development performed within thepresent work. We explain an approach for improving the eﬃciency of the preconditionerin highly non-linear scenarios, where the coupling between the toroidal Fourier modesbecomes strong causing the GMRES convergence to deteriorate massively or break downentirely. The approach followed here, is to combine several toroidal modes into “modegroups”. The interaction between the modes within the mode groups is retained inthe preconditioning matrix P . Consequently, larger block-matrices need to be solvedthan in the original preconditioner, which makes the solution for more expensive interms of memory consumption and computing time. Nevertheless, capturing the non-linear interaction can improve the convergence of the iterative solver so strongly thatPage 10 = n = c o s n = s i n n = c o s n = s i n n = c o s n = s i n + n = n = c o s n = s i n n = c o s n = s i n n = c o s n = s i n n = n = c o s n = s i n n = c o s n = s i n n = c o s n = s i n - Figure 3: Setup for mode groups denoted as

Overlapping II . Schematic illustration oflinear combination of precondition matrices which combine modes (0,1), (1,2),(2,3), and subtracting (1), (2) modes which have overlapping contributions.Thus, the interaction between neighboring modes is fully accounted for in thepreconditioner.the overall performance is signiﬁcantly enhanced, as demonstrated in Section 3.4.Figures 2 and 3 illustrate the approach by two examples. In the ﬁrst case, referred toas

Overlapping I (Figure 2), we combine every two neighboring modes into one block andsolve each block independently. The preconditioner solution is constructed by placingsolutions of individual blocks in corresponding rows of a full solution vector. Solutionsfrom the overlapping blocks are summed up with a factor 1 /

2. When comparing tothe standard approach, where we solve N single-mode blocks, the new method requiressolution of N + 1 blocks of larger size, where N is the total number of toroidal modes; N = 4 in the present example. In the described approach we eﬀectively take intoaccount interactions of neighboring modes which physically resemble to mode couplingas observed in turbulence cascading.Since in the Overlapping I method, neighboring mode coupling is eﬀectively takeninto account with a factor of 1 /

2, we consider an alternative method referred to as

Overlapping II (Figures 3). Here we also combine the neighboring modes into blocks,but instead of multiplying corresponding contributions to the full preconditioner solutionby 1 /

2, we subtract the solutions for individual modes (except the ﬁrst and the last one,which don’t overlap). Thus, here we have to solve 2 N − N − N − n = 0 mode with n = 0. Theimplementation in JOREK is done as follows: • Via the namelist input ﬁle, arbitrary mode groups can be constructed in a ﬂexibleway. Each mode group forms a diagonal block in the preconditioner matrix. • The number of MPI tasks for each mode group can be automatically assignedPage 11ccording to the number of non-zero entries in each mode group, or manually bechosen via the input ﬁle. This allows optimizing load balance for non-uniformmatrix block sizes. • Mode groups can be overlapping. When reconstructing the whole matrix system

P y = c solution as linear combination of the individual solutions from the modegroup blocks, factors are applied. For example, the factor is 1 / OverlappingI case and factor ± Overlapping II example.Mode group methods are benchmarked in Section 3.4 with the standard precondi-tioner.

Now that we have introduced the developments performed to improve the numericaleﬃciency of the JOREK solver and preconditioner in the previous section, a set of real-life simulation cases is taken as basis to evaluate the performance of the methods. Eachcase is brieﬂy explained in the respective subsection. To save computational time, notall solver options are tested for all cases. Note that most developments are already usedin production by the JOREK community at the time of submission of this article.

Here, we compare the performance and parallel scalability of the preconditioner matrixassembly via communication and by direct construction. The benchmark is based ona simpliﬁed simulation of edge localized modes (ELM) in a JET-like geometry, i.e., anX-point plasma with open and closed ﬁeld lines included in the simulation domain. Afairly high poloidal grid resolution of 51k Bezier ﬁnite elements is used, while only thetoroidal mode numbers n = (0 , ,

8) are included in the simulation. Tests for this caseare performed both in the linear phase, where the exponentially growing instabilitiesare too small to modify the axisymmetric n = 0 component and in the non-linear phaseduring the collapse of the edge pressure gradient caused by the instability.The simulations are performed on 6 to 96 compute nodes ( N N ) with 2 MPI tasks pernode and 24 OpenMP threads per MPI task. Simulation timing results are shown inTable 1. With 6 compute nodes, the direct construction is two times more eﬃcient thanthe communication and scales well with a further increasing the number of computenodes. The communication exhibits poor scaling with the numbers of MPI tasks, asexpected. Nevertheless, it is worth mentioning that in realistic simulations, one does notuse such a large number of compute nodes due to limited scalability of the solver hencethe performance of direct construction should be further optimized in the future. To measure parallel scalability and relative performance of PaStiX 5.x and STRUMPACKsolvers we consider the same case as in the previous section. Two diﬀerent resolutionsPage 12 N communication direct construction6 46.07 21.3412 40.55 10.9624 38.25 5.6448 39.48 3.1296 35.94 1.76Table 1: Measurements for preconditioning matrix assembly (phase P ). The wall-clocktime t W (sec) spent in assembling distributed preconditioner block-matricesusing communication and direct construction is compared as function of thenumber of compute nodes N N .with 13745 poloidal grid nodes (LR=low resolution) and 51470 grid nodes (HR=highresolution) are considered here. First, we compare the performance of both PaStiX andSTRUMPACK libraries for the LR case with n = (0 , ,

8) to avoid too strict memoryper node constraints which would limit the ﬂexibility in the number of MPI tasks pernode (in particular for PaStiX). We consider 6 compute nodes and scan the number ofMPI tasks per node, while adjusting the number of OpenMP threads, accordingly. Theresults are shown in Figures 4–6. As we can see, the STRUMPACK solver shows con-siderably better eﬃciency at a larger number of MPI tasks per node, while the PaStiXperformance is less sensitive to this parameter. For the LU factorization, PaStiX showsunfavorable scaling with more than four MPI tasks per node. Besides solver perfor-mance, an important factor for choosing the optimal parameter regime is the time spenton constructing the global sparse matrix. As can be seen from Figure 7, it is most ef-ﬁcient at 4-8 tasks per node for this case. We observe the same trend for the full-sizesimulation as well. Thus, the decision for selecting MPI tasks per node depends onthe ratio of time spent on constructing global matrices to the time spent on GMRESiterations. Finally, due to the ability to provide the distributed matrix directly to thesolver and due to better memory eﬃciency, the maximum memory utilization can besigniﬁcantly lower with the STRUMPACK library as shown in Figure 8. The diﬀerencein memory consumption partially results from the fact that the PaStiX 5.x requireseach responsible MPI task to have a copy of the full preconditioner block-matrix, whilethe matrix is row-wise distributed across MPI tasks in case of the STRUMPACK. Thehigher memory consumption when running with the PaStiX library may limit the abilityto choose the optimal number of MPI tasks per node.For the HR case, the production-size ELM case with n = (0 , , nnz = 1 . × ) requires using at leastsix compute nodes with at least two MPI tasks per node to avoid integer overﬂow whenconstructing the global matrix. For optimal performance, it is beneﬁcial to increasethe number of MPI tasks per node, however, due to higher memory consumption ofthe PaStiX solver (Figure 9, left panel) we are limited to two MPI tasks per nodewith PaStiX. With STRUMPACK, we can use up to eight MPI tasks per node. ThePage 13 t i m e ( s ) STRUMPACKPaStiX

Figure 4: Time spent on LU factorization ( F ) in the the LR simulation case using 6compute nodes t i m e ( s ) STRUMPACKPaStiX

Figure 5: Time spent on solve (part of phase I ) in the LR simulation case using 6 computenodes Page 14 t i m e ( s ) STRUMPACKPaStiX

Figure 6: Time spent on each consecutive GMRES iteration (part of phase I ) in the LRsimulation case using 6 compute nodes t i m e ( s ) Figure 7: Time spent on constructing the global sparse matrix ( G ) in the LR simulationcase using 6 compute nodes p e a k m e m o r y u s a g e ( G B ) STRUMPACKPaStiX

Figure 8: Peak memory consumption M T in the LR simulation case using 6 computenodes Page 15orresponding memory consumption when using the STRUMPACK solver is shown inFigure 9, right panel. PaStiX STRUMPACK A Analysis 20.5 87.9 F Factorization 136.4 148.7 I Solve 0.74 0.53First GMRES iteration 4.65 3.41Consecutive GMRES iteration 1.36 1.08Table 2: Wall-clock time t W (sec) spent on diﬀerent solver phases in the HR simulationusing N N = 6 compute nodes with N M /N N = 2 MPI tasks per node (PaStiX)and N M /N N = 8 MPI tasks per node (STRUMPACK).In Table 2, we show the elapsed wall clock times for the comparison between thePaStiX and STRUMPACK solvers for the HR case. For simulations in X-point geometry,STRUMPACK needs to do a permute and scale procedure to put large entries on thediagonal. This can increase the analysis time ( A ) signiﬁcantly. On the other hand, forPaStiX, a column scaling needs to be performed in JOREK, for which the correspondingtime is not included here, as it is part of the preconditioner matrix assembly ( P ). Inthe case considered here, PaStiX performs 9% faster on the numerical factorizationtask ( F ). Meanwhile, STRUMPACK shows better performance in the solve phase ( I )with an improvement by 36% in the ﬁrst GMRES iteration and 26% improvement ineach consecutive iteration (note that the actual solve is shown separately in the tablefor reference, but is part of the GMRES iterations). Taking this into account, in asimulation with 20 GMRES iterations per time step, the overall performance of bothsolvers becomes comparable when the preconditioner update (i.e., a re-factorization)takes place every four time steps. For more frequent updates the PaStiX solver ispreferable. If preconditioner updates are required less frequently, the STRUMPACKprovides better performance.To address the performance scaling with the number of processors (strong scaling),we have simulated the LR case with n = (0 , ,

8) toroidal modes using 3, 6 and 12compute nodes with 8 MPI tasks per node using the STRUMPACK library. The resultsare presented in Table 3. It is shown that the sparse matrix construction and solvetasks scale relatively well with a factor of about 1 . n = (0 , ,

200 400 600 800 1000 1200 1400time (s)0100200300400500600700800 m e m o r y ( G B ) F a c t o r S o l v e F a c t o r S o l v e m e m o r y ( G B ) A n a l y s i s F a c t o r S o l v e F a c t o r S o l v e Figure 9: Cumulative plot of memory utilization per individual MPI task in the HRsimulation case with PaStiX using N N = 6, N M = 12 (left panel) andSTRUMPACK, using N N = 6, N M = 48 (right panel).Compute nodes 3 6 12 G Global Matrix Construction 2.74 1.57 1.04 I Solve 0.21 0.14 0.10First GMRES iteration 1.28 1.0 0.84Consecutive GMRES iteration 0.34 0.25 0.20Table 3: Wall-clock time t W (sec) spent on solver tasks for diﬀerent number of com-pute nodes using STRUMPACK while keeping the problem size ﬁxed (strongscaling).higher toroidal mode numbers n = (0 , , , ,

8) and n = (0 , , , . . . ,

8) here. For thissimulations we have used 6, 10 and 18 compute nodes, respectively (i.e. 2 nodes per onetoroidal mode), with 8 MPI tasks per node. Since the preconditioner matrix block sizesremain the same for all three cases, the time spent on analysis, LU factorization, andsolve tasks remains the same, while the GMRES communication time increases with thenumber of toroidal modes. The corresponding simulation results are shown in Table 4.With a higher number of toroidal modes, the time spent on constructing the global ma-trix remains roughly constant, since the number of matrix entries per MPI task doesn’tchange signiﬁcantly. Meanwhile, the time spent on GMRES iterations increases linearlydue to increased communication time. The overall performance scaling depends on therelative contributions of matrix construction time and GMRES iteration time.

To compare the performance of real and complex PaStiX solvers, we have consideredthe same case as in Section 3.1. Table 5 contains data from the simulations restartedin the highly nonlinear regime for a single time-step on 6 and 3 compute nodes with2 MPI tasks per node. For the given problem size, the real (PaStiX) solver needs atPage 17oroidal modes 0 , , , , . . . , , , . . . , G Global Matrix Construction 1.57 1.13 1.13 I First GMRES iteration 1.0 1.46 2.61Consecutive GMRES iteration 0.25 0.37 0.61Table 4: Wall-clock time t W (sec) spent on solver tasks for diﬀerent number of toroidalmodes using STRUMPACK while increasing the toroidal resolution accordingly(weak scaling).least 6 compute nodes based on the memory requirement. The complex (PaStiX) solver,on the other hand, can be executed on 3 compute nodes due to the reduced memoryconsumption. PaStiX solver real complexnumber of compute Nodes 6 6 3 A Analysis 22.71 14.98 11.50 F Factorization 133.38 53.40 81.27 I Solve 1.22 0.83 1.46Number of GMRES iterations 25 26 26GMRES solve 42.49 34.31 47.24Table 5: Wall-clock time t W (sec) spent on solver phases using 6 compute nodes for thereal PaStiX solver and 6 resp. 3 compute nodes for the complex PaStiX solver.In Table 5, we compare the wall-clock time t W (sec) spent on solver phases using 6compute nodes for real (PaStiX solver) and 6 respectively 3 compute nodes for complex(PaStiX solver). Due to the reduced matrix dimension, factorization ( F ) as one of themost expensive parts of the solve step is roughly 2.5 and 3.3 times faster when usingthe complex solver on 6 and 3 compute nodes, respectively, compared to the real solver.The analysis ( A ) is also done more eﬃciently with the complex solver, however, beingperformed only in the ﬁrst time step, it adds limited beneﬁts for a typical simulation.Moreover, one can notice from the Table 5 that although the complex solver requires oneadditional GMRES iteration as compared to the real solver, the GMRES solve ( I ) on 6respectively 3 compute nodes is more eﬃcient by a factor of 1.2 and 1.8 as compared tothe real solver on 6 compute nodes.Figure 10 shows the time-lines of the memory consumption for the real solver (with 6compute nodes) and complex solver (with 6, 3 compute nodes). It is apparent that themaximum memory consumption when using the complex solver with 6 nodes is roughly600 GB, which is less than roughly 700 GB in the case of the real solver. The memoryconsumption is even lower in the case on complex solver using 3 compute nodes. Thus, wePage 18

50 100150200250300350400450time (s)0100200300400500600700800 m e m o r y ( G B ) m e m o r y ( G B ) m e m o r y ( G B ) Figure 10: Cumulative plot of memory utilization per individual MPI task in the HRsimulation case with real (left panel) and complex (middle panel) PaStiXsolver using 6 compute nodes, and complex PaStiX solver using 3 computenodes (right panel)conﬁrm that the complex solver reduces the overall memory consumption signiﬁcantly.It is worth mentioning that we observe a larger reduction in memory consumption for ahigher number of toroidal modes.In conclusion, it can be seen from Table 5, that the overall performance gain by thecomplex solver is moderate when the memory consumption imposes the same minimumnumber of compute nodes for the real and complex solvers and when updates of thepreconditioning matrix (factorizations) are needed seldomly. On the other hand, if thepreconditioning matrix is updated more often, the complex solver may reduce the totalcomputational cost (including matrix construction and preconditioner matrix assemblywhich are not explicitly discussed in this section) by a third. When the reduced memoryconsumption by the complex solver allows to use a lower number of compute nodes, theoverall computational costs can be approximately halved due to the limited scalabilityof the solver.

The simulation considered here is a slightly adapted version of the simulations presentedin Ref. [21]. It represents a vertically unstable plasma (vertical displacement event,VDE), that develops violent non-axisymmetric ( n = 0) instabilities while moving intothe plasma facing components. We have chosen this case, since the standard iterativesolver convergence is poor, forcing the use of extremely small time steps. It is suspectedthat the strong non-linearity of the problem reduces the accuracy by which the precon-ditioning matrix P approximates the complete system A , such that condition numberand convergence deteriorate.We have used this VDE case to test the new preconditioner based on the overlappingmode groups. For this purpose we run simulations in the nonlinear regime, and com-pare the “standard” preconditioner (Fig. 1 right) with two new approaches based onPage 19verlapping mode groups (Figs. 2 and 3).Standard Overlapping I Overlapping II A Analysis time (s) 27.0 99.5 99.2 F Factorization time (s) 21.1 123.9 123.8 I Solve time (s) 0.12 0.34 0.34GMRES iteration time (s) 47.3 21.3 16.7Number of GMRES iterations 130 43 32Table 6: Comparison of solver performance of n = (0 , , ,

3) VDE case with ∆ t = 0 . A Analysis time (s) 24.6 126.5 F Factorization time (s) 22.2 147.4 I Solve time (s) 0.13 0.40GMRES iteration time (s) 87.8 42.8Number of GMRES iterations 97 45Table 7: Comparison of solver performance of n = (0 , , , . . . ,

10) VDE case with ∆ t =0 .

05, based on 10 time steps run with STRUMPACK solver using 22 computenodes.As shown in Table 6, the

Overlapping I and

Overlapping II preconditioners reduce thenumber of GMRES iterations by a factor of 3 and 4, respectively, leading to the reductionof the whole GMRES cycle time by a factor of 2.2 and 2.8, respectively. Meanwhile, thefactorization time increased by a factor of 6. Thus, the overall performance stronglydepends on how often preconditioner updates are required. A similar improvementtrend with

Overlapping I method is observed for the n = (0 , , , . . . ,

10) VDE case asshown in Table 7. Here we do not use the

Overlapping II preconditioner due to its highercomputational cost.The analysis, factorization and solve times with the new preconditioner are determinedby the corresponding times spent on the largest diagonal block, which is twice the sizeof the standard case. Since each diagonal block-matrix is factorized and solved indepen-dently, the diﬀerence in matrix sizes introduces a strong workload imbalance. To addressthis, we have implemented a ﬂexible distribution of MPI tasks to each mode group. Forexample, in the current simulation using the

Overlapping I method, we have ﬁve modegroups and a total of 32 MPI tasks. The groups consist of the following toroidal modes:(0),(0,1),(1,2),(2,3),(3). The block size corresponding to n = 0 mode is twice smallerthan for any other mode, and the number of nonzero entries is 4 times smaller. ThePage 20reconditioner block-matrix sizes for each group are shown in Table 8.Mode sets (0) (0,1) (1,2) (2,3) (3)Matrix rank 200,767 602,301 803,068 803,068 401,534nnz 52,808,329 475,274,961 844,933,264 844,933,264 211,233,316Table 8: Ranks and numbers of nonzero entries in preconditioner matrices for n =(0 , , ,

3) VDE case using the Overlapping I preconditioning method.There are diﬀerent ways to distribute the available 32 MPI tasks among the modegroups. For instance, distributing them as (2,6,9,9,6) would increase the factorizationtime to 147.6 sec as compared to 123.9 sec in the case of (1,4,12,12,3) tasks per groupdistribution. Here one should take into account that the most computationally expensivefactorization has diﬀerent MPI task scaling compared to the GMRES iteration cycle, thusthe task distribution for balanced performance would depend on the relative contributionof total factorization time and total iteration time. However, as a rule of thumb, one candistribute tasks roughly proportionally to the number of nonzero entries in each modegroup. In n = (0 , , ,

3) we have assigned 1 and 3 MPI tasks to the single mode groupsof the (0) and the (3) modes, respectively, 4 tasks to the (0 ,

1) group, and 12 tasks toeach of the (1 ,

2) and the (2 ,

3) groups. In the n = (0 , , , . . . ,

10) case 2 MPI tasks areassigned to the smallest mode group of (0) mode, 6 MPI tasks are assigned to the modegroup of (11) mode, and the overlapping mode groups get 8 MPI tasks each.Since the preconditioning matrix needs not to be updated (factorized) in every timestep, the eﬃciency of the new preconditioner is determined by the actual frequency ofthese updates. In our example, the

Overlapping I preconditioner would lead to the sameoverall performance, if the preconditioner is updated every ﬁve time steps. This numberwould be slightly higher in practice, since the time to construct preconditioner matricesis also slightly higher for the new preconditioner. In the nonlinear phase of the VDEsimulation with a normalized time step of 0 .

05, the preconditioner matrix updates needto be performed approximately every 50 time steps, which gives the new preconditionersa strong advantage.Meanwhile, as the preconditioner approximates the total system a lot better now, itbecomes possible to increase the simulation time step allowing for an even more eﬃcientway of running the simulation. For the same nonlinear case, good overall performancewas obtained when increasing the time step by a factor of 3, both for the

Overlapping I and

Overlapping II preconditioners, leading to the total t W spent on simulating 100 stepsusing 8 compute nodes being equal 104 min and 95 min, respectively. Comparing this to97 min required to perform 100 steps with ∆ t = 0 .

05 using the standard preconditioner,an approximate speed-up of a factor 3 × is obtained. The same conclusion holds for theVDE case with a higher toroidal resolution including the n = (0 , . . . ,

10) toroidal modes,which was tested as well.The main disadvantage of the new preconditioner lies in the increased memory con-sumption. As we can see in the Figure 11, for the VDE case with n = (0 , , ,

100 200 300 400 500 600 700 800time (s)050100150200250300350400450 m e m o r y ( G B ) A n a l y s i s F a c t o r S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e m e m o r y ( G B ) A n a l y s i s F a c t o r S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e Figure 11: Cumulative memory consumption in n = (0 , , ,

3) VDE case simulation withstandard preconditioner (left panel) and Overlapping I preconditioner (rightpanel) using 8 compute nodes with 4 MPI tasks per node m e m o r y ( G B ) A n a l y s i s F a c t o r S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e m e m o r y ( G B ) A n a l y s i s F a c t o r S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e S o l v e Figure 12: Cumulative memory consumption in n = (0 , , . . . ,

10) VDE case simulationwith standard preconditioner (left panel) and Overlapping I preconditioner(right panel) using 22 compute nodes with 4 MPI tasks per nodemaximum required memory with the new preconditioner was twice as large as the stan-dard one. With the direct construction of the preconditioning matrix, the initial peakaround t = 50s would disappear such that the maximum memory utilization is in the“solve” phase and the diﬀerence between both methods is reduced to about 50%, likewe observe in the Figure 12 for the n = (0 , , , . . . ,

10) case. Combining mode groupsand complex solver will be investigated in the future as a measure for reducing memoryconsumption.The presented results demonstrate just an example of using the new preconditioner.With the provided implementation ﬂexibility, there is room for further performance im-provements, e.g. by adjusting work load balance, data locality, optimization of GMRESpreconditioner update frequency vs. simulation time step, etc. Page 22

Summary

In this article, we have described the iterative solver with the physics-based precondi-tioner (PC) used in the non-linear MHD code JOREK, the properties of the sparse matrixsystem, and the interfaces to parallel sparse matrix solver libraries. We also explainedsome limitations faced with this solver in particular regarding memory consumption andconvergence in highly non-linear scenarios.For the PaStiX 5.x library the new interface was developed, which allows using com-plex numbers in the preconditioner, leading to the improved performance and memoryutilization. Reduced memory consumption with complex preconditioner allows to oper-ate a code with less nodes and thus allows to shift the point of operation towards thelower (ideal) part of the strong scaling curve.The STRUMPACK library has been added to JOREK as a new solver option, pro-viding several performance beneﬁts, such as ability to use distributed matrices, moreeﬃcient memory utilization, and overall solver speed-up.The new direct method for constructing preconditioner matrices in a distributed fash-ion is implemented, which avoids expensive all-to-all MPI communication. This resultsin potentially signiﬁcant improvement of performance, depending on the case considered.Finally, the most important development, is a generalization of the preconditioner tothe “mode groups”. The new method allows capturing some of the non-linear modecoupling in the PC by combining several toroidal modes in the diagonal matrix blocks,which can be solved independently. Such preconditioner much better resembles theoriginal system, thus leading to major improvement of the GMRES convergence and,consequently, to signiﬁcant reduction of the overall run time in the non-linear simulationregime. Besides the beneﬁts for tokamak simulations, the mode groups will also be anessential building block to make future JOREK simulations of stellarator cases [22, 23]possible, where instabilities do not linearly decouple in the toroidal modes any more.

Acknowledgements

The authors would like to thank Omar Maj for fruitful discussions, and FJ Artola,A Cathey, F Wieschollek, I Krebs, R Ramasamy for providing “real-life” test cases,Mathieu Faverge and the rest of the PaStiX team for their assistance with PaStiX 6.One of the author (I.H.) would like to thank Sebastian Ohlmann (MPCDF) and PieterGhysels (LBNL) for technical advising.Part of this work has been carried out within the framework of the EUROfusionConsortium and has received funding from the Euratom research and training program2014-2018 and 2019-2020 under grant agreement No 633053. The views and opinionsexpressed herein do not necessarily reﬂect those of the European Commission. Some ofthe work was performed using the Marconi-Fusion supercomputer. Page 23 eferences [1] G.T.A. Huysmans and O. Czarny. MHD stability in X-point geometry: simulation of ELMs.

NuclearFusion , 47(7):659, 2007. doi:10.1088/0029-5515/47/7/016.[2] M Hoelzl, G T A Huijsmans, S J P Pamela, M Becoulet, E Nardon, F J Artola, B Nkonga,C V Atanasiu, V Bandaru, A Bhole, D Bonﬁglio, A Cathey, O Czarny, A Dvornova, T Feher,A Fil, E Franck, S Futatani, M Gruca, H Guillard, J W Haverkort, I Holod, D Hu, S K Kim,S Korving, I Krebs, G Latu, F Liu, P Merkel, D Meshcheriakov, S Mochalskyy, J A Morales,R Nies, N Nikulsin, F Orain, J Pratt, R Ramasamy, P Ramet, C Reux, N Schwarz, P Singh Verma,S Smith, C Sommariva, E Strumberger, D van Vugt, M Verbeek, E Westerhof, F Wieschollek,J Zielinski, and the JOREK Team. The jorek non-linear extended mhd code and applications tolarge-scale instabilities and their control in magnetically conﬁned fusion plasmas.

Nuclear Fusion ,submitted, 2020. preprint at https://arxiv.org/abs/2011.09120 .[3] F. Orain, M. B´ecoulet, G. Dif-Pradalier, G. Huijsmans, S. Pamela, E. Nardon, C. Passeron, G. Latu,V. Grandgirard, A. Fil, A. Ratnani, I. Chapman, A. Kirk, A. Thornton, M. Hoelzl, and P. Cahyna.Non-linear magnetohydrodynamic modeling of plasma response to resonant magnetic perturbations.

Physics of Plasmas , 20(10):102510, 2013. doi:10.1063/1.4824820.[4] S. J. P. Pamela, A. Bhole, G. T. A. Huijsmans, B. Nkonga, M. Hoelzl, I. Krebs, and E. Strumberger.Extended full-MHD simulation of non-linear instabilities in tokamak plasmas.

Physics of Plasmas ,27(10):102510, 2020. doi:10.1063/5.0018208.[5] Olivier Czarny and Guido Huysmans. Bezier surfaces and ﬁnite elements for MHD sim-ulations.

Journal of Computational Physics , 227(16):7423 – 7445, 2008. ISSN 0021-9991.doi:10.1016/j.jcp.2008.04.001.[6] M Hoelzl, P Merkel, G T A Huysmans, E Nardon, E Strumberger, R McAdams, I Chap-man, S G¨unter, and K Lackner. Coupling JOREK and STARWALL codes for non-linearresistive-wall simulations.

Journal of Physics: Conference Series , 401:012010, dec 2012.doi:10.1088/1742-6596/401/1/012010.[7] Dani¨el Cornelis van Vugt.

Nonlinear coupled MHD-kinetic particle sim-ulations of heavy impurities in tokamak plasmas . PhD thesis, Technis-che Universiteit Eindhoven, Department of Applied Physics, 7 2019. URL https://research.tue.nl/en/publications/nonlinear-coupled-mhd-kinetic-particle-simulations-of-heavy-impur .Proefschrift.[8] Th´eo Mary.

Block Low-Rank multifrontal solvers: complexity, performance,and scalability . PhD thesis, University of Toulouse, France, 11 2017. URL http://mumps.enseeiht.fr/doc/Thesis_TheoMary.pdf .[9] George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for par-titioning irregular graphs.

SIAM Journal on scientiﬁc Computing , 20(1):359–392, 1998.doi:10.1137/S1064827595287997.[10] Fran¸cois Pellegrini and Jean Roman. Scotch: A software package for static mapping by dualrecursive bipartitioning of process and architecture graphs. In Heather Liddell, Adrian Colbrook,Bob Hertzberger, and Peter Sloot, editors,

High-Performance Computing and Networking , pages493–498, Berlin, Heidelberg, 1996. Springer Berlin Heidelberg. ISBN 978-3-540-49955-8.[11] Pascal H´enon, Pierre Ramet, and Jean Roman. PaStiX: A High-Performance Parallel DirectSolver for Sparse Symmetric Deﬁnite Systems.

Parallel Computing , 28(2):301–321, 2002. URL https://hal.inria.fr/inria-00346017 . Page 24

12] Gr´egoire Pichon, Eric Darve, Mathieu Faverge, Pierre Ramet, and Jean Roman. Sparse supernodalsolver using block low-rank compression: Design, performance and analysis.

Journal of Computa-tional Science , 27:255 – 270, 2018. ISSN 1877-7503. doi:10.1016/j.jocs.2018.06.007.[13] P. R. Amestoy, I. S. Duﬀ, J. Koster, and J.-Y. L’Excellent. A fully asynchronous multifrontal solverusing distributed dynamic scheduling.

SIAM Journal on Matrix Analysis and Applications , 23(1):15–41, 2001. doi:10.1137/S0895479899358194.[14] Patrick R. Amestoy, Abdou Guermouche, Jean-Yves L’Excellent, and St´ephane Pralet. Hybridscheduling for the parallel solution of linear systems.

Parallel Computing , 32(2):136 – 156, 2006.ISSN 0167-8191. doi:10.1016/j.parco.2005.07.004. Parallel Matrix Algorithms and Applications(PMAA’04).[15] P. Ghysels, S. L. Xiaoye, C. Gorman, and F. Rouet. A robust parallel preconditioner for in-deﬁnite systems using hierarchical matrices and randomized sampling. In , pages 897–906, May 2017.doi:10.1109/IPDPS.2017.21.[16] S. Chandrasekaran, M. Gu, and T. Pals. A fast $ ulv $ decomposition solver for hierarchicallysemiseparable representations. SIAM Journal on Matrix Analysis and Applications , 28(3):603–622,2006. doi:10.1137/S0895479803436652.[17] Jianlin Xia, Shivkumar Chandrasekaran, Ming Gu, and Xiaoye S. Li. Fast algorithms for hierar-chically semiseparable matrices.

Numerical Linear Algebra with Applications , 17(6):953–976, 2010.doi:10.1002/nla.691.[18] Patrick Amestoy, Cleve Ashcraft, Olivier Boiteau, Alfredo Buttari, Jean-Yves L’Excellent, andCl´ement Weisbecker. Improving multifrontal methods by means of block low-rank representations.

SIAM Journal on Scientiﬁc Computing , 37(3):A1451–A1474, 2015. doi:10.1137/120903476.[19] Tamas B. Feh´er, Matthias Hoelzl, Guillaume Latu, and G. T. A. Huijsmans. Performance analysisand optimization of the JOREK code for many-core CPUs. arXiv e-prints , abs/1810.04413, 2018.URL http://arxiv.org/abs/1810.04413 .[20] R Nies and M Hoelzl. Testing performance with and without block low rank compression inMUMPS and the new PaStiX 6.0 for JOREK nonlinear MHD simulations. arXiv e-prints , pagearXiv:1907.13442, Aug 2019. URL https://arxiv.org/abs/1907.13442 .[21] F. J. Artola, C.R. Sovinec, Jardin S.C., M. Hoelzl, I. Krebs, and C. Clauser. 3D simulationsof vertical displacement events in tokamaks: A benchmark of M3D-C1, NIMROD and JOREK.

Physics of Plasmas , submitted, 2020. preprint at https://arxiv.org/abs/2011.04523 .[22] N Nikulsin, M Hoelzl, A Zocco, S Guenter, and K Lackner. A three-dimensional reduced MHDmodel consistent with full MHD.

Physics of Plasmas , 26:102109, 2019. doi:10.1063/1.5122013.preprint at http://arxiv.org/abs/1907.12486 .[23] Rohan Ramasamy, Matthias Hoelzl, and et al. Extending the JOREK code to 3D simulation grids.private communication, 2020-12-12..[23] Rohan Ramasamy, Matthias Hoelzl, and et al. Extending the JOREK code to 3D simulation grids.private communication, 2020-12-12.