[PDF] A 3D radiative transfer framework: XII. Many-core, vector and GPU methods

Abstract

3D detailed radiative transfer is computationally taxing, since the solution of the radiative transfer equation involves traversing the six dimensional phase space of the 3D domain. With modern supercomputers the hardware available for wallclock speedup is rapidly changing, mostly in response to requirements to minimize the cost of electrical power. Given the variety of modern computing architectures, we aim to develop and adapt algorithms for different computing architectures to improve performance on a wide variety of platforms. We implemented the main time consuming kernels for solving 3D radiative transfer problems for vastly different computing architectures using MPI, OpenMP, OpenACC and vector algorithms. Adapted algorithms lead to massively improved speed for all architectures, making extremely large model calculations easily feasible. These calculations would have previously been considered impossible or prohibitively expensive. Efficient use of modern computing devices is entirely feasible, but unfortunately requires the implementation of specialized algorithms for them.

Full PDF

AA 3D radiative transfer framework: XII. Many-core,vector and GPU methods

Peter H. Hauschildt

Hamburger Sternwarte, Gojenbergsweg 112, 21029 Hamburg, Germany

E. Baron a Homer L. Dodge Dept. of Physics and Astronomy, University of Oklahoma, 440 W.Brooks, Rm 100, Norman, OK 73019 USA b Department of Physics and Astronomy, Aarhus University, Ny Munkegade 120, DK-8000Aarhus C, Denmark

Abstract

3D detailed radiative transfer is computationally taxing, since the solution ofthe radiative transfer equation involves traversing the six dimensional phasespace of the 3D domain. With modern supercomputers the hardware availablefor wallclock speedup is rapidly changing, mostly in response to requirements tominimize the cost of electrical power. Given the variety of modern computingarchitectures, we aim to develop and adapt algorithms for diﬀerent comput-ing architectures to improve performance on a wide variety of platforms. Weimplemented the main time consuming kernels for solving 3D radiative trans-fer problems for vastly diﬀerent computing architectures using MPI, OpenMP,OpenACC and vector algorithms. Adapted algorithms lead to massively im-proved speed for all architectures, making extremely large model calculationseasily feasible. These calculations would have previously been considered im-possible or prohibitively expensive. Eﬃcient use of modern computing devicesis entirely feasible, but unfortunately requires the implementation of specializedalgorithms for them.

Keywords: numerical methods: radiative transfer

Email addresses: [email protected] (Peter H. Hauschildt), [email protected] (E. Baron)

Preprint submitted to Astronomy and Computing February 5, 2021 a r X i v : . [ a s t r o - ph . S R ] F e b . Introduction In a series of papers, we have described a framework for solving the radia-tive transfer equation in 3D systems (3DRT), including a detailed treatment ofscattering in continua and lines with a non local operator splitting method andits use in the general model atmosphere package

PHOENIX [1, 2, 3, 4, 5, 6, 7, 8,9, 10, 11, hereafter: Papers I–XI].We give a short summary of the problem and the numerical approach in thenext section.In typical non-local thermodynamic equilibrium (NLTE)

PHOENIX/3D appli-cations, the 3DRT uses about 75% of the overall compute time, that is, reducingthe time consumed by the 3DRT module will signiﬁcantly reduce the overall sim-ulation time. This is an advantage of our approach since only a small subsetof code modules require specialized directives and code modiﬁcations to matchthe targeted hardware and compiler.In Paper VIII [8] we showed that specialized codes for GPUs can resultin signiﬁcantly improved performance compared to standard CPUs of the era.With the availability of new many-core CPUs (e.g., Intel Phi), vector CPUs (i.e.,NEC Aurora TSUBASA) and GPUs, the need for algorithms specially adaptedto the hardware in order to maximize performance on such systems becomesapparent. This becomes more urgent due the fact that modern supercomputersare being built using many-core CPUs or as hybrid CPU/GPU systems. While,it is likely that eﬃcient use of such systems will require specialized algorithms,it is not economical to design special codes for each system or to use vendor-locked programming models as individual computing system lifetimes (5 years)are typically much shorter than code lifetimes (20+ years).For these reasons, we investigate here how 3D radiative transfer calculationscan be accelerated using algorithms adapted to many-core and vector CPUsas well as to GPUs. As codes will be used potentially for decades, it is veryimportant to adhere to general standards as closely as possibly in order toretain source code compatibility for long time scales. Thus, we will use Fortran2008 [12] as the base programming language and will use MPI [13], OpenMP[14] and OpenACC [15] as additional directive based performance enhancers.In addition, directive based statements for speciﬁc systems, e.g., to vectorizeloops, are considered as long as they do not interfere with portability to othersystems.In the following, we consider two test cases that represent typical 3DRTusage patterns of

PHOENIX/3D : A Cartesian box with periodic boundary condi-tions in the horizontal (x,y) plane and a spherical coordinate system case usedin irradiation models in pre-CVs or exoplanets. These general setups coverthe vast majority of current and planned

PHOENIX/3D calculations, acceleratingthem can save millions of CPU-core-hours and related energy expenditures. Inaddition, faster calculations can enable the use of smaller supercomputers sothat not only Tier 0 systems, but also Tier 1 or 2 systems can be used for agiven model. This can drastically reduce the turnaround time for a model andin turn, free up Tier 0 systems for larger model runs.

2. 3D radiative transfer framework We use a Cartesian or spherical coordinates grid of non-equal sized volumecells (voxels) for the following discussion. The values of physical quantities,such as temperatures, opacities and mean intensities, are averages over a voxel,which, therefore, also ﬁxes the local physical resolution of the grid. In thefollowing we will specify the size of the voxel grid by the number of voxels alongeach positive axis, e.g., n x = n y = n z = 32 speciﬁes a voxel grid from voxelcoordinates ( − , − , −

32) to (32 , ,

32) for a total of (2 ∗

32 + 1) = 274625voxels, 65 along each axis.The 3DRT framework is typically applied to optically thick environmentswith a signiﬁcant scattering contribution, e.g., modeling the light reﬂected byan extrasolar giant planet close to its parent star. Therefore, not only a for- This section is adapted from [1] ∗ operator is used inthe calculations [e.g., 16], therefore, we exclusively use a non-local Λ ∗ method. For simplicity, we consider here only the static, steady-state radiative trans-fer equation in 3D (see [9, 10] for the non-static case) written asˆ (cid:126)n · ∇ I ( ν, (cid:126)x, ˆ (cid:126)n ) = η ( ν, (cid:126)x ) − χ ( ν, (cid:126)x ) I ( ν, (cid:126)x, ˆ (cid:126)n ) (1)where I ( ν, (cid:126)x, ˆ (cid:126)n ) is the speciﬁc intensity at frequency ν , position (cid:126)x , in the direc-tion ˆ (cid:126)n , η ( ν, (cid:126)x ) is the emissivity at frequency ν and position (cid:126)x , and χ ( ν, (cid:126)x ) is thetotal extinction at frequency ν and position (cid:126)x . The source function S = η/χ .Mathematically, this is a linear ﬁrst order partial intro-diﬀerential equation. InCartesian coordinates the ∇ = ∂∂x + ∂∂y + ∂∂z and the direction ˆ (cid:126)n is deﬁned bytwo angles ( θ, φ ) at the position (cid:126)x . The mean intensity J is obtained from the source function S by a formalsolution of the RTE which is symbolically written using the Λ-operator Λ as J = Λ S. (2)The source function is given by S = (1 − (cid:15) ) J + (cid:15)B , where (cid:15) denotes the thermalcoupling parameter and B is Planck’s function.The Λ-iteration method, i.e. to solve Eq. 2 by a ﬁxed-point iteration schemeof the form ¯ J new = Λ S old , S new = (1 − (cid:15) ) ¯ J new + (cid:15)B, (3)fails in the case of large optical depths and small (cid:15) .Here, ¯ J = (cid:82) ∞ J λ φ λ dλ is the mean intensity averaged over the line proﬁle, φ λ ; S old is the current estimate for the source function S ; and S new is the new,4mproved, estimate of S for the next iteration. The failure of the Λ-iterationto converge is caused by the fact that the largest eigenvalue of the ampliﬁcationmatrix is approximately [17] λ max ≈ (1 − (cid:15) )(1 − T − ), where T is the opticalthickness of the medium. For small (cid:15) and large T , this is very close to unityand, therefore, the convergence rate of Λ-iteration is very poor. A physicaldescription of this eﬀect can be found in [18].The idea of the operator splitting (OS) method is to reduce the eigenval-ues of the ampliﬁcation matrix in the iteration scheme [19] by introducing anapproximate Λ-operator (ALO) Λ ∗ and to split Λ according toΛ = Λ ∗ + (Λ − Λ ∗ ) (4)and rewrite Eq. 3 as ¯ J new = Λ ∗ S new + (Λ − Λ ∗ ) S old . (5)This relation can be written as [20][1 − Λ ∗ (1 − (cid:15) )] ¯ J new = ¯ J fs − Λ ∗ (1 − (cid:15) ) ¯ J old , (6)where ¯ J fs = Λ S old and ¯ J old is the current estimate of the mean intensity J .Equation 6 is solved to get the new values of ¯ J which is then used to computethe new source function for the next iteration cycle. The calculation and thestructure of Λ ∗ should be simple in order to make the construction of the linearsystem in Eq. 6 fast. For example, the choice Λ ∗ = Λ is best in view of theconvergence rate (it is equivalent to a direct solution by matrix inversion) butthe explicit construction of Λ is more time consuming than the construction ofa simpler Λ ∗ .The CPU time required for the solution of the RTE using the OS methoddepends on several factors: (a) the time required for a formal solution and thecomputation of ¯ J fs , (b) the time needed to construct Λ ∗ , (c) the time required forthe solution of Eq. 6, and (d) the number of iterations required for convergenceto the prescribed accuracy. 5 . Test cases The basic test setups for the Cartesian grid with periodic boundary condi-tions (PBCs) and the spherical coordinate system (SP) were chosen so that thetests can run on the smallest memory system (GPU) that we used here. Thesesetups are the basic building blocks that are employed in more complex models(e.g., 3D NLTE calculations).

The test cases we have investigated follow the continuum tests used in [3] and[8]. In detail, we used a conﬁguration that utilizes periodic boundary conditions(PBCs) in a plane parallel slab. We used PBCs on the x and y axes, z max is theoutside boundary, and z min the inside boundary. The slab has a ﬁnite opticaldepth in the z axis. The basic model parameters are1. the total thickness of the slab, z max − z min = 10 cm;2. the minimum optical depth in the continuum, τ minstd = 10 − and the max-imum optical depth in the continuum, τ maxstd = 10 ;3. grey temperature structure for T eﬀ = 10 K;4. boundary conditions with outer boundary condition I − bc ≡ χ c = (cid:15) c κ c + (1 − (cid:15) c ) σ c with (cid:15) c = 10 − . κ c and σ c are the continuum absorption and scatteringcoeﬃcients.The Cartesian grid has ( n x , n y , n z ) = (65 , , n θ , n φ ) = (32 ,

16) = 512 solid angles for the formal solution, equallyspaced in µ = cos( θ ) and φ . 6 .2. Test case setup: spherical coordinate system The setup for the spherical coordinate system test (SP) follows the generalsetup for the corresponding case in Paper IV [4]. The conﬁguration used hereis: 1. Inner radius r c = 10 cm, outer radius r out = 2 × cm.2. Minimum optical depth in the continuum τ minstd = 10 − and maximumoptical depth in the continuum τ maxstd = 10 .3. Grey temperature structure with T eﬀ = 10 K.4. Outer boundary condition I − bc ≡ χ c = C/r , with the constant C ﬁxed by the radiusand optical depth grids.6. Parameterized coherent and isotropic continuum scattering by deﬁning χ c = (cid:15) c κ c + (1 − (cid:15) c ) σ c with (cid:15) c = 10 − . κ c and σ c are the continuum absorption and scatteringcoeﬃcients.The spherical grid (SP) has ( n r , n θ , n φ ) = (129 , ,

65) = 276705 voxels and weuse Ω = ( n θ , n φ ) = (8 ,

16) = 128 solid angles for the formal solution (equallyspaced in µ = cos( θ ) and φ . This is smaller than in typical PHOENIX/3D appli-cations, however, this setup still ﬁts into the smallest RAM of the test systems.The timing proﬁle of this setup is similar to large scale

PHOENIX/3D runs, sothat we use it as a model in this work.

The base system for the comparison is a dual Intel Xeon CPU W-3223 with3.50GHz clock-speed at 8 cores per CPU and 2 hardware threads per core. Thesystem is realized in a MacPro7,1 system with 96GB RAM. The OS is MacOS10.15, the available compilers are GCC 10.2.0, Intel 19.1.3.301 and PGI 19.10-0(the newer NVIDIA HPC compilers were not available for MacOS).7 .3.2. GPU: NVIDIA V100

The GPU test system is a 8 core Xeon Silver 4110 (Skylake) 2.1GHz hostmachine with a NVIDIA V100 GPU with 32GB RAM. The GPU has a clockspeed of 1380 MHz and a Memory Clock Rate of 877 MHz and 4096 bits memorybus width. To target the GPU, we use OpenACC as implemented in the NVIDIAFortran compiler version 20.9-0. The host system is running Centos 7. For thehost CPU the Intel compiler version 19.1.3.304 is also available.

The many-core test system is a Intel Xeon Phi CPU 7250 (Knights Landing;KNL) based system with 1.30GHz clock speed, 64 cores and 4 hardware threadsper core, similar to the NERSC Cori nodes ( ). On this system with use Intel compiler ver-sion 19.1.3.304 with OpenMP directives turned on.

As a vector processor test system we use a NEC SX-Aurora TSUBASAA101 (Aurora) machine with 2 vector engine (VE) cards. Each card has 48GBRAM, 8 vector cores at 1.4 GHz clock speed. The vector host (VH) is a XeonGold 6126 CPU (Skylake) at 2.60GHz clock-speed with 96GB RAM. The VHmanages the 2 VEs and handles their I/O requests, runs the compiler etc. The(MPI) application runs completely on the VEs and is natively compiled usingthe NEC Fortran compiler, version 2.1.1. The VE have fast vector pipelinesand a scalar unit on each core. For the test we use the NEC Fortran Compilerto create vectorized executables with the help of compiler directives and NECMPI for the MPI based parallelization.In addition to the VE, the VH can also be used for calculations. We thusran tests with the Intel Fortran Compiler (version 19.1.3.304) and the NVIDIAcompiler (version 20.9-0) with their respective MPI and OpenMP implementa-tions. 8 . Method

In the following discussion we use the notation of Papers I – XI.Considering the 3DRT module individually for any given wavelength, thebiggest time consumers are: setting up the geometric paths through the gridfor all solid angles (henceforth: the tracker), computing the formal solution forall solid angles, computing the Λ ∗ operator (only in the ﬁrst iteration) and theoperator splitting solver (computing the new J ’s by solving the operator splittinglinear system, see Paper I). In addition, MPI load balancing may be measurable(e.g., if the workload for diﬀerent solid angles is signiﬁcantly diﬀerent so thatsome MPI processes ﬁnish faster than others, this is the case in the PBC testsetup). These steps are repeated for all wavelength points that are consideredin a calculation.For the solution of the large sparse linear system (its rank is equal to thetotal number of voxels of the simulation) in the operator splitting step (‘OSiteration’) we have implemented 3 algorithms: a Jacobi iteration, a Gauss-Seideliteration, and the BiCGStab method [21], parallelized with MPI, OpenMP andOpenACC. Each of them performs diﬀerently on the test systems and we showin the graphs the fastest version for each of the test runs.The standard algorithm implements the algorithms described in Paper III(PBCs) and IV (SP) for multi-CPU systems with MPI parallelization: In theformal solution (and Λ ∗ computation) diﬀerent solid angles are parallelized andthe data needs to be collected from all participating processes only at the endof the formal solution. Similarly, the operator splitting step is parallelized withMPI. The standard algorithm was designed to minimize the memory footprint,so that calculations can be performed on smaller machines. One importantobservation is that the tracker and the formal solution are memory, rather thancompute bound: The diﬀerent solid angles cause memory access patterns thatare aligned with optimal memory and cache access patterns only in speciﬁc cases— in most cases memory is accessed randomly through the formal solution.Thus, memory latency is the primary bottleneck.9ome results for the standard algorithm are shown in Fig. 1. The perfor-mance on older Skylake (Silver 4110) and newer Cascade Lake (W-3223) CPUsis roughly the same. All times are given for a complete node, i.e., all cores ofthe CPU(s) of the node are used, a common pattern in High Performance Com-puting, (HPC). The standard algorithm is not multi-threaded with OpenMP,thus the multiple CPU cores are used with MPI parallelization. For comparisonwe also include the results on the Intel Xeon Phi KNL many-core CPU and theNEC Aurora vector CPU. Whereas the standard algorithm on the KNL CPUfares very well, its performance on the vector processor is abysmal. This is dueto the un-threaded and un-vectorizable form of the standard algorithm. Thetheoretical performance of the KNL cannot be realized in this setup since thestandard algorithm does not produce performance gains with OpenMP thread-ing: It requires critical sections and atomic constructs to give correct resultsand the memory access pattern causes delays in the OpenMP threads, so thatOpenMP directives can even be counter productive.In the following, we will refer to the “formal solution” as the combinationof the formal solution and the the construction of the nearest-neighbor Λ ∗ .Each of these steps, formal solution and Λ ∗ construction take about the samecomputation time. In all cases for the standard algorithm, the formal solutiontakes the largest time (red bars in Fig. 1), whereas the operator splitting solver(blue) timing varies from system to system. The time for the MPI ‘allreduce’(green) is substantially larger in the PBC case, where the workload betweendiﬀerent MPI processes is signiﬁcantly diﬀerent, than in the SP model, wherethe load balancing is much better. As a ﬁrst step towards better performance and constructing an optimizedalgorithm for the formal solution on many-core CPUs we consider the tracker;in particular, the geometry part. In a typical

PHOENIX/3D application, many(100,000 or more) wavelength points are considered for the same structure.NLTE modeling has to repeat this procedure numerous times in order to solve10he multi-level NLTE rate equations consistently with the radiation ﬁeld. There-fore, the geometry and thus, the set of tracks (directions in momentum space)through the voxel grid remain the same for all wavelength points. This can beused to setup and ﬁll a geometry cache once, at the beginning, which is thanaccessed for all subsequent formal solutions, until the wavelength grid has beentraversed. This requires (a) a multi-pass algorithm which separates the geom-etry tracking from the formal solution and (b) optionally storing the geometrytracking data. This requires substantial additional memory, which is, however,fully distributed over diﬀerent MPI processes with no communication and shar-ing required. Thus, when using more MPI processes, each process needs toprogressively store less geometry data up to the theoretical limit of one solidangle per MPI process. This situation is actually very realistic in

PHOENIX/3D simulations as the domain decomposition requires anywhere from 64 to 1024MPI processes (typically 4-64 nodes) and in such realistic setups, the geometrycache requires only a small fractional memory increase per MPI process.The multi-pass algorithm has the additional advantage that it allows formore eﬃcient OpenMP parallelization and Single Instruction Multiple Data(SIMD) style vectorization. Once the geometric paths through the voxel gridare known (either by computing them in a ﬁrst phase or by recalling them fromthe geometry cache) for each solid angle, the formal solution can be computedby loops over the voxel grid rather than stepping along each characteristic. Tokeep the loops short and simple enough for OpenMP based parallelization, itturned out to be advantageous to split the calculation into separate phases.Each phase can then use diﬀerent OpenMP parallelization and SIMD statementvectorization (for Intel CPUs, using the Intel compilers). This increases memorylocality, resulting in better cache usage, but does add temporary arrays that areneeded to store intermediate results.For a multi-core CPU the results are shown in Fig. 2, where we use the XeonW-3223 as an example. Here, the multi-pass algorithm without the geometrycache (labeled ‘MultiPass’ in the ﬁgures) performs substantially better than thestandard algorithm (about 1.9 times) in the PBC case but substantially worse

PHOENIX/3D model the times do not include the construction of the geometry cache (this is done once before the cal-culations start). This ‘MultiPassCache’ version produces the best performancefor both test cases, about 3 times speedup for the SP case and 2.6 times for thePBC test compared to the standard algorithm. These are substantial speedupson classic multi-core CPUs. Listing 1 gives pseudo-code for this method.The corresponding results on many-core CPUs are similar. In Figs. 3 and 4we show the results obtained on the Xeon Phi 7210 (64 cores with 4 hardwarethreads each at 1.3GHz). Here, the standard algorithm is substantially slowerthan the ‘MultiPassCache’ setup. The speedups for the MultiPass+Cache ver-sion are 3.4 (PBCs) and 2.6 (SP) compared to the standard algorithm. Enablingthe geometry cache speeds up the overall calculation by a factor of 2 or more.For a full

PHOENIX/3D simulation run this produces massively reduced wall-clock times compared to the standard and MultiPass algorithms. In the case ofKNL, the MultiPassCache algorithm allows for much better OpenMP utiliza-tion, including the OpenMP SIMD statement that is used to vectorize (i.e., useAVX-512 instructions) on the KNL CPU with the Intel compiler. These vector-izations by themselves already produce signiﬁcant speed-ups compared to thestandard algorithm. This result is highlighted by the observation (cf. Fig. 4)that for the SP test on KNL, a 16 MPI processes with 16 OpenMP threads each(16@16) setup, is faster than the more MPI biased 64@4 setup. Even the veryopenMP focused 4@64 setup is only 10% slower than the 16@16 setup. In thePBC test case, the 4@64 setup is the fastest; however, the 16@16setup is onlyabout 12% slower. This is likely at least, in part, due to MPI load balancing,which is a bigger problem in the PBC test (where the diﬀerent characteristicshave very diﬀerent lengths in terms of voxel counts than the far more balancedSP test case). These eﬀects show that the KNL is quite sensitive to details of the12etup and for each simulation the optimal setup may need to be determined bytesting in advance. Experiments using less than the 4 hardware threads per coreon KNL (as suggested in some performance documents) reduced performancesigniﬁcantly. It appears that using more threads may be able to hide mem-ory latencies better; however, using more than 4 threads per core also reducedperformance.

The Aurora is a vector processor with signiﬁcantly diﬀerent performancecharacteristics than Intel Xeons or many-core CPUs. While its vector perfor-mance is very high, its scalar performance is quite low, that is, vectorizationis of utmost importance on this system. In Fig. 5 we show the performanceresults for the standard and MultiPass algorithms, which do not use the vectorprocessing capabilities of the Aurora and the execution times are very long.Therefore, we have developed a variant of the MultiPass+Cache algorithmthat can be vectorized (with directives) by the NEC Fortran compiler. For this,we swapped loop orders (adding intermediate helper arrays where necessary) andsplit/merged loops so that the NEC compiler vectorized them. The design goalwas to vectorize the longest possible loop, even if it requires additional arrays tostore intermediate data. The resulting code uses slightly more RAM than theMultiPass+Cache version. In addition, we also developed a vectorized versionof the operator splitting solver in order to further increase overall performance.The changes required to the standard operator splitting solvers are small, a fewloop rearrangements and vectorization directives.The performance of the vector algorithm is shown in Fig. 5. Comparedto the standard algorithm, the vector version on the Aurora is about 44 timesfaster for the PBC case and 65 times faster for the SP test case. In addition, thevector version signiﬁcantly reduces MPI load balance issues. On the multi-coreCPU the vector algorithm is 1.5 times faster than the standard algorithm inthe PBC case, but only about 17% faster compared to the standard algorithmfor the SP test case. On the many-core Xeon Phi, the vector algorithm is 3-413imes slower than the MultiPass+Cache version (not shown). Note that thevector algorithm does not use OpenMP threading, so that, in particular, onKNL, only one thread per core is used. In addition, the Intel compiler does not vectorize the loops in the same way that the NEC compiler does and, therefore,the vector algorithm does not use the Xeon vector instructions (e.g., AVX-512on the KNL) eﬃciently.

Over the last decade, using GPUs to accelerate numerical calculations (com-pared to classical CPUs) has become important. To enable such usage, propri-etary methods [e.g., CUDA, 22] as well as open standards have been developed.The latter are realized both as programming standards, e.g., OpenCL [23], andas (directive based) APIs, e.g., OpenACC [15]. For long term portability andvendor independence, we consider it imperative to utilize an open standardbased approach (codes are often used for decades on very diﬀerent hardwaregenerations). We have published results for OpenCL before [8] and therefore weconcentrate here on an OpenACC based approach. New versions of OpenMPalso support GPU based oﬄoading; however, the support for available hardwareprovided by existing compilers that we have access to is presently very limited.Currently, this is actually also true for OpenACC, where the only compiler sup-port available is the NVIDIA compiler [24] on NVIDIA hardware. OpenACCsupport in GCC [25] was, at the time of this writing, not in an advanced enoughstate to be usable.For the OpenACC version we had to make signiﬁcant changes to the code.The main problem is that OpenACC does not work with the array-of-structuresmethod that the original code (standard, cache, and vector versions) uses.Therefore, we had to redesign the code to also (alternatively) use a structure-of-arrays method (labeled ‘FlatCache’ in the ﬁgures) to store the geometry trackingcache and the data used for the operator splitting (e.g., the Λ ∗ array). On theNEC and the Xeons, the structure-of-arrays version is marginally faster (about2%) than the array-of-structures method. The individual arrays are then easily14ransferred onto the GPU with OpenACC update directives. If the GPU de-vice has enough memory, the geometry cache can be stored once on the GPUand then be used directly, without the need to transfer the tracking cache foreach solid angle. In the test cases, the caches and additional 3DRT arrays are25-30GB total, so that they can be stored on the GPU (LargeGPU mode). Forcomparison, we have also run tests in Small GPU mode where the trackingcaches are kept on the host (CPU) and transferred onto the GPU for each solidangle. In this mode, about 1-2GB are used on the GPU so that smaller devicescan also be used or multiple processes can be run on one GPU. Note that inMPI mode with several GPUs only a fraction of the tracking cache needs to bestored, thus, proportionally reducing the memory use per GPU.The OpenACC version of the formal solution is adapted from the vectorversion by adding OpenACC directives. In a few loops, the order was exchangedto better exploit the GPU architecture.Listing 2 gives the pseudo-code for this method. The compiler directivesare relatively simple and converting from OpenMP to OpenACC is relativelystraightforward. Likewise, given a working OpenACC version it is relativelystraightforward to port to OpenMP based device directives, based on small testcases that we have been able to evaluate.The results of the OpenACC test calculations are shown in Fig. 6, wherewe compare the timings obtained with the NVIDIA/OpenACC compiler on theGPU. NVIDIA OpenACC support is also available for many-core CPUs, thus weare able to include results for the KNL used as an OpenACC device with sharedmemory. The test GPU is a NVIDIA V100 with 32 GB, which can be run in bothSmallGPU and LargeGPU mode. The LargeGPU mode is 4 to 9 times fasterthan the SmallGPU mode, clearly showing that data transfer is a signiﬁcantbottleneck for the GPU. The KNL OpenACC code is, in comparison, muchslower and not competitive with the MPI+OpenACC versions (not shown).15 .4. Xeon compiler comparison The timing results for the same jobs on the Xeon W-3223 CPU (this is aMacPro7,1 running MacOS, and when the PGI compiler is used, the NVIDIAHPC SDK is not available for MacOS) diﬀerent compilers are shown in Fig. 7.For both test setups we used the MultiPass+Cache algorithm with MPI+OpenMP(8 processes with 2 threads each, i.e., hyperthreaded mode) on the system, thefastest results for each compiler are shown in Fig. 7. The compilers are IntelVersion 19.1.3.301, GNU Fortran (GCC) 10.2.0, and PGI pgfortran 19.10-0. Inboth model setups, the Intel compiler results in the smallest run time, followedby GCC which is about 12% slower in the SP test and 42% slower in the PBCtest compared to the Intel compiler results. The PGI compiler trails in bothtests signiﬁcantly, a test with the newer NVIDIA HPC compiler on the KNLand the Xeon Silver 4110 (host CPU of the V100 GPU) shows similar trends.

5. Summary and conclusions

In Fig. 8 we show the results for the fastest 7 runs over all systems, algo-rithms, and setups combined. For the PBC test case, the OpenACC LargeGPUversion on the V100 GPU is by far the fastest, followed by the NEC TSUB-ASA Aurora vector processor which is slower by a factor of about 3.2. In thespherical test case the OpenACC LargeGPU setup is also the fastest, followedby the NEC TSUBASA Aurora which is a factor of about 2.4 slower. The 5year old KNL is surprisingly quick, it beats the V100 in SmallGPU mode forthe PBC tests and is faster than the newer multi-core Xeons. In all cases, thegeometry caching is very eﬀective, producing signiﬁcant speedups, particularly,on more recent CPUs (for example, a Skylake Xeon Gold 6126 is 1.89 faster withthe cache enabled than without it). The OpenACC SmallGPU version is muchslower than the LargeGPU setup, showing the cost of data transfer. This is ofgreat practical importance as large scale 3DRT runs require more RAM thancurrent GPUs have available, in which case either SmallGPU mode or multipleGPUs must be used. In SmallGPU mode, the overall utilization of the GPU16s smaller (by about 50%) so that multiple parallel processes on a single GPUmay be used, initial experiments have provided promising results for very largetest cases and 2 independent processes on a single V100 GPU (using NVIDIAMPS).The main problems with OpenACC are that the compiler support needed foreﬃcient and portable (to diﬀerent GPU vendors) code is not yet available andthat complex code and data structures may need to be adapted (simpliﬁed) foreﬃcient OpenACC data transfer. Overall, OpenACC code is simple to generateusing the vector algorithm as a starting point. In the future it may be better toswitch to the OpenMP oﬄoad/target paradigm which appears to be availablefor more hardware options and is supported by more compilers, in particularthe LLVM framework .The NEC Aurora vector processor is very fast and easy to code for, however,its scalar performance is very low. In practical applications it may be best tocombine the vector engine with the host CPU whenever possible, where the3DRT executes on the vector processor and the host CPU is used for I/O andscalar processing. Several methods allow for this option, including running MPIprocesses on the vector processor and the host CPU at the same time andoﬄoading to and from the vector processor.The performance of the Xeon Phi KNL is quite sensitive to the exact setup(MPI and OpenMP) used for each problem, thus, in typical production use it isimportant to determine the optimal conﬁguration beforehand (or to implementan automatic scheme to optimize the conﬁguration at runtime). On all IntelCPUs, the vector algorithm is less eﬃcient than the MultiPass+Cache version(recall that the vector version is a rearranged MultiPass version and includes thegeometry cache). This could be due to the Intel compiler not using the vectorinstructions (which is forced by OpenMP SIMD directives in the MultiPassalgorithm) automatically or by the vector instructions stalling to data gatheror scatter (which may be hidden by the many threads used on the Xeon Phi). llvm.org Acknowledgments

PHH gratefully acknowledges support by the DFG under grants HA 3457/20-1 and HA 3457/23-1. Some of the calculations presented here were performedat the RRZ of the Universit¨at Hamburg, at the H¨ochstleistungs RechenzentrumNord (HLRN), and at the National Energy Research Supercomputer Center(NERSC), which is supported by the Oﬃce of Science of the U.S. Depart-ment of Energy under Contract No. DE-AC03-76SF00098. We thank all theseinstitutions for a generous allocation of computer time. PHH gratefully ac-knowledges the support of NVIDIA Corporation with the donation of a QuadroP6000 GPU used in this research. E.B. acknowledges support from NASA GrantNNX17AG24G and an AUFF sabbatical grant. We thank Dr. Rudolf Fischerfrom NEC Deutschland GmbH for very helpful advise on vectorizing for theNEC TSUBASA Aurora system. 18 eferences [1] P. H. Hauschildt, E. Baron, A 3D radiative transfer framework. I. Non-localoperator splitting and continuum scattering problems, A&A451 (2006) 273–284. arXiv:astro-ph/0601183 , doi:10.1051/0004-6361:20053846 .[2] E. Baron, P. H. Hauschildt, A 3D radiative transfer framework. II. Linetransfer problems, A&A468 (1) (2007) 255–261. doi:10.1051/0004-6361:20066755 .[3] P. H. Hauschildt, E. Baron, A 3D radiative transfer framework. III. Periodicboundary conditions, A&A490 (2008) 873–877. arXiv:0808.0601 , doi:10.1051/0004-6361:200810239 .[4] P. H. Hauschildt, E. Baron, A 3D radiative transfer framework. IV. Spher-ical and cylindrical coordinate systems, A&A498 (2009) 981–985. arXiv:0903.1949 , doi:10.1051/0004-6361/200911661 .[5] E. Baron, P. H. Hauschildt, B. Chen, A 3D radiative transfer framework.V. Homologous ﬂows, A&A498 (2009) 987–992. arXiv:0903.2486 , doi:10.1051/0004-6361/200911681 .[6] P. H. Hauschildt, E. Baron, A 3D radiative transfer framework. VI.PHOENIX/3D example applications, A&A509 (2010) A36+. arXiv:0911.3285 , doi:10.1051/0004-6361/200913064 .[7] A. M. Seelmann, P. H. Hauschildt, E. Baron, A 3D radiative transfer frame-work . VII. Arbitrary velocity ﬁelds in the Eulerian frame, A&A522 (2010)A102+. arXiv:1007.3419 , doi:10.1051/0004-6361/201014278 .[8] P. H. Hauschildt, E. Baron, A 3D radiative transfer framework.VIII. OpenCL implementation, A&A533 (2011) A127+. doi:10.1051/0004-6361/201117051 .[9] D. Jack, P. H. Hauschildt, E. Baron, A 3D radiative transfer framework.IX. Time dependence, A&A546 (2012) A39. arXiv:1209.5788 , doi:10.1051/0004-6361/201118152 . 1910] E. Baron, P. H. Hauschildt, B. Chen, S. Knop, A 3D radiative transferframework. X. Arbitrary velocity ﬁelds in the comoving frame, A&A548(2012) A67. arXiv:1210.6679 , doi:10.1051/0004-6361/201219343 .[11] P. H. Hauschildt, E. Baron, A 3D radiative transfer framework. XI.Multi-level NLTE, A&A566 (2014) A89. arXiv:1404.4376 , doi:10.1051/0004-6361/201423574 .[12] ”Fortran Working Group”, Fortran 2008 Standard, https://wg5-fortran.org/f2008.html (2019).[13] ”MPI Working Group”, The MPI Application Programming Interface, (2019).[14] ”OpenMP Working Group”, The OpenMP Application Programming In-terface, (2019).[15] ”OpenACC Working Group”, The OpenACC Application ProgrammingInterface, (2019).[16] P. H. Hauschildt, H. St¨orzer, E. Baron, Convergence properties of the ac-celerated Λ-iteration method for the solution of radaitive transfer prob-lems., J. Quant. Spec. Radiat. Transf.51 (6) (1994) 875–891. doi:10.1016/0022-4073(94)90018-3 .[17] D. Mihalas, P. B. Kunasz, D. G. Hummer, Solution of the comoving-frame equation of transfer in spherically symmetric ﬂows. I. Computationalmethod for equivalent-two-level-atom source functions., ApJ202 (1975)465–489. doi:10.1086/153996 .[18] D. Mihalas, Solution of the comoving-frame equation of transfer in spher-ically symmetric ﬂows. VI - Relativistic ﬂows, ApJ237 (1980) 574–589. doi:10.1086/157902 .[19] C. J. Cannon, Angular quadrature perturbations in radiative transfertheory., J. Quant. Spec. Radiat. Transf.13 (7) (1973) 627–633. doi:10.1016/0022-4073(73)90021-6 .2020] W.-R. Hamann, Line formation in expanding atmospheres: Multi-level cal-culations using approximate lambda operators, in: W. Kalkofen (Ed.), Nu-merical Radiative Transfer, Cambridge University Press, 1987, p. 35.[21] Richard Barrett, Michael Berry, Tony F. Chan, James Demmel, JuneDonato, Jack Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine,Henk van der Vorst, Templates for the Solution of Linear Systems: Build-ing Blocks for Iterative Methods, SIAM, Philadelphia, 1994, iSBN: 978-0-89871-328-2. doi:https://doi.org/10.1137/1.9781611971538 https://developer.nvidia.com/hpc-sdk (2020).[25] ”GNU Project”, GNU Compiler Suite Manuals, https://gcc.gnu.org/onlinedocs/ (2019). 21 isting 1: Pseudocode for KNL OpenMP Implementation ! $ omp p a r a l l e l do d e f a u l t ( none ) c o l l a p s e (2)! $ omp& p r i v a t e ( v a r i o u s v a r i a b l e s )! $ omp& f i r s t p r i v a t e ( v a r i o u s v a r i a b l e s )! $ omp& shared ( v a r i o u s v a r i a b l e s )! do i z= − nz , nz ! phase 3 l o o p : do i y= − ny , ny !! $ omp simd do i x= − nx , nx ! −− ! −− −− l n g t h = c h a r s p v ( ix , iy , i z ) ! −− ! −− −− and t h e path l n g t h s i n t o v e c t o r s! −− i l o o p : do i =1, l n g t h ! −− ! −− −− −− −−

4a . Fix up f o r BCs and vacuum! −−

4b . s t o r e data f o r next phase i f d o l s t a r i s s e t! −− −− enddo i l o o p enddoenddoenddo ! phase 3 l o o p! $ omp end p a r a l l e l do isting 2: Pseudocode for OpenACC Implementation ! −− ! −− t r a n s f e r data to t h e a c c e l e r a t o r! −− ! $ acc e n t e r data! $ acc& c r e a t e ( v a r i o u s v a r i a b l e s )! $ acc update d e v i c e ( v a r i o u s v a r i a b l e s )!! $ acc p a r a l l e l l o o p c o l l a p s e (3)! $ acc& p r e s e n t ( v a r i o u s v a r i a b l e s )! do i z= − nz , nz ! phase 3 l o o p : do i y= − ny , ny do i x= − nx , nx ! −− ! −− −− ! −− ! −− −− and t h e path l n g t h s i n t o v e c t o r s! −− ! $ acc l o o p seq i l o o p : do i =1, l n g t h ! −− ! −− −− −− −−

4a . Fix up f o r BCs and vacuum! −−

4b . s t o r e data f o r next phase i f d o l s t a r i s s e t! −− −− enddo i l o o p enddoenddoenddo ! phase 3 l o o p! $ acc e x i t data f i n a l i z e! $ acc& d e l e t e ( v a r i o u s v a r i a b l e s ) igure 1: Timing of the standard algorithm. The horizontal bars give the overall executionwalltimes (summed over all iterations) of the diﬀerent phases of the 3D radiative transfer solveras indicated by the colors. ‘Formal Solution’ designates the phase where new mean intensitiesare computed from the current estimate of the source function (including the construction ofthe Λ ∗ operator in the ﬁrst iteration), ‘allreduce’ is the phase where the contribution of allMPI processes are collected and summed up via MPI functions and ‘OS iteration’ is the timespent computing the new estimate of the source function (including the solution of the largesparse linear system). The labels on the right hand of the bars give the overall execution time.The labels on the left hand side specify the system or CPU, the algorithm, the solver of thelarge linear system in the OS step (Jacobi, Gauss-Seidel oder BiCGStab), the compiler andthe parallelization setup (where the notation x @ y indices x MPI processes with y OpenMPthread each).

50 100 150 200 250time [s]Xeon W-3223 @ 3.50GHzMultiPass+Cache/Jacobi/Intel/4@4Xeon W-3223 @ 3.50GHzMultiPass/Jacobi/Intel/8@2Xeon W-3223 @ 3.50GHzVector/Jacobi/Intel/16@1Xeon W-3223 @ 3.50GHzstandard/Jacobi/Intel/8@2

Times for PBCs test total timeFormal SolutionallreduceOS iteration

Times for spherical test total timeFormal SolutionallreduceOS iteration

Figure 2: Timing of diﬀerent algorithms on a modern many-core CPU (Xeon W-3223). Thehorizontal bars give the overall execution walltimes (summed over all iterations) of the dif-ferent phases of the 3D radiative transfer solver as indicated by the colors. ‘Formal Solution’designates the phase where new mean intensities are computed from the current estimateof the source function (including the construction of the Λ ∗ operator in the ﬁrst iteration),‘allreduce’ is the phase where the contribution of all MPI processes are collected and summedup via MPI functions and ‘OS iteration’ is the time spent computing the new estimate of thesource function (including the solution of the large sparse linear system). The labels on theright hand of the bars give the overall execution time. The labels on the left hand side specifythe system or CPU, the algorithm, the solver of the large linear system in the OS step (Jacobi,Gauss-Seidel oder BiCGStab), the compiler and the parallelization setup (where the notation x @ y indices x MPI processes with y OpenMP thread each). igure 3: Timing of diﬀerent algorithms on a many-core CPU (Xeon Phi 7210). The horizontalbars give the overall execution walltimes (summed over all iterations) of the diﬀerent phasesof the 3D radiative transfer solver as indicated by the colors. ‘Formal Solution’ designatesthe phase where new mean intensities are computed from the current estimate of the sourcefunction (including the construction of the Λ ∗ operator in the ﬁrst iteration), ‘allreduce’ isthe phase where the contribution of all MPI processes are collected and summed up via MPIfunctions and ‘OS iteration’ is the time spent computing the new estimate of the sourcefunction (including the solution of the large sparse linear system). The labels on the righthand of the bars give the overall execution time. The labels on the left hand side specify thesystem or CPU, the algorithm, the solver of the large linear system in the OS step (Jacobi,Gauss-Seidel oder BiCGStab), the compiler and the parallelization setup (where the notation x @ y indices x MPI processes with y OpenMP thread each).

25 50 75 100 125 150 175 200time [s]Xeon Phi 7210 @ 1.30GHzMultiPass+Cache/Jacobi/Intel/4@64Xeon Phi 7210 @ 1.30GHzMultiPass+Cache/Jacobi/Intel/16@16Xeon Phi 7210 @ 1.30GHzMultiPass+Cache/Jacobi/Intel/64@4Xeon Phi 7210 @ 1.30GHzMultiPass+Cache/Jacobi/NVIDIA/CPU/16@16Xeon Phi 7210 @ 1.30GHzMultiPass+Cache/Jacobi/NVIDIA/CPU/64@4Xeon Phi 7210 @ 1.30GHzMultiPass+Cache/Jacobi/NVIDIA/CPU/4@64

Times for PBCs test total timeFormal SolutionallreduceOS iteration

Times for spherical test total timeFormal SolutionallreduceOS iteration

Figure 4: Timing of diﬀerent MPI+OpenMP setups for the MultiPass+Cache algorithm ona many-core CPU (Xeon Phi 7210). The horizontal bars give the overall execution walltimes(summed over all iterations) of the diﬀerent phases of the 3D radiative transfer solver asindicated by the colors. ‘Formal Solution’ designates the phase where new mean intensitiesare computed from the current estimate of the source function (including the construction ofthe Λ ∗ operator in the ﬁrst iteration), ‘allreduce’ is the phase where the contribution of allMPI processes are collected and summed up via MPI functions and ‘OS iteration’ is the timespent computing the new estimate of the source function (including the solution of the largesparse linear system). The labels on the right hand of the bars give the overall execution time.The labels on the left hand side specify the system or CPU, the algorithm, the solver of thelarge linear system in the OS step (Jacobi, Gauss-Seidel oder BiCGStab), the compiler andthe parallelization setup (where the notation x @ y indices x MPI processes with y OpenMPthread each). igure 5: Timing of diﬀerent algorithms on a vector CPU (NEC SX-Aurora TSUBASA).The horizontal bars give the overall execution walltimes (summed over all iterations) of thediﬀerent phases of the 3D radiative transfer solver as indicated by the colors. ‘Formal Solution’designates the phase where new mean intensities are computed from the current estimate ofthe source function (including the construction of the Λ ∗ operator in the ﬁrst iteration),‘allreduce’ is the phase where the contribution of all MPI processes are collected and summedup via MPI functions and ‘OS iteration’ is the time spent computing the new estimate of thesource function (including the solution of the large sparse linear system). The labels on theright hand of the bars give the overall execution time. The labels on the left hand side specifythe system or CPU, the algorithm, the solver of the large linear system in the OS step (Jacobi,Gauss-Seidel oder BiCGStab), the compiler and the parallelization setup (where the notation x @ y indices x MPI processes with y OpenMP thread each).

10 20 30 40 50 60 70time [s]Nvidia V100OpenACC/BiCGStab/NVIDIA/GPU/1@1Nvidia V100 (small RAM)OpenACC/BiCGStab/NVIDIA/GPU/1@1

Times for PBCs test total timeFormal SolutionallreduceOS iteration

Times for spherical test total timeFormal SolutionallreduceOS iteration