[PDF] A comparative evaluation of three volume rendering libraries for the visualization of sheared thermal convection

Abstract

Oceans play a big role in the nature of our planet, about 70% of our earth is covered by water. Strong currents are transporting warm water around the world making life possible, and allowing us to harvest its power producing energy. Yet, oceans also carry a much more deadly side. Floods and tsunamis can easily annihilate whole cities and destroy life in seconds. The earth's climate system is also very much linked to the currents in the ocean due to its large coverage of the earth's surface, thus, gaining scientific insights into the mechanisms and effects through simulations is of high importance. Deep ocean currents can be simulated by means of wall-bounded turbulent flow simulations. To support these very large scale numerical simulations and enable the scientists to interpret their output, we deploy an interactive visualization framework to study sheared thermal convection. The visualizations are based on volume rendering of the temperature field. To address the needs of supercomputer users with different hardware and software resources, we evaluate different volume rendering implementations supported in the ParaView environment: two GPU-based solutions with Kitware's native volume mapper or NVIDIA's IndeX library, and a CPU-only Intel OSPRay-based implementation.

Full PDF

AA comparative evaluation of three volume rendering libraries for the visualization ofsheared thermal convection

Jean M. Favre a, ∗ , Alexander Blass b a Swiss National Supercomputing Center (CSCS), Via Trevano 131, CH-6900 Lugano, Switzerland b Physics of Fluids Group, Max Planck Center for Complex Fluid Dynamics, J. M. Burgers Center for Fluid Dynamics and MESA + Research Institute, Departmentof Science and Technology, University of Twente, P.O. Box 217, 7500 AE Enschede, The Netherlands

Abstract

Oceans play a big role in the nature of our planet, about 70% of our earth is covered by water [1]. Strong currents are transportingwarm water around the world making life possible, and allowing us to harvest its power producing energy. Yet, oceans also carrya much more deadly side. Floods and tsunamis can easily annihilate whole cities and destroy life in seconds. The earth’s climatesystem is also very much linked to the currents in the ocean due to its large coverage of the earth’s surface, thus, gaining scientiﬁcinsights into the mechanisms and e ﬀ ects through simulations is of high importance. Deep ocean currents can be simulated by meansof wall-bounded turbulent ﬂow simulations. To support these very large scale numerical simulations and enable the scientists tointerpret their output, we deploy an interactive visualization framework to study sheared thermal convection. The visualizationsare based on volume rendering of the temperature ﬁeld. To address the needs of supercomputer users with di ﬀ erent hardwareand software resources, we evaluate di ﬀ erent volume rendering implementations supported in the ParaView [2] environment: twoGPU-based solutions with Kitware’s native volume mapper or NVIDIA’s IndeX library, and a CPU-only Intel OSPRay-basedimplementation. Keywords:

Scientiﬁc Visualization, High Performance Computing, Navier-Stokes Solver, Direct Numerical Simulation, Computational FluidDynamics

Figure 1: Snapshot of the three-dimensional temperature ﬁeld of sheared ther-mal convection at Ra = . × and Re w =

1. Introduction

Thermohaline ocean circulation [4] is vital for the heat bud-get of our earth. Manabe and Stou ﬀ er [5] observed that it cancontribute to a heat increase of up to ∼ ◦ C on the yearlyaveraged mean surface temperatures in the North Atlantic re-gion. Marshall and Schott [6] investigated a vast variety of ∗ [email protected] scales in ocean dynamics and stated that deep convection canbe related to mixing layers everywhere in the ocean. Sincethere are many complex three-dimensional events happening inlarge-scale ﬂuid bodies such as oceans, it is vital to visualizethe three-dimensional and temporal features of such ﬂow simu-lations.We study these large-scale bodies of ﬂuids which are shearedby winds or currents and inﬂuenced by temperature di ﬀ erencesin the ﬂow. A fundamental setup of this natural mechanism issheared thermal convection (Fig. 1). Many processes in natureare based on heat and momentum transfer and therefore inter-action between buoyancy [7, 8] and shear [9, 10]. Rayleigh-B´enard convection, the ﬂow in a box heated from below andcooled from above is a paradigmatic system for thermal con-vection. We present the use of three di ﬀ erent rendering librariesavailable in ParaView [2] to build a time-dependent volume ren-dering of thermal convection. The deployment and evaluationof the hardware and software requirements of these librarieswas motivated by a showcase submission at the 2018 Inter-national Conference for High Performance Computing, Net-working, Storage and Analysis. In the accompanying video[11] we are able to display the previously two-dimensionallypresented ﬂow structures in a three-dimensional motion. Thereader is led through a presentation of one speciﬁc ﬂow casewith sheared thermal convection and can experience the dy-namics of the thermal structures while being informed about Preprint submitted to Parallel Computing August 27, 2019 a r X i v : . [ phy s i c s . c o m p - ph ] A ug i ﬀ erent ﬂow parameters.

2. Numerical simulations of sheared thermal convection

The direct numerical simulations (DNS) were performedwith the second-order ﬁnite-di ﬀ erence code AFiD [12], in whichthe three-dimensional non-dimensional Navier-Stokes equationswith the Boussinesq approximation are solved on a staggeredgrid.We use u = u ( x , t ) as the velocity vector with streamwise,spanwise and wallnormal components. θ is the non-dimensionaltemperature ranging from 0 ≤ θ ≤

1. The simulations are per-formed in a computational box with periodic boundary condi-tions in streamwise and spanwise directions and conﬁned by aheated plate below and a cooled plate on top. The shearing ofthe ﬂow is implemented by a Couette ﬂow setting where bothtop and bottom plates of the ﬂow are moved in opposite direc-tions with the speed u w keeping the average bulk velocity atzero and therefore minimizing dissipation errors. The domainsize is ( L x × L y × L z ) = (9 π h × π h × h ) using a grid of ( n x × n y × n z ) = (6912 × × ﬀ erence Navier-Stokes solver AFiD [12] was written in Fortran 90 to study large-scale wall boundedturbulent ﬂow simulations. In collaboration with NVIDIA, USA,the code was ported in its newest version to a GPU setting usingan MPI and CUDA Fortran hybrid implementation optimized torun and solve large ﬂow ﬁelds [13].We used data from Blass et al. [3] for our evaluation of vol-ume rendering implementations, where a parameter study overdi ﬀ erent input parameters was conducted to study their inﬂu-ence on the ﬂow ﬁeld. Control parameters were the temperaturedi ﬀ erence between the top and bottom plates as the strength ofthe thermal forcing, non-dimensionalized as the Rayleigh num-ber Ra , and the wall velocity as the strength of the shear forcing,non-dimensionalized as the wall shear Reynolds number Re w .In Fig. 2 we present snapshots of temperature ﬁelds at midheight in di ﬀ erent ﬂow regimes. It can be observed that theﬂow passes from a thermally dominated regime with large ther-mal convection rolls driving the ﬂow (Fig. 2a) into a regimewhere the mechanical forcing is dominant. Here, large-scalemeandering structures can be observed which are driven by theshearing of the top and bottom plates (Fig. 2d). To undergo atransition between the regimes, the ﬂow has to pass through anintermediate stage, in which the thermal plumes get stretchedinto large streaks (Fig. 2b). If the shearing is further increased,these streaks become unstable and start meandering in the ﬁnalﬂow state (Fig. 2c,d).The reason for this streaky ﬂow behavior is the additionof a third dimension to originally quasi-two-dimensional ﬂowstructures in pure thermal convection. Such thermal convec-tion rolls are driven solely by the thermal di ﬀ erence betweenthe plates. Once the wall shearing is added, the ﬂow starts tostrongly move in streamwise direction, which causes the devel-opment of streaks. Figure 2: Zoomed snapshots of temperature ﬁelds of a sheared and thermallyforced ﬂow transitioning through all ﬂow regimes for Ra = . × and (a) Re w =

0, (b) Re w = Re w = Re w = θ min (blue) to θ max (red). In turbulent ﬂows it is very important to research how cer-tain characteristic parameters are inﬂuenced by the ﬂow. Inthermal convection, the heat transfer, non-dimensionally de-ﬁned through the Nusselt number Nu is a good indicator ifchanging ﬂow structures have a supporting e ﬀ ect or may dis-rupt a previously transport-favorable ﬂow situation.While two-dimensional visualizations are very helpful inunderstanding the behavior of the large-scale structures, theydon’t show the complete scientiﬁc picture. They give a goodindication of the ﬂow behavior, but to understand thermal tur-bulence, it is vital to see the whole ﬂow ﬁeld and the dynamicinteraction of turbulent structures with each other. The oppor-tunity to observe the ﬂow evolving and transitioning throughdi ﬀ erent regimes is a great chance to not only statically observedi ﬀ erent ﬂow states at ﬁxed locations in space, but to also ac-tually follow the ﬂow on its path to develop thermal plumes,streaks and meandering structures.It has been previously shown in thermal convection that thelarge thermal plumes can be traced until very close to the heatedand cooled plates [14]. So it is very important to also observethe emergence of structures close to the boundary layer. In theshear dominated regime, which we visualize in the accompany-ing video [11], we can observe extremely large-scale structureswhich are caused by a combination of thermal and shear forc-ing. The detailed visualizations we presented allow us to notonly follow the large-scale structures, but also the interactionof small-scale structures much closer to the plates (Fig. 3).

3. Volume rendering libraries and setup

We use ParaView v5.6.0, a world-class, open source, multi-platform data analysis and visualization application installedon Piz Daint. Piz Daint, a hybrid Cray XC40 / XC50 system,2 igure 3: Zoom of an snapshot of the temperature ﬁeld (top) and the vorticitystructures (bottom) at Ra = . × and Re w = is the ﬂagship supercomputer of the Swiss National HPC ser-vice. We have deployed and tested several solutions withinParaView where parallelism is expressed at di ﬀ erent degrees:data-parallel visualization pipelines with GPU-based renderingsor multi-threaded parallelism for CPU-based renderings.The computational domain used for our simulations is madeof 6912 × ×

384 grid points. The temperature scalar ﬁeldstored as ﬂoat32 takes 36 GB of memory, an overwhelmingsize to handle on a normal desktop. Using di ﬀ erent parallelprogramming paradigms has enabled us to provide an engag-ing environment to promote interactive tuning of visualizationoptions and high productivity for movie generation.Visualization of three-dimensional scalar ﬁelds is a verymature ﬁeld. Many techniques are available to make some senseof the three-dimensional nature of the data, and its variationsthroughout the volume. Surface-based renderings with isosur-face thresholds or slicing planes have a great appeal in that theyare easy to use, and provide unambiguous representations basedon clearly deﬁned numerical values. Volume renderings, earlyapplied to medical applications, are also a great ﬁt for scalar vi-sualizations, especially in the realm of time-dependent outputs.They are, however, much more di ﬃ cult to use. Volume render-ing is based on the principle of converting a 3D scalar ﬁeld ontoan RGB (color) volume and an Opacity volume. Transfer func-tions, often deﬁned in an ad-hoc manner, convert scalar valuesto colors, and classify the data into regions of di ﬀ erent opaci-ties. A volume can then appear as clouds with varying densityand color. Their interpretation remains subjective to the user’staste and practice. We refer readers to other sources [15] to divemore deeply into the principles of Volume Rendering.Volume Rendering can be implemented in di ﬀ erent man- ners. ParaView was chosen because it o ﬀ ers a testbed for sev-eral state of the art implementations which can be selected basedon rendering parameters and available hardware.The largest partition of the Piz Daint supercomputer hasnodes equipped with one Intel Xeon E5-2690 (12 cores, 64GB RAM) and one NVIDIA Tesla P100 GPU (16 GB RAM,OpenGL driver 396.44). Thus our priority is to evaluate theGPU-based implementations. ParaView’s default installationenables also a software ray caster for rendering volumes but wehave found its performance far below the other options. Thelack of advanced parameter settings in the Graphical User In-terface (GUI) of ParaView also led us to abandon its evalua-tion. We tested ParaView’s native GPU ray casting implemen-tation against IndeX an NVIDIA library, as well as OSPRay, asoftware-based library developed by Intel. Doing so, providesa valid option to users of supercomputers not equipped withGPUs. Our performance evaluation is based on ParaView’sbenchmarking Python source code .We have in all cases ignored disk-based I / O costs. Thereis often quite a bit of variability when running on a large dis-tributed ﬁle system shared by hundreds of users. Our motiva-tions are rendering-centered, and two-fold: evaluate the mem-ory cost and resources (CPU, GPU) required to get a ﬁrst imageon the screen, and see if color / opacity transfer function editing,as well as other image tuning, can be done interactively, usingany of the three methods proposed. In the evaluation of per-formance costs, ParaView’s benchmark code enables fully au-tomated testing with a careful management of double bu ﬀ ering,turning o ﬀ all rendering optimizations designed to accelerateinteractive viewing, and forcing full-feature rendering beforesaving images to disk.In the two GPU-based methods evaluated, we use an EGL-based rendering layer [16] to overcome the need to have a server-side X-Windows server running on the compute node. Thisenables headless, o ﬀ screen rendering with GPU acceleration.We note, however, that although the GPUs provide phenomenalrendering power, they are limited by the available memory (16GB on our NVIDIA’s Pascal GPUs). For the full size of oursimulations outputs, we are actually forced to use data-parallelpipelines on multiple nodes to use the aggregate memory of thedi ﬀ erent GPUs.Our third option, uses Intel OSPRay and CPU rendering.HPC compute nodes usually have more memory than their GPUcounterparts. We use Piz Daint’s high memory nodes with 128GB of RAM, where our grid of over 9 billion voxels can be ﬁteasily on a single node. When GPU hardware is present, ParaView’s most e ﬃ cientmapper is a volume mapper that performs ray casting on theGPU using vertex and fragment programs [17]. The core ray-tracing algorithms are coded in GLSL and require a graphicsdriver supporting at least OpenGL version 3.2 [18]. The datais stored into a vtkVolumeTexture which manages the OpenGL source code found in . / Wrapping / Python / paraview / benchmark / igure 4: Comparison between volume renderings of temperature withParaView’s OpenGL GPU RayCastMapper (left), and with NVIDIA IndeX(right). volume texture, its type and internal format. Although this classsupports streaming data into separate blocks to make it ﬁt theGPU memory, we have not used this option which imposesa performance trade-o ﬀ , artiﬁcially going over the ﬁxed GPUmemory limit. Block streaming, sometimes called data brick-ing, may also su ﬀ er from artifacts at the block boundaries wheregradient computations are done to support shading. ParaView’sOpenGL VolumeRayCastMapper binds the 32-bit ﬂoat scalarﬁeld array to a three-dimensional texture image with a call toglTexImage3D(). An explicit texture object is created, transfer-ring data from host memory to GPU memory. The maximumachievable performance will be proportional to the total amountof GPU memory, and to the transfer bandwidth over our highspeed PCIe3 serial bus connecting the host to the GPU device. NVIDIA IndeX [19] is a three-dimensional visualizationSDK developed to enable volume rendering of massive datasets. NVIDIA has worked in tandem with Kitware to bring animplementation of IndeX to ParaView, and we have enjoyedthe beneﬁts of a close partnership between the Swiss NationalSupercomputing Center (CSCS) and NVIDIA, to be able to useIndeX in a multi-GPU setting. We use the ParaView plugin v2.2with the core library NVIDIA IndeX 2.0.1. The NVIDIA In-deX Accelerated Compute (XAC) interface integrates the coresurface and volume sampling programs written in CUDA [20].For this case, we have used the generic programs provided byIndeX, without custom programming. In Fig. 4 we show side-by-side renderings done with the two GPU-based libraries, todemonstrate that they produce equivalent images. The ParaViewGraphical User Interface ensures that both implementations useidentical color and opacity transfer functions and sampling rates.ParaView’s GPU Ray Casting image (left) is used as reference.Di ﬀ erences of illumination are barely noticeable to the humaneye. OSPRay [21] is a ray tracing framework for CPU-based ren-dering. It supports advanced shading e ﬀ ects, large models andnon-polygonal primitives. OSPRay can distribute “bricks” ofdata as well as “tiles” of the framebu ﬀ er, although in our case, we use brick subdivisions only. The Texas Advanced Comput-ing Center has developed a ParaView plugin that enables us totest the possibility of using a ray-tracing based rendering enginefor volumetric rendering. This is the best solution for clusterswhere no GPU hardware is available.OSPRay can use its own internal Message Passing Inter-face (MPI) layer to replicate data across MPI processes andcomposite the image. This would result in linear performancescaling and supports secondary rays used in ParaView’s path-tracer mode, but would be prohibitive in terms of communica-tion costs. In this study, we rely on a di ﬀ erent parallel com-puting paradigm. The emphasis is no more on data parallelism,but rather on multi-threaded execution. A complete software-only ParaView installation was deployed with an LLVM-basedOpenGL Mesa layer. We used Mesa v17.2.8, compiled withLLVM v5.0.0, and the OSPRay v1.7.2 library to provide a verye ﬃ cient multi-threaded execution path taking advantage of PizDaint’s second partition of compute nodes. These nodes arebuilt with two Intel Broadwell CPUs (2x18 cores and 64 / -- cpus-per-task = -- ntasks-per-core =

2” to e ﬀ ectively take full advantageof the multi-threading exposed by the LLVM and OSPRay li-braries. ParaView’s default mode of parallel computing is to usedata-parallel distribution, whereby sub-pieces of a data grid areprocessed through identical visualization pipelines. To combinethe individual framebu ﬀ ers of each computing nodes, ParaViewuses Sandia National Laboratory’s IceT [22] compositing li-brary. We use it in its default mode of operation doing sort-last compositing for desktop image delivery. We note here thatNVIDIA’s IndeX uses a proprietary compositing library, so forthe IndeX tests only, we disable ParaView’s default image com-positor.

4. Volume rendering of the thermal convection

Figure 5: Example of a color and opacity transfer functions to highlight hotand cold plumes.

In visualizing the temperature ﬁeld, we seek to highlightthe turbulence which is best shown by clearly di ﬀ erentiating4 igure 6: Volume rendering with shading based on gradient estimation (left),and with OSPRay-enabled shadows (right). between cold and hot regions to see how they interact with eachother, as seen in Fig. 5. Our movie animation shows an initialphase where region of blue tint is superposed on top of the hot-ter region. Plumes emerging from the bottom and mixing intothe cold regions highlight this phenomenon. ﬀ ects When presented with multiple visualizations including dif-ferent illumination and shading, we preferred the renderingswhich emphasize the amorphous nature of the ﬁeld data. As canbe seen in Fig. 6, shading based on gradient estimation o ﬀ erslittle improvement because our data does not have strong gradi-ents, and the use of shadows which at ﬁrst might seem more ap-pealing, produces images with a strong surface-like look, whichwe discarded upon further analysis. Volumetric rendering of high resolution grids has a non-signiﬁcant cost which we brieﬂy document here. Creating theﬁrst frame after data has been read in memory, i.e., the startupcost has a great impact in having users adopt a particular im-plementation. In a post-hoc visualization, data would be readfrom disk; in an in-situ scenario, data might have to be con-verted to VTK data structures. Thus, we measure performanceafter the time ParaView has collected all the data and createda bounding-box representation. This startup cost for the ﬁrstimage is also of paramount importance in a movie-making sce-nario, where data are read from disk, a single image is com-puted, and the whole visualization pipeline and hardware re-sources are ﬂushed to visualize the next timestep.Unlike ParaView’s native GPU ray caster implementationwhich does not enable block streaming, the NVIDIA IndeX li-brary processes data by chunks. However, it does so by bringingvolume sub-extents incrementally into the GPU memory. Earlyvolume chunks are rendered properly as long as the GPU mem-ory is not exhausted. When memory runs out, late chunks actu-ally corrupt the ﬁnal image. Our attempts to render a 4 billionvoxels dataset on a single node did not succeed with NVIDIAIndeX. We observe failures to allocate 64 voxel cubes and theﬁnal images are corrupted.We summarize in Table 1 the time from when volumetricrendering options are enabled, triggering the building of inter-nal structures until the ﬁrst frame appears. In order to measure the memory cost of all three libraries under evaluation on a sin-gle node, we restricted our test sample to a quarter-size domainof the original grid, i.e., 2.28G voxels (1730 × × settles at9.1 GB for ParaView native raycaster, and 12.3 GB for NVIDIAIndeX. Table 1: Initialization and memory costs for a quarter-size domain on onenode.

Rendering library Startup ParaView taskOSPRay 1.34 s 18.4 GBParaView GPU Mapper 6.17 s 27.2 GBNVIDIA IndeX 11.84 s 39.2 GBWe note both a much higher memory consumption on theapplication side of ParaView and on the GPU memory side forthe NVIDIA IndeX implementation. The high initial setup costincurred by the NVIDIA IndeX library is due to higher volumetransfer between CPU and GPU, a cost that increases furtherwhen in parallel, as the current implementation of IndeX trig-gers re-execution of the data I / O due to larger than usual ghostlayer requirements. Work is in progress to minimize this im-pact in a future version of the plugin. If memory costs are substantial, more nodes, and / or moreGPUs will be required, increasing the run-time cost of the vi-sualization. Our data domain is quite large, and we are notable to load a half-size domain on a single GPU node. Indeed,both the 64 GB RAM on the node and the 16 GB RAM onthe GPU are hard limitations. The OSPRay-based CPU ren-dering is one way to alleviate this problem. We can load thefull size domain on a single node of the multi-core partitionof Piz Daint with dual-Xeon chips and 128 GB of RAM. Wemeasured again the startup cost for the ﬁrst image at full HDresolution (1920x1080 pixels), using 72 execution threads andfound them to increase linearly with grid dimensions. We testedthe quarter-size, half-size and the full domain and report the de-livery of the ﬁrst image in 1.07, 1.50, and 2.33 s, respectively.The associated cost in RAM is also linear, at 18.4 GB, 36.5 GBand 73 GB, respectively. Of great interest is OSPRay’s manage-ment of memory. OSPRay volumes can be stored in two di ﬀ er-ent manners. The ﬁrst variant named shared structured volume matches ParaView’s data layout. Version 5.6 of ParaView is theﬁrst version where this zero-copy access pattern is used and itprovides both a faster startup time and a much lower memoryfootprint, as compared to previous work. Indeed, we reportedearlier on the use of OSPRay’s alternate implementation called block bricked volume whereby data locality in memory is in-creased by arranging voxel data in smaller chunks. This camehowever at a higher cost, doubling the memory footprint on theCPU [23]. GPU memory usage is measured with the nvidia-smi diagnostic tool personal communication with NVIDIA Dev. team ﬀ and we tested therendering speed of that particular mode in a batch productiontest. We created an OSPRay-based benchmark test to mimica navigation ﬂy-through in a full resolution domain, startingfrom an overall view of the full grid, zooming in, rotating theview-point, and ﬁnally zooming in to immerse the viewer inthe volume. Our initial view-point has some regions of screen-space empty, where rendering costs at each pixel are negligi-ble. We then move quickly into the scene such that the view-port is completely covered by active pixels, that is, all pixelrays hit the volume. We rendered our benchmark test at threedi ﬀ erent pixel resolution, WXGA (1280x800 pixels), Full HD(1920x1080 pixels) and 4K Ultra HD (3840x2160 pixels), toevaluate the impact of pixel resolution on rendering costs. Wealso evaluated the use of hyper-threading to further boost per-formance. Table 2 summarizes our average rendering time perframe for 300 frames of navigation. Table 2: Average rendering for the full size domain at di ﬀ erent pixel resolution Pixel Resolution vs. ﬀ er resolution to verylarge sizes is not a showstopper. In a post-processing scenario, we have seen that the twoGPU-based rendering solutions are limited by the available GPUmemory, since our 9-billion voxels data set will not ﬁt on asingle GPU. Likewise, in an in-situ scenario, the visualizationwould most likely use a parallel set of nodes. Loading our full-size data, we rounded up our evaluation of all three renderingoptions, by measuring the initial cost for the ﬁrst image (afterall I / O has been done), and also the average rendering time ina scripted animation loop. Fig. 7 summarizes our results, withthe dataset distributed among 4, 8 and 12 compute nodes.As expected, startup times decrease almost linearly withthe number of compute nodes. For the GPU-based methods,less data is transferred from CPU memory to GPU memory.Our animation benchmark loads a single timestep of data, thus,once the data has migrated to the GPU, there is hardly any CPUto GPU communication apart from a single frame bu ﬀ er im-age. For the CPU-based implementation, the build-up of the Figure 7: Overview of initial cost and average rendering time per frame. ray-tracing acceleration structures takes just over one secondso there is less di ﬀ erence across the few tests executed. Wesee rendering times reduced somewhat linearly since there isless workload. In a movie production setting where all timestepoutputs are read once, rendered once and then discarded, thestartup cost of any rendering library needs to be weighted againstthe I / O costs. Although our data I / O statistics show quite a bitof variation because of the high load of our multi user systemwith over 5000 compute nodes, our simulation data are read, inaverage , in about 32 s (resp. 25, and 16 s) on 4 nodes (resp. 8and 12 nodes). We see that the initialization of the renderingsub-system has a greater impact than expected, and that in an in-situ scenario, it would be the singlemost important barrierto performance. The initialization of the NVIDIA IndeX is themost signiﬁcant bottleneck. Discussions with NVIDIA are on-going and our hope is that this will be improved in future ver-sions of the SDK since the library is still in early development.We comment here that the parallel execution of the OSPRay-based volume rendering was made possible by using yet anotherParaView mode, letting the OSPRay library take full control ofthe overall scene and parallel frame compositing. Finally, wehighlight the fact that the OSPRay average rendering times perframe in our animation are all under one second, while it takesa minimum of 8 compute nodes using the NVIDIA IndeX so-lution. This level of interactivity can be satisfactory during theprototyping phase of a visualization.

5. Summary and conclusion

We have discussed three implementations of volume ren-dering for a thermal convection simulation output of substantialsize. Our time-dependent output is stored as a ﬂoat32 array of36GB per timestep. This is a non-trivial size for the most com-mon GPUs. This leaves the scientist with two options: 1) usea data-parallel visualization application with GPU-assisted ren-dering, or 2) use a

CPU-only visualization environment whichcan ﬁt on compute nodes where large memory banks are usu-ally found. Our choice was to deploy a single application, theopen-source ParaView, due to its support for di ﬀ erent parallelexecution paradigms, and for its ability to work with di ﬀ erent6 ﬀ -screen and on-screen rendering backends. Having a singleapplication, driven by fully automatized python scripts and abenchmarking suite of tools available in ParaView itself, en-abled us to confront all possible implementations with reducedvariability.We tested two GPU-based rendering options. We ﬁrst usedParaView’s native volume rendering which has proved to of-fer the best compromise between startup time, and interactiveperformance; We also tested an alternative solution based on anew library in development by NVIDIA. In our current setup,the IndeX library o ﬀ ers superior interactive rendering, howeverat non-negligible initialization costs.We evaluated an implementation of volume rendering pro-vided by the Intel OSPRay library, a software-based frameworkwhich can take remarkable advantage of a multi-threaded exe-cution layer. This also ﬁts well on a subset of our available hard-ware, a dual-Xeon based compute node without GPU. Our ex-periences are of interest for several computer platforms aroundthe world where graphics hardware is not available.Our emphasis on creating the scientiﬁc visualization shownin the accompanying video [11] was two-fold. First, havingan interactive environment enabling us to prototype the visu-alization with large scale data. The editing of color and opac-ity transfer functions is the most demanding step in derivingthe proper visualization, and we were able to provide an in-teractive setup using either GPU-, or CPU-based volume ren-dering. Dealing with long time-dependent simulation outputswas the second requirement, and the path to achieve high pro-ductivity was to use parallel and scalable I / O routines. Weused VTK’s native XML partitioned ﬁle format convention forcartesian image data. This was pivotal for a quick turn-aroundtime. The OSPRay-based implementation had the best perfor-mance in both initialization and average rendering time, but suf-fered from some parallel image compositing artifacts at inter-process boundaries. Given the very high spatial resolution ofour grid, these artifacts are only visible at extreme zooming inthe vicinity of ghost-cells between MPI-distributed data. Toconclude and ensure the best visual quality, the compromise formovie production was to use small subsets of GPU nodes withParaView’s native volume renderer.The volume rendering benchmarking platform deployed toanalyze our large grid simulations provides a unique chance toobserve sheared thermal convection in a very simple systemwith far reaching consequences. Furthermore, the visualiza-tions allow us to have a very good ﬁrst insight into the inter-play between thermal convection and ﬂow shearing by di ﬀ erentkinds of wind and ﬂow currents. We are now able to better un-derstand the emergence and behavior of ﬂow structures trans-porting heat through the system and a ﬀ ecting the ﬂow dynam-ics. Acknowledgments

Alexander Blass was ﬁnancially supported by the Dutch Or-ganization for Scientiﬁc Research (NWO-I) and conducted hissimulations at the Swiss National Supercomputing Center, un-der compute allocations s713, s802, and s874. We acknowl- edge the support from the Dutch national e-infrastructure ofSURFsara, a subsidiary of the SURF cooperation, and the Pri-ority Programme SPP 1881 Turbulent Superstructures of theDeutsche Forschungsgemeinschaft. We thank the ParaView de-velopment team at Kitware, USA, for fruitful discussions andmotivational material. Dave DeMarle has been particularly help-ful in discussion related to the OSPRay plugin. Mahendra Roopaat NVIDIA has also been extremely receptive to our feedbackand instrumental in helping us get the best of the IndeX libraryin a multi-GPU setting. We are grateful to the reviewers of ourmanuscript who provided critical reading and motivated clariﬁ-cations we have added. We also would like to thank Paul Melisfrom SURFsara for valuable input to our video [11].

References [1] Intergovernmental Panel on Climate Change, Ocean systems, in: ClimateChange 2014 Impacts, Adaptation and Vulnerability: Part A: Global andSectoral Aspects: Working Group II Contribution to the IPCC Fifth As-sessment Report, Chapter 12, 2014, pp. 411–484.[2] J. Ahrens, B. Geveci, C. Law, ParaView: An End-User Tool for LargeData Visualization, Butterworth-Heinemann, 2005.[3] A. Blass, X. Zhu, R. Verzicco, D. Lohse, R. J. A. M. Stevens, Flow or-ganization and heat transfer in turbulent wall sheared thermal convection,Preprint arXiv:1904.11400 (2019).[4] S. Rahmstorf, The thermohaline ocean circulation: A system with dan-gerous thresholds?, Climatic Change 46 (2000) 247–256.[5] S. Manabe, R. J. Stou ﬀ er, Two stable equilibria of a coupled ocean-atmosphere model, J. Climate 1 (1988) 841–866.[6] J. Marshall, F. Schott, Open-ocean convection: Observations, theory, andmodels, Rev. Geophys. 37 (1) (1999) 1–64.[7] G. Ahlers, S. Grossmann, D. Lohse, Heat transfer and large scale dynam-ics in turbulent Rayleigh-B´enard convection, Rev. Mod. Phys. 81 (2009)503.[8] D. Lohse, K.-Q. Xia, Small-scale properties of turbulent Rayleigh-B´enardconvection, Annu. Rev. Fluid Mech. 42 (2010) 335–364.[9] A. J. Smits, B. J. McKeon, I. Marusic, High-Reynolds number wall tur-bulence, Ann. Rev. Fluid Mech. 43 (2011) 353–375.[10] D. Barkley, L. S. Tuckerman, Mean ﬂow of turbulent-laminar patterns inplane Couette ﬂow, J. Fluid Mech. 576 (2007) 109–137.[11] J. M. Favre, A. Blass, Volume renderings of sheared thermal convection[video ﬁle] (2018).URL https://youtu.be/yEj83O3hVv4 [12] E. P. van der Poel, R. Ostilla-M´onico, J. Donners, R. Verzicco, A pen-cil distributed ﬁnite di ﬀ erence code for strongly turbulent wall-boundedﬂows, Computers & Fluids 116 (2015) 10–16.[13] X. Zhu, E. Phillips, V. S. Arza, J. Donners, G. Ruetsch, J. Romero,R. Ostilla-M´onico, Y. Yang, D. Lohse, R. Verzicco, M. Fatica, R. J. A. M.Stevens, AFiD-GPU: a versatile Navier-Stokes solver for wall-boundedturbulent ﬂows on GPU clusters, Comput. Phys. Commun. 229 (2018)199–210.[14] R. J. A. M. Stevens, A. Blass, X. Zhu, R. Verzicco, D. Lohse, Turbulentthermal superstructures in Rayleigh-B´enard convection, Phys. Rev. Fluids3 (2018) 041501(R).[15] W. Schroeder and K. Martin and B. Lorensen, The Visualization Toolkit,Kitware, 2006, pp. 213–244.[16] Egl eye: Opengl visualization without an x server, http://tinyurl.com/ybmnzdtv .[17] Volume rendering improvements in vtk, https://blog.kitware.com/volume-rendering-improvements-in-vtk .[18] Shaders in vtk, .[19] Nvidia index, https://developer.nvidia.com/index .[20] R. Haas, P. Mosta, M. Roopa, A. Kuhn, M. Nienhaus, Programmable in-teractive visualization of a core-collapse supernova simulation, in: Con-ference on High Performance Computing Networking, Storage and Anal-ysis, SC 2018, Dallas, TX, USA, 2018.[21] Ospray: a ray tracing based rendering engine for high-ﬁdelity visualiza-tion, .

22] K. Moreland, W. Kendall, T. Peterka, J. Huang, An image compositingsolution at scale, in: Conference on High Performance Computing Net-working, Storage and Analysis, SC 2011, Seattle, WA, USA, 2011, pp.25:1–25:10.[23] J. M. Favre, A. Blass, Volume renderings of sheared thermal convection,in: Conference on High Performance Computing Networking, Storageand Analysis, SC 2018, Dallas, TX, USA, 2018.22] K. Moreland, W. Kendall, T. Peterka, J. Huang, An image compositingsolution at scale, in: Conference on High Performance Computing Net-working, Storage and Analysis, SC 2011, Seattle, WA, USA, 2011, pp.25:1–25:10.[23] J. M. Favre, A. Blass, Volume renderings of sheared thermal convection,in: Conference on High Performance Computing Networking, Storageand Analysis, SC 2018, Dallas, TX, USA, 2018.