[PDF] Learning Adaptive Sampling and Reconstruction for Volume Visualization

Abstract

A central challenge in data visualization is to understand which data samples are required to generate an image of a data set in which the relevant information is encoded. In this work, we make a first step towards answering the question of whether an artificial neural network can predict where to sample the data with higher or lower density, by learning of correspondences between the data, the sampling patterns and the generated images. We introduce a novel neural rendering pipeline, which is trained end-to-end to generate a sparse adaptive sampling structure from a given low-resolution input image, and reconstructs a high-resolution image from the sparse set of samples. For the first time, to the best of our knowledge, we demonstrate that the selection of structures that are relevant for the final visual representation can be jointly learned together with the reconstruction of this representation from these structures. Therefore, we introduce differentiable sampling and reconstruction stages, which can leverage back-propagation based on supervised losses solely on the final image. We shed light on the adaptive sampling patterns generated by the network pipeline and analyze its use for volume visualization including isosurface and direct volume rendering.

Full PDF

11 Learning Adaptive Sampling and Reconstructionfor Volume Visualization

Sebastian Weiss , Mustafa Is¸ık , Justus Thies , and R ¨udiger Westermann a) b) c) d) e)

Fig. 1: An importance network, together with a differentiable sampler and a reconstruction network, takes a low resolutionvisualization (a) and infers an importance map (b) from it. From this map, an adaptive sampling pattern with adjustablenumber of samples (5% for iso, top; 10% for dvr, bottom) is derived, and a volume ray-caster samples the data according tothese samples (c). The reconstruction network completes the visual representation from the sparse set of samples (d). Theground truth visualizations are shown in (e). The proposed network pipeline works on images of iso-surfaces (top) and directvolume renderings (bottom).

Abstract —A central challenge in data visualization is to understand which data samples are required to generate an image of a data setin which the relevant information is encoded. In this work, we make a ﬁrst step towards answering the question of whether an artiﬁcialneural network can predict where to sample the data with higher or lower density, by learning of correspondences between the data, thesampling patterns and the generated images. We introduce a novel neural rendering pipeline, which is trained end-to-end to generate asparse adaptive sampling structure from a given low-resolution input image, and reconstructs a high-resolution image from the sparse setof samples. For the ﬁrst time, to the best of our knowledge, we demonstrate that the selection of structures that are relevant for the ﬁnalvisual representation can be jointly learned together with the reconstruction of this representation from these structures. Therefore, weintroduce differentiable sampling and reconstruction stages, which can leverage back-propagation based on supervised losses solely onthe ﬁnal image. We shed light on the adaptive sampling patterns generated by the network pipeline and analyze its use for volumevisualization including isosurface and direct volume rendering.

Index Terms —Volume visualization, adaptive sampling, deep learning. (cid:70)

NTRODUCTION W HICH are the data samples that are needed to generate animage of a data set that conveys the relevant informationencoded in this data? This question is fundamental to datavisualization, since it asks for the importance of data samplesfrom a perceptual point of view, rather than a signal processingstandpoint that argues in terms of numerical accuracy.Recent works in visualization have shown that artiﬁcial neuralnetworks can perform an accurate reconstruction from a reduced setof data samples, by learning the relationships between a sparse, yetregular input sampling and the high-resolution output. Learnedrepresentations are then applied in the reconstruction process • All authors are with Technical University of Munich, Germany.E-mail: { sebastian13.weiss,m.isik, justus.thies, westermann } @tum.de. to infer missing data samples. This type of reconstruction hasbeen performed in the visualization image domain to infer high-resolution images from given low-resolution images of isosurfaces[60], in the spatial domain to infer a higher resolution of a 3Ddata set from a low-resolution version [64], and in the temporaldomain to infer a temporally dense volume sequence from a sparsetemporal sequence [21].Others have even proposed neural networks that are trainedend-to-end to learn directly the visual data representations insteadof the data itself. Berger et al. [3] propose a deep image synthesisapproach to assist transfer function design, by letting an artiﬁcialneural network synthesize new volume rendered images fromonly a selected viewpoint and a transfer function. He et al. [23]demonstrate that artiﬁcial neural networks can even be used tobridge the data entirely, by learning the relationships between a r X i v : . [ c s . G R ] J u l the input parameters of a simulation and visualizations of thesimulation results. Both approaches do not make any explicitassumptions about the relevance of certain structures in thedata, yet the learned relationships between parameters and visualrepresentations are considered in the image generation process. Our goal is to make a further step towards learning visualrepresentations, by investigating whether a neural network can a)learn the relevance of structures for generating such representations,b) use this knowledge to adaptively sample a visual representationof a volumetric object, and c) reconstruct an accurate image fromthe sparse set of samples. Notably, even we can demonstrate forvery large volumes and image sizes that adaptive sampling cansave rendering time, performance improvement is not our mainobjective. It is even fair to say that an optimized GPU volumeray-caster can hardly been beaten performance-wise. Our mainobjective is to gain an improved understanding of the learningskills of neural networks for generating visual representations inan unsupervised manner, by letting networks learn the relevanceof certain structures for obtaining such representations. It caneventually become possible to generate data representations thatcompactly encode relevant structures in a way they can be used bya neural network to visualize the data. Such insights can furtherfacilitate the use of transfer learning to construct synthetic data setsthat contain the structures that are important for successful learningtasks on real data. For viewpoint selection, a network might learnto recommend views showing many important structures, and fortraining this information can be used to acquire more data fromsimilar views.To address our objectives, we introduce a novel networkpipeline that is trained end-to-end to learn the relevance ofcertain structures in the data for generating a visual representation(Figure 1). This pipeline is comprised of two consecutive internalnetwork stages: An importance network and a reconstructionnetwork. Both networks work in tandem, in that the ﬁrst learnsto place samples along relevant structures by using the secondnetwork to give feedback on how well a visual representationof the data can be reconstructed from the sparse sampling. Ourapproach differs from previous adaptive sampling approaches involume visualization [37, 31, 2] in that it does not rely on anyspeciﬁc saliency model to determine the image regions that needto be reﬁned. In contrast, we propose a network-based processingpipeline that simultaneously learns where to sample and how toaccurately reconstruct an image from the sparse samples, solelyusing losses on the reconstructed images.For learning an importance map from a low-resolution visu-alization and reconstructing an image from a sparse set of pixelvalues, we use two modiﬁed versions of an EnhanceNet [53]. Toenable network-based learning using gradient descent, two novelprocessing stages are introduced: • A differentiable sampling stage that models the relationshipbetween sample positions and visual representation. • A differentiable image reconstruction stage using the pull-push algorithm [15, 32] to model the relationship betweena sparse set of image samples and the reconstructed image.In a number of experiments, we demonstrate that the importancenetwork effectively selects structures that are relevant for the ﬁnalvisual representation. We focus on adaptive sampling in image-space, i.e., using surface samples and samples resulting from direct volume rendering. As a future direction of research, we outlineadaptive sampling in object space, i.e., using data samples alongview-rays. Our experiments include qualitative and quantitativeevaluations, which indicate good reconstruction accuracy evenfrom few samples. The source code of our processing pipelineis available at https://github.com/shamanDevel/AdaptiveSampling,including some of the data sets that have been used for trainingand validation.

ELATED W ORK

In the following, we review previous works that share similaritieswith our approach from the ﬁelds of adaptive sampling for renderingas well as neural network-based image and volume reconstruction.

Adaptive Sampling for Rendering

Adaptive rendering has along tradition in computer graphics, to reduce the number of raysto trace against the scene and perform rasterization at lower imageresolution. At the core of such approaches is the computation ofimportance values to steer the adaptive reﬁnement, for instance,based on perceptual models [5, 42, 48], image saliency modelsusing pixel variance [44, 49], image difference operations [40], orentropy-based measures [61], to name just a few. In the context offoveated rendering [17], where usually a static adaptive samplingpattern is used that moves with the users gaze, a luminance-contrast-aware criterion was introduced to enable feature-aware adaptivity[58]. The importance map generation process is often startedfrom an image preview that is calculated using a low resolutionrender pass or a high-resolution estimate that can be created in asigniﬁcantly faster way than the ﬁnal image.For volume rendering, a number of approaches have investi-gated adaptive sampling in object space, to reduce the number ofsamples along the view rays [43, 9, 38, 6]. Adaptive image-spacereﬁnement has been proposed by Levoy [37], by using the colorvariances between pixels at low image resolution to decide whetherto reﬁne the image resolution locally. Kratz et al. [31] propose touse the difference image between two coarser resolution images,and locally reﬁne where high differences are observed. Belayev et al. [2] render low-resolution images of isosurfaces and reﬁnedepending on how many pixels surrounding a pixel in the low-resolution view fulﬁll certain requirements. Frey et al. [13] use aﬁxed random sampling structures that is applied in a hierarchicalmanner to progressively reﬁne the image.The major differences between these approaches and ourproposed sampling pipeline are as follows: Firstly, the pipelinelearns to adapt the sampling in an unsupervised manner. A speciﬁcfeature descriptor that steers the placement of samples is notused, and importance values are learned solely using losses onthe reconstructed image. Secondly, the number of samples canbe prescribed, which is not easily possible with existing schemesdue to their pixel-iterative nature. Thirdly, the pipeline learnssimultaneously the adaptive sampling and the image reconstructionfrom the sparse set of samples. In all previous schemes, the ﬁnalinterpolation step is decoupled from the sampling process.

Deep Learning for Upscaling and Denoising

In recent years,deep learning approaches have been used successfully for single-image and video super-resolution tasks [11, 54, 55, 56, 52, 7],i.e., the upscaling of images and videos from a lower to somehigher resolution. Many previous works let the networks learn tooptimize for losses between the inferred and ground-truth imagesbased on direct vector norms [29, 27]. GANs were introduced toprevent the undesirable smoothing of direct loss formulations [53, et al. [34] forlearning adaptivity in Monte-Carlo path-tracing and denoisingof the ﬁnal image. A ﬁrst network learns to adapt the number ofadditional paths from an initial image at the target resolution, whichis generated via one path per pixel. A second denoising networklearns to model the relationship between an image with increasedvariance in the color samples to the ground truth rendering [47, 41].Conceptually, our approach differs in that it works on a low-resolution input map and then learns to freely position the samplelocations in image space, i.e. it learns to place zero or onesample per pixel. This requires a completely different differentiablesampling stage, as well as a differentiable image reconstructionstage that can work on a sparse set of samples. Furthermore,Kuznetsov et al. use ﬁnite differences between images of differentsample counts for gradient estimation. Incurring noise is reducedby averaging multiple samples with different sample counts, whichis not possible in our approach where at most one sample per pixelis taken. Instead, we propose a sigmoid approximation that can bedifferentiated analytically.In visualization, Zhou et al. [64] presented a CNN-basedsolution that upscales a volumetric data set using three hiddenlayers designed for feature extraction, non-linear mapping, andreconstruction, respectively. Han et al. [20] introduced a two-stageapproach for vector ﬁeld reconstruction via deep learning, byreﬁning a low-resolution vector ﬁeld from a set of streamlines.Berger et al. [3] proposed a deep image synthesis approach toassist transfer function design using GANs, by letting a networksynthesis new volume rendered images from only a selectedviewpoint and a transfer function. The use of neural network-basedinference of data samples in the context of in situ visualizationwas demonstrated by Han and Wang [21], where a networklearns to infer missing time steps between 3D simulation results.He et al. [23] use neural networks for parameter-space exploration,by training a network to learn the dependencies between visualmappings of simulation results and the input parameters of thesimulation. Guo et al. [18] designed a deep learning frameworkthat produces coherent spatial super-resolution of 3D vector ﬁelddata. Weiss et al. [60] extent image upscaling to geometry imagesof isosurfaces including depth and normal information. Insteadof data upscaling, Tkachev et al. [57] predict a next time-stepof a simulation and identify regions of interest by high variancebetween the network prediction and the ground truth. Common toall these approaches is the use of a regular sampling structure thatdoes not consider the importance of samples in the inference step.

EARNING TO S AMPLE

In the following, we discuss how the importance network makesuse of both the adaptive sampling stage and the reconstructionnetwork to learn where to place samples with higher density. Theimportance network (subsection 3.1) receives an image of the dataset at low resolution. This image L is of shape C × f H × fW , where W and H denote the screen resolution, and f the downsampling factor. This factor is set to 1 / C channels, such as color, depth, andnormal, representing what is seen through that pixel. The networkis trained to learn an importance function N I that generates agray-scale importance map I ∈ [ , ∞ ) H × W in which low and highvalues, respectively, indicate where less or more samples are taken.The sampler S (subsection 3.2) takes the importance map andplaces a given number of samples, e.g., 5% of the pixel, in thefull resolution image S ∈ R C × H × W according to the importanceinformation. Only at these samples the object is rendered. Thereconstruction network learns a function N R (subsection 3.3)that reconstructs the ﬁnal output O ∈ R C × H × W from the sparseset of samples. We make the sampler differentiable w.r.t. samplepositions to allow gradient ﬂow from the reconstruction network(subsection 3.3) to the importance network, so that the reconstruc-tion network is trained simultaneously and propagates the lossinformation to the sampling stage. Since the entire pipeline istrained end-to-end using a loss on the reconstructed and groundtruth images, the importance network and the pair of samplerand reconstruction network work together in an effort to learnthe placement of samples so that high reconstruction quality isachieved.In principle, one can refrain from using a separate importancemap, by realizing the sampler as a network that directly learns theadaptive sampling. In this case, however, modelling the positionalinformation in a network requires to represent positions explicitly,either in a graph structure or a linear ﬁeld, so that less efﬁcient graphnetworks or fully-connected networks need to be used. Furthermore,the sampler has to be re-trained whenever a different number ofsamples is used. Our approach enables to use efﬁcient convolutionalnetworks, and to change the number of samples at testing time.An overview of the processing pipeline is shown in Figure 2. Itworks with images comprised of an arbitrary number C of channels.In the ﬁrst part of this work, the pipeline is introduced for isosurfacerendering with C =

5, i.e., a binary mask (1: hit, 0: no hit), a normalvector, and a depth value. The application to direct volume renderedimages is discussed in section 6.Weiss et al. [60] enforce frame-to-frame coherence duringanimations by including a temporal loss in the training step. Thisloss considers the difference between the previous frame – warpedby the frame-to-frame optical ﬂow – to the current frame. In theaccompanying video, this approach is used for both the importanceand reconstruction network. In the following discussion, however,temporal connections are omitted and the focus is solely on singleimage reconstruction for clarity.

The importance network I ← N I ( L ) determines the distributionof the samples that are required by the reconstruction networkto generate the visual output according to some loss function.Deeming every pixel equally important, i.e., N I , constant ( L ) i j = , (1)leads to a uniform distribution of the samples [64, 21, 60].Alternatively, and in the spirit of classical edge detection ﬁlters, thescreen space gradients of the individual channels can be used, i.e., N I , gradient ( L ) i j = ∑ c w c || ∇ L i j , c || , (2)where ∇ L i j , c is the screen space gradient of channel c at location i j . The contributions of the individual channels are weighted by Low-res image L ∈ R C × fH × fW N I Importance-Network Importance map I ∈ [ , ∞ ) H × W S Sampling,Rendering Sampling Pattern P ∈ R H × W Samples S ∈ R C × H × W N R Reconstruction-Network Output O ∈ R C × H × W Target T ∈ R C × H × W L Loss Function R data connectiongradient propagation Fig. 2: Overview of network-based adaptive sampling. From a low-resolution image L , the importance network infers the importancemap I . The sampler S uses this map together with a sampling pattern P to adaptively place samples in the high-resolution image S .Ray-casting the object at these samples generates a sparse image. The reconstruction network recovers the dense output O . w ∈ R C . Other known importance measures consider screen spacecurvature via the variation of surface normals [46], or color contrastvia the variation of luminance [58].Alternatively, we introduce a fully convolutional neural network N I , net (subsection 4.2) that predicts a high-resolution greyscaleimportance map I from a low-resolution rendering L . Notably, thisnetwork is not trained w.r.t. speciﬁc characteristics that are derivedfrom the image like gradients or luminance information, since thisrequires to heuristically decide on the importance of pixels. Instead,it is trained end-to-end with losses only on the reconstructed colorinformation, by gradient descend all along the processing pipeline.In section 5, network-based inference of the importance map iscompared to alternative approaches, showing superior predictionof regions that are important for the ﬁnal image. Given the target number of samples in the ﬁnal image, e.g. µ = I to determinewhere to place these samples. To generate the given number ofsamples, two main classes of algorithm are commonly used inrendering: • Stippling starts with a given number of points at randomlocations and iteratively optimize these locations so that thepoint density matches the density of the importance map[10, 16]. • Importance sampling treats the importance map as a densityfunction and place samples via rejection sampling or theinverse cumulative distribution function [35, 1].These algorithms, however, are not easily differentiable w.r.t.changes in the importance map, since they use discrete optimiza-tions or random processes, and often are too slow for real-timeapplications. To make the sampling process differentiable and fast,we propose a sampling strategy that computes for every pixelindependently the chance of being sampled. This is achieved by asmooth approximation of rejection sampling, which is differentiableand allows for gradient propagation through the network pipeline.Since every pixel can be processed independently, this scheme caneffectively leverage parallel execution on the GPU. On the otherhand, it does not allow for an exact match of the prescribed numberof samples, yet produces a number of samples that slightly variesaround the target number. In a ﬁrst step, the importance map I is normalized to have aprescribed mean µ and minimal value l ≤ µ . Let µ I be the meanof I over all pixels, then the image I (cid:48) i j : = min (cid:26) , l + I i j µ − l µ I + ε (cid:27) (3)has the desired properties. A small constant ε = − is used toavoid division by zero. The minimal value l is required to maintaina lower bound on the sample distribution in empty areas, which isimportant to allow for an accurate reconstruction in such areas. Weuse l = .

002 in all of our experiments. Clamping to a maximalvalue of 1 is required by the following sampling step, which isrealized as an independent Bernoulli process via rejection sampling,i.e., a sample at location i j is taken if the probability I (cid:48) i j is largerthan a uniform random value x ∈ [ , ] .To make the sampling deterministic and parallelizable on theGPU, a sampling pattern P ∈ [ , ] H × W – uniformly distributed in [ , ] – is ﬁrst generated by using a permutation of the numbers HW { , ..., HW − } . We analyze four different strategies forgenerating the permutations: Random sampling, regular sampling,Halton sampling [19], and plastic sampling [50]. Plastic samplinghas been selected, since it produced slightly superior results in allof our experiments. section B provides a detailed evaluation of thedifferent strategies.Ray-casting is then used to compute what is seen through thepixels at the determined sample locations. This information isstored in the high resolution image S ∈ R C × H × W . Since duringtraining the same view is rendered many times using differentsampling patterns, pre-computed high-resolution target images T ∈ R C × H × W are provided with the low-resolution inputs. Then,the sampling process simply becomes a selection of pixels from T : S i j = I (cid:48) ij − P ij T i j , (4)where x is 1 if x > I , from which I (cid:48) is derived. Correspondingly, gradients in the loss functionw.r.t. the weights and biases of the importance network will alsobe zero. Therefore, Equation 4 is approximated with a smoothsigmoid function to make it differentiable, so that gradients of theloss function can be back-propagated through all network stagesto change the importance map accordingly. Then, the samplingfunction becomes S i j = sig (cid:0) α ( I (cid:48) i j − P i j ) (cid:1) T i j , sig ( x ) : = + e − x , (5) where α > α → ∞ , Equation 5converges to Equation 4. A large value of α leads to samples thatare either very close to 0 or 1, but leads to exploding gradients inthe backward pass. A low value leads to samples that smoothlycover the entire interval between 0 and 1. In this case, however,the mismatch between the “fractional” samples that are used onlyduring training and the discrete “binary” samples that are usedfor testing and validation leads to a signiﬁcant reduction of thereconstruction quality. In our experiments, a value around α = α and thereconstruction quality is provided in subsection 5.2. Given the sparse set of samples S , the reconstruction function N R needs to estimate the undeﬁned pixel values to produce thedense high-resolution output image O ∈ R C × H × W . By using adifferentiable reconstruction function, gradients of the loss functionon the reconstructed images and the ground truth image can beback-propagated through the sampling stage to the importancemap.In principle, there are different possibilities to fulﬁll therequirement of differentiability: Firstly, a neural inpainting networkcan be trained on sparse inputs and the ground truth outputs tolearn the reconstruction. However, as we have veriﬁed in a numberof experiments, network-based inpainting [25, 39, 62] at a sparsitylevel as used in our application leads to low reconstruction quality(see Figure 6b). The highly varying sample density with gapsbetween valid pixel values of up to 20 pixels poses a challengingproblem for known network architectures. Furthermore, sinceduring training the sampling mask in our proposed pipeline isnot binary but contains continuous values, techniques like PartialConvolutions [39] are not applicable.Secondly, classical non-network-based inpainting methodscan be employed, for instance, PDE-based methods solving aconstrained Laplace problem [4, 14], or patch-based methodsusing non-local cost functions involving correspondence functions[24, 12, 8]. These methods, however, are not easily differentiablew.r.t. the sampling mask. For example, PDE-based methods use thesamples as Dirichlet boundaries and, to the best of our knowledge,there is no meaningful interpretation of a “fractional” Dirichletboundary. Patch-based methods, on the other hand, use a discretesearch over the image space to ﬁnd a correspondence function,which makes the derivation of continuous gradients impossible.Therefore, we introduce a novel reconstruction approach thatcombines a differentiable inpainting method with a residual neuralnetwork that learns to improve the inpainting result. In particular,we propose a variation of the pull-push algorithm [15, 32], whichis differentiable with respect to the sampling mask and can copewith a mask that comprises fractional values.The pull-push algorithm builds upon the idea of mipmaphierarchies. Firstly, the sparsely sampled high-resolution image andthe mask are recursively ﬁltered and downsampled by a factor of2. The pixel values are averaged using the fractional values in thesampling mask as weights (average pooling), and max-pooling isused to combine the values in the mask. This has the effect of ﬁlling level 0level 1level 2level 3Fig. 3: Pull-push-based inpainting using a mipmap hierarchy ofimage samples and masks. The image is downsampled until allpixels are ﬁlled, and then upsampled by combining interpolatedvalues from lower levels with the pixels at the current level. Masksare propagated through the hierarchy to obtain proper interpolationweights.the undeﬁned pixels with values that are averaged from a graduallyincreasing surrounding. Upon reaching a termination criterion,either a maximal number of steps or complete restoration of theundeﬁned pixels, the images are bilinearly upscaled again. Duringupscaling, the pixel values from the coarse levels are weighted bythe values in the mask at this level, and they are then blended withthe value at the ﬁne level based on the sampling values at thatlevel. This allows to smoothly transition from ﬁlled pixels at theﬁne level that are kept in the output towards interpolated valuesfor lower values in the sampling mask. A schematic illustrationof the process is shown in Figure 3. Since the algorithm makesuse exclusively of continuous pooling and interpolation operations,it is fully differentiable with respect to changes in the pixel dataand the sampling mask. The forward code and a manually derivedbackward code are given in section D. The algorithm has beenimplemented via custom CUDA operations in PyTorch [45].After inpainting the sparse samples via the pull-push algorithm,a fully convolutional network is used to improve the reconstructionby modeling the relationship between the inpainting result andthe ground truth. The network sharpens the results and resolvesblurred silhouettes created by the inpainting algorithm. We usethe EnhanceNet [53] as base architecture for this learning task,which is discussed in detail in subsection 4.2. In particular, weuse the EnhanceNet as a residual network that starts with theinpainting result and learns to infer the changes to the reconstructedsamples. A quantitative comparison of different learning approachesis provided in subsection 5.2. RAINING M ETHODOLOGY

In this chapter, we provide a detailed discussion of the used networkarchitectures, as well as the training and inference steps. We alsoshed light on the dependency of the reconstruction quality on theused loss functions.

As training and validation input, 5000 images of randomly selectedisosurfaces in the Ejecta data set, a particle-based supernovasimulation, were generated via GPU ray-casting at a screenresolution of 512 . Each time step was resampled to Cartesian gridswith a resolution of 256 and 512 . The surfaces were rendered L C onv , , R e L U C onv , , R e L U C onv , + · · · U p s a m p li ng C onv , , R e L U U p s a m p li ng C onv , , R e L U C onv , , R e L U C onv , B a s e li n e + U p s a m p li ng S o f t p l u s I (a) Importance Network N I S I np a i n ti ng C onv , , R e L U C onv , , R e L U C onv , + · · ·

10 residual blocks C onv , , R e L U C onv , C + O global residual? (b) Reconstruction Network N R (EnhanceNet) Fig. 4: Network architectures used in the proposed pipeline: To estimate the importance map, we use a smaller version of theEnhanceNet [53] with a 4x-upsampling factor, a residual connection with screen space gradient magnitude as baseline, and a 2x-upsampling network as a post-process. For the reconstruction network, we experimented with the option of passing the raw samples orinterpolated samples as input and using a global residual connection or not. As network architecture, an EnhanceNet with 10 residualblocks is used.from random camera positions, at varying distance to the object andalways facing the object center. Renderings are taken from differenttime steps and resolution levels to let the pipeline learn featuresat different granularity [60]. The renderer provides the normals atthe surface points, which are used in a post-process to computecolors via the Phong illumination model. From this image set, about20.000 random crops of size 256 and showing the isosurface inat least 50% of the pixels were taken, and split between training(80%) and validation (20%). For training, the mean importancevalue was set to µ = .

1, i.e., 10% of the samples (see Equation 3).This does not prohibit using less samples for validation and testing,yet we found it beneﬁcial to allow the network to use more samplesduring training. We used the Adam [30] optimizer with a learningrate of 10 − . The networks were trained on a single GeForce GTX1080 for 300 to 500 epochs in around 5-6 days. The proposed sampling pipeline comprises two trainable blocks:The importance network N I and the reconstruction network N R .Both networks use 3x3 convolutions with zero-padding and a strideof one. The importance network is a variant of EnhanceNet [53],yet with only 5 residual blocks ( 4a). Instead of directly estimatingthe importance map, the network takes as input an importancemap that is computed using screen space gradient magnitudes(subsection 3.1), and learns to improve this map using a residualconnection. We refer to subsection 5.2 for a quantitative comparisonof the network results w/ and w/o an initial gradient-basedimportance estimate.The importance network performs 4x-upscaling of a low-resolution input image with 1 / / = . / / / ≈ .

56% of the pixels inthe ﬁnal importance map need to be rendered.The reconstruction network N R estimates the mask, normal, anddepth values at all pixels, thereby also changing the initial valuesthat were drawn in the rendering process. A modiﬁed EnhanceNet ( 4b) shows superior reconstruction results compared to alternativearchitectures such as the U-Net [51]. Let us refer to section A fora more detailed analysis of both architectures. Both networks areprovided in the code repository accompanying this paper.Our experiments (subsection 5.2) show improved reconstructionquality if inpaining is performed ﬁrst and the result is then passedto a network that uses a residual connection to learn the differencesbetween this result and the ground truth. In addition to the inpaintedinput samples, we pass the sample mask to the network as a per-sample measure of certainty. Since the network produces outputvalues in R , both the mask and depth values are clamped to [ , ] and the normals are scaled to unit length before shading is applied. We employ regular vector norms between the network prediction O and the target image T as primary loss functions on the individualoutput channels. Since the L norm tends to smooth out the resultingimages, we make use of the L norm in this work. With the channelsof the output image, i.e., the mask M , the normal map N , and thedepth D , given as subscript, the L loss of a selected channel X is L , X = || T X − O X || . (6)We do not employ additional perceptual losses, which were shownless effective for isosurface upsampling tasks [60].The mask channel has a special meaning as it indicates whetheror not a ray hits the isosurface. It is used in the ﬁnal output toperform a hard selection between the reconstructed color valuesand the background. To make the mask differentiable, however, itsvalues must be continuous, leading to a smooth blend rather thana binary decision. While this is acceptable along the silhouettes,in the interior it would noticeably distort the reconstruction. Inprinciple, via a sigmoidal mapping it can be enforced that themask values spread continuously between 0 and 1, yet we observedundesirable blurring when using this approach. To produce sharpmasks that are either close to zero or one, we therefore constrainthe reconstruction via two losses that are added to the regular L loss on the mask. The ﬁrst loss is a binary cross entropy (BCE) lossthat ”pulls“ the values closer to either zero or one than a normal L loss: L bce = − W H ∑ i j ( T M , i j log ( O M , i j ) + ( − T M , i j ) log ( − O M , i j )) . (7) The BCE loss, however, requires that the output mask lies within [ , ] and thus the mask is clamped beforehand. This leads to zerogradients once the mask reaches values outside of [ , ] . Therefore,we add the loss L bounds = W H ∑ i j (cid:0) max ( , ( O M , i j − ) − ) (cid:1) , (8)which pushes values outside [ , ] back into [ , ] and leaves valueswithin [ , ] unchanged.An additional loss term is required to account for the normal-ization step in Equation 3. The output of the importance mapis normalized to limit the number of available samples. Hence,scaling the network output does not inﬂuence the values afternormalization. Therefore, during training it can happen that theoutput values increase or decrease in an unbounded manner. Toprevent this, a prior on the importance map is used to enforce thatthe mean is equal to one before the normalization step: L I , prior = (cid:32) − W H ∑ i j I i j (cid:33) . (9)The ﬁnal loss function is a weighted sum of the individual lossterms over all channels, i.e., with X ∈ { M , N , D } it becomes L = ∑ X λ X L , X + λ bce L bce + λ bounds L bounds + ρ L I , prior . (10)Loss weights around λ M = , λ bce = , λ bounds = . , λ N = , λ D =

5, and ρ = . ESULTS AND E VALUATION

In the following, we evaluate the proposed network pipeline. First,we introduce the quality metrics that are used to compare theresults. We then analyze how our design decisions inﬂuence thereconstruction quality on the validation data (subsection 5.2). Thesestatistics help to identify the network conﬁgurations with the bestpredictive skills. Next, the proposed network pipeline is comparedto a ﬁxed super-resolution network ( subsection 5.3). Finally, weshed light on the generalizability of the network pipeline to newviews of Ejecta and data sets that were never seen during training(subsection 5.4).

The quality of network-based reconstruction is assessed using threedifferent image quality metrics commonly used in image processing.These metrics compare the output O of the network pipeline with aground truth rendering T at the target resolution.The peak signal-to-noise ratio (PSNR) is based on the L lossand is deﬁned asPSNR ( O , T ) = −

10 log ( || O − T || ) , (11)where O and T are the network output and target image, respec-tively.The Structural Similarity Index (SSIM) [59] extends on theidea of per-pixel losses by measuring the perceived quality usingthe mean and variance of contiguous pixel blocks in the images. Itis deﬁned asSSIM ( O , T ) = ( µ O µ T + c )( σ O , T + c )( µ O + µ T + c )( σ O + σ T + c ) , (12) Fig. 5: Inﬂuence of the sharpness parameter α on the trainingprocess. A lower value leads to a lower cost during training, butincreases the cost in the validation phase where a perfect stepfunction is used.where µ O and µ T are the average values of O and T , σ O and σ T are the variances of O and T , σ O , T is the covariance between O and T , and c and c are small constants to avoid division by zero.We also use the network-based Learned Perceptual Image PatchSimilarity (LPIPS) metric [63] that predicts human perception ofrelative image similarities. LPIPS builds upon a network that is pre-trained on an image classiﬁcation task—using the AlexNet [33]—and computes a weighted average of the activations at hidden layersfor a given output and target imag. Note that a lower LPIPS scoreis better, whereas PSNR and SSIM indicate higher quality by ahigher score. Therefore, 1 − LPIPS is shown in our statistics forbetter comparison.

Unless otherwise mentioned, all statistics presented in this sectionwere computed on a validation data set using 2000 novel viewsof Ejecta at the resolution of 512 . The importance map wasnormalized to have a minimal value l = .

002 and a mean value µ = .

05. ”plastic“ sampling was used in the sampling stage.

Steepness of the Sampling Function

The parameter α inEquation 5 determines the steepness of the sampling function. Aperfect step function, as used for testing, is obtained for α → ∞ .Figure 5 compares the total loss on the training and validationdata over the course of the optimization for different values of α . A lower value of α leads to a lower cost on the trainingdata, because smoother variations in the fractional samples can beused for reconstruction. However, this behaviour is reversed duringvalidation, because the perfect step function corresponds to a lesserand lesser extent with an increasingly smooth sampling function.Higher values of α , on the other hand, lead to better generalization,yet beyond 100 we observed instabilities in the training as well asnumerical precision issues. Therefore, we decided to use α = Residual Connections for Reconstruction

In principle, thereare different options to reconstruct a dense image from thesparse set of samples, including sole inpainting via the pull-pushalgorithm as well as inpainting in combination with network-basedreconstruction w/ or w/o residual connections. In Figure 6, thereconstruction quality of all options is compared, using screen spacegradient magnitudes as measure for generating the importance map.As can be seen, the pull-push algorithm already provides agood initial guess on the reconstructed image, and reconstructionquality reduces signiﬁcantly when it is not used. On the other hand, a) b) c) d) e)Fig. 6: Comparison of different reconstruction methods: (a) Onlypull-push-based inpainting. (b) Only network-based reconstructionwithout residual. (c) Pull-push plus reconstruction network w/oresidual. (d) Pull-push inpainting plus reconstruction network withresidual. (e) Ground truth. An importance map from screen spacegradient magnitudes is used in all examples, with µ =

5% ofsamples.the network-based approach fails to reliably ﬁll the empty pixels,which is probably due to the vastly different distances between thesparse samples. When using the pull-push algorithm in combinationwith network-based reconstruction, but with disabled residualconnections, no beneﬁt over sole pull-push-based inpainting isgained. The best result is achieved with both pull-push-basedinpainting and residual network connections. This is in line withthe ﬁndings of Kim et al. [28], that the quality of network-basedreconstruction improves if the network needs to learn only thechanges to the baseline method.

Residual Connections for Importance Mapping

On thevalidation data, we then analyze the reconstruction quality usingdifferent approaches for generating the importance map, i.e.,constant importance, importance derived from screen space gradientmagnitudes, as well as network-based importance with or withoutlearning a residual to screen space gradient magnitudes. Figure 7shows the results using the quality metrics described above.a) b) c) d)Fig. 7: Reconstruction quality using different importance maps.Bottom left: Low-resolution input. (a) Constant map, (b) based ongradient magnitudes, (c) only network-based learning, (d) network-based learning with residual on gradient magnitudes. µ =

5% ofsamples were used. Top: Quality metrics for options (a) to (d).As expected, screen space gradient magnitudes already hintto some important regions that should be sampled with higherdensity, signiﬁcantly outperforming a constant importance map. Forreconstructing the mask and normal channels, gradient magnitudesand network-based importance learning differ only marginallyw.r.t. reconstruction quality. The importance network puts moreemphasis on the object silhouettes and leads to an improvedreconstruction of the normals over gradient magnitudes. On theother hand, it is important to note that the network learns theimportance of features for an accurate screen space reconstruction

10% 100%100405060708090200 E j e c t a PSNR color

10% 100%1.000.600.700.800.90

SSIM mask

10% 100%

SSIM normal

10% 100%

SSIM color

10% 100% constant + network network + network SR-Net

Fig. 8: Median reconstruction quality and 25% / 75% quantilesshown as conﬁdence bands for increasing number of samples.Orange: Network-based pipeline using a constant importance map.Green: Network-based pipeline with the network-based importancemap. The red dot represents the 4x-upsampling network fromWeiss et al. [60].without any prior information (Figure 7c). The best results areachieved by combining network-based importance learning andscreen space gradient magnitudes via a residual network connection,demonstrating the feasibility of learning features that are importantfor an accurate reconstruction.

We further analyse the convergence of the proposed samplingpipeline with increasing number of samples. The network is trainedwith 10% of the samples, but during inference the available numberof samples is varied. The results in Figure 8 indicate, that withincreasing number of samples the SSIM and LPIPS scores convergeagainst their optima. Even though this seems logical at ﬁrst, sincethe reconstruction network modiﬁes the given samples, it could, inprincipal, converge against some other solution. Notably, alreadyafter taking 20% to 30% of samples the reconstruction is very closeto the target.We also compare the quality of adaptive sampling to ﬁxedregular sampling using a 4x-upsampling network [60]. The 4x-upsampling network uses a regular sampling structure comprisedof 1 / = .

25% of the pixels in the high-resolution image,corresponding to a constant importance map with 6 .

25% of thesamples when adaptive sampling is used. Figure 8 shows that the4x-upsampling network (red) performs equally good as the adaptivepipeling using a constant importance map (orange). However,when the samples are placed adaptively according to the inferredimportance map (green), the reconstruction quality is signiﬁcantlyincreased at the same number of samples.

The importance and reconstruction networks are trained solely onEjecta. To test how well the networks generalize, they are appliedto a number of data sets that were never seen during training. Weuse a Richtmyer-Meshkov (RM) simulation at 1024 × × , an aneurism (Aneurism)at 256 , a bug (Bug) at 416 × × .Quantitative statistics for RM, Skull and novel views of Ejectaare given in Figure 9. Reconstructed images as well as SSIM andLPIPS statistics for all data sets are shown in Figure 10.The pipeline generalizes well to new data sets and views, and itperforms better than the baseline method using gradient magnitude-based importance mapping and pull-push-based inpainting. Inparticular, the network pipeline produces a tighter spread of thequantitative measures in general, indicating less signiﬁcant outliers Fig. 9: Quality statistics for novel views of Ejecta and new datasets RM and Skull (see Figure 10). Baseline method (blue) refersto gradient magnitude-based importance mapping and pull-push-based inpainting. Results of the proposed network pipeline areshown in red.in the reconstructed values. The network shows lower scores onlyfor the depth maps reconstructed from sparse samples of RM andSkull. We attribute this to different zoom levels in the renderingsand the training images, yet these inaccuracies do not affect thequality of the reconstructed color images. For reconstruction, wealso analyzed the quality of other inpainting algorithms such asPDE-based methods. Notably, these methods are not differentiableand, thus, cannot be used for end-to-end training in combinationwith the importance network, yet they can be used for sole sparseimage reconstruction. A comparison to the pull-push algorithm,however, does not show any perceivable differences. The resultsfurther indicate that the network pipeline can reconstruct images athigh ﬁdelity from only 5% of the samples that are used to renderthe data sets at full pixel resolution. In particular sharp edges arewell preserved, since the network has learned to increase the sampledensity along them.

PPLICATION TO

DVR

The proposed network pipeline can be applied to images that arerendered via Direct Volume Rendering (DVR), i.e., volume ray-casting using an emission-absorption model along the rays of sight.In contrast to isosurface ray-casting, not only one single ray-surfaceintersection point is rendered, but the colors of many sample pointsalong the rays are blended using α -compositing to account forvolumetric attenuation. The importance and reconstruction networks receive RGB α imagesas input, and the network pipeline outputs the reconstructed high-resolution RGB α images. Interestingly, we observed a noticeableincrease in quality when the gradients at the sample points along theview rays are used by the importance and reconstruction network.The normalized gradients in [ − , ] along a single ray are treatedas emission and blended according to the volume rendering integral,just as blending the RGB colors. The resulting gradient map is thenused as an additional input channel. Since the average gradientsindicate, to a certain extent, whether two rays step through vastlydifferent or similar regions, the gradient map serves as an additionalcoherence indicator. When only a single isosurface is rendered, theresulting values converge against the values in the normal map.For training and validation, random transfer functions (TFs)are generated and used to render Ejecta, with L losses on colorand alpha in combination with a LPIPS-based perceptual loss(section C). Since the low-resolution input to the importancenetwork is also generated with a TF, the network can learn toselect features speciﬁc to that TF, even though this was neverseen during training. It is important to note that the reconstruction quality strongly depends on the use of TFs that include a broadrange of different colors in the training step. For instance, if thetraining data only contains desaturated colors, strongly saturatedcolors during testing cannot be reconstructed. For novel views of Ejecta and the data sets introduced in sub-section 5.4, Figure 11 shows a qualitative analysis of the resultsof importance sampling and reconstruction using DVR as wellas SSIM and LPIPS statistics. None of these data sets was usedin the training and validation phases, and the results have beengenerated using TFs that were never seen during training. Theresults indicate that the network pipeline generalizes well to newvolumes and TFs, yet the reconstruction quality is affected by theoccurring color variations. Especially for Thorax and Aneurism,where the TFs introduce rather small scale color variations insome areas, in these areas the network places the samples ratheruniformly and, thus, cannot accurately reconstruct the renderedstructured. Overall, it can be seen that the reconstruction problemis signiﬁcantly more challenging when using DVR samples insteadof isosurface samples. When rendering isosurfaces, the shading inthe interior of the rendered structures is rather smooth, enablingthe network to focus on the silhouettes and internal edges. In DVR,on the other hand, the network needs to learn both the shape andthe color texture stemming from the application of a TF.

ERFORMANCE A NALYSIS

Even though performance improvements are not our main objective,it is interesting to see whether network-based adaptive samplingand image reconstruction can be faster than full-resolution GPU ray-casting, due to the reduced number of samples that need to be taken.The following performance tests were carried out on a workstationrunning Windows 10 with an Intel Xeon E5-1630 @3.70GHz CPU,32GB RAM, and an NVidia Titan RTX. All timings are averagesover 100 frames with random camera positions, with the screenresolution set to 1024 . The ray-caster uses a constant step size of0.25 voxels and tricubic interpolation.For some of the data sets shown in Figure 10 and Figure 11,Table 1 lists the times that are required by the pipeline stages andfull-resolution volume ray-casting. Only for the larger data sets andDVR can the network pipeline achieve a slightly better performancethan the ray-caster. Especially the reconstruction network consumesa signiﬁcant portion of the overall time, sometimes even morethan it requires to render at full resolution. This is because thereconstruction network requires a large amount of data accessand arithmetic operations on the GPU, independent of the volumeresolution.On the one hand, the performance of the reconstruction networkscales linearly with the number of pixels, and hence quadraticallywith the screen resolution. Volume rendering, on the other hand,scales quadratically with the screen resolution but also linear in thevolume resolution. The sampling stage, even though it also scalesin the volume resolution, performs a signiﬁcantly smaller numberof sampling operations than the full-resolution ray-caster. Thus,its overall contribution is negligible, so that performance beneﬁtscan be expected with increasing image and volume size. This isdemonstrated in Table 2, where versions of RM and Ejecta at 2048 are rendered at different resolution levels and large images sizes.Note that in these experiments an Nvidia Titan RTX graphics cardwith 24GB of memory was used to keep all data in memory. TABLE 1: Timings (in milliseconds) of network-based volumerendering (averaged over 100 different views at 1024 targetresolution) for data sets shown in Figure 10 and Figure 11. Timingsare for rendering the low-resolution input image (128 ) and thesparse set of samples (5% and 10% of the target resolutionfor isosurface rendering and DVR respectively), generating theimportance map and sampling pattern, reconstructing the image,and GPU ray-casting at the target resolution. Test case Rendering Importance Reconstruction GT I S O RM 1024 DV R RM 1024 TABLE 2: Performance scaling w.r.t. image and volume size. Eachentry shows the total time of the network pipeline (low-resolutionrendering, importance network, sparse sampling, reconstructionnetwork) and the time required by the volume ray-caster at fullresolution. All timings are in milliseconds. The cells are coloredwith a diverging color map, encoding the performance differencesfrom red (superior performance of ray-casting) to blue (superiorperformance of the network pipeline).

Screen resolution256 512 1024 2048 I S O E j ec t a V o l u m e r e s o l u ti on

256 24 /

11 45 /

20 115 /

75 398 / /

16 55 /

31 124 /

106 405 / /

47 73 /

104 141 /

221 445 / /

278 132 /

806 238 / / R M

256 24 / /

11 111 /

42 393 / / /

15 116 /

58 397 / /

15 71 /

31 132 /

89 411 / /

122 124 /

249 211 /

453 646 / DV R E j ec t a

256 46 /

17 64 /

46 131 /

162 413 / /

36 75 /

73 144 /

225 447 / /

151 103 /

262 180 /

433 594 / /

575 186 / / / R M

256 32 / /

21 117 /

71 398 / /

11 58 /

27 125 /

96 404 / /

29 81 /

58 149 /

158 450 / /

203 143 /

410 248 /

656 830 / It can be seen that for large image sizes—where the GPU isfully utilized by the network—and volume sizes larger than 1024 ,the network pipeline outperforms the GPU ray-caster. Even thougha ray-caster using advanced acceleration schemes can achieveimproved performance, we are conﬁdent that in these scenariosfaster deep-learning hardware and performance-optimized networkarchitectures will let the performance differences grow due to betterscalability of the network pipeline. ONCLUSION AND F UTURE W ORK

In this paper we have introduced and analyzed a network pipelinethat learns adaptive screen space sampling and reconstruction for3D visualization, with the focus on volume rendering applications.For the ﬁrst time, to our best knowledge, a fully differentiableadaptive sampling pipeline comprised of an importance network,a sampling stage, and a reconstruction network is proposed. Ourexperiments have shown, that the pipeline learns to determine thelocations that are important for an accurate image reconstruction,and achieves high reconstruction quality for a sparse set of samples.We are particular intrigued about the quality of the resultscompared to sampling methods that consider explicitly certain feature descriptors. Even without such supervision, the networkpipeline can improve on the reconstruction quality, using solelyimage-based quality losses. We believe that especially for datavisualization there is value in the observation that artiﬁcial neuralnetworks can learn the relevance of structures for generating visualrepresentations. For sole rendering tasks, on the other hand, superiorperformance compared to classical volume ray-casting can only beachieved for large image and volume sizes.The application to DVR opens the interesting question whetherthe proposed network pipeline can be used beyond adaptivesampling in screen space, and learn where to sample in object spaceso that the relevant information is conveyed visually. Conceptuallythis requires end-to-end learning of a mapping from a low-resolution object space representation to a high-resolution visualrepresentation. The ultimate goal is to let the network learn toconvert a low-resolution input volume to a compact yet feature-preserving latent-space representation from which a highly accurateview can be inferred.In particular, we envision a neural volume rendering pipeline,where during training a neural scene representation is build andtrained end-to-end with a renderer that learns sampling and colormapping simultaneously. In the future, we will analyse whether anetwork can learn a suitable color mapping for a given volumetricﬁeld. We also see challenging research problems in the area oftransfer learning, to infer the most important samples for training,and to generate synthetic volumetric ﬁelds to enable training indomains where training data is rare. R EFERENCES [1] T. Bashford-Rogers, K. Debattista, and A. Chalmers. Impor-tance driven environment map sampling.

IEEE transactionson visualization and computer graphics , 20, 11 2013. doi: [2] S. Belyaev, P. Smirnov, V. Shubnikov, and N. Smirnova.Adaptive algorithm for accelerating direct isosurface renderingon gpu.

Journal of Electronic Science and Technology , 16:222–231, 01 2018. doi: [3] M. Berger, J. Li, and J. A. Levine. A generative model forvolume rendering.

IEEE Transactions on Visualization andComputer Graphics , 25(4):1636–1650, April 2019. doi: [4] M. Bertalmio, A. L. Bertozzi, and G. Sapiro. Navier-stokes,ﬂuid dynamics, and image and video inpainting. In

Proceedingsof the 2001 IEEE Computer Society Conference on ComputerVision and Pattern Recognition. CVPR 2001 , vol. 1, pp. I–I,Dec 2001. doi: [5] M. R. Bolin and G. W. Meyer. A perceptually based adaptivesampling algorithm. In

Proceedings of the 25th AnnualConference on Computer Graphics and Interactive Techniques ,SIGGRAPH 98, p. 299309. Association for Computing Ma-chinery, New York, NY, USA, 1998. doi: [6] L. Campagnolo, W. Celes, and L. Figueiredo. Accurate volumerendering based on adaptive numerical integration. pp. 17–24,08 2015. doi: [7] M. Chu, Y. Xie, L. Leal-Taix´e, and N. Thuerey. Tem-porally coherent gans for video super-resolution (tecogan). arXiv:1811.09393 , 2018.[8] A. Criminisi, P. Perez, and K. Toyama. Region ﬁlling andobject removal by exemplar-based image inpainting.

IEEE SSIM= . , LPIPS= . SSIM= . , LPIPS= . SSIM= . , LPIPS= . SSIM= . , LPIPS= . SSIM= . , LPIPS= . SSIM= . , LPIPS= . SSIM= . , LPIPS= . Fig. 10: Visual comparison of adaptive sampling of isosurfaces. From left to right: The importance map and, the sparse set of samples,the inpainted samples, the network output, the ground truth (normals and colors using reconstructed normals). From top to bottom: Novelviews of Ejecta, RM, Skull, Aneurism, Bug, Human, Jet. Networks were trained only on Ejecta. Number of samples is µ =

5% of thepixels in the output image. SSIM= . , LPIPS= . SSIM= . , LPIPS= . SSIM= . , LPIPS= . SSIM= . , LPIPS= . SSIM= . , LPIPS= . SSIM= . , LPIPS= . SSIM= . , LPIPS= . Fig. 11: Visual comparison of adaptive sampling for DVR. Each row shows – from left to right – the importance map, the sparse set ofsamples, the network output, the ground truth. From top to bottom: Novel views of Ejecta, RM, Thorax, Aneurism, Bug, Human, Jet.Networks were trained only on Ejecta. Number of samples is µ =

10% of the pixels in the output image. Transactions on Image Processing , 13(9):1200–1212, Sep.2004. doi: [9] J. Danskin and P. Hanrahan. Fast algorithms for volume raytracing. In

Proceedings of the 1992 Workshop on VolumeVisualization , VVS 92, p. 9198. Association for ComputingMachinery, New York, NY, USA, 1992. doi: [10] O. Deussen, M. Spicker, and Q. Zheng. Weighted linde-buzo-gray stippling.

ACM Trans. Graph. , 36(6), Nov. 2017. doi: [11] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deepconvolutional network for image super-resolution. In

Europeanconference on computer vision , pp. 184–199. Springer, 2014.[12] A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In

Proceedings of the Seventh IEEEInternational Conference on Computer Vision , vol. 2, pp. 1033–1038 vol.2, Sep. 1999. doi: [13] S. Frey, F. Sadlo, K. Ma, and T. Ertl. Interactive progressivevisualization with space-time error control.

IEEE Transactionson Visualization and Computer Graphics , 20(12):2397–2406,2014.[14] P. Getreuer. Total variation inpainting using split bregman.

Image Processing On Line , 2:147–157, 2012. doi: [15] S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen.The lumigraph. In

Proceedings of the 23rd annual conferenceon Computer graphics and interactive techniques , pp. 43–54,1996.[16] J. Grtler, M. Spicker, C. Schulz, D. Weiskopf, and O. Deussen.Stippling of 2d scalar ﬁelds.

IEEE Transactions on Visual-ization and Computer Graphics , 25(6):2193–2204, June 2019.doi: [17] B. Guenter, M. Finch, S. Drucker, D. Tan, and J. Snyder.Foveated 3d graphics.

ACM Transactions on Graphics (TOG) ,31(6):1–10, 2012.[18] L. Guo, S. Ye, J. Han, H. Zheng, H. Gao, D. Chen, J.-X.Wang, and C. Wang. Spatial super-resolution for vector ﬁelddata analysis and visualization. In

Proceedings of IEEE PaciﬁcVisualization Symposium , 2020.[19] J. H. Halton. Algorithm 247: Radical-inverse quasi-randompoint sequence.

Commun. ACM , 7(12):701702, Dec. 1964. doi: [20] J. Han, J. Tao, H. Zheng, H. Guo, D. Z. Chen, and C. Wang.Flow ﬁeld reduction via reconstructing vector data from 3-dstreamlines using deep learning.

IEEE computer graphics andapplications , 39(4):54–67, 2019.[21] J. Han and C. Wang. Tsr-tvd: Temporal super-resolution fortime-varying data analysis and visualization.

IEEE Transac-tions on Visualization and Computer Graphics , 26(1):205–215,2019.[22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition , pp. 770–778,2016.[23] W. He, J. Wang, H. Guo, K. Wang, H. Shen, M. Raj, Y. S. G.Nashed, and T. Peterka. Insitunet: Deep image synthesisfor parameter space exploration of ensemble simulations.

IEEE Transactions on Visualization and Computer Graphics ,26(1):23–33, Jan 2020. doi: [24] H. Igehy and L. Pereira. Image replacement through texturesynthesis. In

Proceedings of International Conference on Image Processing , vol. 3, pp. 186–189 vol.3, Oct 1997. doi: [25] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally andlocally consistent image completion.

ACM Transactions onGraphics (ToG) , 36(4):1–14, 2017.[26] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses forreal-time style transfer and super-resolution. In

Europeanconference on computer vision , pp. 694–711. Springer, 2016.[27] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate imagesuper-resolution using very deep convolutional networks. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pp. 1646–1654, 2016.[28] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In

TheIEEE Conference on Computer Vision and Pattern Recognition(CVPR) , June 2016.[29] J. Kim, J. Kwon Lee, and K. Mu Lee. Deeply-recursiveconvolutional network for image super-resolution. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pp. 1637–1645, 2016.[30] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. arXiv:1412.6980 , 2014.[31] A. Kratz, J. Reininghaus, M. Hadwiger, and I. Hotz. Adaptivescreen-space sampling for volume ray-casting.

ZIB-Report ,2011.[32] M. Kraus. The pull-push algorithm revisited.

ProceedingsGRAPP , 2:3, 2009.[33] A. Krizhevsky. One weird trick for parallelizing convolutionalneural networks. arXiv:1404.5997 , 2014.[34] A. Kuznetsov, N. K. Kalantari, and R. Ramamoorthi. Deepadaptive sampling for low sample count rendering. In

Computer Graphics Forum , vol. 37, pp. 35–44. Wiley OnlineLibrary, 2018.[35] J. Lawrence, S. Rusinkiewicz, and R. Ramamoorthi. Adap-tive numerical cumulative distribution functions for efﬁcientimportance sampling. In

Rendering Techniques , pp. 11–20,2005.[36] C. Ledig, L. Theis, F. Husz´ar, J. Caballero, A. Cunningham,A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al.Photo-realistic single image super-resolution using a generativeadversarial network. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition , pp. 4681–4690,2017.[37] M. Levoy. Volume rendering by adaptive reﬁnement.

TheVisual Computer , 6(1):2–7, 1990.[38] S. Lindholm, D. J¨onsson, H. Knutsson, and A. Ynnerman.Towards data centric sampling for volume rendering. In

SIGRAD , 2013.[39] G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, andB. Catanzaro. Image inpainting for irregular holes using partialconvolutions. In

The European Conference on Computer Vision(ECCV) , September 2018.[40] P. Longhurst, K. Debattista, and A. Chalmers. A gpubased saliency map for high-ﬁdelity selective rendering. In

Proceedings of the 4th International Conference on ComputerGraphics, Virtual Reality, Visualisation and Interaction inAfrica , AFRIGRAPH 06, p. 2129. Association for ComputingMachinery, New York, NY, USA, 2006. doi: [41] M. Mara, M. McGuire, B. Bitterli, and W. Jarosz. An efﬁcientdenoising algorithm for global illumination. In

Proceedings of High Performance Graphics . ACM, New York, NY, USA, jul2017. doi: [42] K. Myszkowski. The visible differences predictor: appli-cations to global illumination problems. In G. Drettakis andN. Max, eds.,

Rendering Techniques ’98 , pp. 223–236. SpringerVienna, Vienna, 1998.[43] K. Novins and J. Arvo. Controlled precision volumeintegration. In

Proceedings of the 1992 Workshop on VolumeVisualization , VVS 92, p. 8389. Association for ComputingMachinery, New York, NY, USA, 1992. doi: [44] J. Painter and K. Sloan. Antialiased ray tracing by adaptiveprogressive reﬁnement. In

Proceedings of the 16th annualconference on Computer graphics and interactive techniques ,pp. 281–288, 1989.[45] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison,A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, andS. Chintala. Pytorch: An imperative style, high-performancedeep learning library. In H. Wallach, H. Larochelle, A. Beygelz-imer, F. d'Alch´e-Buc, E. Fox, and R. Garnett, eds.,

Advancesin Neural Information Processing Systems 32 , pp. 8024–8035.Curran Associates, Inc., 2019.[46] M. Prantl, L. V´asa, and I. Kolingerov´a. Fast screen spacecurvature estimation on gpu. In

VISIGRAPP (1: GRAPP) , pp.151–160, 2016.[47] C. R. Alla Chaitanya, A. S. Kaplanyan, C. Schied, M. Salvi,A. Lefohn, D. Nowrouzezahrai, and T. Aila. Interactive recon-struction of monte carlo image sequences using a recurrentdenoising autoencoder.

ACM Transactions on Graphics , 36:1–12, 07 2017. doi: [48] M. Ramasubramanian, S. N. Pattanaik, and D. P. Greenberg.A perceptually based physical error metric for realistic imagesynthesis. In

Proceedings of the 26th Annual Conference onComputer Graphics and Interactive Techniques , SIGGRAPH99, p. 7382. ACM Press/Addison-Wesley Publishing Co., USA,1999. doi: [49] J. Rigau, M. Feixas, and M. Sbert. Reﬁnement criteria basedon f-divergences. In

Rendering Techniques , pp. 260–269, 2003.[50] M. Roberts. The unreasonable effectiveness ofquasirandom sequences. http://extremelearning.com.au/unreasonable-effectiveness-of-quasirandom-sequences/, 2020.Accessed: 2020-02-14.[51] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutionalnetworks for biomedical image segmentation. In N. Navab,J. Hornegger, W. M. Wells, and A. F. Frangi, eds.,

Medical Im-age Computing and Computer-Assisted Intervention – MICCAI2015 , pp. 234–241. Springer International Publishing, Cham,2015.[52] M. S. Sajjadi, R. Vemulapalli, and M. Brown. Frame-recurrentvideo super-resolution. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pp. 6626–6634,2018.[53] M. S. M. Sajjadi, B. Scholkopf, and M. Hirsch. Enhancenet:Single image super-resolution through automated texturesynthesis. In

The IEEE International Conference on ComputerVision (ICCV) , Oct 2017.[54] W. Shi, J. Caballero, F. Husz´ar, J. Totz, A. P. Aitken,R. Bishop, D. Rueckert, and Z. Wang. Real-time singleimage and video super-resolution using an efﬁcient sub-pixel convolutional neural network. In

Proceedings of the IEEEconference on computer vision and pattern recognition , pp.1874–1883, 2016.[55] Y. Tai, J. Yang, and X. Liu. Image super-resolution viadeep recursive residual network. In

Proceedings of the IEEEconference on computer vision and pattern recognition , pp.3147–3155, 2017.[56] X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia. Detail-revealingdeep video super-resolution. In

Proceedings of the IEEEInternational Conference on Computer Vision , pp. 4472–4480,2017.[57] G. Tkachev, S. Frey, and T. Ertl. Local prediction models forspatiotemporal volume visualization.

IEEE Transactions onVisualization and Computer Graphics , pp. 1–1, 2019. doi: [58] O. T. Tursun, E. Arabadzhiyska-Koleva, M. Wernikowski,R. Mantiuk, H.-P. Seidel, K. Myszkowski, and P. Didyk.Luminance-contrast-aware foveated rendering.

ACM Trans.Graph. , 38(4), July 2019. doi: [59] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al.Image quality assessment: from error visibility to structuralsimilarity.

IEEE transactions on image processing , 13(4):600–612, 2004.[60] S. Weiss, M. Chu, N. Thuerey, and R. Westermann. Volu-metric isosurface rendering with deep learning-based super-resolution.

IEEE Transactions on Visualization and ComputerGraphics , pp. 1–1, 2019. doi: [61] Q. Xu, S. Bao, R. Zhang, R. Hu, and M. Sbert. Adaptivesampling for monte carlo global illumination using tsallisentropy. In

International Conference on Computational andInformation Science , pp. 989–994. Springer, 2005.[62] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Free-form image inpainting with gated convolution. In

Proceedingsof the IEEE International Conference on Computer Vision , pp.4471–4480, 2019.[63] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang.The unreasonable effectiveness of deep features as a perceptualmetric. In

CVPR , 2018.[64] Z. Zhou, Y. Hou, Q. Wang, G. Chen, J. Lu, Y. Tao, and H. Lin.Volume upscaling with convolutional neural networks. In

Pro-ceedings of the Computer Graphics International Conference ,pp. 1–6, 2017. A PPENDIX AC OMPARISON WITH THE

U-N

ET FOR R ECONSTRUC - TION

For reconstruction, we also tested different variants (by varyingthe number of levels and channels at each level) of the U-Netarchitecture [7]. As one can see in Figure 13, in our application theEnhanceNet vastly outperforms all considered U-Net variants. S I np a i n ti ng C onv , C + O global residual?X Conv, X channels; ReLUmax pool 2x2bilinear upsampling Fig. 12: Reconstruction network based on the U-Net architecture.See 4b for the EnhanceNet architecture we use in our experiments.Fig. 13: Comparison of the U-Net and EnhanceNet for sparseimage reconstruction: U-Net 4-4 (a), U-Net 5-3 (b), U-Net 5-4 (c),EnhanceNet (d). a - b indicates a levels and 2 b + i channels in level i (zero-based). The importance network is trained together with thereconstruction network. A PPENDIX BC OMPARISON OF D IFFERENT S AMPLING P ATTERN

For deterministic and parallelizable sampling on the GPU, we usea pre-computed sampling pattern in combination with rejectionsampling (subsection 3.2). The sampling pattern P ∈ [ , ] H × W contains permutations of uniformly distributed numbers in [ , ] , HW { , ..., HW − } . Here we analyze the four different strategiesemployed for generating the permutations (Figure 14, top): Randomsampling, regular sampling, Halton sampling [2], and plasticsampling [6].Random sampling generates a random permutation of thenumbers in P . Regular sampling arranges the pixels in a quad-tree and enumerates them using breath-ﬁrst traversal to generatethe sampling pattern. Random and regular sampling introduce,respectively, largely varying sample densities and a strong biasof the sample distribution towards the top of the image. BothHalton and plastic sampling are deterministic and produce quasi-random sequences with a fairly uniform distribution. As revealedby the quantitative analysis in Figure 15, even though all samplingstrategies allow reconstructing the ﬁnal image at high accuracy,slight differences are noticeable. Halton and plastic sampling leadto superior quality, in particular w.r.t. the variance of the qualitymetrics. Plastic sampling, designed as a low-discrepancy samplingsequence, shows the lowest variance and slightly higher scoresthan Halton sampling. We therefore use plastic sampling in ourimplementation. Fig. 14: Comparison of random sampling, regular sampling, Haltonsampling and plastic sampling (left to right). Top: The samplingsequences. Bottom: The sequences applied to render a sphere withconstant importance of µ = . µ =

5% ofsamples. A PPENDIX CA PPLICATION TO

DVR

In this section, we provide additional details on how the proposedadaptive sampling pipeline is applied to DVR images, as mentionedin section 6. First we present the changes to the pipeline in termsof input and output channels and the used loss function. Second,we describe how to generate the training data including samplingof transfer functions.Input Channels and Loss Function: First, the input channelsto the network pipeline are reinterpreted. For isosurfaces, a mask,normals and depth were passed to the network as input (5 channelsper pixel), now color images from the DVR images, together withalpha, depth and normal maps are used as input (8 channels perpixel). The network also only reconstructs color images in RGB α space.For DVR, depth and normal maps are computed by treatingthe screen space depth and normal at each sample in object spacelike a regular color and blended as such with the opacity given bythe transfer function (TF). The result is a single depth and normalvalue per ray which can be interpreted as a weighted average ofthe depth and normal of all samples along the ray. We found thatadding depth and normals as input channels improves the qualityof the reconstruction as it provides additional locally consistentinformation about the curvature of the object.Second, the loss functions on the individual channels as usedfor isosurfaces are replaced by losses only on the RGB α -color. Weapply L losses on the color and alpha and an additional LPIPSmetric [10] as a perceptual loss, weighted equally: L dvr = L , rgba + L LPIPS , rgb . (13)We found that adding a perceptual loss is critical in reconstructingﬁne details and sharp silhouettes. The networks operates in RGBspace, other colorspaces like HSV, XYZ or CIELAB did notimprove the result. Furthermore, the training data is augmented by randomly shufﬂing the RGB channels. This helps the network tonot overﬁt for a speciﬁc color.Data Set Generation: For training and validation, randomtransfer functions (TFs) are generated (see below) and Ejecta wasused as data set. The test images in the result section use user-generated TFs. Note that since the low-resolution input for theimportance network is also generated with a TF, the network canlearn to select features speciﬁc to that TF, even though it was neverseen during training.To generate meaningful TFs, ﬁrst a density histogram iscomputed and then a Gaussian Mixture Model (GMM) is usedto cluster densities in an unsupervised manner. GMMs have beenpreviously used to cluster two-dimensional feature points [9], e.g.,density and gradient magnitude. Our approach follows the sameidea to cluster one-dimensional feature points, i.e., density values.The GMM represents each cluster as a 1D Gaussian function witha certain mean, i.e., the cluster center, and standard deviation, i.e.,the cluster spread. To determine the number of components ofthe GMM, several GMMs with different numbers of componentsare build and the one with the lowest Bayesian informationcriterion (BIC) [8] value is selected. BIC penalizes the number ofcomponents and prevents overﬁtting using many components.After computing the GMM, the number of peaks of the TF issampled uniformly between 3 and 5. The represented density foreach peak is sampled from the computed GMM. Next, a widthin density space is sampled uniformly from [ . , . ] and theopacity at that peak from [ . , . ] . As colormaps, predeﬁned col-ormaps from SciVisColor are randomly sampled. The generationprocess is visualized in Figure 16.Fig. 16: First row: Histogram of the density values of the Ejectadata set and matched GMM. Second row: three sampled transferfunctions with opacity and color. Third row: Renderings from thetraining data set with those transfer functions.We note that it is important for the quality of the reconstructionthat the color transfer functions in the training data include a broadrange of colors. For example, if the training data only containsdesaturated colors, strongly saturated colors during testing cannotbe reconstructed.

1. https://sciviscolor.org/home/colormaps A PPENDIX DP ULL -P USH A LGORITHM

As a baseline method to interpolate the sparse samples, we apply avariation of the push-pull algorithm [1, 3], see Alg. 1 for the pseudocode. The algorithm builds upon the idea of mip-map levels: ﬁrst,the image is downscaled using bilinear interpolation with weightsbased on the mask. Then, the image is upscaled again and blendedwith the values at the ﬁner levels with the mask values at the ﬁnerlevels. We refer to subsection 3.3 for more details in the contextof the adaptive sampling pipeline. The pull-push algorithm can bedirectly extended to fractional masks as shown in Alg. 1. Duringthe upsampling stage, the mask is not treated binary, i.e. either takethe original pixel at the ﬁne level or use the interpolated value fromthe coarse level, but fractional with a linear interpolation betweenthe original value and the interpolated value. Furthermore, thealgorithm consists only of linear pooling and interpolation layerswhich are easy to differentiate with respect to the input mask. Werefer to subsection E.3 for an outline on how to derive the backwardpass. A PPENDIX ED IFFERENTIATION OF THE S AMPLING AND R ECON - STRUCTION S TAGES

The adjoint code for the gradient propagation in the backward passis automatically generated by PyTorch for the networks, the lossfunctions and for the sampling function Equation 5. For the pull-push algorithm (Alg. 1), the adjoint code was manually derived andimplemented as a custom operation. In this section we provide thefundamentals of the adjoint method to manually derive the adjointcode and show how it can be applied to the sampling function andthe pull-push algorithm.

E.1 Fundamentals of the Adjoint Method

The adjoint method has a long history in Optimal Control Theory,we refer the interested reader to the book by Lions [4] for acomplete mathematical introduction. Here, we brieﬂy sketch thefundamentals following the notation by McNamara et al. [5].By ignoring applications to linear systems and differentialequations and focussing on chained functions instead, the adjointmethod simpliﬁes to an application of the chain rule. Let thealgorithm be deﬁned as a concatenation of functions f i withparameters w i starting from an input value x , x = f ( x , w ) x = f ( x , w ) ... x n = f n ( x n − , w n ) s = J ( x n , w J ) . (14)The result s has to be a scalar value, this is crucial for the applicationof the adjoint method in this simple form. In the context of neuralnetworks, x would be the input image, f to f n the network layerswith weights w i and feature vectors x i , J would be the loss functionwith target image w J and s the scalar score.During training, we are interested in the derivatives ∂ J ∂ w i toupdate the weights or in e.g. ∂ J ∂ x to update the initial image ina feature-visualization context. First, given the – possibly vectorvalued – variables x i and w i , let the adjoint variables ˆ x i and ˆ w i Algorithm 1

Pseudocode of the pull-push algorithm for power-of-two input images (a version handling non-power-of-two inputs andthe adjoint code for computing the derivative with respect to themask and data are provided in the source code). function I NPAINTING (maskIn : HxW, dataIn : HxWxC) if H ≤ ≤ then return maskIn, dataIn (cid:46) end of recursion end if weighted area downsampling: maskLow, dataLow = zeros of shape H x W , H x W x C for i , j ∈ { ,..., H − } × { ,..., W − } do N max , N avg , d = C for a , b ∈ { i , i + } × { j , j + } do (cid:46) loop overneighbors in the ﬁne grid N max = max { N max , maskIn [ a , b ] } N avg += maskIn [ a , b ] d += maskIn [ a , b ] · dataIn [ a , b , : ] end for if N avg > then maskLow [ i , j ] = N max dataLow [ i , j , : ] = d / N avg end if end for recursion: maskLow, dataLow = I NPAINTING (maskLow, dataLow) weighted bilinear upsampling: maskOut, dataOut = zeros of shape H x W , H x W x C for a , b ∈ { ,..., H − } × { ,..., W − } do N , W = , d = C ˆ a = a ÷ , ˆ b = b ÷ (cid:46) Integer-division (round down),indices on the coarse grid a (cid:48) , b (cid:48) = − a , b is even else + N = (cid:8) ( ˆ a , ˆ b , ) , ( ˆ a + a (cid:48) , ˆ b , ) , ( ˆ a , ˆ b + b (cid:48) , ) , ( ˆ a + a (cid:48) , ˆ b + b (cid:48) , ) (cid:9) for ( i , j , w ) ∈ N ∩ image do (cid:46) loop over neighbors if withinbounds N += w maskLow [ i , j ] d += w maskLow [ i , j ] · dataLow [ i , j , : ] W += w end for maskOut [ a , b ] = maskIn [ i , j ] , dataOut [ a , b , : ] = maskIn [ i , j ] · dataIn [ i , j , : ] if N > then (cid:46) blend interpolated values with original data maskOut [ a , b ] += ( − maskIn [ i , j ]) N / W dataOut [ a , b , : ] += ( − maskIn [ i , j ]) d / N end if end for return maskOut, dataOut end function be deﬁned as the gradient of s with respect to x i and w i , ˆ x i : = ∇ x i s , ˆ w i : = ∇ w i s as column vectors. Next, we drop the index i , aswe require it to index the elements in the input and output vectors,and look at a single function f ∈ R N × R W → R M with inputs x ∈ R N , w ∈ R W and output y ∈ R M . The adjoint variables are thencomputed using ˆ x = J Tf , x ( x , w ) ˆ y , ˆ w = J Tf , w ( x , w ) ˆ y . (15)Here, the Jacobian matrix with respect to the different inputs isused, deﬁned as (cid:0) J f , x (cid:1) i j : = ∂ f i ∂ x j , (cid:0) J f , w (cid:1) i j : = ∂ f i ∂ w j . (16)As one can see, given the adjoint variable of the output ˆ y ,the adjoint method propagates these gradients back through thederivatives of f to the adjoint variables of the inputs ˆ x and ˆ w . Inthe context of the chained function Equation 14, this implies that, starting with gradients on the output ˆ x n from the loss function,gradients are ﬁrst propagated to ˆ x n − , ˆ w n via J f i , then to ˆ x n − , ˆ w n − ,and so on until ˆ x , ˆ w is reached.To provide custom differentiable operations, two functions haveto be provided: ﬁrst, the forward code y ← f ( x , w ) with input x andparameter w , and second, the backward code to compute ˆ x and ˆ w from ˆ y , possibly using x , w from the forward pass again to computethe Jacobian. E.2 Backward Pass of the Sampling Function

Using the theory above, we now present the adjoint code for thedifferentiable sampling from subsection 3.2. This serves to highlightwhat is differentiated and how the gradients are propagated. Notethat these functions are implemented based on PyTorch functions,PyTorch can automatically compute the derivatives.The differentiable sampling stage takes the importance map I as input and produces the image of sparse samples S . In theframework of Equation 14, this can be seen as block of functionsthat is cut out in the middle. As parameters, the target mean µ andlower bound l , the sampling steepness α , the sample pattern P andthe target image T are used. Note that no optimization with respectto these parameters is performed, their respective adjoint variablesare unused. To recapitulate, the sampling is performed using thefollowing steps: µ I = W H ∑ i j I i j , I ( ) = I (17a) I (cid:48) i j = min (cid:26) , l + I ( ) i j µ − l µ I + ε (cid:27) (17b) S i j = sig ( α ( I (cid:48) i j − P i j )) T i j with sig ( x ) = + e − x . (17c)Note that the second and third function act on each pixel i j ofthe images independently. Therefore, we use them as per-elementfunctions to simplify the notation of the derivatives. Using thematrix notation from Equation 15, this would imply a diagonalJacobian. Furthermore, T i j and S i j return the vector of channelsat the speciﬁed location. In order to stay within the presentedframework of the adjoint method, if variables are used by a functionand later again by another function, these variables are passedthrough as additional outputs ( I ( ) = I ).For the backward pass, we are given the gradients of the outputˆ S from the backward pass of the reconstruction. This equates toˆ x n in Equation 14. Then the gradients are propagated through thesampling algorithm in reverse order:ˆ T i j = sig ( α ( I (cid:48) i j − P i j )) ˆ S i j ˆ P i j = − ( α T i j sig (cid:48) (cid:0) α ( I (cid:48) i j − P i j ) (cid:1) T ˆ S i j ˆ I (cid:48) i j = ( α T i j sig (cid:48) (cid:0) α ( I (cid:48) i j − P i j ) (cid:1) T ˆ S i j with sig (cid:48) ( x ) = ddx sig ( x ) = sig ( x ) sig ( − x ) (18a)ˆ I ( ) i j = (cid:40) µ − l µ I + ε ˆ I (cid:48) i j , I (cid:48) i j < , I (cid:48) i j ≥ µ l = ∑ i j  I ( ) ij ( l − µ )( µ l + ε ) ˆ I (cid:48) i j , I (cid:48) i j < , I (cid:48) i j ≥  derivatives for µ , l , ε are omitted (18b)ˆ I i j = ˆ I ( ) i j + W H ˆ µ l (18c) E.3 Backward Pass of the Pull-Push Algorithm

As one can see in the previous section, deriving the adjoint code isdone mechanically by deriving each line of code with respect tothe inputs. This, however, produces a vastly longer code, therefore,we only outline the steps to derive the adjoint code of the pull-pushalgorithm Alg. 1. The full source code is available in the onlinerepository.The algorithm is a recursive algorithm with three stages: thedownsampling to the coarse level, the recursive call, and the up-sampling and interpolation with the ﬁne level. During the backwardpass, the order is reversed. First, the adjoint of the upsampling andinterpolation at the ﬁnest level. Then the adjoint of the recursivecall, which itself is the adjoint of upsampling, recursion anddownsampling. And lastly the adjoint of the downsampling. R EFERENCES [1] S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen.The lumigraph. In

Proceedings of the 23rd annual conferenceon Computer graphics and interactive techniques , pp. 43–54,1996.[2] J. H. Halton. Algorithm 247: Radical-inverse quasi-randompoint sequence.

Commun. ACM , 7(12):701702, Dec. 1964.doi: [3] M. Kraus. The pull-push algorithm revisited.

ProceedingsGRAPP , 2:3, 2009.[4] J. L. Lions.

Optimal control of systems governed by partialdifferential equations . Springer, 1971.[5] A. McNamara, A. Treuille, Z. Popovi´c, and J. Stam. Fluidcontrol using the adjoint method.

ACM Transactions OnGraphics (TOG) , 23(3):449–456, 2004.[6] M. Roberts. The unreasonable effectiveness ofquasirandom sequences. http://extremelearning.com.au/unreasonable-effectiveness-of-quasirandom-sequences/, 2020.Accessed: 2020-02-14.[7] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutionalnetworks for biomedical image segmentation. In N. Navab,J. Hornegger, W. M. Wells, and A. F. Frangi, eds.,

MedicalImage Computing and Computer-Assisted Intervention – MIC-CAI 2015 , pp. 234–241. Springer International Publishing,Cham, 2015.[8] G. Schwarz et al. Estimating the dimension of a model.

Theannals of statistics , 6(2):461–464, 1978.[9] Y. Wang, W. Chen, J. Zhang, T. Dong, G.-Y. Shan, andX. Chi. Efﬁcient volume exploration using the gaussianmixture model.

Visualization and Computer Graphics, IEEETransactions on , 17:1560 – 1573, 12 2011. doi: [10] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang.The unreasonable effectiveness of deep features as a percep-tual metric. In