[PDF] Automatic Generation of Interpolants for Lattice Samplings: Part II -- Implementation and Code Generation

Abstract

In the prequel to this paper, we presented a systematic framework for processing spline spaces. In this paper, we take the results of that framework and provide a code generation pipeline that automatically generates efficient implementations of spline spaces. We decompose the final algorithm from Part I and translate the resulting components into LLVM-IR (a low level language that can be compiled to various targets/architectures). Our design provides a handful of parameters for a practitioner to tune - this is one of the avenues that provides us with the flexibility to target many different computational architectures and tune performance on those architectures. We also provide an evaluation of the effect of the different parameters on performance.

Full PDF

AAutomatic Generation of Interpolants for Lattice Samplings: Part II —Implementation and Code Generation

JOSHUA HORACSEK,

University of Calgary

USMAN ALIM,

University of Calgary

In the prequel to this paper, we presented a systematic framework for processing spline spaces. In this paper, we take theresults of that framework and provide a code generation pipeline that automatically generates efficient implementations ofspline spaces. We decompose the final algorithm from Part I and translate the resulting components into LLVM-IR (a low levellanguage that can be compiled to various targets/architectures). Our design provides a handful of parameters for a practitionerto tune — this is one of the avenues that provides us with the flexibility to target many different computational architecturesand tune performance on those architectures. We also provide an evaluation of the effect of the different parameters onperformance.CCS Concepts: •

Mathematics of computing → Mathematical software ; Mathematical software performance ; Con-tinuous mathematics ; •

Hardware → Signal processing systems .Additional Key Words and Phrases: Interpolation, signal processing, code generation

ACM Reference Format:

Joshua Horacsek and Usman Alim. 2021. Automatic Generation of Interpolants for Lattice Samplings: Part II — Implementationand Code Generation. 1, 1 (February 2021), 22 pages. https://doi.org/10.1145/1122445.1122456

In Part I, we introduced a framework of analysis that allows one to process a piece-wise polynomial interpolant(defined over a convex simplicial complex) into a list of related “sub-regions” with desirable properties forimplementation within a machine [Horacsek and Alim 2021]. The implementation details, however were left upto interpretation; this work elaborates upon the generation and evaluation of code from the form given in theprequel. In reality, there are parameters to tune depending on the details of the machine that one wishes to target.For example, code with branching patterns are potentially detrimental to performance on GPU architectures,whereas CPU architectures have been engineered so as to minimize much of the overhead of branching. Themain contribution of this paper is a framework for generating code from the abstract description provided in theprequel, and provide an exploration of the parameter space. While the focus in the prequel was on generality (i.e.interesting test cases), the discussion in this paper is more tuned towards performance.The analysis in the prequel provides one with a means to explore many different combinations of basis functionsand lattices — translating such a description into a concise and performant implementation is non-trivial. A naïve

Authors’ addresses: Joshua Horacsek, [email protected], University of Calgary, 2500 University Dr. NW, Calgary, Alberta, T2N1N4; Usman Alim, [email protected], University of Calgary, 2500 University Dr. NW, Calgary, Alberta, T2N 1N4.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the firstpage. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copyotherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions [email protected].© 2021 Association for Computing Machinery.XXXX-XXXX/2021/2-ART $15.00https://doi.org/10.1145/1122445.1122456 , Vol. 1, No. 1, Article . Publication date: February 2021. a r X i v : . [ c s . M S ] F e b • Joshua Horacsek and Usman Alim approach is to simply take an interpolant 𝜑 and evaluate it within the convolution sum: ∑︁ n ∈ 𝐿 Z 𝑠 𝑐 n 𝜑 ( x − n ) . (1)While this is straightforward, it is plagued by performance issues. For a spline with 𝑛 points in its support, thisleads to 𝑛 evaluations of the interpolant 𝜑 for a given reconstruction, however, there is often a good deal ofsymmetry in higher dimensional splines, thus most of these evaluations contain some amount of repeated work(under symmetry). Moreover, many interesting splines used in scientific visualization are non tensor productsplines and thus have many separate cases one needs to consider for different areas of the spline — the area apoint resides within must be determined at each of the 𝑛 evaluations of the interpolant. It is more efficient tounroll the convolution sum so as to reduce the amount of repeated work and reduce branching behaviour.To accomplish this, while still balancing generality and performance, we propose a code generation frameworkthat takes the abstract representation from Part I, and generates a low level representation that is both close tomachine code and relatively platform independent. The main tool we use to accomplish this is the Low LevelVirtual Machine (LLVM) which is a framework for code analysis and generation. LLVM is the foundation of manycontemporary compilers — at its core it is a library for expressing low level code; code that is independent of atarget architecture. To this end, this allows one to divorce optimisation from high level language design, andallows language designers to focus simply on generating good low level code. Optimization passes are performedsolely on LLVM’s low level instruction representation (LLVM-IR).As in Part I, our goal is to make the benefits of non-Cartesian computing more tangible. This work finalizesthe ideas of Part I. To summarize, the contributions of this work are as follows: • We provide a unified framework that generates code for fast interpolants from non-separable interpolants oncommon lattices in scientific computing and visualization. Our framework is generalizeable to 𝑠 -dimensions. • We detail the parameters of said code generation, elaborate upon how they affect performance for cer-tain types of compute architectures, we also provide a reference implementation which is available onGithub [Horacsek 2021] which contains the code to perform the analysis detailed in Part I, the codegenertion framework presented in this work, as well as example code that demonstrates the use of theframework and builds upon examples in this work. • We explore the effect of parameter choice on generated code for different compute architectures. Addition-ally, we explore the performance of the Voronoi splines in 3D, which are arguably the most splines that areclosest to the tensor product spline on the Cartesian lattice (in both support size and order.) Prior to thiswork, they have not received an efficient implementation and have been seen as impractical interpolants.The remainder of this paper is organized as follows. In Section 2 we touch upon some related work — this rangesfrom splines used in 1,2 and 3 dimensional signal processing, to domain specific compilers. We then describe theneccesary background in Section 3, referring to the first part for details on shift invariant spaces; it is assumedthat the reader is familiar with the first part of this work. In Section 4, we detail the mechanisms behind our codegeneration framework, breaking the system down into components or “blocks” and showing how to composethem so as to implement the algorithm in Part I. Finally, in Section 5 we compare and contrast the effect onperformance of different parameters for different compute architectures.

The motivating factor for this work is the plethora of other publications investigating non tensor product splinesin scientific visualization. However, there are also many works that investigate novel interpolants in lowerdimensions. For uni-variate splines, the optimal class of splines in the maximal order minimal support (MOMS)splines [Blu et al. 2001], although there are some notable, yet non-polynomial, splines such as the CINAPACTsplines that boast nice properties such as being infinitely differentiable [Akram et al. 2015]. Authors tend to , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part II — Implementation and Code Generation • 3 neglect implementation details for uni-variate splines since it is typically straightforward — simply fix the “regionof evaluation” as a line segment, then distribute the polynomials from the convolution sum; polynomial evaluationis performed (optimally) with Horner’s algorithm [Ceberio and Kreinovich 2004]. In 2D, there has been somework around hexagonal interpolants, however, the provided implementations exist as MATLAB code [Condatand Van De Ville 2006].Implementing a shift invariant reconstruction space becomes more tedious as the dimension of the interpolantincreases. If one wishes to implement tensor product schemes, it is relatively straightforward to extend univariateimplementations by simply reducing the problem along each dimension. If one is considering non tensor productsplines, and/or non-Cartesian lattices, the problem becomes more difficult. The reason for this is that non tensorproduct splines are non-separable — one can no longer simply apply univariate results along each axis. Typically,the problem requires some amount of geometric analysis to take advantage of the symmetry of a spline. As such,when an implementation appears in the literature, it may include a detailed implementation, but more often it isfollowed by a publication describing an efficient implementation [Csebfalvi 2013; Finkbeiner et al. 2010; Kim andPeters 2009]. However, these implementations are, again, often specific to a single language and platform. We aimto fill that gap in this paper, and the main tool we use is LLVM. Specifically, we take an abstract representation ofa shift invariant reconstruction space and translate it into a low level representation.There are many benefits in translating to LLVM-IR. First of all, it allows us to represent a reconstruction spacewith low level code that is relatively agnostic to the underlying machine architecture. This strong abstractionbetween high-level and low-level representations better facilitates further processing of the original abstractrepresentation. Languages such as Julia keep their abstract representation wholly distinct from the underlyingmachine [Bezanson et al. 2017]; one may build additional processing steps on top of this representation, such asauto-differentiation, then pass the resulting representation through the same code generation mechanism usedfor the original code. The second main benefit we reap is the removal of the overhead that comes from a generalpurpose implementation of a shift invariant reconstruction space. While there is no other work that specificallyattempts to take a shift invariant space and generate code, there are works within numerical computing that useLLVM as a tool to reduce overhead in a similar way as described below.TensorFlow’s XLA (Accelerated Linear Algebra) library is a domain specific compiler that translates compo-nents of a computational graph into LLVM-IR, which is in turn compiled to a target architecture [Team et al.2017]. The main benefit XLA provides is that one may fuse together operations in a computational graph soas to keep more operations in registers during computation, thereby reducing overhead by using less memorybandwidth. While the improvement is modest; roughly 1.14 × in their benchmarks, the improvement is consistent,especially when one considers that many applications that use TensorFlow are long running applications. Thereare also libraries, such as Nimble [Shen et al. 2020], that compile deep learning models down to lower level codeto both reduce the overhead of inference, and remove dependencies on large deep learning libraries.From the perspective of numerical computation, domain specific compilation is a relatively new, but powerful,technique. Combined with an appropriate analysis framework, it has been used to accelerate matrix operations— this is particularly useful when the matrices in question have very specific structure [Fabregat-Traver andBientinesi 2012]. Another example of numerical computation domain specific compiler (and language) is SDSLc,a language for generating fast stencil computation code [Rawat et al. 2015] — curiously, this work does notacknowledge or make use of LLVM. Our focus in Part I was to present the theoretical foundation needed to transform the defining objects of a shiftinvariant space (a lattice and a basis function) into a form that can easily be translated into an implementation — From this perspective, one may consider our work a domain specific compiler whose input is a spline reconstruction space., Vol. 1, No. 1, Article . Publication date: February 2021. • Joshua Horacsek and Usman Alim

123 4 c c c c c c c Sub-region Polynomial Transform1 ( 𝑐 − 𝑐 − 𝑐 − 𝑐 + 𝑐 + 𝑐 ) 𝑥 + ( 𝑐 − 𝑐 + 𝑐 + 𝑐 − 𝑐 + 𝑐 ) 𝑥 + (( 𝑐 − 𝑐 − 𝑐 + 𝑐 ) 𝑥 − 𝑐 + 𝑐 ) 𝑥 + 𝑐 + 𝑐 + 𝑐 + 𝑐 + 𝑐 − ( 𝑐 − 𝑐 ) 𝑥 None2 Reference to sub-region 1 ⟳ ◦ ⟳ ◦ ⟲ ◦ Fig. 1. The image on the left continues the example from Part I, showing the sub-region decomposition for the ZP-element [Ho-racsek and Alim 2021]. The square region in the center is the region of evaluation; it is subdivided into the sub-regions ofevaluation, labelled 1-4. The coefficients 𝑐 to 𝑐 denote the lattice sites that contribute to a reconstruction in sub-region 1.The table on the right details the relationship between sub-regions — the transformation is generally affine, however in thiscase we do not need to make use of any shift. Sub-regions with no transform are reference sub-regions, in this example thereis only one. However, in general a spline may emit more than one reference sub-region (e.g. the cubic spline on the FCClattice [Kim et al. 2008]). it is assumed that the reader has a good notion of the content and background presented in that work [Horacsekand Alim 2021]. The translation into LLVM is the focus of this paper. In the prequel, our analyses provided uswith a shift invariant structure over a lattice; in specific, we concluded with a shift invariant polyhedral region ofevaluation, and a dissection of the region of evaluation into sub-regions of evaluation. One only needs to considera single space-tiling region of evaluation since, by definition of a shift invariant space, every other region ofevaluation is related by a shift. Each sub-region contains a polynomial and a set of lattice sites. The data at theselattice sites are components of the polynomial belonging to each sub-region and the polynomial dictates how toreconstruct a value in a given sub-region. Figure 1 visualizes all these components.Very often in practice, the splines we use on a given lattice are symmetric, at least in dimensions 1 to 3 —this helps encourage reconstruction spaces to capture frequency content more evenly in all directions. This isbecause the underlying shift invariant space inherits, partially, the symmetry of the basis function — in reality,the symmetry of the underlying space is a combination of the symmetry of the lattice and the basis function.This symmetry is a tool that allows us to write different sub-regions in terms of each other — relating them byrigid body transformations, thereby allowing us to reduce the amount of polynomials stored for each sub-region.For most splines used in visualization, we only require one distinct polynomial, this polynomial is then used toreconstruct values within each sub-region (after an appropriate change of variables). As such, we also store thisauxiliary information. That is, if there exists a rigid body transformation that takes one region into another, weonly store that transformation.The above representation is generic and independent from the machine on which we will implement saidscheme. This is advantageous because it allows for further simple processing of the spline space. One may easilytake derivatives of the polynomials within sub-regions to obtain gradient estimates (note that the symmetryanalyses still apply even for derivative computation). This can be taken further to compute the Hessian, providedthe spline is smooth enough. One may also transform these polynomials into forms that are more convenient forevaluation, i.e. Horner form, or Berstein Bézier (BB) form. , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part II — Implementation and Code Generation • 5 A naïve approach, provided this representation, would be to represent each polynomial into a specific form— BB-form for example — then implement an algorithm to perform that style of polynomial evaluation. Forpolynomials in BB form, the algorithm would be De CastleJau’s algorithm. However, the algorithm we choosemay be inappropriate for a given machine. For example De CastleJau’s algorithm is recursive, which is difficult toimplement on GPU architectures — this can, in part, be alleviated by unrolling the recursion. Another downsideof simply using De CastleJau’s algorithm is temporary space usage. Temporary space usage is bound by thenumber of terms in the BB form (i.e. the degree of the polynomial). On modern CPU architectures, this spaceis allocated on the stack, which resides in main memory, but is likely aliased to cache for fast access. On theGPU, temporary storage space often means increasing register allocation for a program, which can drasticallyreduce the amount of parallelism available for the program. Alternatively, higher temporary storage needs maycause registers to “spill” into global memory, requiring global memory accesses which are very slow. Moreover,polynomial evaluation is only one part of the reconstruction scheme, there is also the preamble of determiningwhich region and sub-region of evaluation a point resides in prior to reconstruction. This must be accounted for— there is overhead in looking up the properties of a given reconstruction space and delineating what is to bedone in certain cases.The approach we take is to generate implementations rather than provide a single implementation that takesa lattice and a basis function as parameters. This allows us to reduce some amount of overhead by “baking”information into the generated code. For example, we are able to remove the (small) cost associated with lookingup member variables, hard-code any logic that may need to be applied for a specific basis function or lattice, andunroll loops as necessary. This is similar to template meta-programming in C++ which, in part, exists to takeadvantage of such optimizations. Code generation also allows us to replace certain components of our pipelinewith other implementations. For example, we may pull out the Horner polynomial evaluation and replace it withDe Castlejau’s algorithm. We avoid doing so in this work, since Horner schemes have low temporary storagerequirements and are easily unrollable. There are also many options for decomposing and pipelining a Hornerscheme, which we also extensively explore — the same treatment for De Casteljau’s algorithm is outside thescope of this paper.

One simple approach to code generation is to choose a popular language, such as C, as a target language, thencreate “templates” for specific parts of the reconstruction function. For example, one component may be todetermine the region of evaluation, whereas another may be to evaluate a given polynomial. In our experience,this becomes messy quite quickly, the amount of cases one must write and test is fairly large. Moreover, thetemplate code becomes unreadable quickly. This approach also lacks elegance; we would start with a low leveldescription of the spline space, then generate high level code, which must be processed by a compiler to producelow level code. A more elegant approach is to directly generate a low level representation. Thus, we emit code toLLVM intermediate representation (LLVM-IR) directly; this has the benefit of being low level but still platformand machine agnostic.One must, however, take into account some differences in machine architecture when generating code. Ac-cording to Flynn’s taxonomy [Flynn 1972], there are four main classes to computer architecture. The two ofinterest to us are single instruction single data (SISD) machines in which each operation operates on at most oneelement of data, and single instruction multiple data machines (SIMD) in which an operation may operate overmultiple elements of data. The other two main distinctions: multiple instruction multiple data (MIMD) and multipleinstruction single data are not as relevant — from the perspective of a compute thread, instructions are eitherSISD or SIMD. Figure 2 show the differences between the two. Most modern CPUs are super-scalar machines,that is, they perform instruction level parallelism — many operations may be in flight at any given time. For , Vol. 1, No. 1, Article . Publication date: February 2021. • Joshua Horacsek and Usman Alim

SISDa[i] SIMDa[i] a[i+1] a[i+n-1]...b[i] b[i] b[i+1] b[i+n-1]...c[i] c[i] c[i+1] c[i+n-1]... S i n g l e O p e r a t i o n Fig. 2. An example of the difference between SISD and SIMD architecture. Here, 𝑛 is the vector length, and as a singleoperation is applied to the operands 𝑎 and 𝑏 , this operation is applied in parallel to all elements of 𝑎 and 𝑏 . example, a memory fetch may appear early in an instruction stream, but may only be required near the end of thestream; the CPU may commit this to the memory controller and move on to the next instruction, only stallingonce the required memory access is actually needed and has not finished yet. This is in contrast to traditionalscalar architectures, in which instructions are strictly executed in order and have predictable execution times.From the perspective of a thread, most contemporary CPUs are SIMD machines, whose SIMD instructions arevectorized SISD instructions. GPUs have historically been fixed function multi-processing units dedicated to triangle rasterization — that is, asingle GPU contained a large amount of cores dedicated to transforming and projecting triangles onto a planardisplay. This allowed for a high level of paralellism for the task of rasterization. However, as the graphics pipelinebecame more programmable through the use of different types of shaders, GPUs eventually became generalpurpose computing devices. Modern GPU architectures are more akin to general purpose SIMD machines withwide instruction lanes — a single instruction on a GPU will operate on 32 to 64 elements of data, depending on themanufacturer. While this is true at the hardware level, the programming model presented by GPU manufacturersis that each lane of execution corresponds to a lightweight “thread”. From the perspective of each thread, theinstruction stream is largely SISD.Again, historically practitioners have taken advantage of the parallelism offered by GPUs through OpenGLshaders. Shaders in OpenGL are small programs that execute on the GPU and operate element wise over data:pixels, vertices and/or triangles. However, this quickly became insufficient for general purpose computing onthe GPU (GPGPU). This was for various reasons, but one pressing reason was the lack of fine grained controlover hardware resources. NVIDIA first offered a solution to this, CUDA, which provided better control over (andaccess to) the underlying hardware. Programs in CUDA are known as CUDA kernels and execute element-wiseover data. CUDA code runs on CUDA devices, these devices contain their own address space and are connectedto a host via a PCI-e bus. Data and programs must be uploaded across the PCI-e bus prior to execution. Once akernel is completed, the results of the computation must be again copied off the device into host memory.The basic unit of execution in CUDA is a thread; threads are collected in groups of 32 called warps. Warpsexecute programs in lockstep and share the resources of a streaming multiprocessor (SM) — that is, an SM has aset of registers that are allocated to each thread in a warp; based on the problem one wishes to solve, more orless registers may be required. Threads in a warp may communicate with each other at the register level, andmay use fast local shared memory to communicate with other threads within the same block. The amount ofshared memory and registers allocated per thread may be so that it is not possible to schedule a warp on eachcore of a SM at any given moment. The ratio of active warps to total number of possible warps active is knownas occupancy. Generally, one wishes to maximize occupancy to make the best use of resources on a given SM, butit is not always possible to do so. , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part II — Implementation and Code Generation • 7

Grids are collections of blocks and are the most coarse grain level of problem decomposition. For example, torender an image, we may break it up into chunks of 8x8 pixels, or a total of 32 pixels. This would correspond to ablock size of 64 (block sizes must be multiples of the warp size, i.e. 32). There is no guaranteed order in whichblocks complete — for the example above completion order is irrelevant, one simply needs to ensure that allblocks are completed before the image is displayed. If synchronization is needed between blocks, then one mustsynchronize by waiting for the current kernel to finish, then launching a new kernel .Since GPU cores are SIMD machines at the hardware level, one potential difficulty with the lightweight threadmodel of CUDA is branching — what are threads (i.e. SIMD lanes) to do when a branch occurs? Provided allthreads take the same path for a branch, then there is no issue. If threads within a warp take different paths at abranch, then those threads are disabled and revisited later. This is known as branch divergence. This is somethingone generally wishes to avoid, at least for long runs of instructions. This can be dealt with by algorithm design orbranch predication.From the perspective of an application running on the GPU, to compute a value from an interpolation schemeas fast as possible, one may be tempted to distribute computation along each lane of a warp in terms of SIMDthreads. However, this approach makes little sense, as applications are likely demanding reconstructions on a perthread basis. Thus it makes more sense to create optimized SISD instruction streams along every lane. However,taking the same observation into account for code on the CPU, again each thread is likely going to correspond toone element of the problem, but each thread also has access to SIMD vector instructions. Additionally, one musttake care to use the appropriate memory fetch — on a GPU texture fetches are spatially cached, and one musttake care to organize memory reads to take advantage of this.Of course NVIDIA is not the only GPU vendor, AMD also produces comparable GPUs, but the eco-system fordeveloping GPGPU applications on AMD has been limited to OpenCL until recently with HIP. The concepts inOpenCL are very similar to those in CUDA — the concept of a warp translates to that of a wavefront (which is acollection of 64 threads); block translates to work group; and grid translates to overall problem size.Another subtlety of GPU compute applications is that bandwidth is a more prominent serious concern. Sincecompute throughput is high, GPUs must be able to read and write data to memory at comparable rates. Tofacilitate this, the best performance is achieved when memory access patterns are simple. CUDA provides aprogrammer with various different levels of memory. The first is a register; registers which are accessible to athread. The next level is constant memory and shared memory, then global memory. Each is an order of magnitudeslower than the last. Shifting a performant implementation of any algorithm to a different architecture can be a daunting task. Inthat respect LLVM is a powerful tool. One of the core components of LLVM that helps facilitate this is LLVM-IR,a low level assembly-like language targeting a fictional virtual machine. This representation is in single staticassignment (SSA) form. That is, values are assigned to output variables, then those variables cannot be modified,and may only be used as inputs in subsequent operations. While LLVM-IR can be interpreted and thereforeexplicitly executed, doing so is only advisable when the LLVM tool-chain does not support code generation on aspecific architecture; the intention of LLVM-IR is to be translated into machine code of a given architecture. Thiscan either be linked into an executeable at compile time, or linked into a running executable on-the-fly (i.e. justin time or JIT compilation).In terms of code generation, it is the responsibility of the backend translator to take LLVM-IR and generatemachine code. For the most part, the job of a backend translator is to assign registers and stack space to operandsand results, as well as translating LLVM-IR instructions to the target machine’s architecture. Prior to final Although block level synchronization is supported through cooperative groups in CUDA 9 and above., Vol. 1, No. 1, Article . Publication date: February 2021. • Joshua Horacsek and Usman Alim

LLVM FrontendHigh Level Language (swift, clang, etc...)High Level Spline Space Description Spline Space OptimizationLanguage Speci ﬁ c Optimization Code GenerationParameterized Code GenerationPart I Part II LLVM-IRLLVM OptimizationPasses*LLVM Backendx86_64 aarch64 (ARM) NVPTX (CUDA) Other Fig. 3. A high level overview of the LLVM ecosystem, including our analysis and code generation contributions. The * in theLLVM optimization passes denotes that they are optional. compilation, LLVM also provides many optimization passes — these passes transform LLVM-IR into optimizedLLVM-IR. The decision to implement optimization passes at the level of LLVM-IR reduces the complexity ofcompilers producing LLVM-IR (also known as frontends for LLVM; frontends need not worry too much aboutoptimization. Figure 3 shows how the frontend, optimization passes, and backend relate, as well as where ourworks fit into the LLVM ecosystem.LLVM is a natural complement to the CUDA ecosystem. CUDA similarly provides a low-level machineabstraction, the parallel thread executable (PTX) environment. PTX code reads very much like LLVM-IR howeverwith more low-level constructs that expose certain features of the hardware to a developer (thread communicationprimitives, for example). PTX and LLVM are similar in their goals — PTX exists as a means of representingshader/compute code in a low-level format, yet still affords some platform agnosticism.Again, PTX code is written from the perspective of a single thread, so translating LLVM-IR to PTX is relativelystraight-forward. Final register allocation, grid size, and block size will be determined by the CUDA run-time,and is independent of the LLVM-IR/PTX representation of the code. In the next section, we detail how we useLLVM to translate each component of a reconstruction space into LLVM-IR, and the different variations on howeach component is parameterized.

We start by breaking down the components of the algorithm proposed in Part I into separate distinct chunks,then elaborate upon how code is generated for each of these chunks. We reiterate the algorithm here with minormodifications; specifically we make the memory fetches explicit in Algorithm 1. We first decompose Algorithm 1into different parts: the preamble, the summation loop and polynomial evaluation.We present these as different generation “blocks”. Blocks take in LLVM-IR variables, and output a set ofLLVM-IR variables. Some input variables may correspond to input (i.e. the point at which we are evaluating thespline) whereas other may correspond to internal use elements (lookup tables, or the output of other blocks).Bridging the gap between input and output are LLVM-IR instructions which take input, and provide one output.We proceed with the running example from Part I, which has already been reintroduced in Section 3. , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part II — Implementation and Code Generation • 9

Algorithm 1

Branch free evaluation at a point. procedure Eval( x ) 𝑓 ← for l ∈ { l , l , · · · l 𝑀 − } do k ← 𝜌 ( x ) ⊲ Determine the shift to ROE x ← x − k ⊲ Shift ROE 𝑞 ← for 𝑖 ∈ { , , · · · 𝑄 − } do ⊲ Determine BSP index 𝑞 ← ( x · p 𝑖 − 𝑑 𝑖 < ) ? 𝑞 : 𝑞 | 𝑖 end for 𝑆𝑢𝑏𝑅𝑒𝑔𝑖𝑜𝑛𝐼𝑛𝑑𝑒𝑥 ← 𝜎 ( 𝑞 % 𝑝 ) ⊲ Map BSP index into sub-regions 𝑔 ← 𝑇 ′ ← 𝑇 [ 𝑆𝑢𝑏𝑅𝑒𝑔𝑖𝑜𝑛𝐼𝑛𝑑𝑒𝑥 ] t ′ ← − 𝑇 ′ t [ 𝑆𝑢𝑏𝑅𝑒𝑔𝑖𝑜𝑛𝐼𝑛𝑑𝑒𝑥 ] 𝜋 ′ ← 𝜋 [ 𝑆𝑢𝑏𝑅𝑒𝑔𝑖𝑜𝑛𝐼𝑛𝑑𝑒𝑥 ] for 𝑗 ∈ { , · · · , 𝑛 − } do 𝑐 𝑗 ← 𝑀𝑒𝑚𝑜𝑟𝑦𝐿𝑜𝑜𝑘𝑢𝑝 ( 𝜋 ′ [ 𝐿𝑎𝑡𝑡𝑖𝑐𝑒𝑆𝑖𝑡𝑒 [ 𝑗 ]]) end for for 𝑖 ∈ { , , · · · , 𝐾 − } do 𝑣 ← 𝜓 𝜋 ′ 𝑖 ( 𝑇 ′ x − t , k ) 𝑔 ← 𝑔 + 𝑃𝑠𝑖𝐼𝑛𝑑𝑒𝑥 [ 𝑆𝑢𝑏𝑅𝑒𝑔𝑖𝑜𝑛𝐼𝑛𝑑𝑒𝑥 ] == 𝑖 ? 𝑣 : 0 end for 𝑓 ← 𝑓 + 𝑔 ⊲ Add the contribution for this coset end for return 𝑓 end procedure The generated code begins with a function declaration and generation of the lookup tables needed for computation— specifically, the tables precomputed in Part I (explicitly

𝑇 , t , 𝜋 and 𝑃𝑠𝑖𝐼𝑛𝑑𝑒𝑥 in Algorithm 1) are populated ifnecessary; some splines have only one element in each of these tables, so tables are not generated in those casesand that single element will be inlined when needed. The function definition takes a spatial location (i.e. 𝑠 floatingor double scalar values) and a memory lookup primitive. The lookup primitive can be a pointer to a function,a sequence of memory arrays (each corresponding to a Cartesian coset of the lattice) or a texture object (forGPU code). If the lookup primitive is a function pointer, this function is called on every lattice site lookup. If anarray is specified, memory fetches are generated at every lattice site fetch. If the lookup primitive is a texturefetch, then code generation module will generate texture fetches when the data within lattice sites are needed.These can be specified as either linear or nearest neighbor fetches. Listing 1 shows an example of the generatedLLVM-IR code containing lookup tables and the function declaration. ; Different indexes are declared here , addrspace (4) tells the backend to place these tables inconstant memory @" bsp_index " = addrspace (4) constant [8 x i8] ;; constant data omitted @" xform_lookup " = addrspace (4) constant ; constant data omitted ; For functions passed a pointer , the function signature is: , Vol. 1, No. 1, Article . Publication date: February 2021. define double @reconstruct ( double %x_0 , double %x_1 , double (i32 , i32)* % lookup ) { ; For functions passed a texture , the function signature is: define double @reconstruct ( double %x_0 , double %x_1 , texture_t % lookup ) { entry : ; ... composition of blocks ... } Listing 1. Example of function and lookup table definition. Some splines may have more or less lookup tables; some havenone. In practice ‘texture_t’ is presented as an i64 in order to remain transparent to LLVM.

If the spline space has a coset decomposition, we generate the header for a loop over thecosets. We first generate a label and index, then generate the memory look-ups for the coset offsets. We thenincrement the index for the next iteration. After the subsequent blocks, a conditional branch brings executionflow back to the label we defined above.

This block determines the lattice site closest to (or within) the region of evaluation.The user may specify the region of evaluation as a paralellpiped region that tiles space according to the givenlattice (in which they specify the paralellpiped via a matrix); they may also specify the region of evaluationas the Voronoi cell of the lattice. These are two methods of collecting the reference sub-regions of evaluation,there other possible groupings, but these two are very simple to implement and cover a wide variety of cases.The output of this block is an 𝑠 -tuple of LLVM-IR variables that correspond to a lattice shift to the referenceregion. The input point is then shifted by the reference point. Figure 4 demonstrates how 𝜌 is calculated, and also(geometrically) how it is used. x- ρ (x)stencil+ ρ (x) % tmp0 = fmul float %x_0 , 1.0 ; inverse matrix transform. % tmp1 = fmul float %x_1 , 0.0 ; these operations will get % tmp2 = fmul float %x_0 , 0.0 ; optimized away , but are % tmp3 = fmul float %x_1 , 1.0 ; kept here for clarity % tmp4 = fadd float %tmp0 , tmp1 % tmp5 = fadd float %tmp2 , tmp3 ; clamp to center of parallelpiped % pre_rho0 = call float @" llvm.round.f32 "(% tmp4 ) % pre_rho1 = call float @" llvm.round.f32 "(% tmp5 ) % tmp6 = fmul float % pre_rho0 , 1.0 ; forward matrixtransform. % tmp7 = fmul float % pre_rho1 , 0.0 % tmp8 = fmul float % pre_rho0 , 0.0 % tmp9 = fmul float % pre_rho1 , 1.0 % rho0 = fadd float %tmp6 , tmp7 ; float , but will be cast % rho1 = fadd float %tmp8 , tmp9 ; to integer when needed Fig. 4. The image on the left exemplifies the utility of 𝜌 ( 𝑥 ) — space is tesselated into regions of evaluations, using 𝜌 ( 𝑥 ) weshift it to our reference region of evaluation. In general this is performed by an inverse matrix transform, then either a roundor floor operation, and finally a transform back to the original space tiled by the paralellpiped. This is exemplified by theLLVM code on the right. The value of 𝜌 ( 𝑥 ) is also used to shift the lookup lattice sites (i.e. the stencil) from the referenceregion to the correct lattice sites. , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part II — Implementation and Code Generation • 11 Once the point of evaluation has been shifted to the region of evaluation, thenext task is to determine the sub-region the point resides within. The next block takes in the shifted evaluationpoint from the previous blocks and compares the point of evaluation with the planes that split the region ofevaluation into sub-regions. Each comparison consists of a dot product and a subtraction or addition, this is easilyvectorized, however the vector reduction functions are still experimental in LLVM, thus we leave this as scalarcode, and hope that a given backend will vectorize this code if possible. Each plane comparison correspondsto a bit in an unsigned integer; where the integer width is chosen to accomodate the number of planes, thisproduces an integer 𝑞 . Figure 5 shows both how the plane comparisons determine the sub-region index, as wellas the generated code. While 𝑞 could be used to index the sub-region of evaluation, we may have drasticallymore possible values of 𝑞 than sub-regions, this would require an entry for ever possible value of 𝑞 in our lookuptables. Instead, we take 𝑞 mod 𝑝 to compress it to a more suitable range, then we pass this through a lookup tablethat produces the final sub-region of evaluation. Plane 1 Plane 2 P l a n e P l a n e SubregIndex[2%p] ; Test against plane 1 % srm0 = fmul float %xt_0 , -1.0 % srm1 = fmul float %xt_1 , 1.0 % srm2 = fadd float %srm1 , srm0 % srm3 = fcmp oge float %srm3 , 0 % bsp_index0 = select i1 %srm3 , i32 1, i32 0 ; Test against plane 2 % srm4 = fmul float %xt_0 , 1.0 % srm5 = fmul float %xt_1 , 1.0 % srm6 = fadd float %srm5 , srm4 % srm7 = fcmp oge float %srm6 , 0 % bsp_index1 = select i1 %srm7 , i32 2, i32 0 % bsp_idx = or i32 % bsp_index1 , % bsp_index0 ; lookup the subregion % lutidx = urem i32 % bsp_idx , 4 % el_addr = getelementptr [8 x i8], [8 x i8]addrspace (4)* @" bsp_index ", i32 0, i32 % bsp_idx % subreg_idx = load i8 , i8 addrspace (4)* % el_addr Fig. 5. The image on the left shows the planes that decompose the region of evaluation into sub-regions of evaluation. Todetermine the sub-region, the point of evaluation is shifted to the reference region, then the point of evaluation is tested todetermine which side of each plane it lies on. Each plane test corresponds to a bit of an integer, which then is compresseddown into a reasonable range before being fed into a lookup table that yields the final sub-region index. The code on theright demonstrates what this procedure looks like in LLVM-IR.

Now that the sub-region index is known, we may look up the various transformations needed to translate onesub-region into another. There are a few subtleties to note here. The first is that even though we require only onequery to the different lookup tables, it may be advantageous to explicitly re-fetch the elements of these lookuptables whenever they are needed — this hints at the lifespan of a register and allows the backend to allocate lessregisters, increasing occupancy, at the cost of a few memory accesses. On the GPU the lookup tables are stored inconstant memory and avoid additional slow accesses to main memory; on the CPU these tables will quickly enter , Vol. 1, No. 1, Article . Publication date: February 2021. the CPU’s cache memory system. Since this operation consists only of memory look-ups, we omit an examplelisting.

The next significant body of code is the loop over all reference sub-regions. This loop is always explicitly unrolled,since the representation of the polynomial may not necessarily be the same per reference sub-region.

The final step is to evaluate the given polynomial within a sub-region. Since the width of the memory bus on acompute architecture is variable, as is the “raw compute to memory bandwidth” ratio of modern architectures,we introduce parameters to tune the performance of the polynomial evaluation on a per-architecture basis. Thefirst parameter we explore is a consequence of dividing the polynomial within a sub-region into smaller chunks;that is, we divide the polynomial into groups of size 𝑚 based on memory accesses. This allows us to interleavecomputation and memory accesses — that is, as memory read instructions are waiting on results, we are able tocompute parts of a polynomial that do not depend on the memory reads still in flight. First, the polynomial issplit up into groups that depend only on (at most) 𝑚 coefficients — we call 𝑚 the evaluation group size . At anygiven moment, we allow a maximum of 𝑑 (such that 𝑑 ≥ 𝑚 ) memory reads to be sent to the memory controller,computations for the first 𝑚 coefficients begin once they have come back from the memory controller — we call 𝑑 the pipeline depth . To make this idea more concrete, we demonstrate the principle with the example from theabove. We set 𝑚 = 𝑑 =

4, then the polynomial in Figure 1 is decomposed into the following polynomials: 𝑝 ( 𝑥 , 𝑥 , 𝑐 , 𝑐 ) : = (cid:18) 𝑐 − 𝑐 𝑥 − 𝑐 + 𝑐 𝑥 (cid:19) 𝑥 − 𝑐 𝑥 + 𝑐 + 𝑐 + 𝑐 𝑥 𝑥 (2) 𝑝 ( 𝑥 , 𝑥 , 𝑐 , 𝑐 ) : = (cid:18) 𝑐 − 𝑐 𝑥 − 𝑐 + 𝑐 𝑥 (cid:19) 𝑥 − 𝑐 − 𝑐 𝑥 𝑥 + 𝑐 + 𝑐 (3) 𝑝 ( 𝑥 , 𝑥 , 𝑐 , 𝑐 ) : = (cid:18) 𝑐 𝑥 + 𝑐 − 𝑐 𝑥 (cid:19) 𝑥 + 𝑐 + 𝑐 − 𝑐 𝑥 𝑥 (4) 𝑝 ( 𝑥 , 𝑥 , 𝑐 ) : = (cid:18) 𝑐 𝑥 + 𝑐 𝑥 (cid:19) 𝑥 + 𝑐 𝑥 𝑥 (5)Each of the above polynomials is in a multivariate Horner form (from a greedy Horner factorization). Summingall of these polynomials gives the polynomial for the reference sub-region in Figure 1. Keep in mind that eachof the 𝑐 𝑖 correspond to memory lookup at a lattice site, so the decomposition above will depend on the orderin which one decides to visit 𝑐 𝑖 — for discussion on this and its optimization procedure, see Part I. Algorithm 2shows a concrete example of pipelining. When the linear fetch is used, the order of operationschanges compared to the case above. In this case, parts of the polynomial are computed prior to the trilinearmemory fetches. Thus the approach is similar to the above, but the order of fetches and computation are reversed.Moreover, it is no longer possible to pipeline polyomial evaluation as above, since the trilinear decompositionrequires an evaluation group size of 1.

To validate the performance of our method and demonstrate the utility of our parameterization, we measure theaverage “speed” of our generated code for various cases. In this context, we take “speed” to mean the averagenumber of point-wise reconstructions one may perform per second. We do this for all valid combinations of , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part II — Implementation and Code Generation • 13

Algorithm 2

Piplineing example. 𝑐 ← 𝐹 𝐸𝑇𝐶𝐻 ( 𝑇 𝑡 [− , ] 𝑡 + [ 𝑟ℎ𝑜 , 𝑟ℎ𝑜 ] 𝑡 ) ⊲ Commit 4 memory reads 𝑐 ← 𝐹 𝐸𝑇𝐶𝐻 ( 𝑇 𝑡 [ , − ] 𝑡 + [ 𝑟ℎ𝑜 , 𝑟ℎ𝑜 ] 𝑡 ) 𝑐 ← 𝐹 𝐸𝑇𝐶𝐻 ( 𝑇 𝑡 [ , ] 𝑡 + [ 𝑟ℎ𝑜 , 𝑟ℎ𝑜 ] 𝑡 ) 𝑐 ← 𝐹 𝐸𝑇𝐶𝐻 ( 𝑇 𝑡 [ , ] 𝑡 + [ 𝑟ℎ𝑜 , 𝑟ℎ𝑜 ] 𝑡 ) 𝑆𝑇 𝐴𝐿𝐿 ( 𝑐 ) ⊲ Wait for 2 fetches to complete 𝑆𝑇 𝐴𝐿𝐿 ( 𝑐 ) 𝑟𝑒𝑠𝑢𝑙𝑡 ← 𝑝 ( 𝑥𝑡 , 𝑥𝑡 , 𝑐 , 𝑐 ) ⊲ Compute what we can while memory fetches are in flight 𝑐 ← 𝐹 𝐸𝑇𝐶𝐻 ( 𝑇 𝑡 [ , − ] 𝑡 + [ 𝑟ℎ𝑜 , 𝑟ℎ𝑜 ] 𝑡 ) ⊲ Commit 2 new reads to the controller before 𝑐 ← 𝐹 𝐸𝑇𝐶𝐻 ( 𝑇 𝑡 [ , ] 𝑡 + [ 𝑟ℎ𝑜 , 𝑟ℎ𝑜 ] 𝑡 ) 𝑆𝑇 𝐴𝐿𝐿 ( 𝑐 ) 𝑆𝑇 𝐴𝐿𝐿 ( 𝑐 ) 𝑟𝑒𝑠𝑢𝑙𝑡 ← 𝑟𝑒𝑠𝑢𝑙𝑡 + 𝑝 ( 𝑥𝑡 , 𝑥𝑡 , 𝑐 , 𝑐 ) 𝑐 ← 𝐹 𝐸𝑇𝐶𝐻 ( 𝑇 𝑡 [ , ] 𝑡 + [ 𝑟ℎ𝑜 , 𝑟ℎ𝑜 ] 𝑡 ) 𝑆𝑇 𝐴𝐿𝐿 ( 𝑐 ) 𝑆𝑇 𝐴𝐿𝐿 ( 𝑐 ) 𝑟𝑒𝑠𝑢𝑙𝑡 ← 𝑟𝑒𝑠𝑢𝑙𝑡 + 𝑝 ( 𝑥𝑡 , 𝑥𝑡 , 𝑐 , 𝑐 ) 𝑆𝑇 𝐴𝐿𝐿 ( 𝑐 ) 𝑟𝑒𝑠𝑢𝑙𝑡 ← 𝑟𝑒𝑠𝑢𝑙𝑡 + 𝑝 ( 𝑥𝑡 , 𝑥𝑡 , 𝑐 ) pipeline depth, maximum fetch count and, when appropriate, branch behaviour. We also collect and report thevariance associated with each experiment when possible.We performed our tests over three different architectures: x86_64, ARM, and CUDA. All of our x86_64 testswere performed on a Ryzen 9 3900X clocked at 3.8GHz with 64GB of DDR4 RAM running at 3200 MHz. OurARM test cases were run on a Raspberry-Pi 4 (i.e. a Cortex-A72 at 1.5GHz and 4GB of RAM). Finally, we ran ourCUDA tests on an RTX 3090 with 24GB of RAM.We ran our tests over the CC, BCC and FCC lattices; in each case, the lattices are chosen so that they occupyapproximately the same amount of memory. More explicitly, the CC lattice has dimension 128 × × × ×

202 (but with the non-BCC points discarded) and the FCC has dimension 161 × × , Vol. 1, No. 1, Article . Publication date: February 2021. In this experiment we look at some of the box-splines found in the scientific visualisation literature — we omitsplines that emit a Cartesian coset decomposition (i.e. those that are effectively a sum of splines on two or moreshifted grids). We do this for box-splines of order 2,3 and 4. P i p e li n e D e p t h Mean Time (ms) P i p e li n e D e p t h Variance

CC TP Linear (Voronoi Order 2) (arm) P i p e li n e d e p t h Mean Time (ms) P i p e li n e d e p t h Variance

BCC Linear Rhombic Dodecahedron (x86_64) P i p e li n e d e p t h Mean Time (ms) P i p e li n e d e p t h Variance

CC TP Linear (Voronoi Order 2) (x86_64) P i p e li n e d e p t h Mean Time (ms)

CC TP Linear (Voronoi Order 2) (CUDA)

Linear Fetch1 2 3 4Group size 1234 P i p e li n e d e p t h Mean Time (ms)

BCC Linear Rhombic Dodecahedron (CUDA)

Linear Fetch P i p e li n e d e p t h P i p e li n e d e p t h P i p e li n e D e p t h Mean Time (ms) P i p e li n e D e p t h Variance

BCC Linear Rhombic Dodecahedron (arm)

Fig. 6. Timing results for second order splines. Each square cell in the plot refers to code that has been generated with aspecific value for “pipeline depth” and “group size”. The mean reconstruction speed is averaged over trials at randomlocations in the volume. GPU results have no variance collected, but do include timing results for cases in which a lineartexture fetch has been used to group multiple reads together. Figure 6 compares second order box-splines. There are only two cases in the literature to consider, the BCCLinear spline [Entezari et al. 2004] and the tensor-product linear spline. Between architectures one sees ordersof magnitudes of speed differences. The ARM results are the slowest, being orders of magnitude slower thanx86_64 — an expected result, since the Ryzen 9 is fabricated at a lower process node, includes features like branchprediction and out of order execution that are not present on the ARM chip, and is clocked higher than the CortexCPU. The CUDA code also performed orders of magnitude faster than the x86_64 code. Again, this is expected,since the raw parallel compute capability of the RTX 3090 is many orders of magnitude higher than that of theRyzen 9. , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part II — Implementation and Code Generation • 15

The linear rhombic dodecahedon box-spline requires only four memory lookups and has a relatively simplepolynomial representation compared the the tri-linear interpolant, which requires eight memory lookups andincludes more terms in its polynomial representation. In our ARM tests, as expected, the linear rhombic dodec-ahedron spline to outperforms the tri-linear. This is a trend that is seen in the ARM code, but not the x86_64or GPU code. This discrepancy is likely due to the fact that there is slightly worse data locality on the BCClattice compared to the CC — splitting lattice cosets into separate textures has the side effect of reducing datalocality. For both test splines, one sees no benefit in making use of a linear fetch. This is strange for the tensorproduct linear spline on the CC lattice, as the linear texture fetch is natively implemented in hardware on thecard. However, our framework is not intelligent enough to simply replace this case with a single texture fetchinstruction; there is a small amount of overhead in the preamble which likely leads to this case under-performing.Figure 7 compares third order splines. This marks the first experiment in this work in which we see the use ofbranch predication — the FCC cubic truncated octahedron box-spline has two representative sub-regions [Kimet al. 2013], which necesscitates either a branch or branch predication. In this case, for both CPU results we seethe FCC cubic spline readily outperforming the tensor product quadratic on the CC lattice. This is likely due tothe simple fact that the FCC cubic spline has many fewer points in its support. Requiring fewer memory accesseson a CPU architecture where arithmetic is cheap would lead to all-around better performance.The use of branching provides insight into the effect of support size on reconstruction speed. When branchingis disabled, reconstruction with the FCC cubic spline requires approximately twice the amount of computation,but this does not amount to doubling of the reconstruction time when compared to the case in which branchingis enabled. This suggests that memory bandwidth has a bigger effect on reconstruction time than raw computerequirements do. This observation is present in both CPU results, but is exaggerated on x86_64 which is likelydue to its superior arithmetic performance in general. It is difficult to make a similar observation from the GPUresults in this case, we will touch more on the effect branching has on the GPU when we look at the Voronoisplines, but in this specific case branching provides a tangible benefit in reconstruction speed, but only for certainparameter combinations, on average it seems to hurt performance.The tensor product quadratic spline has some interesting discrepancies. Between different CPU architectures,there is a vast difference in the effect of the group size and pipeline depth. Specifically for the ARM CPU, aftera group size of 1, increasing the group size leads to better performance, but it is the opposite on the x86 CPU.However, both attain their best performance when group size is 1.Between the two splines, we see the spline with smaller support (the FCC cubic) outperforming the tensorproduct spline on both CPU architectures. However, we see the FCC cubic spline under-performing on the GPU,again, likely due to memory locality. However, the performance of the best case FCC cubic code is similar to thebest case tensor product quadratic. In both test splines, introducing a linear fetch hurts performance.Figure 8 compares fourth order splines. [Entezari and Möller 2006]. Most cases echo observations seen forthe third order splines. The notable exceptions are certain GPU results. The tensor product cubic spline on theCC lattice marks the first instance in which the linear fetch trick improves performance. This is unsurprising,as we reduce 64 fetches to 8 via the linear fetch trick in this instance. This experiment also marks the firsttime a non-Cartesian spline outperforms equivalent order Cartesian splines — both splines on the BCC latticeoutperform the fourth order splines on the CC lattice, with the BCC quartic spline being fastest in all cases.

The Voronoi splines are important test cases as they provide the same support size and order on all 3D lattices.As such, we consider them the most fair comparison to the tensor product splines on the Cartesian lattice (whichare exactly the Voronoi splines for the Cartesian lattice). , Vol. 1, No. 1, Article . Publication date: February 2021. P i p e li n e d e p t h (Branchless) Mean Time (ms) P i p e li n e d e p t h (Branchless) Variance P i p e li n e d e p t h (Branchy) Mean Time (ms) P i p e li n e d e p t h (Branchy) Variance FCC Cubic Truncated Octahedron (x86_64) P i p e li n e d e p t h Mean Time (ms) P i p e li n e d e p t h Variance CC TP Quadratic (Voronoi Order 3) (x86_64) P i p e li n e D e p t h Mean Time (ms) P i p e li n e D e p t h Variance

CC TP Quadratic (Voronoi Order 3) (arm) P i p e li n e D e p t h (Branchless) Mean Time (ms) P i p e li n e D e p t h (Branchless) Variance P i p e li n e D e p t h (Branchy) Mean Time (ms) P i p e li n e D e p t h (Branchy) Variance FCC Cubic Truncated Octahedron (arm) P i p e li n e d e p t h Mean Time (ms)

CC TP Quadratic (Voronoi Order 3) (CUDA) P i p e li n e d e p t h (Branchless) Mean Time (ms) P i p e li n e d e p t h (Branchy) Mean Time (ms) FCC Cubic Truncated Octahedron (CUDA)

Linear Fetch P i p e li n e d e p t h Linear Fetch P i p e li n e d e p t h Fig. 7. Timing results for third order splines. The FCC Cubic Truncated Octahedron spline [Kim et al. 2008] emits adecomposition that requires either branch predication or branching, labelled as “branchless” and “branchy” respectively.

Figure 9 shows the performance results for the second order Voronoi splines. In general, on the both the CPUand GPU, the non-Cartesian Voronoi splines perform on par with the tensor product linear spline (see Figure 6 asa comparison) when branchy code is generated — they perform on the same order of magnitude with a smallspeed penalty likely due to the additional non-separability of their polynomial regions and memory incoherence.The effect of branching on the CPU code in this case is consistent with the observations for the FCC cubic box , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part II — Implementation and Code Generation • 17

10 20 30 40 50 60Group Size 102030405060 P i p e li n e D e p t h Mean Time (ms)

10 20 30 40 50 60Group Size 102030405060 P i p e li n e D e p t h Variance

789 1e 6 1.01.5 1e 12

CC TP Cubic (Voronoi Order 4) (arm) P i p e li n e D e p t h Mean Time (ms) P i p e li n e D e p t h Variance

BCC Quartic Truncated Rhombic Dodecahedron (arm) P i p e li n e D e p t h Mean Time (ms) P i p e li n e D e p t h Variance

BCC Quintic Rhombic Dodecahedron (arm)

10 20 30 40 50Group Size 1020304050 P i p e li n e D e p t h Mean Time (ms)

10 20 30 40 50Group Size 1020304050 P i p e li n e D e p t h Variance

456 1e 4 1234 1e 9

CC Quartic Truncated Rhombic Dodecahedron (arm)

Mean Time (ms) P i p e li n e d e p t h

1e 10

BCC Quintic Rhombic Dodecahedron (CUDA) P i p e li n e d e p t h Mean Time (ms) P i p e li n e d e p t h Variance

BCC Quintic Rhombic Dodecahedron (x86_64)

10 20 30 40 50Group size 1020304050 P i p e li n e d e p t h Mean Time (ms)

10 20 30 40 50Group size 1020304050 P i p e li n e d e p t h Variance

CC Quartic Truncated Rhombic Dodecahedron (x86_64)

10 20 30 40 50 60Group size 102030405060 P i p e li n e d e p t h Mean Time (ms)

10 20 30 40 50 60Group size 102030405060 P i p e li n e d e p t h Variance

CC TP Cubic (Voronoi Order 4) (x86_64) P i p e li n e d e p t h Mean Time (ms) P i p e li n e d e p t h Variance

BCC Quartic Truncated Rhombic Dodecahedron (x86_64)

Linear Fetch51015202530 P i p e li n e d e p t h Mean Time (ms) P i p e li n e d e p t h

10 20 30 40 50 60Group size 1020304050601e 9

CC TP Cubic (Voronoi Order 4) (CUDA) P i p e li n e d e p t h Mean Time (ms) P i p e li n e d e p t h

1e 10

BCC Quartic Truncated Rhombic Dodecahedron (CUDA)

Linear Fetch51015202530 P i p e li n e d e p t h Mean Time (ms)

10 20 30 40 50Group size 1020304050 P i p e li n e d e p t h

1e 9

CC Quartic Truncated Rhombic Dodecahedron (CUDA)

Linear Fetch1020304050 P i p e li n e d e p t h Fig. 8. Timing results for fourth order splines. , Vol. 1, No. 1, Article . Publication date: February 2021. P i p e li n e d e p t h (Branchless) Mean Time (ms) P i p e li n e d e p t h (Branchy) Mean Time (ms) BCC Voronoi Order 2 (CUDA) P i p e li n e d e p t h (Branchless) Mean Time (ms) P i p e li n e d e p t h (Branchy) Mean Time (ms) FCC Voronoi Order 2 (CUDA) P i p e li n e d e p t h (Branchless) Mean Time (ms) P i p e li n e d e p t h (Branchless) Variance P i p e li n e d e p t h (Branchy) Mean Time (ms) P i p e li n e d e p t h (Branchy) Variance BCC Voronoi Order 2 (x86_64) P i p e li n e d e p t h (Branchless) Mean Time (ms) P i p e li n e d e p t h (Branchless) Variance P i p e li n e d e p t h (Branchy) Mean Time (ms) P i p e li n e d e p t h (Branchy) Variance FCC Voronoi Order 2 (x86_64) P i p e li n e D e p t h (Branchless) Mean Time (ms) P i p e li n e D e p t h (Branchless) Variance P i p e li n e D e p t h (Branchy) Mean Time (ms) P i p e li n e D e p t h (Branchy) Variance BCC Voronoi Order 2 (arm) P i p e li n e D e p t h (Branchless) Mean Time (ms) P i p e li n e D e p t h (Branchless) Variance P i p e li n e D e p t h (Branchy) Mean Time (ms) P i p e li n e D e p t h (Branchy) Variance FCC Voronoi Order 2 (arm)

Linear Fetch P i p e li n e d e p t h Linear Fetch P i p e li n e d e p t h Fig. 9. Timing results for the second order Voronoi splines. The results on the left were generated on an x86 machine, whereasthe results on the right were generated on an ARM machine. spline. This is many orders of magnitude faster than what is reported in the original Voronoi spline paper, inwhich it is cited that their renders took many days to complete [Mirzargar and Entezari 2010, 2011]. On the GPU,the effect of branching is inconclusive. The second order BCC spline is hurt by allowing branches (on average),however the second order FCC spline is aided by the addition of branches (on average). Enabling linear fetchesreduces performance.Figure 10 shows the performance for the third order Voronoi splines. On the CPU, the best case branchy codeis roughly on par with the third order tensor product spline (see Figure 7). The performance is significantly worsethan the branchy case. The likely reason for this is the 7 unique sub-region polynomials the must be predicated , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part II — Implementation and Code Generation • 19 on each evaluation. This leads to an excess of additional work that must be performed on each evaluation. Onthe GPU, branching increases performance on the FCC lattice on average, and reduces performance on the BCClattice (on average). However, the best GPU performance on the BCC lattice is attained when branching is enabled.Additionally, the best performanceon the BCC lattice is on the same order of magnitude as the best performanceon the FCC lattice (albeit a small factor slower). Both results are an order of magnitude slower than the GPUcode for the third order tensor product spline on the Cartesian lattice (Figure 7). Figure 11 shows a summary ofthe best performance of the different Voronoi splines (on their native lattices) on different architectures. Notably,the ARM experiments shows better relative performance on the BCC and FCC lattices compared to the x86_64implementations. This is likely because the ratio of compute to memory bandwidth is higher on the ARM chip— that is, the ARM implementation has more time to "compute" while waiting for memory fetches to return.Compared to the x86_64 implementation in which the compute to memory bandwidth ratio is much lower. Thiswould also imply that the Quadratic polynomial that is being evaluated is somehow more complicated than thosefor the Voronoi spline. This would imply that more work must be done to compute the Quadratic tensor productspline than the other Voronoi splines, which is counter intuitive, since tensor product splines are notably moresimple than their non-separable counterparts. Keep in mind that we expand the tensor product representation,then use a greedy Horner factorization for evaluation, which leads to more work than the original tensor productrepresentation would have. The GPU has a higher compute to memory bandwidth ratio, but gets penalizedheavier for more complicated memory reads. Moreover, this better compute to bandwidth ratio gets consumedby the branch predication, which must waste up to 7 × the computation for evaluation on the BCC lattice. Overall, the parameters we introduced into our code generation scheme have a large impact on the speed ofresulting approximation scheme. In general, for CPU based interpolation methods enabling branches in the codegeneration leads to more efficient implementations — this is an expected result, since the performance penalty ofbranching on CPU architectures is very small compared to the performance impact of wasting computation. Thesame cannot be uniformly said for GPU, there are indeed cases in which branch predicated code performs betterthan branch-heavy code. However, our results show that the dogma of avoiding branches in GPU code is notcompletely true, at least on modern Volta GPUs. On average, however, branching seems to hurt performancemore than it helps performance.In terms of polynomial grouping, larger groups tend to worse performance in most cases. This is somewhatcounter-intuitive, as one would expect grouping multiple memory reads together would result in less time spenton polynomial evaluations. However, smaller grouping likely leads to better interleaving of computation andmemory fetches. Keep in mind that setting the group size low may lead to lower accuracy results compared tolarger group sizes, since a larger group size combines more terms during the Horner factorization, which implieslower error rates.Pipeline depth does seem to have an effect on reconstruction speed, but it is not as pronounced as the memoryfetch grouping. A good set of default parameters seems to be a low group size (simply set to 1) and a high pipelinedepth (the maximum value) with branches enabled for CPU code, and disabled for GPU code.

Long has the belief been held that non-Cartesian methods are impractical because they introduce additionalcomplexity and unacceptable computational overhead. First of all, this is not true in general, we have explicitlyshown in this work that there are cases in which non-Cartesian splines out perform Cartesian splines. The matteris more nuanced than it appears, and depends on many factors including machine architecture, polynomialrepresentation, pipelining, to name a few. , Vol. 1, No. 1, Article . Publication date: February 2021. P i p e li n e d e p t h (Branchless) Mean Time (ms) P i p e li n e d e p t h (Branchless) Variance P i p e li n e d e p t h (Branchy) Mean Time (ms) P i p e li n e d e p t h (Branchy) Variance FCC Voronoi Order 3 (x86_64) P i p e li n e d e p t h (Branchless) Mean Time (ms) P i p e li n e d e p t h (Branchless) Variance P i p e li n e d e p t h (Branchy) Mean Time (ms) P i p e li n e d e p t h (Branchy) Variance BCC Voronoi Order 3 (x86_64) P i p e li n e D e p t h (Branchless) Mean Time (ms) P i p e li n e D e p t h (Branchless) Variance P i p e li n e D e p t h (Branchy) Mean Time (ms) P i p e li n e D e p t h (Branchy) Variance BCC Voronoi Order 3 (arm) P i p e li n e D e p t h (Branchless) Mean Time (ms) P i p e li n e D e p t h (Branchless) Variance P i p e li n e D e p t h (Branchy) Mean Time (ms) P i p e li n e D e p t h (Branchy) Variance FCC Voronoi Order 3 (arm) P i p e li n e d e p t h (Branchless) Mean Time (ms) P i p e li n e d e p t h (Branchy) Mean Time (ms) BCC Voronoi Order 3 (CUDA) P i p e li n e d e p t h (Branchless) Mean Time (ms) P i p e li n e d e p t h (Branchy) Mean Time (ms) FCC Voronoi Order 3 (CUDA)

Linear Fetch P i p e li n e d e p t h Linear Fetch P i p e li n e d e p t h Fig. 10. Timing results for the third order Voronoi splines.

Thanks to modern advances in GPU technology and theoretical frameworks (i.e. thi s work) we claim that theoverhead for non-Cartesian interpolation is now negligible on the GPU. We should be clear, we are not statingthat there is no overhead to non-Cartesian splines, there clearly is in certain cases, but we are at a point whereraw computation is cheap enough to use such methods. We also show good performance on the CPU as well, butthe sentiment against non-Cartesian methods has not seemed as strong for CPU based implementations in theliterature.Additionally, by using the results from our previous work in Part I [Horacsek and Alim 2021], and proposinga framework for generating efficient cross-platform implementations of spline spaces, we lift the burden of , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part II — Implementation and Code Generation • 21 ms ) 1e 5Order 2 Voronoi CCOrder 2 Voronoi BCCOrder 2 Voronoi FCCOrder 3 Voronoi CCOrder 3 Voronoi BCCOrder 3 Voronoi FCC Order 2 baseline3.04× faster2.90× faster Order 3 baseline5.29× faster7.19× faster Voronoi Summary arm ms ) 1e 6Order 2 baseline1.49× slower1.53× slowerOrder 3 baseline 3.01× slower2.46× slower Voronoi Summary x ms ) 1e 9Order 2 baseline1.43× slower1.17× slowerOrder 3 baseline 13.97× slower4.12× slower Voronoi Summary cuda

Order 2 Voronoi CCOrder 2 Voronoi BCCOrder 2 Voronoi FCCOrder 3 Voronoi CCOrder 3 Voronoi BCCOrder 3 Voronoi FCC

Fig. 11. Best performance for Voronoi, each bar shows the relative performance to the "baseline" Cartesian Voronoi spline. implementation away from the practitioner. While these results are likely not the absolute fastest possible, ourmethodology is applicable to a large number of splines, including splines that are very difficult to implement andhave not received any form of an efficient implementation in the literature (i.e. the Voronoi Splines).Our code generation framework includes a small set of tune-able parameters that impact performance. Inexploring these, we exposed the effect of these parameters on performance. Moreover, we showed that parameterchoice affects performance differently depending on architecture — this affords a practitioner more options intuning performance on a given architecture, as it is clear that one evaluation scheme is not guaranteed to attainoptimal performance over all possible compute architectures. We also provide a reference implementation of bothPart I and II [Horacsek 2021]. Future work is devoted to investigating alternative polynomial representations,increasing the class of interpolants to which this work is applicable (i.e. to extend beyond polynomial interpolants),incorporating gradient computations and applying additional heuristics to accelerate tensor product forms.

REFERENCES

Akram, B., Alim, U. R., and Samavati, F. F. (2015). CINAPACT-splines: A family of infinitely smooth, accurate and compactly supportedsplines. In Bebis, G., Boyle, R., Parvin, B., Koracin, D., Pavlidis, I., Feris, R., McGraw, T., Elendt, M., Kopper, R., Ragan, E., Ye, Z., and Weber,G., editors,

Advances in Visual Computing , pages 819–829, Cham. Springer International Publishing., Vol. 1, No. 1, Article . Publication date: February 2021.

Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2017). Julia: A fresh approach to numerical computing.

SIAM review , 59(1):65–98.Blu, T., Thcvenaz, P., and Unser, M. (2001). Moms: Maximal-order interpolation of minimal support.

IEEE Transactions on Image Processing ,10(7):1069–1080.Ceberio, M. and Kreinovich, V. (2004). Greedy algorithms for optimizing multivariate horner schemes.

ACM SIGSAM Bulletin , 38(1):8–15.Condat, L. and Van De Ville, D. (2006). Three-directional box-splines: Characterization and efficient evaluation.

IEEE Signal Processing Letters ,13(7):417–420.Csebfalvi, B. (2013). Cosine-weighted B-spline interpolation: A fast and high-quality reconstruction scheme for the body-centered cubiclattice.

IEEE transactions on visualization and computer graphics , 19(9):1455–1466.Entezari, A., Dyer, R., and Möller, T. (2004). Linear and cubic box splines for the body centered cubic lattice. In

Proceedings of the conferenceon Visualization’04 , pages 11–18. IEEE Computer Society.Entezari, A. and Möller, T. (2006). Extensions of the Zwart-Powell box spline for volumetric data reconstruction on the cartesian lattice.

IEEETransactions on Visualization and Computer Graphics , 12(5).Fabregat-Traver, D. and Bientinesi, P. (2012). A domain-specific compiler for linear algebra operations. In

International Conference on HighPerformance Computing for Computational Science , pages 346–361. Springer.Finkbeiner, B., Entezari, A., Van De Ville, D., and Möller, T. (2010). Efficient volume rendering on the body centered cubic lattice using boxsplines.

Computers & Graphics , 34(4):409–423.Flynn, M. J. (1972). Some computer organizations and their effectiveness.

IEEE transactions on computers , 100(9):948–960.Horacsek, J. (2021). Fast spline github repository. https://github.com/jjh13/fast-spline.Horacsek, J. and Alim, U. (2021). Generating fast interpolants for volumetric data: Part I — Theory and analysis.Kim, M., Entezari, A., and Peters, J. (2008). Box spline reconstruction on the face-centered cubic lattice.

IEEE Transactions on Visualization andComputer Graphics , 14(6):1523–1530.Kim, M., Kim, H., and Sarfaraz, A. (2013). Efficient GPU isosurface ray-casting of bcc datasets.Kim, M. and Peters, J. (2009). Fast and stable evaluation of box-splines via the BB-form.

Numerical Algorithms , 50(4):381–399.Mirzargar, M. and Entezari, A. (2010). Voronoi splines.

IEEE Transactions on Signal Processing , 58(9):4572–4582.Mirzargar, M. and Entezari, A. (2011). Quasi interpolation with voronoi splines.

IEEE transactions on visualization and computer graphics ,17(12):1832–1841.Rawat, P., Kong, M., Henretty, T., Holewinski, J., Stock, K., Pouchet, L.-N., Ramanujam, J., Rountev, A., and Sadayappan, P. (2015). SDSLc:A multi-target domain-specific compiler for stencil computations. In

Proceedings of the 5th International Workshop on Domain-SpecificLanguages and High-Level Frameworks for High Performance Computing , pages 1–10.Shen, H., Roesch, J., Chen, Z., Chen, W., Wu, Y., Li, M., Sharma, V., Tatlock, Z., and Wang, Y. (2020). Nimble: Efficiently compiling dynamicneural networks for model inference. arXiv preprint arXiv:2006.03031arXiv preprint arXiv:2006.03031