[PDF] Automatic Generation of Interpolants for Lattice Samplings: Part I -- Theory and Analysis

Abstract

Interpolation is a fundamental technique in scientific computing and is at the heart of many scientific visualization techniques. There is usually a trade-off between the approximation capabilities of an interpolation scheme and its evaluation efficiency. For many applications, it is important for a user to be able to navigate their data in real time. In practice, the evaluation efficiency (or speed) outweighs any incremental improvements in reconstruction fidelity. In this two-part work, we first analyze from a general standpoint the use of compact piece-wise polynomial basis functions to efficiently interpolate data that is sampled on a lattice. In the sequel, we detail how we generate efficient implementations via automatic code generation on both CPU and GPU architectures. Specifically, in this paper, we propose a general framework that can produce a fast evaluation scheme by analyzing the algebro-geometric structure of the convolution sum for a given lattice and basis function combination. We demonstrate the utility and generality of our framework by providing fast implementations of various box splines on the Body Centered and Face Centered Cubic lattices, as well as some non-separable box splines on the Cartesian lattice. We also provide fast implementations for certain Voronoi splines that have not yet appeared in the literature. Finally, we demonstrate that this framework may also be used for non-Cartesian lattices in 4D.

Full PDF

AAutomatic Generation of Interpolants for Lattice Samplings: Part I —Theory and Analysis

JOSHUA HORACSEK,

University of Calgary

USMAN ALIM,

University of Calgary

Interpolation is a fundamental technique in scientific computing and is at the heart of many scientific visualization techniques.There is usually a trade-off between the approximation capabilities of an interpolation scheme and its evaluation efficiency.For many applications, it is important for a user to be able to navigate their data in real time. In practice, the evaluationefficiency (or speed) outweighs any incremental improvements in reconstruction fidelity. In this two-part work, we firstanalyze from a general standpoint the use of compact piece-wise polynomial basis functions to efficiently interpolate datathat is sampled on a lattice. In the sequel, we detail how we generate efficient implementations via automatic code generationon both CPU and GPU architectures. Specifically, in this paper, we propose a general framework that can produce a fastevaluation scheme by analyzing the algebro-geometric structure of the convolution sum for a given lattice and basis functioncombination. We demonstrate the utility and generality of our framework by providing fast implementations of various boxsplines on the Body Centered and Face Centered Cubic lattices, as well as some non-separable box splines on the Cartesianlattice. We also provide fast implementations for certain Voronoi splines that have not yet appeared in the literature. Finally,we demonstrate that this framework may also be used for non-Cartesian lattices in 4D.CCS Concepts: •

Mathematics of computing → Mathematical software ; Mathematical software performance ; Con-tinuous mathematics ; •

Hardware → Signal processing systems .Additional Key Words and Phrases: Interpolation, signal processing, volumetric rendering

ACM Reference Format:

Joshua Horacsek and Usman Alim. 2021. Automatic Generation of Interpolants for Lattice Samplings: Part I — Theory andAnalysis. 1, 1 (February 2021), 23 pages. https://doi.org/10.1145/1122445.1122456

Given a set of discrete data points that reside on a lattice, one of the key fundamental operations that one canperform on those data is interpolation . This appears in many different contexts in science and engineering: onemay use interpolation to fill in data between points in time-series data, to fill in missing data between pixels inan image, or missing data between voxels in a volumetric image. In fact, many scientific visualization algorithmsrequire a continuous representation of a signal as an input.As simple as it may seem, interpolation of lattice sampled data can quickly become complicated. To many,interpolation is synonymous with linear interpolation, and if we wish to interpolate in higher dimensions, wecan simply linearly interpolate in each dimension separately (i.e. a tensor product extension). However, there ismuch more freedom available that we may take advantage of when we move past one dimension. For example,there are many alternative interpolants (or basis functions ) we may choose instead of a simple linear interpolant,

Authors’ addresses: Joshua Horacsek, [email protected], University of Calgary, 2500 University Dr. NW, Calgary, Alberta, T2N1N4; Usman Alim, [email protected], University of Calgary, 2500 University Dr. NW, Calgary, Alberta, T2N 1N4.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the firstpage. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copyotherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions [email protected].© 2021 Association for Computing Machinery.XXXX-XXXX/2021/2-ART $15.00https://doi.org/10.1145/1122445.1122456 , Vol. 1, No. 1, Article . Publication date: February 2021. a r X i v : . [ c s . M S ] F e b • Joshua Horacsek and Usman Alim many with higher accuracy and/or higher smoothness than a simple linear interpolant. Not only this, but wemay have more freedom in terms of where samples are placed when the dimension of the space is ≥

2. We say‘may’ because for some problems, one is forced to use a specific lattice structure - typically the Cartesian lattice,whereas other problems allow lattice choice as a degree of freedom.In this work, we build a framework for interpolation on lattices, and show how to do it fast . Our work originatedin scientific visualization, in particular working with volumetric data, but the observations we make are generallyapplicable to other domains as well as higher dimensions. We will mainly keep to the 3-dimensional case, but wewill note how to generalize to higher dimensions when it is necessary.Returning to the context of scientific visualization, interpolation speed is important in practice because itfacilitates interactivity which, in turn, allows for fast iteration over data-sets — the ability to quickly (visually)explore data is one of the key strengths of visualization. But more often than not, in practice, interpolationboils down to “use trilinear interpolation”, and it is clear why: trilinear interpolation is fast, it is implementedin hardware for all modern graphics processing units; it is straightforward to use, it inlvolves a single texturefetch instruction for modern GPUs; and it looks “good enough” in practice. However, linear interpolation has anumber of shortcomings. Perhaps the most obvious is smoothness. Trilinear interpolation is based on a piece-wiselinear univariate function extended via tensor product along all three dimensions; this ensures a continuousapproximation but introduces discontinuities in the first and higher order derivatives. While this could be mitigatedby using filtering methods, there is additional memory overhead associated with such methods [Alim et al. 2010].Linear interpolation also introduces visual artifacts when interpolating at lower resolutions. This is again relatedto the smoothness of the trilinear interpolant, but also has deeper roots in sampling and approximation theory. Inshort, the geometry and support of the chosen interpolant dictate how a reconstruction attenuates high-frequencycontent in the Fourier domain. This gives rise to different types of artifacts in data reconstruction [Csébfalvi andRácz 2016]. Thus, there is value in exploring the design space of interpolants with the goal of minimizing error;in practice, one also needs to account for the computational costs associated with different interpolants.Before moving further, we need to clarify what we mean by non-Cartesian computing . We use the term“non-Cartesian computing” to indicate a move away from tensor-product extensions. This can take the formof non-Cartesian lattices — take Figures 1 and 2 as examples that show the typical cubic Cartesian (CC) andsquare lattices in three and two dimensions, as well as more efficient non-Cartesian variants, namely the BodyCentered Cubic (BCC) lattice, and the Face Centered Cubic (FCC) lattice, both of which have received a goodamount of attention in the field of scientific visualization. We may also use the term “non-Cartesian computing”to denote the move away from tensor-product interpolants to non-separable interpolants. In the Cartesian world,one typically uses tensor product B-splines as basis functions; however, non tensor-product splines exist, andhave been investigated in the literature [Csébfalvi and Rácz 2016]. One example is the ZP-element which requiresfewer memory accesses per reconstruction than the cubic tensor-product B-spline but has similar approximationproperties and provides smoother, more isotropic reconstructions [Entezari and Möller 2006]. When comparingdifferent sampling lattices, one may argue that a true comparison for these lattices would be their Voronoisplines [Mirzargar and Entezari 2010], which, until this work, have yet to have a dedicated GPU implementation.What do we gain by adding additional complexity to our interpolation and data processing schemes? Theseemingly benign difference in lattice and basis function changes the properties of the resulting approximationscheme. In fact, in 2D, an optimal hexagonal sampling scheme (Fig. 2) yields 14% better approximations whencombined with an appropriate interpolant, whereas on the BCC lattice (Fig. 1)— the optimal sampling lattice in3D — the improvement is around 30% [Entezari 2006]. Despite these advantages, the benefits of non-Cartesiancomputing have remained elusive. This is largely due to the complex algebro-geometric interplay between thelattices and the basis function they are paired with. These basis functions are typically composed of box splineswhich have intricate piece-wise polynomial structures which are challenging to evaluate in an efficient manner.Most treatments have dealt with evaluation and implementation issues in a somewhat adhoc manner. , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part I — Theory and Analysis • 3

Fig. 1. Three trivariate integer lattices. In order from left to right we have the Cartesian (CC), Body Centered Cubic (BCC)and Face Centered Cubic. The different colored lattice sites correspond to the Cartesian cosets within each sampling lattice.The polytope in the center of each grid corresponds to grid’s respective Voronoi cell.

Our goal in this work is to make the advantage of non-Cartesian computing more tangible — explicitly, wewant to make the use of alternative interpolants/lattices more practical and usable. Our main contribution is aholistic analysis of approximation spaces with a careful focus on fast implementation. We outline a frameworkfor translating interpolation bases into fast interpolation schemes for piece-wise polynomial interpolants. Wethen show how to use this pipeline to create unified fast implementations of many of the interpolants in the non-Cartesian volumetric visualization literature. It is important to mention that we obtain a form of evaluation thatis agnostic to platform. We could generate CPU code, GPU code, or Verilog from our intermediate representation.Our GPU implementation takes advantage of the in-built tensor product linear filtering hardware (i.e. lineartexture fetches) on contemporary GPUs to reduce the amount of memory accesses needed for reconstruction,however this does not always lead to optimal results.We provide the theoretical analysis in this paper, and in the subsequent work we provide extensive implemen-tation details. In our supplementary material we share the worksheet we use to automate the analysis as well asthe code generation module, and pre-generated CUDA PTX code. To summarise, in this part, our contributionsare as follows: • We provide a unified framework of analyses for combinations of splines and lattices from the context ofscientific visualization. • We show how to use trilinear interpolation to reduce memory fetches and increase performance of ourimplementations. • We provide implementations for many interpolants in the literature that have not yet had robust GPUimplementations. • We show the relative performance between GPU implementations, and show generality by moving towardsa 4 dimensional example.The remainder of this paper is organized as follows. In Section 2, we review the literature in our native context,scientific visualisation, but we will also pay note to some of the work done in one dimension and in imageprocessing. In Section 3, we provide the background necessary for our framework. In Section 4, we attack theproblem at its root, the convolution sum. We manipulate the convolution sum until it emits a form suitable forfast evaluation — this can be thought of as unrolling the convolution sum loop. In Section 5 we generate code forvarious interpolants (box splines and Voronoi splines) some of which have not yet appeared in the literature. Wethen evaluate their performance on the CPU and GPU with respect to the application of volume rendering. , Vol. 1, No. 1, Article . Publication date: February 2021. • Joshua Horacsek and Usman Alim

Interpolation is a fundamental operation in scientific visualization. Many visualization tasks (volumetric rendering,iso-surface extraction, flow visualization etc.) rely on the interpolation of a continuous function from a discreteset of values. In one dimension, the family of B-splines are the prototypical interpolant — the linear B-spline isthe least computationally expensive in this family, and has received the most attention in practice [Unser 1999].However, it is well understood what B-splines are members of the maximal order minimal support (MOMS)family, which are the optimal interpolants for a given support [Blu et al. 2001]. This is not the only option though,there are other constraints one may wish to design around. For example: arithmetical complexity; total number ofmemory accesses (i.e. support) needed for a single point reconstruction; and smoothness are all parameters thatcan be tweaked. If one requires infinite differentiability, then CINAPACT splines are an appropriate interpolantcandidate [Akram et al. 2015].The design space for uni-variate interpolants is well understood. Appropriate choice of interpolant is notso clear in higher dimensions. For the domain of scientific visualization, the move to three dimensions allowspractitioners the freedom to consider non-seperable splines. In this case, it is clear that the tensor productB-splines are no longer MOMS splines; the generalized ZP-element shows this, since it has lower support than(but the same approximation order as) the cubic B-spline [Entezari and Möller 2006].There are multiple works that investigate non-separable basis functions on non-Cartesian lattices; box splinesand weighted linear combinations of box splines often appear as popular candidates [Csebfalvi 2013; de Boor et al.1993; Domonkos and Csébfalvi 2010]. However, it is fairly well known that evaluating a box spline numerically isquite difficult; the recursive form for box splines is unstable if naively implemented [de Boor 1993]. One canrectify this by ensuring that any decision made while evaluating a point on the spline’s separating hyperplanes isconsistent [Kobbelt 1997]. Even then, recursive box spline evaluation is unsuitable for use on the GPU — theconditional nature and large branching behaviour of the recursive form will lead to branch divergence and stallexecution units on the GPU, leaving its resources underutilized. It is more convenient to work with either theBernstein Bézier (BB) form, or the explicit piece-wise polynomial (PP) form of a box spline which has beencharacterized in closed form [Horacsek and Alim 2016]. Mathematically these two forms are equivalent, theBB form is simply one specific factorization of the PP form that lends itself to evaluation with De Castlejau’salgorithm. Generally, a polynomial can be factorized in many different ways.The BB form has been used in fast GPU evaluation schemes for box splines on non-Cartesian lattices. In thework of Kim et al. [Kim 2017], the symmetry of a box spline is used to create look-up tables of the BB polynomialcoefficients. Since these coefficients are rational, they are stored as pairs of integers and any division occurs atrun time. This is particularly convenient on the GPU, as it allows one to write one function for De Castlejau’salgorithm, and each separate region of the spline to be evaluated is a set of integers that can be looked up basedon an analysis of the regions of the spline’s mesh at runtime. However, this is somewhat wasteful since mostregions within a spline’s mesh are related by a rotation and/or a shift (i.e. an affine change of variables). As wewill see in Section 4, this change of variables allows us to re-write the polynomial within the region of evaluationas a transformation followed by a polynomial evaluation of some chosen reference polynomial (provided thespline has an appropriate structure). This reference polynomial can be evaluated in BB form, however, we chooseto use a Horner factorization of the PP form — this allows us to reduce the amount of operations needed tocalculate a given polynomial [Ceberio and Kreinovich 2004]. This is a generalization of the approach used in thework of Finkbeiner et al. [Finkbeiner et al. 2010].On the BCC lattice multiple box splines have been investigated for volumetric visualization. The linear andquintic box splines were among the first used by Entezari et al. [Entezari et al. 2008, 2009] for volumetric datavisualization. These splines are particularly interesting as they mimic the geometry of the BCC lattice and theyattain the same order of approximation as the trilinear and tricubic splines on the CC lattice, but with smaller , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part I — Theory and Analysis • 5 support. Then there is the quartic box spline of Kim et al. which has even smaller support than the quinticbox spline, but the same order of approximation [Kim 2013c]. Additionally, in that work, Kim et al. proposed a12-direction box spline with tensor product structure; this box spline has a large support size, but is reasonablyfast compared to the other proposed box splines due to its tensor product flavour — in our implementation, wecall the 12-direction box spline the tricubic box spline on the BCC lattice.The FCC lattice has not received as much attention from researchers as the BCC lattice; we are only aware offew works that investigate box splines on the FCC lattice. In particular, Kim et al. have investigated a six directionbox spline that respects the geometry of the FCC lattice [Kim 2013a; Kim et al. 2008a]. However, it is also truethat the 12-direction box-spline [Kim 2013c] is usable on the FCC lattice. The proposed splines of Csebfalvi et al. may also be generalized to the FCC lattice [Csebfalvi 2013; Domonkos and Csébfalvi 2010].We are only aware few works that attempt the use of a non tensor-product box spline on the CC lattice forvisualization. The work of Entezari et al. generalizes the Zwart-Powell (ZP) element of two dimensions to threedimensions. This also maintains the same order of approximation as the cubic tensor product spline, but withsmaller support [Entezari and Möller 2006]. Even though this spline tends toward higher fidelity approximations,we are not currently aware of any GPU implementation of this spline, and we provide one in the supplementarymaterial.There is also the work of Csebfalvi et al. , where a shifted and re-weighted box spline is designed so as tomap easily to linear interpolation among the cosets of the BCC lattice [Csebfalvi 2013]. This method has theadvantage of being extremely easy to implement on the GPU, has respectable reconstruction quality and runsat decent speeds compared to trilinear interpolation. There has also been some work investigating the use ofsplines designed for one lattice on another [Csébfalvi and Rácz 2016]. The use of direction vectors that do notcorrespond to principle lattice directions helps distribute frequency content more evenly in the Fourier domain.Another intuitive idea is to look at the Voronoi cell of a lattice and convolve that with itself to obtain aninterpolant, Figure 1 shows the CC, BCC and FCC lattices with their Voronoi cells. This produces a validapproximation scheme, but is quite expensive to compute [Mirzargar and Entezari 2010, 2011; Van De Ville et al.2004]. Until this work, there was no GPU implementation available for these splines. These fit within our pipeline,and we provide an implementation for the BCC and FCC lattices — Voronoi splines on the CC lattice correspondto the tensor product B-spline.Finally, while not strictly related to spline evaluation, it is important to discuss quasi-interpolation , since anapproximation space may not harness the full approximation power of a given basis function without properpre-filtering. Most of the basis functions discussed so far are not interpolating. If a basis is stable ( i.e it forms a

Riesz basis ) then it is possible to process the input data so that the resulting reconstruction interpolates data [Alim2012]. However, this does not always yield the best reconstruction. Moreover, some bases cannot be madeinterpolating, so what is to be done in those cases? Quasi-interpolants ensure that a reconstruction harnessesthe full approximation order of a space — in general it is a good idea to prefilter data with a quasi-interpolant ifa basis is not interpolating. This is related to an error kernel analysis, where the data convolved with a filteris guaranteed to decay with the order of the basis [Alim 2012; Blu and Unser 1999]. While this is an importantingredient to a good reconstruction scheme, in this work we are not concerned with the approximation propertiesof a basis function, only on the speed at which it can be evaluated.

When we work with data that are uniformly sampled in one dimension there is only one possible samplingscheme — take equally spaced samples along the indepenent axis. The straightforward extension of this tomultiple dimensions is to sample evenly in all cardinal directions, i.e. a Cartesian sampling. However, the situationis generally more complex in higher dimensions. , Vol. 1, No. 1, Article . Publication date: February 2021. • Joshua Horacsek and Usman Alim

Fig. 2. The square Cartesian (left) quincunx (middle) and hexagonal (right) lattices respectively. Notice the coset structure ofthe quincunx lattice is two interleaved Cartesian lattices — denoted by the difference in lattice site coloring. The same appliesfor the hexagonal lattice, although slightly distorted.

To move away from Cartesian lattices and account for this complexity, a function can be sampled according toan invertible matrix. This matrix — known as the generating matrix — represents a lattice structure. That is, ifwe have a real valued function 𝑓 ( x ) in some dimension 𝑠 , we define the sampled version of that function withrespect to a given lattice as 𝑓 n : = 𝑓 ( ℎ𝐿 n ) . Here, ℎ is a real valued scale parameter that controls the granularity orcoarseness of the sampling pattern (i.e. the sampling rate), n ∈ Z 𝑠 and 𝐿 is the generating matrix of the lattice. Inany dimension, the 𝑠 × 𝑠 identity matrix generates a Cartesian sampling. In two dimensions the hexagonal andquincunx lattices are generated by the matrices 𝐿 𝐻𝐸𝑋 : = (cid:20) √ (cid:21) and 𝐿 𝑄𝐶 : = (cid:20) −

11 1 (cid:21) (1)respectively (Figure 2). In three dimensions, the BCC and FCC lattices are generated by 𝐿 𝐵𝐶𝐶 : =  − − −  and 𝐿 𝐹𝐶𝐶 : =   (2)respectively (Figure 1). It is worth noting that a lattice may be generated by any number of different generatingmatrices — one may construct an equivalent lattice by carefully choosing vectors in 𝐿 Z 𝑠 . In general, finding aminimal basis for a given generating matrix is an NP-hard problem [Conway and Sloane 2013], as such it isdifficult to find a relationship between matrices that generate the same lattice.While lattice theory is quite deep and general, we only adhere to the case where 𝐿 ∈ Z 𝑠 × 𝑠 and 𝐿 has full rank.This is because, in practice, we use lattices as array-like data structures and it is not possible to index via realvalued numbers. As a consequence, it is not possible for us to directly represent functions on the hexagonallattice within our framework. However, it is possible to stretch an integer lattice into another non-integer lattice;for example, in Figure 2, the hexagonal lattice can be seen as a distorted quincunx lattice.Another fundamental object related to a lattice is its Voronoi region, which is the set of points closest to thelattice site situated at the origin (both Figure 1 and Figure 2 show each lattice’s Voronoi region). The Voronoiregion of a lattice embodies all the properties of a lattice — from it one may derive the shortest vector of a lattice(i.e. the shortest vector problem which is NP-complete) as well as set of shortest spanning vectors, the symmetrygroup of the lattice, packing efficiency etc. [Conway and Sloane 2013; Viterbo and Biglieri 1996]. Unsurprisingly,computing the Voronoi region of a lattice is NP-complete as well, but for reasonably specified bases in lowerdimensions, the problem is tractable via the Diamond Cutting algorithm [Viterbo and Biglieri 1996].To reconstruct a function on a given lattice, we choose a basis function 𝜑 ( x ) , shift it to each lattice site, andmodulate it by the coefficient stored at that lattice site. The space of all functions spanned by the lattice shifts of , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part I — Theory and Analysis • 7 𝜑 is defined as the set V ( 𝜑, 𝐿, ℎ ) : = (cid:40) ∑︁ n ∈ 𝐿 Z 𝑠 𝑐 n 𝜑 (cid:16) x ℎ − n (cid:17) : c ∈ 𝑙 (cid:41) , (3)where c is a real-valued coefficient sequence associated with the lattice. Obtaining a good approximation˜ 𝑓 ( x ) ∈ V ( 𝜑, 𝐿, ℎ ) to 𝑓 ( x ) equates to finding an appropriate set { 𝑐 n } ; this is highly dependent on the choice of 𝜑 ( x ) and 𝐿 . If we are so lucky that 𝜑 ( x ) is interpolating on 𝐿 (i.e. 𝜑 ( ) = 𝜑 ( 𝐿 n ) = n ∈ Z 𝑠 /{ } ), thenwe may choose 𝑐 n = 𝑓 ( ℎ𝐿 n ) . However, in general, some preprocessing must be performed in order to achieveoptimal error decay between the original function and its approximation. This is beyond the scope of this article,but the interested reader may refer to some of the related work [Alim 2012; Blu and Unser 1999; Unser 2000].Since any asymptotic analysis is outside of the scope of this paper, without loss of generalization we set ℎ = 𝜑 , box splines have been investigated extensively in scientific visualization. Boxsplines have many equivalent definitions, but perhaps the most intuitive is the convolutional definition whereone successively convolves an indicator function along the 𝝃 𝑖 direction vectors [de Boor et al. 1993]. We collect 𝑛 of these vectors in an 𝑠 × 𝑛 matrix Ξ : = (cid:2) 𝝃 𝝃 · · · 𝝃 𝑛 (cid:3) . (4)In our case, we are interested only in matrices where 𝑛 ≥ 𝑠 and the dimension of the range of Ξ is equal to 𝑠 ;we require these for a valid approximation scheme [de Boor et al. 1993]. With this, a box spline can be definedrecursively as 𝑀 Ξ ( x ) : = ∫ 𝑀 Ξ /{ 𝝃 } ( 𝑡 𝝃 − x ) 𝑑𝑡, for 𝑛 > 𝑠, (5)where Ξ /{ 𝝃 } is interpreted as removing a direction vector 𝝃 from the matrix Ξ . When 𝑛 = 𝑠 , we define 𝑀 Ξ ( x ) as 𝑀 Ξ ( x ) : = (cid:40) | Ξ | if Ξ − x ∈ [ , ) 𝑠 , (6)which provides a base case for the recursion.The box spline 𝑀 Ξ ( x ) is compactly supported within a polytope defined by the Minkowski sum of the directionvectors in Ξ . We denote the support as supp 𝑀 Ξ ; it is given bysupp 𝑀 Ξ = (cid:40) 𝑛 ∑︁ 𝑘 = 𝑎 𝑘 𝝃 𝑘 | ∀( 𝑘 ) 𝑎 𝑘 ∈ ( , ) (cid:41) . (7) 𝑀 Ξ is piecewise polynomial within its support. The polynomial pieces are delineated by all ( 𝑠 − ) -dimensionalhyperplanes spanned by the direction vectors in Ξ . Denoting H as the set of all such hyperplanes, we can definea fine mesh that is formed by the lattice shifts of 𝑀 Ξ : Γ ( Ξ , 𝐿 ) : = (cid:216) 𝐻 ∈ H 𝐻 + 𝐿 Z 𝑠 . (8)Each region of the mesh is a convex polytope that contains a polynomial, typically these are different polynomials,however, it is possible that some regions may share the same polynomial. In general, it is possible to derive theexplicit piece-wise polynomial form for each of these regions [Horacsek and Alim 2016]— we will use this explicitform in Section 4 to determine a simpler representation of the convolution sum in Equation 3.Our framework also accounts for Voronoi splines, which can be intuitively defined as successive convolutionsof a lattice’s indicator region with itself. More precisely, Voronoi splines are defined as 𝑉 ( x ) : = 𝜒 𝐿 ( x ) (9) 𝑉 𝑛 ( x ) : = ( 𝑉 ∗ 𝑉 𝑛 − )( x ) (10) , Vol. 1, No. 1, Article . Publication date: February 2021. • Joshua Horacsek and Usman Alim Quincux Lattice Separate Textures on GPU

Fig. 3. Lattices are broken into their Cartesian coset structure, then stored in separate 𝑠 -dimensional arrays. On the GPU,this translates to a collection of textures. For volumetric data, we decompose the lattice into separate volumetric textures. where ∗ denotes convolution, and 𝜒 𝐿 ( x ) = /| det 𝐿 | if x is in the Voronoi region of the lattice 𝐿 , and 0 otherwise.Voronoi splines are also piece-wise polynomial, and their piece-wise polynomial form may be obtained by firstredefining 𝑉 ( x ) as the sum of a collection of constant valued box splines; higher order 𝑉 𝑛 ( x ) may be obtained bythe convolution definition [Mirzargar and Entezari 2010]. Unrolling the recursion, we obtain 𝑉 𝑛 ( x ) = 𝑉 ( x ) ∗ 𝑛 where ∗ 𝑛 denotes successive convolutions. If 𝑉 ( x ) is the sum of box-splines, then one may use the multinomialtheorem to expand 𝑉 ( x ) ∗ 𝑛 into the sum of many box-spline convolutions; convolving a box-spline with anotherbox-spline simply produces a (possibly shifted) box-spline. Thus, by summing all these box-splines, we arrive atthe final piece-wise polynomial form. However, these calculations are not trivial. For the BCC lattice, 𝑉 ( x ) is thesum of 16 box-splines, 𝑉 ( x ) (analogous to the tri-linear B-spline on the CC lattice) is the sum of 136 box-splines, 𝑉 ( x ) is the sum of 816 and 𝑉 ( x ) , the sum of 3876 [Mirzargar and Entezari 2010]. While these numbers are notintractably large, keep in mind that one must compute a box-spline’s polynomial representation at every iteration,which can take a non-trivial amount of time; on the order of a few dozen minutes, for the splines that sum to 𝑉 ( x ) . Spread across 4 CPU cores, this took us about a month to compute. The convolution sum in Equation 3 is the focus of this section. There are two important parts to evaluating afunction of this form; the first is the representation of the coefficient set 𝑐 n (i.e. the memory layout), the secondis the evaluation of the basis function 𝜑 . Concerning 𝑐 n , we choose a specific representation in which each 𝑐 n corresponds to a fetch from an 𝑠 -dimensional array. Thus, for any lattice, we choose to treat it as a union of shiftedCartesian lattices (also known as a Cartesian coset decomposition). This allows us to use texture memory onGPUs to store each coset, as shown in Figure 3. Provided we combine all reads from a given coset, this also allowsus to take advantage of data locality (i.e. cache). It also allows us to use trilinear fetches as building blocks for ourinterpolation schemes — that is, we attempt to write our interpolation schemes as weighted sums of trilineartexture fetches on the different cosets of the lattice. This does not necessarily reduce computation per evaluation,but rather reduces the number of texture fetches needed to evaluate the convolution sum, and therefore reducesbandwidth.With the memory layout solidified, we turn to the basis function — we apply a series of manipulations to“unroll” the convolution sum. We build some insight into how to transform 𝜑 into a more convenient form forevaluation with some running examples. We consider the linear tensor product spline and the Zwart-Powell (ZP) , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part I — Theory and Analysis • 9 Fig. 4. Splines used as running examples; from left to right are box splines generated by Ξ 𝑇𝑃 , Ξ 𝑍𝑃 and Ξ 𝑄𝐶 respectively. element on the two-dimensional Cartesian lattice. These have direction matrices Ξ TP2 = (cid:20) − − (cid:21) , Ξ ZP = (cid:20) − (cid:21) respectively. On the quincunx lattice, we use a tensor product style spline defined by Ξ QC = (cid:20) − − (cid:21) . These box splines are shown in Figure 4. While this is nowhere near an exhaustive list of interpolants, theexamples we provide are simple enough to understand in 2D and show the intricacies of the procedure.

We start by reiterating the convolution sum in Equation (3): 𝑓 ( x ) : = ∑︁ n ∈ 𝐿 Z 𝑠 𝑐 n 𝜑 ( x − n ) . (11)At a high level, our methodology consists of incrementally applying a series of simple algebraic manipulations tothis sum so that it can be more effectively mapped to an implementation; be it CPU or GPU. Along the way, wepause and discuss the effects of certain choices and properties of 𝜑 on an evaluation scheme.The first manipulation we apply is one that has been used to derive fast interpolants on the BCC and FCClattices [Csebfalvi 2013; Domonkos and Csébfalvi 2010]. In particular, these interpolation schemes take advantageof the fact that an integer lattice can be decomposed into a sum of sub-lattices. We build upon this observation bynoting that we may decompose the convolution sum up over any coset structure of the lattice. This decompositionis independent of our data representation. For example, one may take a cubic Cartesian lattice and decompose itas two shifted BCC lattices. Within our framework, on the GPU each of these two BCC lattices will in turn havetwo 3D textures associated with it.Formally, if 𝐺 is an integer subsampling matrix that yields a sub-lattice of 𝐿 , then there exist 𝑀 : = | det 𝐺 | integer vectors l , l , · · · , l 𝑀 − such that 𝐿 Z 𝑠 = ∪ 𝑀 − 𝑘 = 𝐺𝐿 Z 𝑠 + l 𝑘 . Note that the lattice sites 𝐺𝐿 Z 𝑠 + l 𝑘 constitute thecoset corresponding to the shift vector l 𝑘 . Without loss of generality, we take l to be zero vector of Z 𝑠 . We cannow write the convolution sum as ∑︁ n ∈ 𝐿 Z 𝑠 𝑐 n 𝜑 ( x − n ) = 𝑀 − ∑︁ 𝑘 = ∑︁ m ∈ 𝐺𝐿 Z 𝑠 + l 𝑘 𝑐 m 𝜑 ( x − m ) , (12)which decomposes the sum over the coset structure of the lattice induced by 𝐺 . Figure 2 shows the Cartesiancoset structure of the quincunx lattice. In three dimensions, the BCC lattice emits similar behaviour to this; it , Vol. 1, No. 1, Article . Publication date: February 2021. consists of two interleaved Cartesian grids where one of the Cartesian grids is shifted by the vector ( , , ) . TheFCC lattice is similarly decomposed into Cartesian cosets, as shown in Figure 1.The matrix 𝐺 is a parameter choice that must be made depending on what is most appropriate for the deviceon which we are implementing an interpolation scheme. Currently, for evaluation on the GPU, a reasonablechoice is a 𝐺 such that 𝐺𝐿 = 𝐷 where 𝐷 is a diagonal matrix. In other words, 𝐺 yields a Cartesian coset structure,and we may store the lattice as a collection of Cartesian lattices, specifically 3D textures (see Figure 3). However,it is not always strictly advantageous to do so; if 𝜑 does not have partition of unity on the sub-lattice generatedby 𝐺𝐿 , then this complicates the geometric decomposition of the spline. We will revisit this when we discuss the sub-regions of evaluation of a spline. While an integer invertible 𝐺 will still produce a valid evaluation scheme, inpractice 𝐺 must be chosen so that it both produces a simple sub-lattice 𝐺𝐿 and respects the geometry of 𝜑 .Next, we note that in all practical instances we have a compact basis function 𝜑 — either we choose 𝜑 to becompact, or the finite nature of our data imply a bound on the support of 𝜑 . Thus we change our perspectiveand rewrite the convolution sum over the support of 𝜑 shifted to the evaluation point x . From this perspective,any lattice site that falls within the support of this function will contribute to the reconstruction. To simplifynotation, we define the set 𝐶 𝑘 ( x ) : = supp 𝜑 (· − x ) ∩ ( 𝐺𝐿 Z 𝑠 + l 𝑘 ) . (13)As such, 𝐶 𝑘 ( x ) is the set of lattice sites on coset 𝑘 that contribute to the reconstruction at point x . Therefore, wemay write ∑︁ n ∈ 𝐿 Z 𝑠 𝑐 n 𝜑 ( x − n ) = 𝑀 − ∑︁ 𝑘 = ∑︁ m ∈ 𝐶 𝑘 ( x ) 𝑐 m 𝜑 ( x − m ) . (14)So far, we have not strictly required 𝜑 to be compact, but now we impose the following constraints on 𝜑 : it haspartition of unity on 𝐿 , and it is piece-wise polynomial with compact polyhedral support. Partition of unity— the ability for a reconstruction space to reproduce a constant function — is a basic requirement for a validapproximation scheme, along with polynomial decay in the Fourier domain [Strang and Fix 2011]. Compactnessis generally not restrictive, almost all 𝜑 used in practice are compact. The polynomial restriction is slightlyrestrictive, as it does not allow us to consider the class of exponential box splines [Ron 1988], the CINAPACTsplines [Akram et al. 2015], or the cosine weighted spline [Csebfalvi 2013] (which is an exponential box spline).From Equation (14), it is clear how the support of 𝜑 affects the inner summation of this equation — if there aremore lattice sites within the support of 𝜑 , then the inner summation will run over more terms. The effect thishas on performance depends on the underlying architecture of the machine and the complexity of 𝜑 . If we cancommit memory reads for the 𝑐 m while beginning to compute parts of 𝜑 that do not depend on the coefficientsof the summation, we can effectively hide the computation of 𝜑 in the latency of the memory fetches. Latencyhiding of this form is common in GPU compute applications. Moreover, the out-of-order execution units onmodern CPUs allow for this behaviour without explicitly coding for it. However, if memory accesses are not anissue (i.e. if we are reading from fast memory, such as cache), then the cost of reconstruction will be dominatedby the complexity of 𝜑 , and it would be advantageous to choose 𝜑 with larger support but lower evaluation timecomplexity.We now shift our focus to the geometry of 𝜑 . Without loss of generality, we limit our discussion to 𝐶 ( x ) since all other cases are shifted versions of this base case. Let us choose some x such that there exists some 𝜖 -neighborhood around x for which 𝐶 ( x ) remains constant. Then there is some larger region 𝑆 containing x forwhich 𝐶 ( x ) does not change. That is, if we pick any x , y ∈ 𝑆 then 𝐶 ( x ) = 𝐶 ( y ) . The closure of these regionstessellate space; Figure 5 shows examples of this. For box splines, these correspond to the regions within thefiner mesh Γ ( Ξ , 𝐺𝐿 ) of a the box spline on the lattice 𝐺𝐿 [de Boor et al. 1993]. In this work, we refer to these asthe sub-regions of the spline. , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part I — Theory and Analysis • 11 Fig. 5. Sub-regions and regions of evaluation on the Cartesian lattice for the tensor product box splines defined by Ξ TP 𝑛 andthe ZP element defined by Ξ ZP . The equivalence classes for the sub-regions of evaluation are denoted by the hue of theregion. The sub-regions are collected near the origin to create the region of evaluation. with coset decomposition Fig. 6. Sub-regions and regions of evaluation on the quincunx lattice for the tensor product box spline with no coset matrix,and coset matrix 𝐺 = 𝐿 𝑇𝑄𝐶 . At the origin are the representative sub-regions of evaluation. Again, the equivalence classes forthe sub-regions of evaluation are denoted by the hue of the region. In this figure, note the region of evaluation for the exampleon the left, shifting a point of evaluation to this region requires more logic than the case in which we use an appropriatecoset decomposition, as seen on the right.

Note the periodic nature of the sub-regions. They naturally fit into equivalence classes under the followingdefinition: we say that two mesh regions 𝑃 and 𝑄 belong to the same equivalence class if for any x ∈ 𝑃 , thereexists some lattice shift k ∈ 𝐺𝐿 Z 𝑠 such that x + k ∈ 𝑄 . Note that there are only finitely many equivalence classes,say 𝑁 of them, and we denote them as 𝑆 , 𝑆 , · · · , 𝑆 𝑁 − . We will call the choice of representative mesh regions sub-regions of evaluation . We define a region of evaluation as a convex polyhedral collection of sub-regions withhyper-volume | det 𝐺𝐿 | . Note that, since we have hyper-volume | det 𝐺𝐿 | , a region of evaluation will tessellatespace with one convex polyhedron about the lattice 𝐺𝐿 . Additionally, while the sub-region equivalence classesare determined by the (decomposed) lattice and 𝜑 , there are multiple possible regions of evaluation for a given 𝐺𝐿 (see Figure 5 and Figure 6). We choose all our regions of evaluation so that they contain or touch the origin,we call the choice of such a region 𝑅 .The reason we focus on a single region of evaluation 𝑅 is that it allows us to exploit the shift invari-ant nature of the approximation space. That is, say we derive a fast evaluation function 𝜁 ( x ) with 𝜁 ( x ) = (cid:205) 𝑀 − 𝑘 = (cid:205) m ∈ 𝐶 𝑘 ( x ) 𝑐 m 𝜑 ( x − m ) for x ∈ 𝑅 (but perhaps not outside 𝑅 ). The function 𝜁 necessarily depends on somesubset of coefficients { 𝑐 i : i ∈ 𝐴 } for some fixed 𝐴 ⊂ Z 𝑠 . Since the region of evaluation tessellates space, for any , Vol. 1, No. 1, Article . Publication date: February 2021. x ∈ R 𝑠 there exists some k ∈ 𝐺𝐿 Z 𝑠 such that x − k ∈ 𝑅 . Thus, we can derive a fast evaluation scheme for allpoints by shifting the evaluation to 𝜁 ( x − k ) and shifting the coefficients { 𝑐 i + k : i ∈ 𝐴 } — this is a consequence ofthe “shift invariance” property of shift invariant spaces. We define the function 𝜌 : R 𝑠 → 𝐺𝐿 Z 𝑠 such that x ∈ R 𝑠 = ⇒ x − 𝜌 ( x ) ∈ 𝑅 (15)to formalize this notion, and assert that this must exist if 𝑅 tessellates space and has hypervolume equal to | det 𝐺𝐿 | .Thanks to this shift invariance, we may focus solely on the region of evaluation. To each representativesub-region of evaluation, we associate a function 𝜓 𝑗 ( y , k ) that is defined as follows: 𝜓 𝑗 ( y , k ) : = ∑︁ n ∈ 𝐶 ( 𝑆 𝑗 ) 𝑐 n + k 𝜑 ( y − n ) for y ∈ 𝑆 𝑗 . (16)For our piece-wise polynomial splines, we may explicitly construct the polynomial representation of 𝜓 𝑗 ( y , k ) . Weuse a greedy Horner scheme to optimize the evaluation of 𝜓 𝑗 ( y , k ) , computation on the GPU is relatively cheap,at least compared to bandwidth access.We combine all these sub-region functions to a single “approximation” function, Ψ ( x , l ) , with the definition Ψ ( x , l ) : =  𝜓 ( x − 𝜌 ( x ) , 𝜌 ( x ) + l ) if x − 𝜌 ( x ) ∈ 𝑆 𝜓 ( x − 𝜌 ( x ) , 𝜌 ( x ) + l ) if x − 𝜌 ( x ) ∈ 𝑆 ...𝜓 𝑁 − ( x − 𝜌 ( x ) , 𝜌 ( x ) + l ) if x − 𝜌 ( x ) ∈ 𝑆 𝑁 − . (17)We can finally write the convolution sum per coset as ∑︁ n ∈ 𝐶 𝑘 ( x ) 𝑐 n 𝜑 ( x − n ) = Ψ ( x − l 𝑘 , l 𝑘 ) . (18)To obtain the final approximation, one only needs to sum this over all cosets.While this is an optimized form, it still contains multiple branches — we have different cases for each sub-region.Modern CPUs have been heavily optimized to efficiently execute branch heavy code, whereas GPUs have not.The conditional behaviour of Ψ ( x , l ) is a major problem for GPU implementations. However, it is possible to usethe symmetry of 𝜑 to eliminate branches. A similar idea has been used to derive fast interpolation schemes onnon-Cartesian lattices [Kim and Peters 2009]. The approach taken in other works has been to encode each differentpolynomial in BB form, use the sub-regions to determine which coefficient sequence to use in deCasteljau’salgorithm [Kim and Peters 2009]. In this work, we use a slightly more efficient idea — we use the symmetry ofthe basis function to encode a change of variable relationship between subregions. This allows us to rely on asmall set of polynomials, rather than having one for each sub-region. We first discuss how to determine whichsub-region a point lies within without any branches. We then discuss how to eliminate branches in Equation 17. Suppose we have some x ∈ 𝑅 , since 𝑅 is a convex polyhedron, and is the closed union of finitely many convex poly-hedra, we know there is a finite set of planes that divide the region of evaluation into the sub-regions of evaluation.We will say we have 𝑄 of these planes, and we will call the set of planes 𝑃 = {( p , 𝑑 ) , ( p , 𝑑 ) , · · · , ( p 𝑄 − , 𝑑 𝑄 − )} .Here p 𝑖 is the normalized orientation of the plane, and 𝑑 𝑖 is the distance from the origin; explicitly, these planesare the planes that “cut” the region of evaluation into sub-regions. Thus, if we have some point x ∈ 𝑅 , we canconstruct an integer 𝑞 in the range (cid:2) , 𝑄 (cid:1) by assigning its 𝑖 th bit a value of 1 if x is on the left side of the plane 𝑝 𝑖 , and 0 otherwise. This allows us to associate an integer with every single sub-region of evaluation. However, , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part I — Theory and Analysis • 13 S S Fig. 7. Two different reconstructions within the region of evaluation. Each reconstruction point is denoted by an “x”. Eachreconstructed point lies within a different sub-region of evaluation 𝑆 and 𝑆 . The lattice sites that contribute to thatreconstruction are coloured with the region’s respective color. this does not map to the entire range (cid:2) , 𝑄 (cid:1) , that is, it is possible that there are more integers in the range (cid:2) , 𝑄 (cid:1) than there are sub-regions. Further, if 𝑄 is large, then 𝑞 may be vastly bigger than 𝑁 . Therefore, wefirst compress the range [ , 𝑄 ) down to some range [ , 𝑟 ) — that is, we first construct 𝑞 as above, then take itsremainder upon division with 𝑟 (i.e. 𝑞 mod 𝑟 ) where 𝑟 is chosen so that each subregion maps to a unique integerin [ , 𝑟 ) . To find such an 𝑟 , we simply perform a bruteforce search. We then create another (surjective) map 𝜎 : { , , · · · , 𝑟 − } → { , , · · · , 𝑁 − } that takes 𝑞 mod 𝑟 to a sub-region’s index. This is realized by an arrayof 𝑟 entries. On a GPU, we store this in fast constant or shared memory. On a CPU, this will hopefully be cachedvery quickly. Thus, to determine this index we only need 𝑄 plane comparisons, one remainder operation and onefast memory lookup (the conditional behaviour of the plane comparison is done on an instruction predicate level,and does not cause any branching).For most of the cases we consider, the number of planes 𝑄 turns out to be quite small. The largest 𝑄 for boxsplines in 3D we observe is 9, which corresponds to the quartic BCC spline [Kim 2013c], and seems to havelittle effect on reconstruction speed. We did however, notice that for all orders of the Voronoi BCC spline, 𝑄 isprohibitively large, 𝑄 >

32. While this does not pose a problem to the theory of this method, if we did not imposethe compression of the range [ , 𝑄 ) to [ , 𝑟 ) the mapping to sub-regions would consume over 4 𝐺𝐵 of memory.We can additionally apply this procedure recursively. That is, we determine which octant a reconstruction lieswithin, then transform it to the positive octant; this approach only works provided the spline has the reflectivesymmetry necessary, which the BCC Voronoi spline does indeed have, and allows us to reduce our lookup tablesfurther. Note that this does introduce some overhead, as one needs to incorporate another transform in thegenerated code. To build the intuition for this optimization, we begin with an example. Say we have two regions 𝑆 and 𝑆 that“appear” similar, Figure 7 shows an example for two sub-regions. Intuitively we can see that 𝑆 is simply a rotation(or a reflection) of 𝑆 . One could easily model this with a rotation matrix 𝑇 and, to be more general, a translationvector t . That is, we would want 𝜓 ( x , k ) = 𝜓 ( 𝑇 x − t , k ) . If this were true, we could store 𝑇 and t in a lookup tableand implement the piece-wise structure in Equation 17 as a lookup instead of branch by first looking up 𝑇 and t ,applying them to x , then evaluating a single 𝜓 . The situation, however, is slightly more complicated, as we havefailed to match the coefficients between 𝜓 and 𝜓 . To make this consistent, we need some “renaming” map 𝜋 , Vol. 1, No. 1, Article . Publication date: February 2021. that renames the 𝑐 n to be consistent with the transformation 𝑇 and shift t . That is, we desire 𝑇 , t , and 𝜋 such that 𝜓 𝜋 ( 𝑇 x − 𝑡, k ) : = ∑︁ n ∈ 𝐶 ( 𝑆 ) 𝑐 𝜋 ( n )+ k 𝜑 ( 𝑇 x − n − 𝑡 ) = ∑︁ n ∈ 𝐶 ( 𝑆 ) 𝑐 n + k 𝜑 ( x − n ) = 𝜓 ( x , k ) . If we find such

𝑇 , t and 𝜋 then we know for any x ∈ 𝑆 then we have 𝑇 − ( x + t ) ∈ 𝑆 , and by renaming the memorylook-ups, we can use 𝜓 𝜋 − ( 𝑇 − ( x + t ) , k ) to evaluate the sub-region’s function. Here, 𝜋 must be an invertible map,and 𝑇 must be an invertible rigid body transformation.To find such parameters, we first choose a region 𝑆 as our reference sub-region and search for 𝑇 , t and 𝜋 forevery other sub-region 𝑆 𝑖 . The shift t is chosen to be the zero vector, or the center of mass for the sub-region; 𝑇 ischosen from the symmetry group of the spline. We implement this as a combinatorial brute force search. That is,we enumerate all possible pairs of t and 𝑇 and check whether this combination yields a valid transformation.Here, “valid” means that there exists a 𝜋 such that the above simplification holds. For splines with polynomial sub-regions, we can associate a polynomial with each coefficient lookup 𝑐 m in 𝜓 , then we apply the transformationto 𝜓 and determine the polynomial associated with each 𝑐 m of the transformed sub-region evaluation function.This forms a bipartite graph where the untransformed coefficients are colored in, say blue, and the transformedcoefficients are coloured in red; we assign an edge between two points in those sets if their associated polynomialis identical. If there exists a perfect matching between two sets, then we say it is “valid”. Moreover, the sameperfect matching yields the function 𝜋 . Perhaps unsurprisingly, 𝜋 is a rigid body transformation in all the caseswe consider in this work — that is, 𝜋 ( n ) = 𝑇 n + t .In general, there may be instances where it is not possible to find 𝑇 , t and 𝜋 for two sub-regions 𝑆 𝑖 , 𝑆 𝑗 , 𝑖 ≠ 𝑗 .This is the case for the 6 direction box spline on the FCC lattice — there are sub-regions with vastly differentgeometry [Kim 2013b] . In general, we may have 𝐾 such unique regions that are not re-writeable in terms of oneanother, and we collect them in the set { 𝜓 𝜋 ,𝜓 𝜋 , · · · ,𝜓 𝜋𝐾 − } . To handle these cases, we use branch predication —we calculate the polynomial within all sub-regions and use a predicate operator to choose the correct region’sresult at the end of the calculation. This performs unnecessary computation, but avoids the heavy overhead ofbranching on the GPU [Kim et al. 2013]. Algorithm 1 summarizes how to use the techniques we have discussedto perform point-wise evaluation. While the above form is appropriate for execution on the GPU, each call to 𝜓 𝜋𝑖 incurs a heavy amount of memoryaccesses for larger splines. All contemporary GPUs include hardware accelerated linear texture fetches. A naiveimplementation of the previous sections would use nearest neighbor interpolation to facilitate the coeffecientfetches, however, trilinear texture fetches are only a small constant factor slower (approximately 1 . × on ourhardware) and combine up to 8 texture fetches in one instruction.In general, it is not always possible to use trilinear interpolation to reduce 8 texture fetches into 1 — this isonly possible when the basis is separable. However, it is possible to rewrite two memory fetches (that reside onadjacent points of a given Cartesian coset) as a single tri-linear fetch. This offloads some of the reconstructionbandwidth pressure, but may require more computation. Fortunately, on GPUs, computation is relatively cheapcompared to bandwidth.To demonstrate this, let us fix a single 𝜓 𝜋𝑖 , and suppose that we have two two lattice sites a and b that areadjacent on the same Cartesian coset. Then we attempt to find two functions 𝑡 ( x ) , and 𝑔 ( x ) such that 𝑔 ( x ) · (( − 𝑡 ( x )) 𝑐 a + 𝑡 ( x ) 𝑐 b ) = 𝑐 a 𝜑 ( x − a ) + 𝑐 b 𝜑 ( x − b ) . (19) , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part I — Theory and Analysis • 15 Algorithm 1

Branch free evaluation at a point. procedure Eval( x ) 𝑓 ← for l ∈ { l , l , · · · l 𝑀 − } do k ← 𝜌 ( x − l ) ⊲ Determine the shift to ROE x ′ ← x − k − l ⊲ Shift ROE 𝑞 ← for 𝑖 ∈ { , , · · · 𝑄 − } do ⊲ Determine BSP index 𝑞 ← ( x ′ · p 𝑖 − 𝑑 𝑖 < ) ? 𝑞 : 𝑞 | 𝑖 end for 𝑆𝑢𝑏𝑅𝑒𝑔𝑖𝑜𝑛𝐼𝑛𝑑𝑒𝑥 ← 𝜎 ( 𝑞 % 𝑟 ) ⊲ Determine the shift to ROE 𝑇 ′ ← 𝑇 [ 𝑆𝑢𝑏𝑅𝑒𝑔𝑖𝑜𝑛𝐼𝑛𝑑𝑒𝑥 ] t ′ ← − 𝑇 ′ t [ 𝑆𝑢𝑏𝑅𝑒𝑔𝑖𝑜𝑛𝐼𝑛𝑑𝑒𝑥 ] 𝜋 ′ ← 𝜋 [ 𝑆𝑢𝑏𝑅𝑒𝑔𝑖𝑜𝑛𝐼𝑛𝑑𝑒𝑥 ] 𝑔 ← for 𝑖 ∈ { , , · · · , 𝐾 − } do 𝑣 ← 𝜓 𝜋 ′ 𝑖 ( 𝑇 ′ x ′ − t , k + l ) 𝑔 ← 𝑃𝑠𝑖𝐼𝑛𝑑𝑒𝑥 [ 𝑆𝑢𝑏𝑅𝑒𝑔𝑖𝑜𝑛𝐼𝑛𝑑𝑒𝑥 ] == 𝑖 ? 𝑣 : 0 end for 𝑓 ← 𝑓 + 𝑔 ⊲ Add the contribution for this coset end for return 𝑓 end procedure a b a' b' t( x ) Tri-lerp replaces a and b Fig. 8. Replacing two texture fetches a and b on the quincunx lattice with a single linear texture fetch. There is a smallamount of arithmetic involved in translating a to a ′ but it is neglible. Additionally, textels are often shifted slightly to makethe hardware design easier — they are often shifted by 0.5 in the cardinal directions. The term (( − 𝑡 ( x )) 𝑐 a + 𝑡 ( x ) 𝑐 b ) is exactly linear interpolation; this can be easily translated into single texturefetch instruction (and some additional computation for 𝑡 ( x ) and 𝑔 ( x )) . Explicitly, the solutions 𝑔 ( x ) and 𝑡 ( x ) aregiven by 𝑔 ( x ) = 𝜑 ( x − a ) + 𝜑 ( x − b ) and 𝑡 ( x ) = 𝜑 ( x − b ) 𝑔 ( x ) which is easily verified. Note that we require | 𝑔 ( x )| > x within a given sub-region. Since we wish to use Equation (19) to replace two fetches in the convolution , Vol. 1, No. 1, Article . Publication date: February 2021. A BCD

Fig. 9. The set covering for the tensor product quadratic spline. The middle square is the region of evaluation (there is onlyone sub-region of evaluation for this spline). The points in red denote the lattice sites that contribute to a reconstruction inthe region of evaluation. The highlighted and labeled boxes correspond to the groups of 1, 2, and 4 that can be simplified intoa linear texture fetch. Note that, by our heuristic, we would rather evaluate these groups in the order D-C-B-A as opposed to,say, D-B-C-A which would have + + + = and + + + = cache misses, respectively. sum (3), we know that the basis shifts associated with two coefficients in that sum must contribute to thefinal sum over the entire sub-region — they cannot be identically zero. The only other troublesome case is if 𝜑 ( x − a ) = − 𝜑 ( x − b ) for some x in the sub-region under investigation. For all the cases we consider, our basisfunctions are strictly positive, so we need not worry about this — however, for some interpolating bases, thismay present a problem. If we are working with volumetric data, with a ′ and b ′ defined as the texture coordinatesof the lattice sites a and b in GPU memory, then we can write Equation 19 as 𝑔 ( x ) · 𝑇 𝐸𝑋 𝐷 ( tex_coset , a ′ + 𝑡 ( x )( b ′ − a ′ )) where 𝑇 𝐸𝑋 𝐷 is the linear texture fetch built into most GPUs. This is demonstrated in Figure 8. It also possibleto perform a similar optimization with groups of 4 points that are arranged as a square (on a Cartesian coset) andgroups of 8 arranged as a cube (again, on a Cartesian coset). Algebraically, these cases are much more involved,but the basic principle of solving for two functions 𝑡 ( x ) and 𝑔 ( x ) remains the same.While we may be able to re-write groups of fetches as a single linear fetch, there are some details that need tobe addressed. The first subtlety is that, while it may be possible to cover two memory fetches with one, it mayalso be possible to group those two memory fetches in a larger group. If every grouped memory fetch is a subsetof the total collection of memory fetches, then we seek a minimal set covering of the total set.The set covering problem is NP-complete, however, we only perform this step once as a precomputation. Weenumerate all groups of 1, 2, 4 and 8 that can be combined into a single texture fetch and use Algorithm X to finda solution to the minimal set covering problem [Knuth 2000]. Once we have a set covering we may still choosean ordering of the memory fetches. We make the assumption that, when a texture fetch occurs, all the pointsthat contribute to that texture fetch are cached — any neighboring texture fetch may then reuse some of thevalues that have been cached. To take advantage of this, we create a complete directed-graph in which each nodecorresponds to one linear texture fetch. To each edge of the graph we assign the number of cache-misses incurredby performing the pair of memory fetches (in order). We then seek a route through this graph that minimizescache misses — this reduces to the travelling salesman problem; another NP-complete problem. Again, this is a , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part I — Theory and Analysis • 17 one-time precomputation, and Figure 9 shows an example of the set covering for the quadratic tensor productspline. In this work, we validate our methodology by demonstrating that our framework works for many differenttest cases. While performance is important, and we do report limited performance metrics in this work, wedefer in-depth performance analysis to Part II [Horacsek and Alim 2021]. To this end, we choose two simpleapplications that demonstrate the generality of our framework. The first is volumetric rendering, in which wegenerate implementations for various different spline lattice combinations, many of which have not had fast GPUimplementations. The second application is simply function approximation, however, we do this on the D latticeon the CPU. To our knowledge, there relatively little works that investigate function approximation in this space.We are not aware of any that assess the convergence of such spaces, nor any that provide implementations.We do not “implement” Algorithm 1 directly; we use it simply as a template for code generation. That is,the input to our pipeline is a list of (polyhedral region, polynomial) pairs, and the output is generated LLVMcode [Lattner and Adve 2004]. For our volumetric rendering experiments, we use the LLVM code to generate PTXcode, for our other tests we emit x86_64 assembly. We have constrained ourselves to the CUDA ecosystem as itis the most popular GPGPU platform, however, LLVM contains both a backend for AMD GPUs and a platformagnostic SPIR-V backend that is capable of targing OpenCL + OpenGL devices [Kessenich et al. 2018]. The detailsof the code generation stage, and a more detailed breakdown of performance is detailed in part II [Horacsek andAlim 2021].The processing time for a given interpolant varies based on the spline, higher order splines with large supporttake much longer, but the process takes typically on the order of hours on an Intel i7 7700 with 16GB of DDR4RAM at 2333 MHz. However, the processing time of the pipeline is a one time computation, and need not berepeated once an effecient form has been derived. We run our tests on the generated code. Volume rendering is an illustrative problem as it demonstrates an average case application — as rays marchthrough the cells in the volume, encountering a new cell will cause a cache miss, however, for points sampledwithin a cell we get cache hits. Moreover it does not favour any given lattice structure — a problem that usesslices of a volume along principle lattice directions may favour one case over another.All of our tests were run with the same CPU as above and an NVIDIA GTX 1060 with 6GB ram and no memoryor core over-clocks. Compute kernels were set to use a maximum of 128 registers. Table 1 enumerates the listof interpolants for which we generated code. When generating code, there is a small number of parametersto choose — we tuned each interpolant separately in order to maximize their performance as detailed in PartII [Horacsek and Alim 2021].Each test cased was budgeted approximately the same amount of memory to ensure fairness between testresults. For our test function, we sampled the Marshner Lobb function [Marschner and Lobb 1994] at each latticesite, this is the de-facto standard when comparing interpolation kernels — as one moves further from the centerof the volume, the spatial frequency increases, as such reconstruction errors visually depict how frequencycontent is distorted. Typically, one renders the Marshner Lobb as an iso-surface, however we stick to volumetricrendering to expose the worst case behaviour of the algorithm. The isosurface for a given isovalue may changeslightly among grids, thus one grid may terminate earlier than others, our volumetric rendering scheme ensuresuniformity of marched rays. Scale was chosen to be equal in each case. Our transfer function was chosen so thatthe volume was mostly transparent, thus forcing all rays to traverse the entire volume. , Vol. 1, No. 1, Article . Publication date: February 2021.

Interpolant Name Lattice Direction Matrix Order   :2 † CC N\A 2 16 [Mirzargar and Entezari 2010][Mirzargar and Entezari 2011]Linear Rhombic † Dodecahedron CC  − − − −

11 1 − −    :3 † CC N\A 2 32 [Mirzargar and Entezari 2010][Mirzargar and Entezari 2011]Truncated RhombicDodecahedron CC  − −

10 1 0 1 − −

10 0 1 1 1 − −  † CC N\A 3 54 [Mirzargar and Entezari 2010][Mirzargar and Entezari 2011]CC Voronoi Spline 3 CC   :4  − − − −

11 1 − −  † BCC N\A 2 8 [Mirzargar and Entezari 2010][Mirzargar and Entezari 2011]BCC Voronoi Spline 2 † BCC N\A 3 27 [Mirzargar and Entezari 2010][Mirzargar and Entezari 2011]Quartic TruncatedRhombic Dodecahedron BCC  − −

10 2 0 1 − −

10 0 2 1 1 − −   − − − −

11 1 − −  :2 † FCC N\A 2 8 [Mirzargar and Entezari 2010][Mirzargar and Entezari 2011]Cubic TruncatedOctohedron FCC  − −

10 0 1 −  † FCC N\A 3 27 [Mirzargar and Entezari 2010][Mirzargar and Entezari 2011]

Table 1. A list of all interpolants and lattices for which we generated code. The notation : 𝑛 denotes concatenating a matrixwith itself 𝑛 − times. Splines are named by the shape of their support, and the degree of the polynomial pieces, except forthe Voronoi splines. Splines that cannot be generated by a single box spline have N\A in the Direction Matrix column. Splinesmarked with a † are ones for which there exists no previous GPU implementation. , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part I — Theory and Analysis • 19 Reconstruction time (ms)

CC Voronoi 1CC Voronoi 2FCC Voronoi 1Linear RhombicDodecahedronFCC Voronoi 1BCC Voronoi 1Cubic TruncatedOctahedronLinear RhombicDodecahedronCC Voronoi 3BCC Voronoi 1Quartic TruncatedRhombic DodecahedronTruncated RhombicDodecahedronQuintic RhombicDodecahedronFCC Voronoi 2FCC Voronoi 2BCC Voronoi 2 S p li n e Volumetric Rendering Summary

CartesianBCCFCC

Fig. 10. Reconstruction speed in milliseconds per frame (alternatively FPS) for each test case.

For every combination of basis function and lattice tested, we obtained reasonablerender times, shown in Figure 10. The only case in which performance degraded beyond what we considerreasonable is the BCC Voronoi 2 spline on the BCC lattice — this is likely due to the complex polynomial structureof the spline combined with the large amount of memory fetches one needs to reconstruct a single value. We donot consider this spline to be real time, however it may be used in progressive rendering approaches to refine arendered image once user interaction has stopped.A question one might have, based on these results, is how the generated code scales up as the interpolantsbecome more complex. That is, if we require double the memory accesses per reconstruction, it is reasonable tothink that a reconstruction would take double the amount of time to complete. Figure 11 shows a plot relating thenumber of points that contribute to a reconstructed point versus the render time, and a least squares fit (with thesingle outlier point removed). The least squares fit has a slope of approximately 0.6. At first glance this appearsquite promising — one might expect a slope of 1, and 0.6 implies that the cost of doubling the complexity of thespline is less than one might expect. However, the correlation coefficient of these data is approximately 0.45(with p-value 0.07), thus this correlation is quite weak. Unfortunately, there are too many confounding variables— remember, we take the best results over all configurations of spline spaces (i.e we mix branch predication,tri-linear lookups, etc.). We will return to this in Part II.

While it is not the goal of this work to showcase qualitative visual results, we comparethe 𝑉 splines of each lattice on their respective lattice. Figure 12 shows the qualitative difference for the renderedvolumes. These are at similar resolutions, and it is clear that the BCC and FCC outperform the CC lattice. Betweenthe BCC and FCC lattice the comparison is more difficult, there are subtle difference in reconstruction, yet theartifacting seems wholly more isotropic on the BCC lattice. , Vol. 1, No. 1, Article . Publication date: February 2021. Spline Support Size R e c o n s t r u c t i o n T i m e Run-time Trend

Fig. 11. Trend of reconstruction speeds. The red line is the least squares fit without the outlier at (27, 700ms) — this correspondsto the BCC Voronoi 2, which is a very computationally expensive spline on the GPU, it requires the predication of uniquesub-regions of evaluation. D lattice To demonstrate that we are not bound to 2 and 3 dimensions, we derive a fast evaluation scheme for the4-dimensional D lattice. The generating lattice, as well as the direction matrix for our spline are 𝐿 D : =  − − − − −  and Ξ D : =  − − − −

11 1 − − − − − − − −

11 1 1 1 1 1 1 1  (20)respectively. The geometry of this lattice is similar to that of the FCC lattice; the D lattice consists of 8 Cartesiancosets, shifted to the 3-facets of the 4-dimensional hypercube. The spline we choose is a modification of theone presented by Kim et al., however some direction vectors have been removed to make the space computablein a reasonable amount of time [Kim and Peters 2011]. The spline 𝑀 Ξ D ( x ) is a fourth order spline, but notinterpolating, as such it must be pre-filtered to ensure that the error will decay as expected. We used a filterwith value at the origin, and − at all D lattice sites at distance √ 𝜎 = . ℎ = . ℎ

10 times,at each iteration we sampled the function on the grid ℎ · 𝐿 D , then measured the 𝐿 error over the domain [− . , . ] via Monte Carlo integration with 10,000 samples. The overall decomposition analysis for the splinetook approximately 2 days to compute, and the code generation [Horacsek and Alim 2021] took approximately aday to compute. Again, this is a one time pre-computation; reconstruction, on the other hand, is many orders ofmagnitude faster, a single point evaluation takes less than a millisecond. Since the spline 𝑀 Ξ D ( x ) is fourth order spline, we expect to see the error decay by a factor of at every iteration. Figure 13 demonstrates exactly this behaviour. An equivalent tensor-product spline onthe 4-dimensional Cartesian lattice would require 256 memory accesses for the same order, whereas this caserequires only 240. Perhaps astonishingly, the same spline has the same order on the D ∗ lattice (the dual of the D ) requiring only 60 memory accesses. The D lattice attains the optimal hyper-sphere packing in 4-dimensions,as such, the D ∗ lattice would be the optimal sampling lattice. However, we stick to the D lattice, as it is more , Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part I — Theory and Analysis • 21 (a) Ground Truth (b) CC Voronoi 1(c) BCC Voronoi 1 (d) FCC Voronoi 1Fig. 12. Voronoi spline reconstructions on different lattices. Note the much harsher artifacting on the top of the volume forthe CC lattice. The FCC and BCC lattices show much smoother reconstructions, with the BCC having slightly less isotropicreconstruction on the top, and the FCC having slightly less isotropic reconstruction on the lower face. These were all renderedwith framerates well above 60fps. difficult to derive interpolants for; its coset structure contains more shifted lattices than the D ∗ lattice, and issomewhat of a “stress-test” for our methodology. We presented a generalized framework for analysing multidimensional splines on non-Cartesian (and Cartesiangrids), with the target of producing fast evaluation schemes for said spline spaces. While this is the maincontribution of the work, we also produced performant code for the notoriously complex Voronoi splines on the , Vol. 1, No. 1, Article . Publication date: February 2021. -2 -1 Grid Scale -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 E rr o r Approximation Convergence

Approximation Error4-th Order Reference

Fig. 13. Error decay of function approximation as the D lattice becomes finer. The slope of the light grey line indicates thetheoretic decay of the approximation space, and the line denotes the actual decay of the approximation error. The functionbeing approximated is a Gaussian. FCC and BCC lattice which have not yet had efficient implementations. We also demonstrated the computationalbehaviour of our approach as spline size increased, showing reasonable computational increase as the support ofa spline increases—however we saw that it is difficult to predict performance based on support size alone. Finally,we investigated the performance in 4-dimensions, and provided an imlpementation (and quasi-interpolant filter)for a spline on the D lattice. The entire pipeline of our worksheet is implemented within a SageMath worksheet,and is available on github [Developers 2016; Horacsek 2021]. Further details related to the code generation step ofour pipeline, as well as detailed performance results are presented in part II of this work [Horacsek and Alim 2021].Future theoretical work is focused on extending this framework to more classes of splines (i.e. the exponentialbox splines), designing optimized interpolants and assessing the quality of interpolants in a systematic manner. REFERENCES

Bita Akram, Usman R. Alim, and Faramarz F. Samavati. Cinapact-splines: A family of infinitely smooth, accurate and compactly supportedsplines. In George Bebis, Richard Boyle, Bahram Parvin, Darko Koracin, Ioannis Pavlidis, Rogerio Feris, Tim McGraw, Mark Elendt,Regis Kopper, Eric Ragan, Zhao Ye, and Gunther Weber, editors,

Advances in Visual Computing , pages 819–829, Cham, 2015. SpringerInternational Publishing. ISBN 978-3-319-27857-5.Usman Alim, Torsten Möller, and Laurent Condat. Gradient estimation revitalized.

IEEE Transactions on Visualization and Computer Graphics ,16(6):1495–1504, 2010.Usman R. Alim.

Data Processing on the Body-Centered Cubic Lattice . PhD thesis, Simon Fraser University, Burnaby BC, Canada, 07/2012 2012.Thierry Blu and Michael Unser. Quantitative fourier analysis of approximation techniques. i. interpolators and projectors.

IEEE Transactionson signal processing , 47(10):2783–2795, 1999.Thierry Blu, P Thcvenaz, and Michael Unser. Moms: Maximal-order interpolation of minimal support.

IEEE Transactions on Image Processing ,10(7):1069–1080, 2001.Martine Ceberio and Vladik Kreinovich. Greedy algorithms for optimizing multivariate horner schemes.

ACM SIGSAM Bulletin , 38(1):8–15,2004.John Horton Conway and Neil James Alexander Sloane.

Sphere packings, lattices and groups , volume 290. Springer Science & Business Media,2013.Balazs Csebfalvi. Cosine-weighted b-spline interpolation: A fast and high-quality reconstruction scheme for the body-centered cubic lattice.

IEEE transactions on visualization and computer graphics , 19(9):1455–1466, 2013.Balázs Csébfalvi and Gergely Rácz. Retailoring box splines to lattices for highly isotropic volume representations.

Computer Graphics Forum ,35(3):411–420, 2016. ISSN 1467-8659. doi: 10.1111/cgf.12917. URL http://dx.doi.org/10.1111/cgf.12917.Carl de Boor. On the evaluation of box splines.

Numerical Algorithms , 5(1):5–23, 1993. ISSN 1017-1398. doi: 10.1007/BF02109280., Vol. 1, No. 1, Article . Publication date: February 2021. utomatic Generation of Interpolants for Lattice Samplings: Part I — Theory and Analysis • 23

Carl de Boor, Klaus Höllig, and Sherman Riemenschneider.

Box Splines . Springer-Verlag New York, Inc., New York, NY, USA, 1993. ISBN0-387-94101-0.The Sage Developers.

Sage Mathematics Software (Version 7.2) , 2016. .Balázs Domonkos and Balázs Csébfalvi. DC-splines: Revisiting the trilinear interpolation on the body-centered cubic lattice. In

Proceedings ofthe 15 th Vision, Modeling, and Visualization Workshop (VMV) , pages 275–282, Siegen, Germany, November 2010.A. Entezari, D. Van De Ville, and T. Moller. Practical box splines for reconstruction on the body centered cubic lattice.

Visualization andComputer Graphics, IEEE Transactions on , 14(2):313–328, March 2008. ISSN 1077-2626. doi: 10.1109/TVCG.2007.70429.Alireza Entezari. Towards computing on non-cartesian lattices.

IEEE Transactions on Visualization and Computer Graphics , 2006.Alireza Entezari and Torsten Möller. Extensions of the Zwart-Powell box spline for volumetric data reconstruction on the cartesian lattice.

IEEE Transactions on Visualization and Computer Graphics , 12(5), 2006.Alireza Entezari, Ramsay Dyer, and Torsten Möller. Linear and cubic box splines for the body centered cubic lattice. In

Proceedings of theconference on Visualization’04 , pages 11–18. IEEE Computer Society, 2004.Alireza Entezari, Mahsa Mirzargar, and Leila Kalantari. Quasi-interpolation on the body centered cubic lattice. In

Computer Graphics Forum ,volume 28, pages 1015–1022. Wiley Online Library, 2009.Bernhard Finkbeiner, Alireza Entezari, Dimitri Van De Ville, and Torsten Möller. Efficient volume rendering on the body centered cubiclattice using box splines.

Computers & Graphics , 34(4):409–423, 2010.J.J. Horacsek and U.R. Alim. Fast and exact evaluation of box splines via the PP-form.

ArXiv e-prints , June 2016.Joshua Horacsek. Fast spline. https://github.com/jjh13/fast-spline, 2021.Joshua Horacsek and Usman Alim. Generating fast interpolants for volumetric data: Part ii — implementation and code generation. 2021.John Kessenich, Boaz Ouriel, and Raun Krisch. Spir-v specification.

Khronos Group , 3, 2018.Minho Kim. GPU isosurface raycasting of FCC datasets.

Graphical Models

Graphical Models , 75(2):90–101, 2013b.Minho Kim. Quartic box-spline reconstruction on the BCC lattice.

IEEE transactions on visualization and computer graphics , 19(2):319–330,2013c.Minho Kim. Analysis of symmetry groups of box-splines for evaluation on gpus.

Graphical Models

Numerical Algorithms , 50(4):381–399, 2009. ISSN1017-1398. doi: 10.1007/s11075-008-9231-6.Minho Kim and Jörg Peters. Symmetric box-splines on root lattices.

Journal of computational and applied mathematics , 235(14):3972–3989,2011.Minho Kim, Alireza Entezari, and Jörg Peters. Box spline reconstruction on the face-centered cubic lattice.

Visualization and ComputerGraphics, IEEE Transactions on , 14(6):1523–1530, 2008a.Minho Kim, Alireza Entezari, and Jörg Peters. Box spline reconstruction on the face-centered cubic lattice.

IEEE Transactions on Visualizationand Computer Graphics , 14(6):1523–1530, 2008b.Minho Kim, Hyunjun Kim, and Aaliya Sarfaraz. Efficient gpu isosurface ray-casting of bcc datasets. 2013.Donald E Knuth. Dancing links. arXiv preprint cs/0011047 , 2000.Leif Kobbelt. Stable evaluation of box‚splines.

Numerical Algorithms , 14(4):377–382, 1997. ISSN 1017-1398. doi: 10.1023/A:1019133501773.Chris Lattner and Vikram Adve. LLVM: A compilation framework for lifelong program analysis and transformation. pages 75–88, San Jose,CA, USA, Mar 2004.Stephen R Marschner and Richard J Lobb. An evaluation of reconstruction filters for volume rendering. In

Proceedings of the conference onVisualization’94 , pages 100–107. IEEE Computer Society Press, 1994.Mahsa Mirzargar and Alireza Entezari. Voronoi splines.

IEEE Transactions on Signal Processing , 58(9):4572–4582, 2010.Mahsa Mirzargar and Alireza Entezari. Quasi interpolation with voronoi splines.

IEEE transactions on visualization and computer graphics , 17(12):1832–1841, 2011.Amos Ron. Exponential box splines.

Constructive Approximation , 4(1):357–378, 1988.Gilbert Strang and George Fix. A fourier analysis of the finite element variational method. In

Constructive aspects of functional analysis , pages793–840. Springer, 2011.Michael Unser. Splines: A perfect fit for signal and image processing.

IEEE Signal processing magazine , 16(6):22–38, 1999.Michael Unser. Sampling-50 years after shannon.

Proceedings of the IEEE , 88(4):569–587, 2000.Dimitri Van De Ville, Thierry Blu, Michael Unser, Wilfried Philips, Ignace Lemahieu, and Rik Van de Walle. Hex-splines: A novel splinefamily for hexagonal lattices.

IEEE Transactions on Image Processing , 13(6):758–772, 2004.Emanuele Viterbo and Ezio Biglieri. Computing the voronoi cell of a lattice: the diamond-cutting algorithm.