AsyncTaichi: On-the-fly Inter-kernel Optimizations for Imperative and Spatially Sparse Programming
AAsyncTaichi: Whole-Program Optimizations for Megakernel SparseComputation and Differentiable Programming
YUANMING HU ∗ , MIT CSAIL
MINGKUAN XU ∗ , Tsinghua University
YE KUANG,
Tsinghua University
FRÉDO DURAND,
MIT CSAIL
Fig. 1.
Top:
In the original Taichi system [Hu et al. 2019], computational kernels are eagerly compiled and launched. Therefore there is no chance for theoptimizer to optimize beyond a single kernel.
Bottom:
In our work, we accumulate kernels in a execution buffer, only flushing the execution queue whennecessary. This allows the optimizer to gain more context and conduct optimization beyond a single kernel. We dynamically build a dependency graph(“state-flow graph") of kernels, so that sparse computation kernels can be optimized at a whole-program level. After a suite of domain-specific optimizationpasses including list generation removal, sparse data structure activation elimination, the tasks are much better optimized compared to those in the originalTaichi system. As a result, the whole-program optimized Taichi programs run much faster on parallel devices.
We present a whole-program optimization framework for the Taichi pro-gramming language. As an imperative language tailored for sparse and dif-ferentiable computation, Taichi’s unique computational patterns lead toattractive optimization opportunities that do not present in other compileror runtime systems. For example, to support iteration over sparse voxelgrids, excessive list generation tasks are often inserted. By analyzing sparsecomputation programs at a higher level, our optimizer is able to remove the ∗ Both authors contributed equally to this work.Authors’ addresses: Yuanming Hu, MIT CSAIL, [email protected]; Mingkuan Xu,Tsinghua University, [email protected]; Ye Kuang, Tsinghua University,[email protected]; Frédo Durand, MIT CSAIL, [email protected] to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].© 2020 Association for Computing Machinery.XXXX-XXXX/2020/12-ART $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn majority of unnecessary list generation tasks. To provide maximum program-ming flexibility, our optimization system conducts on-the-fly optimization ofthe whole computational graph consisting of Taichi kernels. The optimizedTaichi kernels are then just-in-time compiled in parallel, and dispatched toparallel devices such as multithreaded CPU and massively parallel GPUs.
Without any code modification on Taichi programs, our new system leadsto 3 . − . × fewer kernel launches and 1 . − . × speed up on ourbenchmarks including sparse-grid physical simulation and differentiableprogramming.CCS Concepts: • Software and its engineering → Domain specific lan-guages ; •
Computing methodologies → Parallel programming lan-guages ; Physical simulation .Additional Key Words and Phrases: Sparse Data Structures, GPU Computing.
ACM Reference Format:
Yuanming Hu, Mingkuan Xu, Ye Kuang, and Frédo Durand. 2020. Async-Taichi: Whole-Program Optimizations for Megakernel Sparse Computa-tion and Differentiable Programming. 1, 1 (December 2020), 10 pages.https://doi.org/10.1145/nnnnnnn.nnnnnnn , Vol. 1, No. 1, Article . Publication date: December 2020. a r X i v : . [ c s . P L ] D ec • Taichi is a new programming language for spatially sparse [Hu et al.2019] and differentiable programming [Hu et al. 2020]. It has shownits productivity and performance, in forward physical simulation,and deep learning tasks with differentiable physical modules. How-ever, existing Taichi compilers optimizations are restricted withinkernels, without high-level views of the whole program. This mo-tivates a new system that can conduct whole-program automaticperformance optimizations for Taichi programs.Whole-program optimization, also known as “link-time optimiza-tion” or “interprocedural optimization” in some literature, is nota new idea. For example, in traditional ahead-of-time compilerssuch as gcc , whole program optimizations can be easily achieved us-ing the -flto flag. In more dynamic array programming languagessuch as JAX/HLO [Bradbury et al. 2018], array-level fusion [Frostiget al. 2018] has been an effective optimization. However, three ma-jor challenges do exist when applying the idea of whole-programoptimization to Taichi.
Firstly, Taichi eagerly launches computational tasks.
The existingTaichi system just-in-time (JIT) compiles and then launches everykernel immediately after it is called. This eager execution schemedestroys opportunities to analyze and optimize beyond a singlekernel.
Secondly, Taichi is imperative, has mutable data buffers and needsto support spatially sparse programming.
Unlike systems that relyon dense, immutable, and holistic arrays (“tensors” in systems likeTensorFlow), partial updates and mutable, sparse buffers can leadto difficulties when analyzing a Taichi program.
Thirdly, the spatially sparse programming feature of Taichi canlead to further complexities to the analysis and optimization process.
While tools for analyzing dense array programs are well established,their counterparts in sparse array computations are largely under-exploited.In this work, we propose a state-flow formulation of Taichi pro-grams for analyzing Taichi programs, a new asynchronous executionengine for Taichi that opens up spaces for performance optimization,and most importantly, a whole-program optimizer for imperativespatially sparse and differentiable computation.In contrast to previous Taichi system that does no whole programoptimization, in this work we build a dynamic graph consisting ofpending kernels, then compile and run them lazily. This enablesus to perform cross-kernel optimizations ranging from commonoptimizations such as kernel fusion to sparse-computation-specific activation/list generation removal . We can also perform parallel com-pilation, which significantly reduces compilation time.With productivity and performance in mind, our system is prac-tically designed following the design guidelines below: • Transparent to users.
No code modification is needed for usersto leverage the new execution and optimization system. • Compile Just-in-time.
The just-in-time compilation system inTaichi has proven to bring users great flexibility and performance,since JIT delays the need of value of “compile-time constant“values. For example, Δ 𝑡 in physical simulators is often run-timevariable in ahead-of-time compilation, but with JIT Δ 𝑡 would be a compile-time constant, which allows the compiler to do moreoptimizations such as constant folding. • Problems of all scales matter.
Graphics applications cover awide range of problem sizes. For example, a particle simulationmay cover from 2 ,
000 to 100 , ,
000 particles. For small-scaletasks, compilation time may be the bottleneck; for large scaletasks, computation time is more important.The rest of this paper covers our detailed design decisions moti-vated by the design guidelines. We summarize our contributions asfollows:(1) A state-flow formulation of spatially sparse computation. Theresulted state-flow graph (SFG) serves as a high-level interme-diate representation (IR) of a Taichi program that capturesessential information to analyze sparse computation.(2) An asynchronous task execution scheme for the Taichi pro-gramming language that enables parallel compilation andexposes whole-program optimization opportunities;(3) Most importantly, a whole-program optimizer for asynchro-nous spatially sparse and differentiable computation, leverag-ing the state-flow graph for performance optimizations;(4) A systematic study of the resulted asynchronous Taichi sys-tem. Based on the benchmarks, we show 1 . − . × (geomet-ric mean) execution time improvements over the traditionalsynchronous Taichi system without whole-program optimiza-tions [Hu et al. 2019]. Sparse data structures.
Taichi directly draws inspirations frompopular sparse data structures in computer graphics, includingVDB [Hoetzlein 2016; Museth et al. 2013; Wu et al. 2018] and SP-Grid [Gao et al. 2018; Setaluri et al. 2014]. While these data structureshave demonstrated effective computation and storage benefits overdense arrays, writing programs that leverage them is not an easytask. Taichi provides a language abstraction that allows using thesedata structures as if they are dense, and runtime systems that auto-matically handle parallel voxel iteration and memory management.These designs benefit the end users, but may end up with more com-putation that would need a whole-program analysis to optimize.Another thread of work on sparse computation is sparse linearalgebra languages, such as TACO [Chou et al. 2018; Kjolstad et al.2017], which can effectively generate kernels for Einstein summa-tions on sparse matrices and tensors. Instead of explicitly buildingthe sparse matrices, Taichi encourages matrix-free linear algebracomputations, which are often the more effective way for high-performance linear algebra solves for physical simulation (see, forexample [Liu et al. 2018]).
Array data-flow analysis.
The static-single assignment (SSA) formhas been a very popular IR structure. SSA forms are designed forscalar variables, and it cannot directly represent array states, wherepartial updates may happen. Array SSA forms have been proposedand successfully adopted in parallelization [Knobe and Sarkar 1998]and array privatization [Maydan et al. 1993]. However, related workin this topic is mostly focused on dense arrays. Our high-level IR sys-tem represents not only array partial updates, but also the topologychanges in sparse arrays. , Vol. 1, No. 1, Article . Publication date: December 2020. syncTaichi: Whole-Program Optimizations for Megakernel Sparse Computation and Differentiable Programming • 3
Whole-Program Optimization (WPO).
WPO is also known as Inter-procedural optimization (IPO). For ahead-of-time compilation, IPOtypically happens at link time, so sometimes it is also called link-time optimization (LTO). Many existing compilers, such as gcc,MSVC and clang, already support LTO. While WPO is extensivelyexplored in classical compiling systems, it is still underexploited forspatially sparse computation. The unique computational pattern inTaichi brings higher complexity and the need for a unified high-levelintermediate representation for analysis and optimization.
Compute graph optimization in deep learning frameworks.
A feed-forward deep neural (DNN) network can be naturally representedas directed acyclic graphs (DAG). This leads to a straightforwardmapping between DNNs and the compute graph: immutable, densefeature maps directly map to compute graph edges , and operators (such as convolutions, max pooling, and element-wise add) mapsto compute graph nodes . High-level optimizations on the computegraph have been a popular feature in deep learning frameworks.The HLO IR of XLA and PyTorch GLOW [Rotem et al. 2018] arerepresentative examples. Based on the compute graph, traditionalcomputer optimizations such as operator fusion, dead code elim-ination (DCE), common subexpression elimination (CSE) can beapplied. We refer the readers to [Li et al. 2020] for a good survey.Our system is similar to these systems in that a high-level graph-based IR is used, yet the high-level IR for Taichi must consider itssparse, imperative, and megakernel nature.
Taichi [Hu et al. 2019] is a new programming language for spa-tially sparse and differentiable visual computing. As a domain-specific language embedded in Python, Taichi’s just-in-time com-piler transforms compute-intense kernels (Megakernels, similar toa __global__
GPU kernel in CUDA) into parallel executables. Userscan flexibly launch the kernels using Python. Key features of Taichiare described below.
Data-oriented programming. field is the key concept in Taichithat represents data. A field can not only represent one- to eight-dimensional dense tensors, but also their sparse variants. Taichialso allows programmers to change data layouts using a data lay-out description language, which can easily describe common datalayouts such as array-of-structures (AOS) or structures-of-arrays(SOA). Taichi decouples algorithms from data structures . This allowsthe users to exploit the vast design space of data structures andlayouts, so that maximum memory performance (e.g., cache hit-rateand cacheline utilization) can be achieved via rapid trial-and-error.
Sparse programming. is a unique feature [Hu et al. 2019] of Taichi.Most 3D graphics data (especially those stored on voxel grids) arespatially sparse. Taichi has first-class support for sparse data struc-tures. In order to make sparse data structures as easy to use as densedata structures, various designs are made on the syntax, compiler,and runtime levels: • Sparse struct-for loops allow users to easily iterate over activevoxels of sparse data structures. For example, the following codeloops over a 3D sparse field: for i, j, k in x:x[i, j, k] += 1
List generation is the key mechanism to achieve efficient sparsefor-loops (Fig. 2).
Fig. 2. The structure of ti.root.pointer(ti.i, 4).dense(ti.i, 2).place(x) in Taichi, a two-level 1D sparse data structure with the upper levelbeing a pointer array and the lower level being dense blocks. Highlightedcells are active. Lists of each layer are defined to be a collection of activenode indices. • Activation on write ensures sparse data structure nodes are im-plicitly activated on writing. For example, the following codegenerates a 2 × × y from a higher-resolution sparse field x : for i, j, k in x:y[i // 2, j // 2, k // 2] += x[i, j, k] Note that corresponding voxels of y does not have to be explicitactivated before this for loop. Taichi will automatically activate y[i // 2, j // 2, k // 2] and zero-fill the initial data. See Fig-ure 3 for an example. Fig. 3. Execution result of simple program for i in x: y[i // 2] +=1 . y[1] and y[3] are activated on write. Because of the constraints of thedense node, y[0] and y[2] are also activated. • Automatic memory management frees the user from worryingabout memory allocation and deallocation. Taichi’s high-performancememory allocator will automatically manage sparse data structurenodes. • Programmable megakernels allows users to easily write parallelprograms with very high flexibility and rich expressiveness. Taichikernels allow complex control flows - in fact, users can easily writea recursive ray tracer (See, e.g., [Hu et al. 2019]). • Automatic parallelization
Taichi kernels are decomposed into tasks that are serial or parallel. , Vol. 1, No. 1, Article . Publication date: December 2020. • Differentiable Programming in Taichi.
Unlike some other differ-entiable programming systems for deep learning such as Tensor-Flow and PyTorch that operate on functional (“immutable”) buffers,Taichi is imperative (providing “if” statements, serial and parallel“for” loops). The imperative nature makes it easier to port mostphysical simulation code written in popular languages like C++, tothe Taichi programming than to other modern functional program-ming languages such as TensorFlow. Taichi uses a novel two-scalereverse-mode automatic differentiation [Hu et al. 2020]: a light-weightgradient tape stores kernel launches for end-to-end simulation differ-entiation, and source-code transform is used for for differentiatingwithin kernels. This design ensures that the gradient versions of themegakernels are still megakernels, and naturally uses global fieldsas checkpoints for gradient computation.
Observations.
Many potential performance optimizations in Taichiprograms need a whole-program view, which motivates the remain-ing of this work. For example, sparse struct-for loops may easilyincur redundant list generation in the data structure tree. Elimi-nating these unnecessary list generations needs a whole-programunderstanding of the program.
In imperative programming,
Program = State + Compute . Specificallyin Taichi, because of the existence of sparse programming support,“state” means way more than values of field s, and “compute” meansmore than GPU kernels that operates on data. This creates morechallenges on modelling and analyzing Taichi programs, comparedto traditional imperative programming languages (such as C++) andarray-based systems such as TensorFlow.In TensorFlow, every operation creates a new, immutable buffer(“tensor”). In simulation, we have to go imperative, not only becausegraphics programmers have been accustomed to using imperativeprogramming for decades, but also because in-place operations inimperative programming offer significant performance advantagesin graphics applications, such as physical simulation.We reformulate the imperative computation scheme of Taichiinto a collection of states and tasks . The ultimate goal of our IRdesign is to strengthen and simplify optimization rules for spatiallysparse computation. To systematically optimize spatially sparsecomputation, we formulate an AsyncTaichi program as a state-flowgraph (SFG), which is a DAG with nodes being tasks and edgesbeing states. This results in a state flow formulation and a high-levelintermediate representation (IR). Scalar data-flow analysis is wellstudied in optimizing compilers, and SFGs can be considered as anextended version of data-flow analysis to handle spatially sparsecomputation.
Background: Structural Nodes (SNodes).
In Taichi, specifying adata structure includes choices at both the macro level, dictatinghow the data structure components nest with each other and theway they represent sparsity, and the micro level, dictating how dataare grouped together (e.g. structure of arrays vs. array of structures).Taichi provides Structural Nodes (SNodes) to compose the hierarchy and particular properties. Commonly used SNodes include dense , bitmasked , pointer , dynamic . States split the holistic description of a Taichi program into a suitablegranularity for analysis and optimization. dense is the only SNodethat has no sparsity information. Other SNodes can be spatiallysparse, so we must decompose the holistic descriptions of their datainto the following states: • A value state simply represents the collection of numerical valuestored in field. Note that in data structure trees of Taichi, onlythe leaf nodes (i.e., place
SNodes) store numerical values. Valuestates are the most basic states and have the same meaning asthose in data flow optimization. It is worth noting that in sparsedata structures every voxel has a numerical value, even if thevoxel is sparse - in that case the inactive voxel has value 0. • A mask state cannot be treated as plain data flow since we needto understand it and do domain-specific optimization. Masks arescattered in various forms in the data structure. They are eitherindicated by a bit in a bitmask SNode, or a non-null pointer forthe pointer SNode. Essentially mask is not a unified concept fordifferent data structures. Therefore we need to generate a unifiedelement list for struct-fors on different structures. • A list state of a SNode represents the data structure nodes main-tained by the runtime system. Recall that Taichi needs to gener-ate/consume data structure node lists for load-balancing paralleliterations over sparse data structure nodes. See [Hu et al. 2019]for more details. • An allocator state represents the state of Taichi’s memory allocator.For computation that allocates/deallocates sparse data structurenodes, the allocator states are changed.The relationship between value, mask, and list state is depictedin 4.
Taichi tasks.
Each Taichi task has input edges (input states), out-put edges (modified states). It also maintains its own metadata, suchas loop ranges (range/struct-fors). These edges and metadata willbe used for whole-program optimization.For each state, we use a latest state tracker to track which taskholds the latest version of this state.
Now let’s focus on a single state. For simplicity, let’s use value state 𝑆 (Fig. 5), which is operated on by kernels 𝑓 , 𝑔, 𝑝, ℎ, 𝑞 . Note that 𝑓 , ℎ and 𝑞 read and write the value state 𝑆 , yet 𝑔 and 𝑝 only reads thevalue of 𝑆 . Every time we modify a state, a new “copy” is created.Clearly, only the latest writer holds the latest version of a state,while readers only fetches a copy without creating a new copy. Ifwe only consider the writers, we basically get a chain structure foreach state, with a few branches for readers, see the example in Fig. 5as an concrete example.Therefore, for a single state we can easily build a chain (a directedacyclic graph, DAG), which we call “state-flow chain” (SFC). Note that in imperative programming the modifications are actually applied in place,yet for optimization purposes we assume that we always create a new virtual copy., Vol. 1, No. 1, Article . Publication date: December 2020. syncTaichi: Whole-Program Optimizations for Megakernel Sparse Computation and Differentiable Programming • 5
Fig. 4. State decomposition of a two-level sparse array, containing a sparseintermediate layer and a dense leaf layer. Note that the value state coversall pixels, even if the pixel is inactive. In other words, whenever an accessreads a pixel from the sparse array, the mask state will first be queried. Ifthe mask state says the pixel is inactive, will be returned. Otherwise thesystem queries the value state and returns the corresponding value. Herewe omit allocator states for simplicity.Fig. 5. A state-flow chain of value state 𝑆 . The edges in the state-flow chaindepicts the kernel dependency relationships. Note that each state-flowchain always has a main branch (write-after-write) and a reader branch(read-after-write & write-after-read). In the main branch, each node (task)creates a new version of the state. A Taichi program can easily have hundreds of states. Here we in-troduce state-flow graphs (SFG), which are essentially state-flowchains sticking to together (Fig. 6). SFGs completely describes therelationship between tasks in Taichi. Since unions of DAGs are stillDAGs, SFGs are DAGs too.The SFG serves as the IR for whole-program sparse computationoptimization. The allows us to use well-established graph theorylanguages for optimization (e.g., fusion as shown later).
Fig. 6. A state-flow graph, by definition, is a union of state-flow chains ofall the states used in a program.
Whenever a task is inserted into the execution queue, we dynami-cally create a SFG node and create corresponding dependency edges.SFG has two useful properties:(1)
Order independency . Any topologically ordered task se-quence leads to the same program behavior(2)
Reconstruction invariance , corollary of “order indepen-dency”. Any topologically ordered task sequence of G con-structs the same graph G.“Reconstruction invariance” is particularly useful when manip-ulating the graph nodes: for example, to remove a node from SFG,simply topologically sort the SFG nodes, remove the node from thesorted list, and rebuild the SFG. This frees us from worrying abouthow to correctly handle edges that are connected to the removednode.
The existing Taichi system (JIT) compiles and then launches everykernel eagerly . This simple strategy actually prevents cross-kerneloptimization from happening, since the system only sees one kernelat a time. Therefore, in order to make the SFG practically useful, weneed to hold the SFG nodes from executing before optimizationsare done.This motivates us to develop an asynchronous execution enginefor Taichi. By making Taichi asynchronous , we can obtain a listof kernels to compile and run lazily , and we can perform parallelcompilation, which reduces compilation time. More importantly,we can perform kernel fusion, which reduces memory bandwidthconsumption and has a direct contribution to performance.
By-product: parallel compilation.
As Taichi becomes more widelyadopted, the compiler needs to deal with programs with increasinginstructions and optimization passes. It it not uncommon that com-pilation can sometimes take more than 70% of program end-to-endrun time. In the previous eager execution scheme, a serial threadis used to compile and launch these kernels. In contrast, since theasynchronous execution engine sees multiple kernels at a time, par-allel compilation can be done easily, which can significantly reducewall-clock time wasted on compilation.
With the state-flow graph IR that describes the whole programs, andthe asynchronous execution engine that saves the tasks from being , Vol. 1, No. 1, Article . Publication date: December 2020. • (a) Generated tasks from two kernels and the corresponding SFG. x is a sparse data structure. [node: initial_state:0][node: inc_c4_0_serial:1]S1pointer_list[node: inc_c4_0_listgen:2]S0root_list S1pointer_mask[node: inc_c4_0_struct_for:3]S2place
Fig. 8. The generated tasks of 2 kernels without kernel fusion. x is a sparsedata structure. As shown in Figure 8, since x is a sparse data structure, Taichineeds to generate an active list of x to know which elements of x need to be looped over. So there are 3 tasks per such a smallkernel in synchronous mode. If the kernel on the right succeeds thekernel on the left of Figure 8, and Taichi performs asynchronouscomputing, we can fuse the 2 kernels into 1 kernel, perform some analysis to know that the sparsity (i.e., which elements are active) isnot changed, and finally get Figure 9 after optimizations. In this case, Fig. 9. The generated tasks of 2 kernels with kernel fusion. we reduce the number of generated tasks from 6 to 3.
Kernel fusionis not new, but fusing kernels that operates on sparse data strictures isa unique challenge in Taichi , since the iteration over active elementsimplicitly depends on the mask of the sparse data structures.Even if the bodies of both kernels cannot be directly optimizedlike this example, we can still remove some list generation tasksand reduce running time. This can be a significant improvement forsmall kernels where the list generation time is comparable to thereal computation time.Taichi’s sparse computation model motivates us to apply thefollowing domain-specific compiler optimizations: • List generation removal • Activation demotion • Task fusion • Dead store eliminationThe remainder of this section details these optimizations. , Vol. 1, No. 1, Article . Publication date: December 2020. syncTaichi: Whole-Program Optimizations for Megakernel Sparse Computation and Differentiable Programming • 7
This is the easiest whole program optimization, yet it leads to sig-nificantly higher performance for sparse computations in certaincases. A list generation task takes as input a mask and outputs a list.Two list generation tasks with the same parent list and the samemask as the input outputs the same list, and we can eliminate oneof them.List generation removal not only saves unnecessary executiontime on generating the sparse element lists, but also opens up op-portunities for other optimizations. For example, if two struct-fortasks are using the same list after list generation removal, a taskfusion may be able to fuse the tasks.
Recall that Taichi has an activation-on-write mechanism. However,it is often the case that the sparse element was already activatedbefore the task execution, so the element activation was checkedby not re-activated. This extra activation checking not only createsdiverging instruction flow on CPU/GPUs that harms performance,but also creates a modification to the corresponding mask state,creating obstacles for list generation removal. Therefore, we shouldtry to demote activating accesses to non-activating accesses.Fortunately, many activations can be demoted, by analyzing thetask contexts. If two struct-for tasks are identical, the loop lists arethe same, and the activation statement in the second task dependsonly on the loop indices, then the activation in the second task canbe removed.This is remarkably effective for repeated access patterns suchas [ 𝑖 // ] . For example, in the restriction (downsample) operatorof multigrid solvers, it is common to have the following pattern(Fig. 10): for i, j in x:y[i // 2, j // 2] += x[i, j] * 0.25 Fig. 10. The activation pattern of for i, j in x: y[i // 2, j // 2]+= x[i, j] * 0.25 . x is the grid on the left, and y is the grid on theright. Our activation elimination optimizer can successfully infer thatif the mask of 𝑥 has not been changed, then the mask of 𝑦 will notchange either. This avoid false-positives mask state modifications,and can further bring down the list generation kernel tasks by 6 . × in the MGPCG example. Clearly, we need to know the data dependency before we fuse(G2P2G, stencil etc, multigrid, multi-channel advection (improved cacheline utilization)) If all tasks are serial: Tasks A and B can befused if and only if there is no path of length ≥ → B in the SFG, we need every accesses onthat SNode are at the same address, and that address is unique periteration of the loop.To find all fusible pairs of tasks, we compute the transitive closureof the SFG using bitsets. For pairs of tasks without edges, we grouptasks by the tasks’ type, loop range (if the type is range for), or theSNode (if the type is struct for). For each group, we use the transitiveclosure to find which pairs of tasks do not have any path to eachother quickly. For each edge A → B in the SFG, we check if there isa task C such that A has a path to C and C has a path to B using thetransitive closure, and apply the above check to find if A and B arefusible.This is very effective because we have many intraproceduraloptimizations, but it might be time-consuming when there are toomany tasks.
We can also perform some cross-kernel analysis with asynchronouscomputing. For example, ti.clear_all_gradients() may exces-sively zero-fill unrelated tensors, which can be eliminated withdata-flow analysis.For convenience a user may frequently zero-fill fields in Taichi toensure data are correctly re-initialized. This is a typical source ofdead stores. For cases like this that a field is completely overwritten,our optimizer can eliminate the previous dead stores.
The whole-program optimizations are relatively simple to imple-ment, but extra attention was paid to the infrastructure to supportthese optimizations. In this section, we briefly cover implementationdetails that we empirically find to have direct impacts to perfor-mance.
We implement an asynchronous execution engine that performsparallel compilation and kernel fusion.We store all tasks into a queue until synchronization, which mayhappen when there is anything we need to output or the Pythonprogram embedding Taichi comes to an end. When synchroniza-tion happens, we check each task that which tensors’ sparsity arechanged, and remove list generation tasks when there is a previousone for the same tensor and there is no task between them changingthe sparsity of the tensor.After removing redundant list generation tasks, we check eachadjacent tasks to see if they can be fused. For now, we only fusetasks that loop over the active elements of the same tensor, or tasksthat are simple parallel loops with the same range.
Since a kernel can be launched many times with the same IR, westore all IRs into an IR bank to avoid repeated passes on the IRand improve asynchronous compilation performance. We use IR , Vol. 1, No. 1, Article . Publication date: December 2020. • handles to access IRs in the bank. An IR handle consists of a pointerto the IR and the hash of the IR. We assign an IR handle to eachtask, and whenever we are going to do any modification to theIR, we check if we have already done it in the IR bank, where wecache the result of IR optimization passes such as fusion, activationelimination, and dead store elimination. If the result is not cached,we copy the IR on write to avoid corrupting the IR in the bank,do the modification, store the modified IR into the bank, and thencache the mapping from the IR handle before modification to the IRhandle after modification into the bank. We also cache some datathat do not need to modify the IR into the bank, such as the taskmeta of the IR. To achieve better performance after kernel fusion, we need an op-timization pass on the task after fusion. As Taichi IntermediateRepresentation (IR) is relatively hierarchical, we build a data-flowgraph for data-flow analysis, to perform optimizations across thewhole kernel including store-to-load forwarding, dead store elimina-tion, and identical store/load elimination. For example, in Figure 9,on CPU we demote atomic addition operations into loads, adds andstores, and with store-to-load forwarding, we can replace the loadof the second atomic addition ( x[i] += 2 ) with the addition resultof the first atomic addition ( x[i] += 1 ), and get the final result as ifthe input was x[i] += 3 with other optimizations. More details onintra-kernel data-flow optimizations can be found in the Appendix.
Metrics.
On each test case, we evaluate the performance with fourmetrics: execution time on the backend, number of tasks launched,number of instructions emitted, and number of tasks compiled. Eachcase is executed multiple times on CPU (x64) and GPU (CUDA) witha synchronization after each run in asynchronous mode, and theaverage running time is recorded.
Benchmark cases.
We constructed 10 simply yet indicative mi-crobenchmarks (tens of lines of code each) to unit test specificwhole-program optimizations. Three more complex test cases (hun-dreds of lines of code each) tests the behavior of our optimizer onreal-world programs.
We constructed 10 microbenchmark cases to unit-test the system.The results are promising: Without code modification, the newsystem leads 3 . × fewer kernel launches on GPUs and 2 . × speedup on our benchmarks, as shown in Fig. 11. More details on themicrobenchmarks are discussed in the appendix. In this benchmark case we use the MacCormack advection scheme [Selleet al. 2008] with RK3 path integration, to advect three scalar physicalfields. We follow recent trends to use collocated grids (see, e.g. [Gag-niere et al. 2020; Nielsen and Bridson 2016]) to improve cachelineutilization. We find that on CUDA our optimizer leads to 1 . × performance boost, and 3 . × fewer tasks launched on both back-ends. The improved performance and reduced tasks launched in this benchmark is because of the task fusion optimization, and listgeneration removal. In this benchmark we use a sparsely populated region in a 512 × ,
820 to 177 , . × fewer. Thisis because the restriction, smoothing, and prolongation operationsleads to 351 ,
440 redundant list generation tasks, which are reducedto 343 (1025 × fewer) with our list generation removal and activationdemotion. Note that the CUDA speed up (3 . × ) is much higherthan the x64 speed up (1 . × ), likely because parallel task launcheson GPUs are relatively more expensive than that on CPUs, and themajority of the speed ups in this benchmark case is from eliminatingsmall kernels such as list generation and clearing. The task fusionpass is also able to fuse the Jacobi smoothing and reduction kernels,leading to improved memory performance. We implemented MLS-MPM [Hu et al. 2018] with Lagrangian forces [Jianget al. 2015]. In the simulation, the structural is modeled using trian-gular meshes and a NeoHookean hyperelastic model. The force f 𝑖 on the particle 𝑖 is by definition f 𝑖 = − 𝜕𝐿 ( x ) 𝜕 x 𝑖 . Since manually deriving the partial derivative on the right handside is error-prone, we rely on Taichi’s automatic differentiationsystem [Hu et al. 2020]. The key optimization opportunity is thefollowing code: with ti.Tape(total_energy):compute_total_energy()
The code above does the forward computation of total energy 𝐿 ( x ) , and then automatically evaluates for x.grad , which is essen-tially 𝜕𝐿 ( x ) 𝜕 x 𝑖 . In the majority of the cases, the result of the totalenergy 𝐿 is not used, so by looking at the whole program our opti-mizer can automatically eliminate the forward computation, onlydoing the backward gradient evaluation. Whole-program dead storeelimination plays the most important role in this benchmark case.An interesting observation is that our system gets significantlyhigher speed up on CUDA than x64. This is because the particle-to-grid (P2G) transfer step plays different roles in the total timeconsumption. Note that P2G requires atomic add, which is a rel-atively cheap operation on CUDA (native hardware support) yetexpensive operation on x64 (needs software compare and swap). Asa result, when our whole-program optimization is on, P2G takes51% run time on x64, yet only 7% on CUDA. This means the forwardtotal energy computation, which is optimized out, occupies smallerfraction on x64 (since P2G remains the bottleneck), hence a smallerspeed up. , Vol. 1, No. 1, Article . Publication date: December 2020. syncTaichi: Whole-Program Optimizations for Megakernel Sparse Computation and Differentiable Programming • 9 Table 1. Benchmarks against the original Taichi system [Hu et al. 2019]. In [Hu et al. 2019] the benchmarks are done against state-of-the-art manuallyengineered CPU and GPU implementations. Benchmarks are done on a system with a quad-core Intel Core i7-6700K CPU with 32 GB of memory, and a GTX1080 Ti GPU with 12 GB of GRAM. The geometric mean of execution time boost is . × , the reduction of task launched is . × . Cases Backend Execution time (s) Tasks launched Instructions emitted Tasks compiled
Reference Ours Reference Ours Reference Ours Reference OursMacCormack x64 8.973 8.899 9001 2401 8308 8210 96 30CUDA 0.477 0.313 9001 2401 8308 8210 96 30MGPCG x64 16.222 15.188 880820 177614 2808 3166 189 96CUDA 6.084 1.823 880820 177614 3234 3299 189 96AutoDiff energy x64 17.171 15.799 88204 56604 1353 2145 23 32CUDA 2.688 0.588 88204 56604 1353 2375 23 33
We have presented a whole program optimization framework withan asynchronous execution engine in Taichi, tailored for spatiallysparse programming and differentiable programming. In our testcases, we get 1 . − . × performance improvement without requir-ing users to change any code. Taichi’s spatially sparse programmingpatterns open up new opportunities for whole-program optimiza-tions. For example, we successfully removed redundant list gen-erations of sparse data structures, and detect sparse array accesspatterns that must be already activated according to the context.We believe our optimizer can greatly improve the productiv-ity and performance of Taichi programs, since programmers canmore flexibly code in Taichi without worrying about the redundantunderlying tasks. We also hope our whole-program optimizationframework can help optimize other programming systems in thenear future. REFERENCES
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary,Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. 2018. JAX: composable transformations of Python+NumPyprograms. http://github.com/google/jaxStephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. Format abstraction forsparse tensor algebra compilers.
Proceedings of the ACM on Programming Languages
2, OOPSLA (2018), 1–30.Roy Frostig, Matthew James Johnson, and Chris Leary. 2018. Compiling machinelearning programs via high-level tracing.
Systems for Machine Learning (2018).Steven W Gagniere, David AB Hyde, Alan Marquez-Razon, Chenfanfu Jiang, ZihengGe, Xuchen Han, Qi Guo, and Joseph Teran. 2020. A Hybrid Lagrangian/EulerianCollocated Advection and Projection Method for Fluid Simulation. arXiv preprintarXiv:2003.12227 (2020).Ming Gao, Xinlei Wang, Kui Wu, Andre Pradhana-Tampubolon, Eftychios Sifakis, YukselCem, and Chenfanfu Jiang. 2018. GPU Optimization of Material Point Methods.
ACM Trans. Graph. (Proc. SIGGRAPH Asia)
32, 4 (2018), 102.Rama Karl Hoetzlein. 2016. GVDB: Raytracing sparse voxel database structures onthe GPU. In
Proceedings of High Performance Graphics . Eurographics Association,109–117.Yuanming Hu, Luke Anderson, Tzu-Mao Li, Qi Sun, Nathan Carr, Jonathan Ragan-Kelley, and Frédo Durand. 2020. DiffTaichi: Differentiable Programming for PhysicalSimulation.
ICLR (2020).Yuanming Hu, Yu Fang, Ziheng Ge, Ziyin Qu, Yixin Zhu, Andre Pradhana, and Chen-fanfu Jiang. 2018. A moving least squares material point method with displacementdiscontinuity and two-way rigid body coupling.
ACM Trans. Graph. (Proc. SIGGRAPHAsia)
37, 4 (2018), 150.Yuanming Hu, Tzu-Mao Li, Luke Anderson, Jonathan Ragan-Kelley, and Frédo Durand.2019. Taichi: a language for high-performance computation on spatially sparse datastructures.
ACM Transactions on Graphics (TOG)
38, 6 (2019), 201.Chenfanfu Jiang, Craig Schroeder, Andrew Selle, Joseph Teran, and Alexey Stomakhin.2015. The affine particle-in-cell method.
ACM Transactions on Graphics (TOG)
34, 4(2015), 1–10. Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amaras-inghe. 2017. The tensor algebra compiler.
Proceedings of the ACM on ProgrammingLanguages
1, OOPSLA (2017), 1–29.Kathleen Knobe and Vivek Sarkar. 1998. Array SSA form and its use in paralleliza-tion. In
Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles ofprogramming languages . 107–120.Mingzhen Li, Yi Liu, Xiaoyan Liu, Qingxiao Sun, Xin You, Hailong Yang, ZhongzhiLuan, and Depei Qian. 2020. The Deep Learning Compiler: A Comprehensive Survey. arXiv preprint arXiv:2002.03794 (2020).Haixiang Liu, Yuanming Hu, Bo Zhu, Wojciech Matusik, and Eftychios Sifakis. 2018.Narrow-band Topology Optimization on a Sparsely Populated Grid.
ACM Trans.Graph. (Proc. SIGGRAPH Asia)
37, 6 (2018), 251:1–251:14.Dror E Maydan, Saman P Amarasinghe, and Monica S Lam. 1993. Array-data flowanalysis and its use in array privatization. In
Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages . 2–15.Ken Museth, Jeff Lait, John Johanson, Jeff Budsberg, Ron Henderson, Mihai Alden,Peter Cucka, David Hill, and Andrew Pearce. 2013. OpenVDB: an open-source datastructure and toolkit for high-resolution volumes. In
Acm siggraph 2013 courses .1–1.Michael B Nielsen and Robert Bridson. 2016. Spatially adaptive FLIP fluid simulationsin bifrost. In
ACM SIGGRAPH 2016 Talks . ACM, 41.Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Garret Catron, Summer Deng, RomanDzhabarov, Nick Gibson, James Hegeman, Meghan Lele, Roman Levenstein, et al.2018. Glow: Graph lowering compiler techniques for neural networks. arXiv preprintarXiv:1805.00907 (2018).Andrew Selle, Ronald Fedkiw, Byungmoon Kim, Yingjie Liu, and Jarek Rossignac. 2008.An unconditionally stable MacCormack method.
Journal of Scientific Computing
ACM Trans.Graph. (Proc. SIGGRAPH Asia)
33, 6 (2014), 205.Kui Wu, Nghia Truong, Cem Yuksel, and Rama Hoetzlein. 2018. Fast fluid simulationswith sparse volumes on the GPU. In
Computer Graphics Forum (Proc. Eurographics) ,Vol. 37. Wiley Online Library, 157–167., Vol. 1, No. 1, Article . Publication date: December 2020.
APPENDIXMicrobenchmark cases
Fig. 11. Microbenchmarks in synchronous/asynchronous mode. On aver-age . × performance boost is achieved on the execution time ( exec_t )running time metric. Here we describe the cases in the microbenchmarks.The case chain_copy contains 2 kernels y[i] = x[i] + 1 and z[i]= y[i] + 4 , like Figure 8. They are fused in asynchronous mode. increments contains 10 inc() kernels in Figure 8. fill_array contains 10 kernels all filling a 1-D dense array withthe same constant value. With task fusion, only 1 task is launchedin these cases instead of 10 tasks. The running time is nearly 10xfaster. sparse_saxpy contains some kernels performing saxpy (ScalarAlpha X Plus Y) operations among sparse tensors. The performanceboost of execution time comes from the elimination of list generationand task fusion. Sometimes the wall-clock time is slower than thesynchronous mode because of the overhead of the asynchronousengine. autodiff computes a loss function as reduction on an array andaccumulates the gradients to another array 10 times. With dead storeelimination, the forward tasks computing the loss function shouldbe eliminated except for the last one, so the number of launchedtasks reduces by roughly a half. stencil_reduction performs stencil and reduce operations on atensor. They are common operations in computer graphics. mpm_splitted contains some substep() kernels in an MPM pro-gram [Hu et al. 2018]. simple_advection performs semi-Lagrangian advection 10 times.The performance boost comes from activation elimination and taskfusion. multires is a multi-resolution program downsampling in 4 levels. deep_hierarchy contains 5 jitter() kernels x[i] += x[i + 1] when i % 2 == 0 . The tasks are not fusible but we can still get someperformance boost by eliminating list generation tasks.
Intra-kernel data-flow optimizations
We apply the traditional control-flow analysis to optimize withinkernels. We build a control-flow graph along with the hierarchicalIR, and perform analysis on the graph. With the help of control-flowanalysis, we perform optimizations including store-to-load forward-ing, dead store elimination, and identical load/store elimination.These optimizations motivate task fusion as it greatly simplifiesfused tasks.We also utilize control-flow analysis to help compute task metainformation. Since stores to a SNode may only partially modify avalue state, the resulting value state (which contains the modifiedand unmodified part) may need a read from the previous version ofthe value state. We use control-flow analysis to detect which SNodesdo not need a read from the previous version of the value state.Figure 12 shows the effect of data-flow optimization on 360 Taichitest cases. Although these test cases are relatively simple, data-flowoptimization still leads to 16% fewer instructions.