COMET: A Domain-Specific Compilation of High-Performance Computational Chemistry
Erdal Mutlu, Ruiqin Tian, Bin Ren, Sriram Krishnamoorthy, Roberto Gioiosa, Jacques Pienaar, Gokcen Kestor
CCOMET: A Domain-Specific Compilation of High-PerformanceComputational Chemistry
Erdal Mutlu, Ruiqin Tian
Pacific Northwest NationalLaboratoryRichland, WA, USA{erdal.mutlu,ruiqin.tian}@pnnl.gov
Bin Ren
William & MaryWilliamsburg, VA, [email protected]
Sriram Krishnamoorthy
Pacific Northwest NationalLaboratoryRichland, WA, USA{sriram}@pnnl.gov
Roberto Gioiosa
Pacific Northwest NationalLaboratoryRichland, WA, USA{roberto.gioiosa}@pnnl.gov
Jacques Pienaar
GoogleMountain View, CA, [email protected]
Gokcen Kestor
Pacific Northwest NationalLaboratoryRichland, WA, [email protected]
ABSTRACT
The computational power increases over the past decades havegreatly enhanced the ability to simulate chemical reactions andunderstand ever more complex transformations. Tensor contrac-tions are the fundamental computational building block of thesesimulations. These simulations have often been tied to one platformand restricted in generality by the interface provided to the user.The expanding prevalence of accelerators and researcher demandsnecessitate a more general approach which is not tied to specifichardware or requires contortion of algorithms to specific hardwareplatforms. In this paper we present COMET, a domain-specificprogramming language and compiler infrastructure for tensor con-tractions targeting heterogeneous accelerators. We present a systemof progressive lowering through multiple layers of abstraction andoptimization that achieves up to 1 . × speedup for 30 tensor con-tractions commonly used in computational chemistry and beyond. The recent slowdown of growth in realized multi-core performanceof commodity microprocessors has pushed vendors and users toconsider more specialized architectures, including GPGPUs, FPGAs,and system-on-chip. Several domains, such as artificial intelligence,have experienced an explosion of highly-specialized heterogeneousarchitectures, including Google TPUs [10], NVIDIA DLA, and IntelNirvana. With such a large variety of architectures, performanceportability and productivity have become as important as peakperformance, if not more. On one side, scientists and engineershave moved towards high-level, domain-specific (DSL) languagesthat facilitate implementing complex algorithms and allow themto focus on the algorithm’s details rather than the specific idiosyn-crasies of the underlying architectures. On the other side, to achieveoptimal performance on modern architectures, it is imperative toexploit hardware features by writing highly-specialized, low-level,architecture-dependent kernels.This struggle for balance is not a simple one to solve. A one-to-one mapping between each DSL and each architecture is impracticaland expensive to maintain. Instead, researchers have looked intoways to abstract the domain-specific aspects of an implementation from the architecture-specific ones and identify intermediate repre-sentations (IRs) to realize such abstractions. For example, LLVM [12]maps multiple front-end programming languages (e.g., C, C++, andFortran) to LLVM IR, and then maps this IR to various target ar-chitectures. On the other hand, a generalized IR also means thatdomain-specific information and semantics are lost while loweringthe code. For example, there is no simple way to express in LLVMIR common operations such as generic matrix-matrix multiplica-tion (GEMM) or 2D convolution. It follows that domain-specificoptimizations are impractical to be performed at such low-level IR,which could introduce performance loss. To overcome this limita-tion, modern systems (e.g., TensorFlow [1], Rust, and Halide [14])propose high-level IRs where domain-specific optimizations areperformed before lowering the code to the lower IRs.In this work, we introduce COMET, a novel compiler infrastruc-ture and programming language for tensor algebra targeting high-performance computational chemistry. COMET increases produc-tivity by providing very high-level programming abstractions thatresemble Einstein notations [3], performs sophisticated domain-specific code optimizations and rewriting, and generates code amenableto be executed on heterogeneous architectures. COMET is basedon the Multi-Level Intermediate Representation (MLIR) recentlyintroduced by Google to simplify writing new compiler infrastruc-tures. In the COMET multi-level IR, domain-specific, application-dependent optimizations are performed at higher levels of theIR stack where operations resemble programming languages’ ab-stractions and can be optimized based on the operations seman-tics. Generic, architecture-specific optimizations are, instead, per-formed at lower-levels, where simpler operations are mapped tothe memory hierarchy and to processor’s registers. A distinct ad-vantage of a compiler-based approach compared to library-basedsolutions [9, 21, 22] is that COMET can handle arbitrary multi-operand tensor expressions and perform state-of-the-art algorith-mic and code optimizations. Our performance results indicate thatthe code automatically generated by COMET is on par or better thanmanually-optimized tensor contractions from TCCG [20] that lever-age state-of-the-art computational libraries, such as BLIS [22] and a r X i v : . [ c s . M S ] F e b CPC’20, October 2020, Stony Brook University, NY E. Mutlu et al.
HPTT [21], and achieves up to 1 . × speedup over a set of 30 con-tractions. COMET provides enough expressiveness and complete-ness to implement two complex methods (coupled-cluster singles and doubles excitation equations) from the NorthWest Chemistry(NWChem) [2] computational chemistry package, which consists of154 expressions and a total of 417 tensor contractions, and achievesup to 23 . × speedup over executing tensor contraction in the natu-ral order of the expression. Additionally, by pairing our compiler tohardware accelerator simulators, COMET can be used to study noveldata-parallel hardware designs and their impact on the entire chem-istry methods. To the best of our knowledge, COMET is the firstmodular compiler framework that allows researchers to expresscomplex tensor expressions, perform domain- and architecture-specific code optimizations and transformations outperforminghand-tuned solutions, and can be used for hardware-software co-design. In summary, we make the following contributions: • COMET, a novel compiler and DSL for tensor algebra thatspecifically targets chemistry applications with support formulti-operand expressions; • a multi-level IR, a set of progressive lowering steps, and aseries of domain- and architecture-specific optimizations togenerate efficient code; • a new co-design methodology to study custom accelerators; • a comparison with state-of-the-art, manually-implementedtensor contraction benchmarks that leverage highly-optimizedcomputational libraries. Tensor contractions are high-dimension analogs of matrix multipli-cations widely used in many scientific and engineering domains,including deep learning, quantum chemistry, and finite-elementmethods. For example, the perturbative triples correction in couplecluster CCSD(T) [15] methods used in the NWChem computationalchemistry framework [2] originates a 6D output tensor from two 4Dinputs tensors. Tensor contractions are computationally intensiveand dominate the execution time of many computational applica-tions.Because of their wide applications in many fields, tensor con-tractions have been widely studied and optimized. Consider thefollowing tensor contraction, expressed using Einstein notation [3],where two 4D tensors, 𝐴 and 𝐵 , are contracted to produce a 4Dtensor 𝐶 : 𝐶 [ 𝑎, 𝑏, 𝑐, 𝑑 ] = 𝐴 [ 𝑎, 𝑒, 𝑏, 𝑓 ] ∗ 𝐵 [ 𝑑, 𝑓 , 𝑐, 𝑒 ] (1)In this contraction, the indices 𝑒, 𝑓 appear in both right-hand ten-sors but not in the left-hand tensor 𝐶 ( summation or contraction indices). The indices 𝑎, 𝑏, 𝑐, 𝑑 appear in exactly one of the two inputtensors and the output tensor ( external or free indices). A tensorcontraction is, thus, the contraction of the two input tensors 𝐴 and 𝐵 over the contraction indices 𝑒, 𝑓 : 𝐶 [ 𝑎, 𝑏, 𝑐, 𝑑 ] = ∑︁ 𝑒,𝑓 𝐴 [ 𝑎, 𝑒, 𝑏, 𝑓 ] ∗ 𝐵 [ 𝑑, 𝑓 , 𝑐, 𝑒 ] (2) AlgebraTensor Algebra (TA) DSLTA ASTTensorLLVM IRAffine LoopLLVM IRStandard
Front/backend dialect Optimization dialect External representation
Linear Algebra TTGT, multi-operand expressions, optimal index permutation Multi-dimensional tensors, tensor contractions, index ranges2D matrices, matrixmultiplications, transposesTiling, loop permutationLoop permutationOptimizations Concepts(Affine) LoopsMicro-kernel
Figure 1: COMET execution flow and compilation pipeline
A naïve way to perform the above computation is to directly lowerto a nested-loop implementation of the problem. Such implemen-tations have been shown to be inefficient due to poor data local-ity. A more efficient approach, commonly used in modern high-performance tensor libraries, leverages highly optimized GEMM en-gines. This approach, often referred as transpose-transpose-GEMM-transpose (TTGT), performs the permutations of the input tensorsfollowed by a high-performance matrix-matrix multiplication anda final permutation to reconstruct the output tensor. The first twotransposes “flatten” a multi-dimensional tensor into a 2D matrix byfirst permutating the indices so that they are contiguous in memory( 𝐴 → 𝑇𝐴 ) and then merging pairs of consecutive indices to formlower-dimensional tensors ( ( 𝑎, 𝑏 ) → 𝑖 , ( 𝑒, 𝑓 ) → 𝑗 , ( 𝑑, 𝑐 ) → 𝑘 ): 𝐴 [ 𝑎, 𝑒, 𝑏, 𝑓 ] → 𝑇𝐴 [ 𝑎, 𝑏, 𝑒, 𝑓 ] = 𝐴 𝑝 [ 𝑖, 𝑗 ] ; 𝐵 [ 𝑑, 𝑓 , 𝑐, 𝑒 ] → 𝑇 𝐵 [ 𝑒, 𝑓 , 𝑑, 𝑐 ] = 𝐵 𝑝 [ 𝑗, 𝑘 ] The tensor contraction expressed in equation 1 can then be ex-pressed as 𝐶 𝑝 [ 𝑖, 𝑘 ] = 𝐴 𝑝 [ 𝑖, 𝑗 ] ∗ 𝐵 𝑝 [ 𝑗, 𝑘 ] (3)where 𝐶 𝑝 [ 𝑖, 𝑘 ] = 𝑇𝐶 [ 𝑎, 𝑏, 𝑑, 𝑐 ] → 𝐶 [ 𝑎, 𝑏, 𝑐, 𝑑 ] . The TTGT methodis effective to perform high-efficient tensor contractions despite theoverhead of potentially performing three additional permutations.In fact, highly-optimized GEMM operations perform considerablybetter than nested loop implementations on modern architecturesand exploit high data locality (see Section 6). In this work, weconsider employing custom accelerators to perform even moreefficient GEMM operations, thus our compiler produces target codethat is optimized and amenable to such accelerators (Sections 6and 7). Our proposed compiler infrastructure consists of a DSL languagefor tensor algebra computations, a progressive lowering processto map high-level operations to low-level architectural resources,a series of optimizations performed at each step in the loweringprocess, and various IR dialects to represent key concepts, oper-ations, and types at each level of the multi-level IR. COMET isbased on the MLIR framework [11], a compiler infrastructure to
OMET: A Domain-Specific Compilation for Computational Chemistry LCPC’20, October 2020, Stony Brook University, NY ⟨ ilabel ⟩ :: = IndexLabel ⟨ id-list ⟩ = ⟨ range ⟩ ; ⟨ range ⟩ :: = [ int ] | [ int : int : int ] ⟨ tensor ⟩ :: = Tensor< ⟨ element-type ⟩ > ⟨ id-list ⟩ ; ⟨ id-list ⟩ :: = ⟨ id ⟩ | [ ⟨ id ⟩ , ⟨ id-list ⟩ ] ⟨ id-list ⟩ :: = int | double | float ⟨ id ⟩ :: = any object identifier (a) Tensor label grammar ⟨ tensor-op ⟩ :: = ⟨ op-lhs ⟩ = ⟨ op-rhs ⟩| ⟨ op-lhs ⟩ += ⟨ op-rhs ⟩| ⟨ op-lhs ⟩ -= ⟨ op-rhs ⟩⟨ op-lhs ⟩ :: = ⟨ label-tensor ⟩⟨ op-rhs ⟩ :: = ⟨ alpha ⟩ | ⟨ label-tensor ⟩| ⟨ alpha ⟩ * ⟨ label-tensor ⟩| ⟨ alpha ⟩ * ⟨ label-tensor ⟩ * ⟨ label-tensor ⟩| ⟨ op-rhs ⟩ * ⟨ label-tensor ⟩⟨ label-tensor ⟩ :: = ⟨ tensor-id ⟩ ( ⟨ label-list ⟩ ) ⟨ alpha ⟩ :: = tensor value type (b) Tensor operations grammar Figure 2: Tensor label and operation grammar. build reusable and extensible compilers. MLIR supports the compi-lation high-level abstraction and domain-specific constructs andprovides a disciplined, extensible compiler pipeline with gradualand partial lowering. Users can build domain-specific compilersand customized IRs, as well as combining existing IRs, opting in tooptimizations and analysis.Figure 1 shows the compilation pipeline of COMET. Users ex-press their computation using a high-level tensor algebra language(Section 4). The language operators, types, and structures are firstmapped to an abstract syntax tree and then to the tensor algebra(TA) dialect , the first dialect in the COMET IR stack. The TA di-alect contains domain-specific concepts, such as multi-dimensionaltensors, contractions, and tensor expressions. Here, several domain-specific optimizations are performed, such as reformulating tensorcontractions using the TTGT method. Next, COMET lowers the TAdialect representation of the computation to a mixed dialect basedon the linear algebra (LinAlg) and Affine loop dialects. High-levelconcepts, such as tensor contractions, are replaced with more gen-eral operations (transpose, matrix multiplication, affine maps, etc.).At this stage, there is a departure from domain-specific concepts butthe operations are still architecture-independent. The next step con-sists of further lowering of LinAlg operations to the (Affine) Loopdialect: at this stage, COMET performs architecture-specific opti-mizations and requires information for the specific target. GEMMsare tiled to fit matrix slices in the processor’s data caches as wellas to map computation to the processor’s registers. The innermostGEMM computation is performed using an architecture-specific,highly-optimized micro-kernel (Section 6) or (simulated) customaccelerators (Section 7). Finally, COMET lowers the program to stan-dard dialect and then to the LLVM dialect, which is then mappedto LLVM IR and lowered to machine instructions for execution.
We developed a high-level Tensor Algebra (TA) DSL for tensor alge-bra computation. As for any DSL, the goal of the COMET languageis to allow scientists 1) to express concepts and operations in aform that closely resembles familiar notations and 2) to conveydomain-specific information to the compiler for better program optimization. Our language represents Einstein mathematical nota-tion and provides users with an interface to express tensor algebrasemantics.Figure 2a describes the tensor structures and how they are rep-resented and constructed in our DSL. A tensor object refers to amulti-dimensional array of arithmetic values that can be accessed byusing indexing values. Range-based index label constructs ( ⟨ ilabel ⟩ )represent the range of indices expressed through a scalar, a range,or a range with increment. Index labels can be used both for con-structing a tensor ( ⟨ tensor ⟩ ) or for representing a tensor operation( ⟨ label-tensor ⟩ ). In a tensor construction, index labels are used torepresent each dimension size. In the context of a tensor operation,they represent slicing information of the tensor object where the op-eration will be applied. Figure 2b shows the grammar for supportedtensor operations. Similar to the tensor construction, index labelsare used as the main construct for representing the operations onthe tensors. Each tensor operation production rule ( ⟨ tensor-op ⟩ )is composed of a left-hand side (lhs) and a right-hand side (rhs)operation. While lhs can only be a labeled tensor construct, rhs canbe of different types: • alpha value ( A[i,j] =
CPC’20, October 2020, Stony Brook University, NY E. Mutlu et al. (a) DSL TA language (b) TA dialect
Figure 3: Example tensor contraction in COMET DSL and its relative TA dialect. users to express high-level concepts with familiar notations. Fig-ure 3a shows a general tensor contraction implementation usingCOMET TA language. Line 3 describes an index label representingthe size of each tensor dimension and the operation labels for de-scribing the tensor fill and tensor contraction. Our TA languagesupports defining multiple
IndexLabel variables (similar to struc-tured bindings in C++-17) in a single statement. Tensors are thenconstructed using the
IndexLabel s and the element type (lines6-8). and contracted over indices [e,f] (line 16).
The Tensor Algebra dialect is the first dialect in the COMET com-piler stack (Figure 1). The main goal of this dialect is to representbasic building blocks of the tensor algebra computation, describetensor algebra specific types and operations, and represent seman-tics information expressed through the TA DSL.Figure 3b shows the TA dialect representation of the tensor con-traction example shown in Figure 3a. We define new operations inTA dialect that correspond to each tensor algebra DSL operationsemantic. A ta.index_label corresponds to an
IndexLabel con-struct in the TA language while a ta.labeled_tensor is for the
LabeledTensor constructs. Figure 3b shows how an index labeloperation is constructed using a range ( !ta.range ) type. New ten-sors are declared with the ta.tensor_decl operation, which takesas input the index labels for each dimension and the data type. The ta.labeled_tensor operation represents a slice of a tensor that isbeing used in any operation that references it. This operation takesa tensor declaration and a set of index label references as inputs toconstruct a sliced version of the tensor.Three classes of tensor operations are currently supported: unary(fill, copy, set), binary (contraction), and multi-operand expressions (contraction chains). ta.fill initializes a tensor object with a sin-gle value provided as an attribute to the operation. ta.copy per-forms an element-wise copy operation between two tensors scalingthe output tensor by factor alpha . ta.set operates similarly to ta.copy but takes as input the result of a binary operation insteadof a tensor. This operation is used to support multi-operand tensorcontractions. Tensor contractions ta.tc take as input the input andoutput tensors, the scaling value alpha , and the indexing map sfor the labels used in the contraction.For multi-operand expressions that involve several contractions,we introduce a utility operation ( ta.mult ) that represents a binaryoperation. The actual computation for a multi-operand expressionincludes calculation of intermediates and then the actual tensorcontractions. We represent the order of binary operations with abinary tree and assign results to the output tensor using ta.set . The main advantage of a multi-level IR is that different kinds ofoptimizations can be applied at each level of the IR stack and opti-mizations can be shared and reused across different stacks. In theCOMET compiler, we apply domain-specific optimizations at theTA dialect level, general optimizations at the LinAlg dialect level,and architecture-specific optimizations at the lower levels. In thefollowing, we explain our optimizations from the top IR level whilelowering the code.
TTGT
As discussed in Section 2, tensor contractions can bereformulated by transposing multi-dimensional input tensors into2D matrices, performing a GEMM operation, and unflatting theoutput tensor back to its original form. Although, this approachincurs the additional overhead of transpose operations, employinghighly-optimized GEMM kernels outweighs this overhead. The
OMET: A Domain-Specific Compilation for Computational Chemistry LCPC’20, October 2020, Stony Brook University, NY
Figure 4: TA dialect after reformulation with a TTGT method (left panel). Lowering and optimization of transpose (right-top)and GEMM (right-bottom). left panel in Figure 4 shows the TTGT reformulation of the ta.tc operation: transpose of input tensors ( linalg.copy ) and a GEMMoperation ( linalg.matmul ). Optimal permutation for TTGT
The permutation chosen toreformulate a tensor contraction using the TTGT method has aconsiderable impact on performance. The cost of transposing ahigh-dimension tensor into a 2D tensor depends on which indicesare transposed and the storage format. For row-major format, trans-posing the first indices is more expensive than transposing the latestones, especially for the output tensor. In order to select the bestpermutation, we use a cost model based on a heuristic that assignshigher costs to permutations that move the outermost dimensions.Additionally, some permutation naturally results in a reductionof the number of transpose operations. We compute the cost ofeach valid transposition of input and output tensors, including theposition swap for the input tensors, and select the permutationwith the lowest cost.
Multi-Operand Expression
Given the associative property oftensor contractions, the order in which contractions are computedin a multi-operand expression produces the same results. However,the number of operations performed may vary depending on thegrouping order of the contractions. Performance variation maybe significant, especially if some of the tensors involved have lowcardinality in some dimensions (e.g., “skinny matrices”). We exploreall possible ordering of a multi-operand expression and identifythe one that minimizes the overall number of operations. We thenorganize the sequence of operations in a binary tree and lowerthe multi-operand expression to a sequence of tensor contractions.Note that because the shape of the intermediate tensors is differentfrom the original one, some tensor contractions may degenerateto simpler lower-dimension operations, such as GEMM or tensor-vector multiplications, which are further optimized (e.g., removingadditional transpose).
Transpose optimization
Our transpose optimization consistsof two steps: loop permutation and tiling. The main rational of thecost model is that the loops corresponding to the innermost indicesshould be at the innermost level. We assign a weight to each loopindex according to its position in the input and output tensor (theweight is higher if an inner index does not correspond to an innerloop) and compute the overall cost of the permutation by summingthese weights. The right-top panel in Figure 4 shows the IR afterthe selected loop permutation ( 𝑖, 𝑗, 𝑘, 𝑙 → 𝑖, 𝑘, 𝑗, 𝑙 ). Next, we employtiling to improve locality. GEMM optimizations
GEMM plays a paramount role in abroad range of domains such as deep learning, scientific computing,and image processing. Tiling and blocking are effective methodsto improve data locality and overlap computation and memoryaccess. We employ the same tiling strategy used [9, 22]: given 𝐶 [ 𝑀, 𝑁 ] = 𝐴 [ 𝑀, 𝐾 ] ∗ 𝐵 [ 𝐾, 𝑁 ] , the 𝐶 matrix is partitioned intomultiple tiles with size 𝑀 𝑐 × 𝑁 𝑐 . Each tile of 𝐶 needs to access to 𝐴 matrix with size 𝑀 𝑐 × 𝐾 and a whole column of 𝐵 matrix withsize of 𝐾 × 𝑁 𝑐 . Since the whole row band of 𝐴 and column of 𝐵 are still large to be accommodated in the processor’s cache, the 𝐾 -dimension has to be partitioned into smaller tiles of 𝐴 ( 𝑀 𝑐 × 𝐾 𝑐 )and 𝐵 ( 𝐾 𝑐 × 𝑁 𝑐 ). 𝑀 𝑐 , 𝐾 𝑐 and 𝑁 𝑐 are carefully selected so that thesub-matrix of 𝐵 ( 𝐾 𝑐 × 𝑁 𝑐 ) fits into the L3 cache and the 𝐴 sub-matrix( 𝑀 𝑐 × 𝐾 𝑐 ) fits into the L2 cache. The 𝑁 𝑐 and 𝑀 𝑐 dimensions arefurther tiled by 𝑁 𝑟 and 𝑀 𝑟 , respectively, so that the sub-matrix of 𝐴 ( 𝑀 𝑟 × 𝐾 𝑐 ) and 𝐵 ( 𝐾 𝑐 × 𝑁 𝑟 ) fit into the L1 cache. The 𝑀 𝑟 × 𝑁 𝑟 elements from the matrix 𝐶 fit into registers and the innermostcomputation of size 𝑀 𝑟 × 𝑁 𝑟 is executed as a micro-kernel. Theright-bottom panel in Figure 4 shows the various levels of tiling. GEMM micro-kernel
Modern architectures are very complexand require sophisticated and highly-optimized code to fully achievehigh performance. Compiler frameworks do their best to generatesuch code but also remain general. COMET leverages the LLVMback-end for code and binary generation which, when combined
CPC’20, October 2020, Stony Brook University, NY E. Mutlu et al. with tiling optimizations, provide good performance. However, ahighly-optimized code for the specific architecture can fully lever-age vector instructions, instruction-level parallelism, speculation,and other architectural features, achieving even higher performance.Also, COMET has been designed to support custom hardware accel-erators that implement specific functionalities in hardware. Thus,the innermost computation in the GEMM kernel is performed us-ing a micro-kernel , either a highly-specialized code for the targetarchitecture (“soft-accelerator”) or a custom hardware accelerator(“hard accelerator”).
Custom hardware accelerators may significantly increase perfor-mance, area, and energy efficiency of high-end compute systems. Itcomes as no surprise that hardware acceleration is widely employedin many domains, including mobile, automotive, high-performancecomputing, and machine learning. However, designing custom ac-celerators requires computationally-intensive simulations usingsoftware or FPGA-based tools, which may limit the scope to verysmall kernels.COMET provides an opportunity to perform co-design and de-sign space exploration (DSE) efficiently and to assess the perfor-mance of the entire application, instead of only the innermostkernel. As we explained in the previous section, the lower leveldialects represent architecture-specific operations and the inner-most computation in the GEMM operation is implemented as anarchitecture-specific, highly-optimized micro-kernel. In order toperform co-design for a target accelerator and measure the impactof realistic tensor contractions, we replace the micro-kernel with atiming model of the hardware accelerator and execute the entirecontraction at native speed. To this extend, we pair COMET with Al-addin [18], a pre-register-transfer level (RTL), power-performancesimulation framework that targets rapid prototyping of data par-allel accelerators. Aladdin specifications are essentially C repre-sentations of the functionalities that need to be implemented inhardware. From these representations, an LLVM-based tracer ex-tracts a dynamic data dependence graph (DDDG) that describesthe accelerator. Next, Aladdin applies various optimizations andresource constraints, therefore generating a realistic design. Finally,Aladdin estimates power and performance from dynamic tracesobtained from a driver program.Performing DSE for the hardware accelerator design with Al-addin takes order of minutes (instead of hours as in traditionalFPGA-based DSE) while executing tensor contraction with COMETis performed at native speed. The entire process, thus, can be com-pletely automated and executed within minutes.
This section presents the performance code generated by the COMETcompiler from the TA language. First, we show the performance im-pact of progressively applying the various optimizations describedin Section 6. Next, we compare our automatically-generated codeagainst the code that leverages hand-optimized libraries. Finally,we show a co-design case for the GEMM accelerator. We performour experiments on a compute node featuring Intel Xeon 6126 CPUat 2.60GHz and 192 GB of memory. We compare our results to 30 tensor contractions from the TCCG benchmark suite [20], usingthe reference problem sizes. The first eight contractions includetensor-matrix multiplication from machine learning domain (1stto 8th). The next three contractions are used to transform a set oftwo-electron integrals from an atomic orbital basis to a molecularorbital basis (9th to 11th). Finally, the following 19 contractions arefrom the CCSD method of quantum chemistry. TCCG benchmarksare implemented in C++ and leverage highly-optimized computa-tional libraries for tensor transpositions (HPTT) and GEMM kernels(BLIS). The results reported are the average of ten runs.
Overall Performance Evaluation
Figure 5 shows the perfor-mance impact of applying the optimizations described in Section 6to the TCCG tensor contractions written with COMET DSL. Thex-axis represents each tensor contraction, and the y-axis denotesthe performance in terms of GFLOPS. We start from TTGT refor-mulations with a nested-loop GEMMs and transposes (blue bars)and progressively apply architecture-independent and architecture-specific optimizations, thus each bar in the graph shows the in-cremental benefits. The plot shows that each optimization bringsvaried performance gains on the different tensor contractions. Thearchitecture-independent optimizations may (or may not) bringimmediate benefit. This is the case of the arbitrary permutationfor TTGT being already the optimal one. However, in some cases,optimizing transpose operations almost double performance (6thand 8th contractions). It is important to note that some architecture-independent optimization, such as selecting the optimal permuta-tion, might bring performance gains after all other optimizationsare applied (Figure 6). Architecture-specific optimizations, suchas GEMM tiling, bring considerable performance improvementsin most of the cases, achieving 23 × speedup on average and upto 56 × speedup compared to the equivalent loop nest. For high-dimensional tensor contractions (e.g., the ones from CCSD), forwhich data movement dominates the execution time, exploiting datalocality provides significant performance improvements. Finally,employing an architecture-specific micro-kernels that leveragesAVX512 vector instructions, memory pre-fetching, and speculation,greatly improves performance, especially for the compute-boundcontractions, achieving up to 51 GFLOPS. We remark that the micro-kernel is only effective once all other optimizations are performed,otherwise vector instructions cannot leverage data locality.Figure 6 compares COMET against state-of-the-art implementa-tion of TCCG tensor contractions. While both COMET and the C++implementations use the TTGT method, TCCG implementations(solid blue line) leverage highly-optimized computational libraries.COMET implementations, instead, employ the optimizations de-scribed in the previous sections. In order to perform a fair com-parison, we use the same micro-kernel implemented by the BLISlibrary, hence both hand-tuned and COMET-generated code usethe same architecture-specific code. The plot shows that COMETresults (solid yellow line) are comparable and, in some cases, bet-ter than the TCCG C++ implementations: COMET achieves anaverage of 1 . × and up to 1 . × speedup compared to the TCCGbenchmarks. However, we believe that COMET has the distinctadvantages of being more general (optimizations can be selectivelyapplied), portable (different targets architectures may be chosenwithout re-implementing high-level optimizations), user-friendly(the TA language allows expressing equations in mathematical OMET: A Domain-Specific Compilation for Computational Chemistry LCPC’20, October 2020, Stony Brook University, NY
Figure 5: Performance breakdown. The plot shows the impact of code optimizations applied incrementally to multiple tensorcontractions (higher is better). forms), and does not force programmers to bend algorithms tospecific libraries APIs and formats. The graph also shows COMETperformance when employing an arbitrary permutation insteadof the best one (solid green line). Using the best permutation per-forms up to 4.36x better (1.38x on average) than using an arbitrarypermutation. These results show that, although it is not evident atfirst (Figure 5), choosing an optimal permutation does increase theoverall performance once architecture-specific optimizations areapplied.Overall, the results in Figures 5 and 6 show that to achieve highperformance it is imperative to employ sophisticated, architecture-specific optimizations that naturally make code much less portable.However, a compiler framework based on a multi-level IR can seam-lessly apply both architecture-specific and architecture-independentoptimizations, achieve optimal performance, and still maintain thehigh-level language specifications and semantics.
Multi-Operand Expression
Differently from a library-basedapproach, where programmers need to match the library-defined(binary) APIs, COMET can analyze tensor expressions and optimizethe order in which each operation is executed. As stated in Section 6,although the final results of a multi-operand tensor contractiondo not change despite of the execution order of the contractions,the number of operations does change, hence some ordering pro-vides higher performance than others. In particular, when usinga library (which normally provides an interface for contractingtwo tensors), programmers either have to figure out the optimalordering manually a priori or may incur in performance loss if
Table 1: Performance speedup of re-ordering multi-operandexpressions.
Multi-operand Tensor expressions Perf.
A[c,d,m,n] * B[i,n,a,d] * C[m,c] 23.9A[d,c,m,n] * B[i,n,a,d] * C[m,c] 21.1A[c,d,m,n] * B[i,d] * C[n,a] * D[m,c] 4.9A[d,c,m,n] * B[i,d] * C[n,a] * D[m,c] 4.5A[m,n,e,j] * B[e,i] * C[a,m] * D[b,n] 1.4A[m,n,f,e] * B[e,i] * C[f,n] * D[a,b,m,j] 1.4A[m,n,e,f] * B[a,m] * C[f,n] * D[e,b,i,j] 2.2A[m,n,e,f] * B[b,m] * C[f,n] * D[e,a,j,i] 1.8A[n,m,e,f] * B[a,m] * C[f,n] * D[e,b,i,j] 2.2A[n,m,e,f] * B[b,m] * C[f,n] * D[e,a,j,i] 1.8A[m,n,e,f] * B[e,i] * C[f,n] * D[a,b,m,j] 1.5A[m,n,e,f] * B[e,j] * C[f,n] * D[b,a,m,i] 1.7A[m,n,f,e] * B[e,j] * C[f,n] * D[b,a,m,i] 1.5
Table 2: Characteristics of the emulated GEMM hardware.
131 1026 32770
Avg. Power (mW)
Avg. Area (uM ) they follow the natural order of the tensor contractions. COMET,instead, automatically analyzes the entire expression and breaks CPC’20, October 2020, Stony Brook University, NY E. Mutlu et al.
Figure 6: COMET’s performance comparison against the hand-optimized TCCG benchmarks on x86 (solid lines) and emulatedplatforms (dashed lines). it into binary operations properly ordered to achieve the highestperformance by minimizing the number of overall operations. Weevaluated 118 tensor contraction expressions from two NWChemmethods that involve 3 and 4 operands. Table 1 shows that themulti-operand optimization reduces the total number of operationsand provides performance speedup, up to 23 . × . We do not reportthe other expressions, as the natural order coincides with the opti-mal ordering. Note that in these experiments, we have employedthe same optimizations introduced in Section 6, only differenceis that the baseline executes contractions in the natural order ofthe expression whereas the multi-operand expression optimizationidentifies an ordering that reduces the overall number of operations. Case Study: Designing Custom GEMM Accelerators
Our fi-nal experiments use COMET to perform co-design for a customGEMM data-parallel accelerator. The main idea is to identify thebest accelerator to perform GEMM operation in the TTGT methodto solve tensor contractions. In this case, the GEMM accelerator isconsidered a “hard accelerator”, as opposed to the “soft accelerator”for x86 employed in the previous section. We leverage COMETmodeling capabilities and combine the code generated by our com-piler framework with the timing estimates produced by Aladdinmodels of the GEMM designs. In particular, we replace the x86micro-kernel with a delay that represents the execution time of theinnermost GEMM computation on the hardware accelerator. Weanalyze three possible scenarios: small (16 × × × 𝑚𝑚 while an NVIDIA Volta GPU die measures 815 𝑚𝑚 , whichare 2,867x and 14,606x bigger than the 16x16 accelerator. Figure 6 also reports the performance of a system that featurescustom GEMM hardware accelerators (dashed lines). The plot showsthat hardware accelerators may substantially increase performancefor compute-bound tensor contractions, such as the last 11 con-tractions, and achieve up to 156 GFLOPS, 3 . × speedup over thesame code employing a “soft” AVX512 accelerator. By comparingthe results in Figure 5 and 6, it is evident that the contractions thatmost benefit from tiling (and thus become compute-bound) alsogreatly benefit from hardware acceleration. The plot also shows animportant point: while it seems intuitive that larger acceleratorsprovide higher performance, this is not always the case in our exper-iments. There may be several reasons for this behavior, includinglarge carry-over loops, computing GEMM for non-square matrices,caches that are not large enough to contain all the data, etc. Figure 6does show that tensor contractions that are compute-bound withsmaller hardware accelerators become memory-bound with thelargest GEMM design. We infer that the lowest-level cache does nothave sufficient storage to feed such large accelerators or to supportdata reuse. The actual point of co-design is, indeed, to figure outthose trade-offs and select the best accelerator for the particularworkload (64 ×
64) instead of the best accelerator from the singleoperation (256 × Among the compiler-based approaches for tensor algebra, the Ten-sor Contraction Engine (TCE) [8] is an early effort as a compilerframework that automatically optimizes dense tensor contractionsin the quantum chemistry. TACO compiler is a C++ library thatemploys compiler-based techniques for dense and sparse tensors.TACO enables automatic code generation for a wide variety of
OMET: A Domain-Specific Compilation for Computational Chemistry LCPC’20, October 2020, Stony Brook University, NY linear and tensor algebra operations while supporting differentstorage formats. TACO provides similar notation to COMET TAlanguage to express tensor expression, although programmers needto invoke object methods to pack/unpack data structures. UnlikeTACO, COMET leverages core compiler optimizations, such as tilingor loop ordering, and supports multiple back-ends via LLVM.There has been a lot of work on library-based approaches. TheFLAME [7] focuses on formal description of linear algebra methodson matrices and the derivation of optimized implementations fordistributed-memory systems. Later works [13, 16, 17] extend theframework for multi-dimensional tensor operations. The CyclopsTensor Framework (CTF) [19] focuses on distributed computationof tensor operations. libtensor [4] focuses on describing blocktensor operations using
C++ templates. Recent work, such as ITen-sor [5] and TensorNetwork [6], employs tensor networks to rep-resent contractions of several tensors. Libraries-based approachesare easy to use but force scientists to re-arrange algorithms andimplementations around the library APIs. This implies that, amongothers, library-approaches rarely support arbitrary tensor expres-sions. Moreover, libraries are typically implemented for specificarchitectures and may require heavy modifications to run on differ-ent heterogeneous systems. COMET, on the other hand, providesa programming interface close to the Einstein mathematical nota-tion, supports arbitrary and mixed tensor expressions, and supportexecution on different architecture via LLVM backends.To the best of our knowledge, none of the tools and librariesavailable provide an easy path to perform co-design of hardwareaccelerators for tensor algebra computations. COMET has beendesigned to support hardware accelerators and allows swappingoptimizations in and out according to the target architecture.
10 CONCLUSIONS
The recent explosion of high-efficient and specialized architec-tures has dramatically decreased program portability and produc-tivity. On one side, scientists prefer high-level, domain-specificlanguages that provide high-expressiveness and portability; onthe other side, achieving high performance on modern architec-tures requires highly-tuned, architecture-specific implementationsand support for custom hardware accelerators. This work presentsCOMET, a novel compiler framework that supports tensor algebraoperations, specifically those related to chemistry. COMET consistsof a high-level DSL, a multi-level IR, and a series of progressivelowering steps and program optimizations. COMET’s multi-levelIR approach allows us to change some of the dialects without theneed to re-implement the entire IR stack and optimizations. Weshow that the code automatically generated by COMET outper-forms hand-tuned tensor contractions that leverage state-of-the-artcomputational libraries across 30 tensor contractions from variousdomains. Our approach provides the distinct advantage of analyz-ing multi-operand expressions and identify the optimal ordering oftensor operations, achieving up to 23 . × speedup over equivalentcode that follows the natural order. Finally, we show that COMETcan also be used to perform co-design and identify the best GEMMaccelerator for the tensor contractions under study. We plan toextend COMET in various directions, including additional supportfor tensor algebra operations, support for sparse operations, and support for additional architectures. We also plan to open sourceCOMET.
11 ACKNOWLEDGEMENT
This research is supported by PNNL Laboratory Directed Researchand Development Program (LDRD), Data-Model Convergence Ini-tiative, project DuoMO: A Compiler Infrastructure for Data-ModelConvergence, and project Hybrid Advanced Workflows.
REFERENCES
The Journal of ChemicalPhysics
Annalender Physik
Journal of Computational Chemistry
34, 26 (2013), 2293–2309.[5] Matthew Fishman, Steven R. White, and E. Miles Stoudenmire. 2020. The ITensorSoftware Library for Tensor Network Calculations. arXiv:2007.14822 [cs.MS][6] Google. 2020. TensorNetwork. https://github.com/google/TensorNetwork.[7] John A. Gunnels et al. 2001. FLAME: Formal Linear Algebra Methods Environ-ment.
ACM Trans. Math. Softw.
27, 4 (Dec. 2001), 422–455.[8] So Hirata. 2003. Tensor Contraction Engine: Abstraction and Automated ParallelImplementation of Configuration-Interaction, Coupled-Cluster, and Many-BodyPerturbation Theories.
The Journal of Physical Chemistry A
Proceedings of the 44th Annual International Symposium onComputer Architecture . 1–12.[11] Chris Lattner et al. 2020. MLIR: A Compiler Infrastructure for the End of Moore’sLaw. arXiv preprint arXiv:2002.11054 (2020).[12] Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework forLifelong Program Analysis and Transformation. In
CGO ’04 . 75–88.[13] Jack Poulson et al. 2013. Elemental: A New Framework for Distributed MemoryDense Matrix Computations.
ACM Trans. Math. Softw.
39, 2, Article 13 (Feb. 2013),13:1–13:24 pages.[14] Jonathan Ragan-Kelley et al. 2013. Halide: A Language and Compiler for Opti-mizing Parallelism, Locality, and Recomputation in Image Processing Pipelines.In
PLDI’13 . 519–530.[15] Krishnan Raghavachari et al. 1989. A Fifth-Order Perturbation Comparison ofElectron Correlation Theories.
Chemical Physics Letters
157 (05 1989), 479–483.[16] M. Schatz, R. van de Geijn, and J. Poulson. 2016. Parallel Matrix Multiplication:A Systematic Journey.
SIAM Journal on Scientific Computing
38, 6 (2016), C748–C781.[17] Martin D. Schatz, Tze Meng Low, Robert A. van de Geijn, and Tamara G. Kolda.2014. Exploiting Symmetry in Tensors for High Performance: Multiplicationwith Symmetric Tensors.
SIAM J. Scientific Computing
36, 5 (2014).[18] Yakun Sophia Shao et al. 2014. Aladdin: A Pre-RTL, Power-Performance Ac-celerator Simulator Enabling Large Design Space Exploration of CustomizedArchitectures. In
ISCA . IEEE Press, 97–108.[19] Edgar Solomonik, Devin Matthews, Jeff R Hammond, John F Stanton, and JamesDemmel. 2014. A massively parallel tensor contraction framework for coupled-cluster computations.
J. Parallel and Distrib. Comput.
74, 12 (2014), 3176–3190.[20] Paul Springer and Paolo Bientinesi. 2018. Design of a High-Performance GEMM-like Tensor–Tensor Multiplication.
ACM Trans. Math. Softw.
44, 3, Article 28(2018), 29 pages.[21] Paul Springer, Tong Su, and Paolo Bientinesi. 2017. HPTT: A High-PerformanceTensor Transposition C++ Library. (2017), 56–62.[22] Field G. Van Zee and Robert A. Van De Geijn. 2015. BLIS: A Framework forRapidly Instantiating BLAS Functionality.