SPORES: Sum-Product Optimization via Relational Equality Saturation for Large Scale Linear Algebra
Yisu Remy Wang, Shana Hutchison, Jonathan Leang, Bill Howe, Dan Suciu
SSPORES: S UM -P RODUCT O PTIMIZATION VIA R ELATIONAL E QUALITY S ATURATION FOR L ARGE S CALE L INEAR A LGEBRA
Yisu Remy Wang
University of WashingtonSeattle, Washington [email protected]
Shana Hutchison
University of WashingtonSeattle, Washington [email protected]
Jonathan Leang
University of WashingtonSeattle, Washington [email protected]
Bill Howe
University of WashingtonSeattle, Washington [email protected]
Dan Suciu
University of WashingtonSeattle, Washington [email protected]
December 24, 2020 A BSTRACT
Machine learning algorithms are commonly specified in linear algebra (LA). LA expressions canbe rewritten into more efficient forms, by taking advantage of input properties such as sparsity , aswell as program properties such as common subexpressions and fusible operators . The complexinteraction among these properties’ impact on the execution cost poses a challenge to optimizingcompilers. Existing compilers resort to intricate heuristics that complicate the codebase and addmaintenance cost but fail to search through the large space of equivalent LA expressions to find thecheapest one. We introduce a general optimization technique for LA expressions, by convertingthe LA expressions into Relational Algebra (RA) expressions, optimizing the latter, then convertingthe result back to (optimized) LA expressions. One major advantage of this method is that it iscomplete, meaning that any equivalent LA expression can be found using the equivalence rulesin RA. The challenge is the major size of the search space, and we address this by adopting andextending a technique used in compilers, called equality saturation . We integrate the optimizer intoSystemML and validate it empirically across a spectrum of machine learning tasks; we show thatwe can derive all existing hand-coded optimizations in SystemML, and perform new optimizationsthat lead to speedups from . X to X . Consider the Linear Algebra (LA) expression sum (( X − U V T ) ) which defines a typical loss function for approxi-mating a matrix X with a low-rank matrix U V T . Here, sum () computes the sum of all matrix entries in its argument,and A squares the matrix A element-wise. Suppose X is a sparse, 1M x 500k matrix, and suppose U and V are densevectors of dimensions 1M and 500k respectively. Thus, U V T is a rank 1 matrix of size 1M x 500k, and computingit naively requires 0.5 trillion multiplications, plus memory allocation. Fortunately, the expression is equivalent to sum ( X ) − U T XV + U T U ∗ V T V . Here U T XV is a scalar that can be computed efficiently by taking advantageof the sparsity of X , and, similarly, U T U and V T V are scalar values requiring only 1M and 500k multiplicationsrespectively.Optimization opportunities like this are ubiquitous in machine learning programs. State-of-the-art optimizing compil-ers such as SystemML [1], OptiML[27], and Cumulon[11] commonly implement syntactic rewrite rules that exploitthe algebraic properties of the LA expressions. For example, SystemML includes a rule that rewrites the precedingexample to a specialized operator to compute the result in a streaming fashion. However, such syntactic rules fail See the SystemML Engine Developer Guide for details on the weighted-square loss operator wsloss . a r X i v : . [ c s . D B ] D ec PREPRINT - D
ECEMBER
24, 2020on the simplest variations, for example SystemML fails to optimize sum (( X + U V T ) ) , where we just replaced − with + . Moreover, rules may interact with each other in complex ways. In addition, complex ML programs often havemany common subexpressions (CSE), that further interact with syntactic rules, for example the same expression U V T may occur in multiple contexts, each requiring different optimization rules.In this paper we describe SPORES, a novel optimization approach for complex linear algebra programs that leveragesrelational algebra as an intermediate representation to completely represent the search space. SPORES first transformsLA expressions into traditional Relational Algebra (RA) expressions consisting of joins ∗ , union + and aggregates (cid:80) . It then performs a cost-based optimizations on the resulting Relational Algebra expressions, using only standardidentities in RA. Finally, the resulting RA expression is converted back to LA, and executed.A major advantage of SPORES is that the optimization rules in RA are complete. Linear Algebra seems to require anendless supply of clever rewrite rules, but, in contrast, by converting to RA, we can prove that SPORES is complete.The RA expressions in this paper are over K -relations [9]; a tuple X ( i, j ) is no longer true or false, but has a numericalvalue, e.g. ; in other words, the RA expressions that result from LA expressions are interpreted over bags instead ofsets. A folklore theorem states that two Unions of Conjunctive Queries over bag semantics are equivalent iff they areisomorphic , which implies that checking equivalence is decidable. (In contrast, containment of two UCQs with bagsemantics is undecidable [13]; we do not consider containment in this paper.) We prove that our optimizer rules aresufficient to convert any RA expression into its canonical form , i.e. to an UCQ, and thus can, in principle, can discoverall equivalent rewritings.However, we faced a major challenge in trying to exploit the completeness of the optimizer. The search space is verylarge, typically larger than that encountered in standard database optimizers, because of the prevalence of unions + ,large number of aggregates (cid:80) , and frequent common subexpressions. To tackle this, SPORES adopts and extends atechnique from compilers called equality saturation [28]. It uses a data structure called the E-Graph [22] to compactlyrepresent the space of equivalent expressions, and equality rules to populate the E-Graph, then leverages constraintsolvers to extract the optimal expression from the E-Graph.We have integrated SPORES into SystemML [1], and show that it can derive all hand-coded rules of SystemML.We evaluated SPORES on a spectrum of machine learning tasks, showing competitive performance improvementcompared with more mature heuristic-based optimizers. Our optimizer rediscovers all optimizations by the latter, andalso finds new optimizations that contribute to speedups of 1.2X to 5X.We make the following contributions in this paper:1. We describe a novel approach for optimizing complex Linear Algebra expressions by converting them toRelational Algebra, and prove that this approach is complete (Sec. 2).2. We present search algorithm based on Equality Saturation that can explore a large search space while reusingmemory (Sec. 3).3. We conduct an empirical evaluation of the optimizer using several real-world machine learning tasks, anddemonstrate it’s superiority over an heuristics-driven optimizer in SystemML (Sec. 4). R LR : from LA to RA and Back In this section we describe our approach of optimizing LA expressions by converting them to RA. The rules convertingfrom LA to RA and back are denoted R LR .To justify our approach, let us revisit our example loss function written in LA and attempt to optimize it using standardLA identities. Here we focus on algebraic rewrites and put aside concerns about the cost model. Using the usualidentities on linear algebra expressions, one may attempt to rewrite the original expression as follows: sum (( X − U V T ) )= sum (( X − U V T ) ∗ ( X − U V T ))= sum ( X − X ∗ U V T + ( U V T ) )= sum ( X ) − sum ( X ∗ U V T ) + sum (( U V T ) ) This was claimed, for conjunctive queries only, in Theorem 5.2 in [3] but a proof was never produced; a proof was given forbag-set semantics in [4]. See the discussion in [8]. PREPRINT - D
ECEMBER
24, 2020
A x A ∗ x T Ax LA: (cid:20) (cid:21) (cid:20) (cid:21) (cid:20) (cid:21) (cid:20) (cid:21) RA: i j j i j i A ∗ x T is element-wise multiplication, computingthe matrix ( A ij x j ) ij , while Ax is standard matrix-vector multiplication. Bottom: their translations into K -relationsand Relational Algebra operations. Each relation is a bag, where the attribute represents the multiplicity, i.e. A (1 , has multiplicity etc. Then A ∗ x T becomes a standard query Q ( i, j ) = A ( i, j ) ∗ x ( j ) , while Ax is a query with agroup-by and aggregate, Q ( i ) = (cid:80) j A ( i, j ) ∗ x ( j ) .1. A ∗ B → [ − i, − j ] ( [ i,j ] A ∗ [ i,j ] B ) .2. A + B → [ − i, − j ] ( [ i,j ] A + [ i,j ] B ) .3. sum row A → [ − i, ] (cid:80) j [ i,j ] A . Similar for sum col , sum .4. AB → [ − i, − k ] (cid:80) j ( [ i,j ] A ∗ [ j,k ] B ) .5. A T → [ − j, − i ] [ i,j ] A .6. A − B → A + ( − ∗ B Figure 2: LA-to-RA Ruleset R LR . Notice that ∗ means point-wise multiply for matrices, but means natural join forrelations. Similarly, + means addition for matrices, but union for relations.At this point we are stuck trying to rewrite sum ( X ∗ U V T ) (recall that ∗ is element-wise multiplication); it turns outto be equal to sum ( U T XV ) , for any matrices X, U, V (and it is equal to the scalar U T XV when U, V are columnvectors), but this does not seem to follow from standard LA identities like associativity, commutativity, and distribu-tivity. Similarly, we are stuck trying to rewrite sum (( U V T ) ) to sum ( U T U ∗ V T V ) . Current systems manually addsyntactic rewrite rules, whenever such a special case is deemed frequent enough to justify extending the optimizer.Instead, our approach is to expand out the LA expression element-wise. For example, assuming for simplicity that U, V are column vectors, we obtain sum (( U V T ) ) = (cid:80) i,j ( U i ∗ V j ) ∗ ( U i ∗ V j )= (cid:80) i,j ( U i ∗ U i ) ∗ ( V j ∗ V J )= ( (cid:80) i U i ∗ U i ) ∗ ( (cid:80) j V j ∗ V j )= U T U ∗ V T V The expressions using indices represent Relational Algebra expressions. More precisely, we interpret every vector, ormatrix, or tensor, as a K -relation [9] over the reals. In other words we view X ij is a tuple X ( i, j ) whose “multiplicity”is the real value of that matrix element. We interpret point-wise multiply as natural join; addition as union; sum asaggregate; and matrix multiply as aggregate over a join . Figure 1 illustrates the correspondence between LA andRA. We treat each matrix entry A ij as the multiplicity of tuple ( i, j ) in relation A under bag semantics. For example A , = 7 , therefore the tuple (2 , has multiplicity of in the corresponding relation. A ∗ x T denotes element-wisemultiplication, where each element A ij of the matrix is multiplied with the element x j of the row-vector x T . In RAit is naturally interpreted as the natural join A ( i, j ) (cid:111)(cid:110) x ( j ) , which we write as A ( i, j ) ∗ x ( j ) . Similarly, Ax is thestandard matrix-vector multiplication in LA, while in RA it becomes a query with a group by and aggregate, whichwe write as (cid:80) j A ( i, j ) ∗ x ( j ) . Our K -relations are more general than bags, since the entry of a matrix can be a realnumber, or a negative number; they correspond to K -relations over the semiring of reals ( R , , , + , ∗ ) .We now describe the general approach in SPORES. The formal definition of LA and RA are in Table 1. LA consistsof seven operators, which are those supported in SystemML [1]. RA consists of only three operators: ∗ (natural join), + (union), and (cid:80) (group-by aggregate). Difference is represented as A − B = A + ( − B (this is difference in R ; wedo not support bag difference, i.e. difference in N like − , because there is no corresponding operation in LA), In the implementation, we use outer join for point-wise multiply and addition, where we multiply and add the matrix entriesaccordingly. In this paper we use join and union to simplify presentation. PREPRINT - D
ECEMBER
24, 2020name type syntax L A mmult M M,L × M L,N → M M,N AB or MxMelemmult M M,N × M M,N → M M,N A ∗ B elemplus M M,N × M M,N → M M,N A + B rowagg M M,N → M M, sum row A colagg M M,N → M ,N sum col A agg M M,N → M , sumA transpose M M,N → M N,M A T c onv . bind M M,N × [ i, j ] → R i : M,j : N [ i,j ] A unbind R i : M,j : N × [ i, j ] → M M,N [ − i, − j ] A R A join R S × R S → R S ∪ S A ∗ B union R S × R S → R S ∪ S A + B agg R S × U → R S \ U (cid:80) U A Table 1: LA and RA Operators. The type M M,N is a matrix of size M × N ; [ i, j ] is a list of attribute names; R i : M,j : N is a relation with attributes i of size M and j of size N ; S , S , S, and U are sets of attributes.1. A ∗ ( B + C ) = A ∗ B + A ∗ C (cid:80) i ( A + B ) = (cid:80) i A + (cid:80) i B
3. If i (cid:54)∈ A , A ∗ (cid:80) i B = (cid:80) i ( A ∗ B ) (else rename i )4. (cid:80) i (cid:80) j A = (cid:80) i,j A
5. If i (cid:54)∈ Attr ( A ) , then (cid:80) i A = A ∗ dim ( i ) A + ( B + C ) = +( A, B, C ) (assoc. & comm.)7. A ∗ ( B ∗ C ) = ∗ ( A, B, C ) (assoc. & comm.)Figure 3: RA equality rules R EQ . ∗ means natural join and + means union.while selection can be encoded by multiplication with relations with 0/1 entries. We call an expression using thesethree RA operators an RPlan , for Relational Plan, and use the terms RPlan and RA/relational algebra interchangeably.Finally, there are two operators, bind and unbind for converting between matrices/vectors and K -relations.The translation from LA to RA is achieved by a set of rules, denoted R LR , and shown in Figure 2. The bind operator [ i,j ] converts a matrix to a relation by giving attributes i, j to its two dimensions; the unbind operator [ − i, − j ] convertsa relation back to a matrix. For example, [ − j, − i ] [ i,j ] A binds A ’s row indices to i and its column indices to j , thenunbinds them in the opposite order, thereby transposing A .SPORES translates a complex LA expression into RA by first applying the rules R LR in Figure 2 to each LA op-erator, replacing it with an RA operator, preceded by bind and followed by unbind . Next, it eliminates consecutiveunbind/bind operators, possibly renaming attributes, e.g. [ k,l ] [ − i, − j ] A becomes A [ i → k, j → l ] , which indicates thatthe attributes i and j in A ’s schema should be renamed to k and l , by propagating the rename downward into A . As aresult, the entire LA expression becomes an RA expression (RPlan), with bind operators on the leaves, and unbind atthe top. For an illustration, the left DAG in Figure 6 shows the expression sum (( X − U V T ) ) translated to relationalalgebra. sum( (X – UV T ) ) All equivalent LA expressionsR RA R LA sum(X ) – 2sum(X*U*V T ) + U T U * V T V R LA Figure 4: Venn diagram of LA and/or RA expressions. Each of the two white islands represent LA expressions thatcan be proven equivalent using identities in LA; there is no way to move from one island to the other by using LAidentities. The large gray box shows the set of expressions that can be proven equivalent by using RA identities. Thisallows us to connect the two islands. 4
PREPRINT - D
ECEMBER
24, 2020 e LA e (cid:48) LA e RA C ( e RA ) ≡ C ( e (cid:48) RA ) e (cid:48) RA = R LR R LR R EQ R EQ Figure 5: Two LA expressions e LA and e (cid:48) LA are equivalent iff their relational counterparts e RA and e (cid:48) RA have isomor-phic canonical forms C ( e RA ) , C ( e (cid:48) RA ) . R LR translates a LA expression to RA and R EQ are the relational equalityrules. *U V+[a] +[c]+*X+[a,c]Σ a,c * -1 +* Σ a,c ** Σ a,cΣ a,c U V+[a] +[c]-1 U V+[a] +[c]X+[a,c] Figure 6: RA DAGs for sum (( X − U V T ) ) (left) and its canonical form (right). R EQ : from RA to RA The equational rules for RA consists of seven identities shown in Figure 3, and denoted by R EQ . The seven rulesare natural relational algebra identities, where ∗ corresponds to natural join, + to union (of relations with the sameschema) and (cid:80) i to group-by and aggregate. In rule 5, i / ∈ Attr ( A ) means that i is not an attribute of A , and dim ( i ) isthe dimension of index i . For a very simple illustration of this rule, consider (cid:80) i . Here is a constant, i.e. a relationof zero arity, with no attributes. The rule rewrites it to dim ( i ) , where dim ( i ) is a number representing the dimensionof i . As we have seen at the beginning of this section, when rewriting LA expressions using identities in linear algebra wemay get stuck. This is illustrated in Figure 4, which shows two white islands representing sets of provably equivalentLA expressions; the simple identities in LA allow us to move within one white island, but are insufficient to movefrom one island to the other. Instead, by rewriting the expressions to RA, the seven identities in R EQ are much morepowerful, and allow us to prove them equivalent. We prove here that this approach is complete , meaning that, if twoLA expressions are semantically equivalent, then their equivalence can be proven by using rules R EQ . The proofconsists of two parts: (1) the rules R EQ are sufficient to convert any RA expression e to its normal form (also called canonical form ) C ( e ) , and back, (2) two RA expressions e, e (cid:48) are semantically equal iff they have isomorphic normalforms, C ( e ) ≡ C ( e (cid:48) ) . Figure 5 shows the main steps at a high level: two LA expressions are semantically equivalent ifand only if their canonical form in RA are isomorphic, where R LR translates each expression to RA and R EQ takesthe RA expression to its canonical form. Throughout this section we write e R ∗ EQ e (cid:48) when e, e (cid:48) can be proven equalby using the identities in R EQ (Fig. 3). Normal Form
The normal form of an RPlan is similar in spirit to the canonical form of polynomials as a sumof monomials, except that the monomial terms also include aggregations. As for polynomials, we combine equalfactors by introducing power, e.g. X ∗ X becomes X , and combine isomorphic monomials by introducing constantcoefficients, e.g. X Y + 5 X Y becomes X Y . For example, the following RPlan denotes a 1-dimensional vector Q ( i ) = ( (cid:80) j x ( i, j ) ∗ ( (cid:80) k y ( j, k ) ∗ x ( i, j )) + (cid:80) m,n x ( i, m ) y ( m, n ) and its canonical form is (cid:80) j,k x ( i, j ) y ( j, k ) .Formally: Definition 2.1.
An RA expression is canonical , or in normal form , if it is the sum (i.e. + ) of n monomials , whereeach monomial i consists of a constant c i multiplied by an aggregation over a (possibly empty) set of attributes A i (for ≤ i ≤ n ), of a product of m i factors , where each factor is some matrix or vector x ij (for ≤ i ≤ n , ≤ j ≤ m i )indexed by some index variables (not shown), and possibly raised to some power: PREPRINT - D
ECEMBER
24, 2020 c (cid:88) A (cid:16) x k ∗ · · · ∗ x k m m (cid:17) + · · · + c n (cid:88) A n (cid:16) x k n n ∗ · · · ∗ x k nmn nm n (cid:17) We further assume that no monomial contains the same factor twice (otherwise, replace it with a single factor with ahigher power k ij ) and no two monomials are isomorphic (otherwise we replace them with a single monomial with alarger coefficient c i ) The order of summands and multiplicands is insignificant because ∗ and + are commutative and associative. Wenotice that, unlike traditional polynomials, here the same matrix name may occur in different factors, for example in (cid:80) i,j,k x ( i, j ) ∗ x ( j, k ) ∗ x ( k, i ) the matrix x occurs three times, but the factors are different i.e. we cannot write x .The first of the completeness proof consists in showing that every expression e in RA is equivalent to some normalform C ( e ) , and, moreover, that their equivalence can be proven using the rules R EQ in Figure 3. Lemma 2.1. ∀ e ∈ RA there exists a canonical form C ( e ) and, moreover, their equivalence follows from the rules in R EQ , in notation: e R ∗ EQ C ( e ) . The proof is a straightforward induction on the structure of e . We apply distributivity of ∗ over + , then pull out thesummation (cid:80) ; we omit the details. We illustrate in Figure 6 the canonical form of the expression sum (( X − U V T ) ) .Notice that, since the rules in R EQ are sound, it follows that any expression e has the same semantics as its canonicalform C ( e ) .The second step of the completeness proof is to show that canonical forms are unique up to isomorphism. Lemma 2.2. ( Uniqueness of RA Normal Form ) Let e , e be two RA expressions in normal form. Suppose thatthey have the same semantics, i.e. e = e for all inputs with arbitrary dimensions. Then their canonical forms areisomorphic. We give a proof of this lemma in the appendix. Here, we comment on a subtle issue, namely the requirement that e = e on inputs of any dimensions is necessary. For example, if we restrict the matrices x, y to be of dimension × , thenthe expressions (cid:80) i,j x ( i, j ) ∗ y ( i, j ) and (cid:80) i,j x ( i, j ) ∗ y ( j, i ) have the same semantics, but different canonical form. Foranother example, consider three vectors x, y, z of dimension N . Then, if N ≤ these two expressions are identical: (cid:80) i,j,k x ( i ) ∗ y ( j ) ∗ z ( k )+2 (cid:80) i x ( i ) ∗ y ( i ) ∗ z ( i ) and (cid:80) i,j x ( i ) ∗ y ( i ) ∗ z ( j )+ (cid:80) i,j x ( i ) ∗ y ( j ) ∗ z ( i )+ (cid:80) i,j x ( j ) ∗ y ( i ) ∗ z ( i ) ,although they are not equal in general.We are now ready to establish the completeness of RA equalities, by showing any equivalent LA expressions can berewritten to each other through the translation rules R LR and the canonicalization rules R EQ : Theorem 2.3. ( Completeness of R EQ ) Two LA expressions are semantically equivalent if and only if their relationalform is in the transitive closure of R EQ rules: ∀ e , e ∈ LA , ∀ d.e ( d ) = e ( d ) ⇐⇒ R LR ( e ) R ∗ EQ R LR ( e ) Here R LR ( e ) translates LA expression e into RA. Proof.
Translating e and e to RA preserves semantics under R LR . By Lemma 2.1 normalizing R LR ( e ) and R LR ( e ) preserves semantics. By the uniqueness of the normal form (Lemma 2.2) C ( R LR ( e )) ≡ C ( R LR ( e )) ⇐⇒ R LR ( e ) = R LR ( e ) Since every rule in R EQ is reversible, R LR ( e ) = R LR ( e ) ⇐⇒ R LR ( e ) R ∗ EQ R LR ( e ) With a complete representation of the search space by relational algebra, our next step is to explore this space and findthe optimal expression in it. Traditional optimizing compilers commonly resort to heuristics to select from availablerewrites to apply. SystemML implements a number of heuristics for its algebraic rewrite rules, and we discuss a fewcategories of them here. 6
PREPRINT - D
ECEMBER
24, 2020 * YX * * YX * * * Figure 7: E-Graph representing ( X ∗ Y ) ∗ Y (left), and the graph after applying associativity to the root (right). Newnodes are in gray. Each dashed box is an E-Class.C OMPETING OR C ONFLICTING R EWRITES
The same expression may be eligible for more than one rewrites. Forexample, sum ( AB ) rewrites to sum ( sum col ( A ) T ∗ sum row ( B )) , but when both A and B are vectors the expressioncan also be rewritten to a single dot product. SystemML then implements heuristics to only perform the first rewritewhen the expression is not a dot product. In the worst case, a set of rules interacting with each other may create aquadratic number of such conflicts, complicating the codebase.O RDER OF R EWRITES
Some rewrite should be applied after others to be effective. For example,
X/y could berewritten to X ∗ /y which may be more efficient, since SystemML provides efficient implementation for sparsemultiplication but not for division. This rewrite should occur before constant folding; otherwise it may create spuriousexpressions like X/ (1 /y ) → X ∗ (1 / (1 /y )) , and without constant folding the double division will persist. However, arewrite like / (1 + exp ( − X )) → sigmoid ( X ) should come after constant folding, in order to cover expressions like (3 − / (1 + exp ( − X )) . Since SystemML requires all rewrites to happen in one phase and constant folding another,it has to leave out rewrites like X/y → X ∗ /y .D EPENDENCY ON I NPUT / P
ROGRAM P ROPERTIES
Our example optimization from sum (( X − U V T ) ) to sum ( X ) − U T XV + U T U ∗ V T V improves performance only if X is sparse. Otherwise, computing X and X ∗ U V T would both create dense intermediates. Similarly, some rewrites depend on program properties like com-mon subexpressions. Usually, these rewrites only apply when the matched expression shares no CSE with others inorder to leverage common subexpression elimination. Testing input and program properties like this becomes boiler-plate code, making implementation tedious and adds burden to maintenance.C OMPOSING R EWRITES
Even more relevant to us is the problem of composing larger rewrites out of smaller ones.Our equality rules R EQ are very fine-grained, and any rule is unlikely to improve performance on its own. Ourexample optimization from sum (( X − U V T ) ) to sum ( X ) − U T XV + U T U ∗ V T V takes around 10 applicationsof R EQ rules. If an optimizer applies rewrites one by one, it is then very difficult, if not impossible, for it to discoverthe correct sequence of rewrites that compose together and lead to the best performance.Stepping back, the challenge of orchestrating rewrites is known as the phase-ordering problem in compiler optimiza-tion. Tate et al. [28] proposed a solution to this problem dubbed equality saturation , which we adapt and extend inSPORES. Equality saturation optimizes an expression in two steps:
Saturation : given the input expression, the optimizer enumerates equivalent expressions and collects them into acompact representation called the E-Graph [22].
Extraction : given a cost function, the optimizer selects the optimal expression from the E-Graph. An expressionis represented by a subgraph of the E-Graph, and the optimizer uses a constraint solver to find the subgraph that isequivalent to the input, and is optimal according to the cost function.
The E-Graph Data Structure
An E-Graph represents sets of equivalent expressions. A node in the graph is called an E-Class, which contains the rootoperators of a set of equivalent expressions. The edges are similar to the edges in an abstract syntax tree; but insteadof pointing from an operator directly to a child, each edge points from an operator to an E-Class of expressions. Forexample, in Figure 7 the top class in the middle represents the set of equivalent expressions { ( X ∗ Y ) ∗ Y, X ∗ ( Y ∗ Y ) } .Note that the class represents two expressions, each with 2 appearances of Y and one appearance of X , whereas Another reason to leave out this rewrite is that X ∗ /y rounds twice, whereas X/y only rounds once. PREPRINT - D
ECEMBER
24, 20201 def saturate ( egraph , equations ): for eq in equations : matches = egraph . match ( eq . lhs ) for eclass in matches : ec = egraph . add ( eq . rhs ) egraph . merge ( eclass , c ) Figure 8: Pseudo code for saturating the E-Graph given a set of equalities. match returns the IDs of the root class ofany matching subgraph; merge combines two E-Classes given their IDs; it also propagates the congruent closure ofthe new equality. Figure 9 defines add .1 def add ( expr ): ID = egraph . find ( expr ) if ID != NULL : return ID else : cids = expr . children . map ( add ) ID = egraph . insert ( expr . op , cids ) return ID Figure 9: Pseudo code for adding an expression to the E-Graph. find looks for the given expression in the E-Graph,and returns its root ID if it already exists, or
NULL otherwise. insert adds the given operator to the E-Graph, andpoints its children to E-Classes with the given class IDs.each variable only appears once in the E-Graph. This is because the E-Graph makes sure its expressions share allpossible common subexpressions. As the size of the graph grows, this compression becomes more and more notable;in some cases a graph can represent a number of expressions exponential to its size [28]. We take advantage of thiscompression in SPORES to efficiently cover vast portions of the search space. If saturation, as described below, carriesout to convergence, the E-Graph represents the search space exhaustively.An E-Graph can also be seen as an AND-OR DAG over expressions. Each E-Class is an OR node whose children areequivalent expressions from which the optimizer chooses from. Each operator is an AND node whose children mustall be picked if the operator itself is picked. In this paper we favor the terms E-Graph and E-Class to emphasize eachOR node is an equivalence class.
Saturating the E-Graph
At the beginning of the optimization process, the optimizer instantiates the graph by inserting the nodes in the syntaxtree of the input expression one by one in post order. For example, for input ( X ∗ Y ) ∗ Y , we construct the leftgraph in Figure 7 bottom-up. By inserting in post order, we readily exploit existing common subexpressions in theinput. Once the entire input expression is inserted, the optimizer starts to extend the graph with new expressionsequivalent to the input. It considers a list of equations, and matches either side of the equation to subgraphs of theE-Graph. If an equation matches, the optimizer then inserts the expression on the other side of the equation to thegraph. For example, applying the associative rule extends the left graph in Figure 7 with X ∗ ( Y ∗ Y ) , resulting inthe right graph. Figure 8 shows the pseudo code for this process. While inserting new expressions, the optimizerchecks if any subexpression of the new expression is already in the graph. If so, it reuses the existing node, therebyexploiting all possible common-subexpressions to keep the E-Graph compact. In Figure 7, only two ∗ are added sincethe variables X and Y are already in the graph. Once the entire new expression has been added, the optimizer thenmerges the newly created E-Class at its root with the E-Class containing the matched expression, asserting them equal.Importantly, the optimizer also propagates the congruent closure of this new equality. For example, when A + A ismerged with ∗ A , the optimizer also merges ( A + A ) with (2 ∗ A ) . Figure 9 shows the pseudo code for addingan expression to E-Graph. This process of match-and-insert is repeated until the graph stops changing, or reachinga user-specified bound on the number of saturation iterations. If this process does converge, that means no rule canadd new expressions to the graph any more. If the set of rules are complete, as is our R EQ , convergence of saturationimplies the resulting E-Graph represents the transitive closure of the equality rules applied to the initial expression. Inother words, it contains all expressions equivalent to the input under the equality rules.8 PREPRINT - D
ECEMBER
24, 2020
14 20 4
Figure 10: The CSE problem. Each node shows its cost. A greedy optimizer picks nodes with costs 0, 1 and bothnodes with cost 4; but the optimal plan uses nodes with costs 0, 2 and shares the same node with cost 4.The outer loop that matches equations to the graph can be implemented by a more efficient algorithm like the Retealgorithm [6] when the number of equations is large. However, we did not find matching to be expensive and simplymatch by traversing the graph. Our implementation uses the E-Graph data structure from the egg [30] library.
Dealing with Expansive Rules
While in theory equality saturation will converge with well-constructed rewrite rules, in practice the E-Graph mayexplode for certain inputs under certain rules. For example, a long chain of multiplication can be rewritten to an expo-nential number of permutations under associativity and commutativity (AC rules). If we apply AC rules everywhereapplicable in each iteration, the graph would soon use up available memory. We call this application strategy the depth-first strategy because it eagerly applies expansive rules like AC. AC rules by themselves rarely affect performance, andSystemML also provides the fused mmchain operator that efficiently computes multiplication chains, so permuting achain is likely futile. In practice, AC rules are useful because they can enable other rewrites. Suppose we have a rule R factor : A ∗ X + B ∗ X → ( A + B ) ∗ X and an expression U ∗ Y + Y ∗ V . Applying commutativity to Y ∗ V would then transform the expression to be eligible for R factor . With this insight, we change each saturation iterationto sample a limited number of matches to apply per rule, instead of applying all matches. This amounts to adding matches = sample(matches, limit) between line 3 and line 4 in Figure 9. Sampling encourages each rule to beconsidered equally often and prevents any single rule from exploding the graph. This helps ensure good explorationof the search space when exhaustive search is impractical. But when it is possible for saturation to converge and beexhaustive, it still converges with high probability when we sample matches. Our experiments in Section 4.3 showsampling always preserve convergence in practice. Extracting the Optimal Plan
A greedy strategy to extract the best plan from the saturated E-Graph is to traverse the graph bottom-up, pickingthe best plan at each level. This assumes the best plan for any expression also contains the best plan for any of itssubexpressions. However, the presence of common subexpressions breaks this assumption. In Figure 10 each operatornode is annotated with its cost. Between the nodes with costs 1 and 2, a greedy strategy would choose 1, which incurstotal cost of . The greedy strategy then needs to pick the root node with cost 0 and the other node with cost4, incurring a total cost of 9. However, the optimal strategy is to pick the nodes with 0, 2 and share the same node withcost 4, incurring a total cost of 6.We handle the complexity of the search problem with a constraint solver. We assign a variable to each operator andeach E-Class, then construct constraints over the variables for the solver to select operators that make up a validexpression. The solver will then optimize a cost function defined over the variables; the solution then corresponds tothe optimal expression equivalent to the input.
Constraint Solving and Cost Function
We encode the problem of extracting the cheapest plan from the E-Graph with integer linear programming (ILP).Figure 11 shows this encoding. For each operator in the graph, we generate a boolean variable B op ; for each E-Classwe generate a variable B c . For the root class, we use the variable B r . Constraint F ( op ) states that if the solver selectsan operator, it must also select all its children; constraint G ( c ) states that if the solver selects an E-Class, it must selectat least one of its members. Finally, we assert B r must be selected, which constrains the extracted expression to bein the same E-Class as the unoptimized expression. These three constraints together ensure the selected nodes forma valid expression equivalent to the unoptimized input. Satisfying these constraints, the solver now minimizes thecost function given by the total cost of the selected operators. Because each B op represents an operator node in theE-Graph which can be shared by multiple parents, this encoding only assigns the cost once for every shared commonsubexpression. In our implementation, we use Gurobi [10] to solve the ILP problem.9 PREPRINT - D
ECEMBER
24, 2020
Constraints ≡ B r ∧ (cid:94) op F ( op ) ∧ (cid:94) c G ( c ) F ( op ) ≡ B op → (cid:94) c ∈ op.children B c G ( c ) ≡ B c → (cid:95) op ∈ c.nodes B op minimize (cid:88) op B op · C op s.t. Constraints
Figure 11: ILP constraint and objective for extraction. S [ X ∗ Y ] = min ( S [ X ] , S [ Y ]) S [ X + Y ] = min (1 , S [ X ] + S [ Y ]) S [ (cid:88) i X ] = min (1 , | i | · S [ X ]) Figure 12: Sparsity estimation. We define sparsity = nnz/size , i.e. a 0 matrix has sparsity . . | i | is the size of theaggregated dimension.Each operation usually has cost proportional to the output size in terms of memory allocation and computation. Sincethe size of a matrix is proportional its the number of non-zeroes (nnz), we use SystemML’s estimate of nnz as the costfor each operation. Under our relational interpretation, this corresponds to the cardinality of relational queries. We usethe simple estimation scheme in Figure 12, which we find to work well. Future work can hinge on the vast literatureon sparsity and cardinality estimation to improve the cost model. In the rules R EQ used by the saturation process, Rule (3) If i (cid:54)∈ A , A ∗ (cid:80) i B = (cid:80) i ( A ∗ B ) contains a condition onattribute i which may be deeply nested in the expression. This means the optimizer cannot find a match with a simplepattern match. Fortunately, all expressions in the same class must contain the same set of free attributes (attributes notbound by aggregates). In other words, the set of free variables is invariant under equality. This corresponds preciselyto the schema of a database - equivalent queries must share the same schema. We therefore annotate each class withits schema, and also enable each equation to match on the schema.In general, we find class invariants to be a powerful construct for programming with E-Graphs. For each class we trackas class invariant if there is a constant scalar in the class. As soon as all the children of an operator are found to containconstants, we can fold the operator with the constant it computes. This seamlessly integrates constant folding with therest of the rewrites. We also treat sparsity as a class invariant and track it throughout equality saturation. Because oursparsity estimation is conservative, equal expressions that use different operators may have different estimates. But assoon as we identify them as equal, we can merge their sparsity estimates by picking the tighter one, thereby improvingour cost function. Finally, we also take advantage of the schema invariant during constraint generation. Because weare only interested in RA expressions that can be translated to LA, we only generate symbolic variables for classes thathave no more than two attributes in their schema. This prunes away a large number of invalid candidates and helps thesolver avoid wasting time on them. We implement class invariants using egg ’s Metadata API. Since equality saturation can rewrite any expression given a set of equations, we can directly perform the translationbetween LA and RA within saturation, simply by adding the translation rules R LR from Figure 2. Furthermore,saturation has flexible support for custom functions. The simplest option is to treat a custom functions as a blackbox, so saturation can still optimize below and above them. With a little more effort, we have the option to extendour equations R EQ to reason about custom functions, removing the optimization barrier. We take this option for Some may find this definition counter-intuitive; we define it so to be consistent with SystemML. PREPRINT - D
ECEMBER
24, 2020
SPORESSystemML .DML programLA DAGLA DAG ... optimize output LA planRA plan{ EQ. RA plan }Best RA planBest LA plan translateEQ. saturateextract w/ solvertranslate
Figure 13: Architecture of SPORES and integration within SystemMLcommon operators that are not part of the core RA semantics, e.g. square, minus and divide. In the best scenario, if thecustom function can be modeled by a combination of basic operators, we can add a rule equating the two, and retainboth versions in the same graph for consideration. In fact, this last option enables us to encode fused operators andseamlessly integrate fusion with other rewrite rules. As a result, the compiler no longer need to struggle with orderingfusion and rewrites, because saturation simultaneously considers all possible ordering.
Using equality saturation, SPORES elegantly remedies the drawbacks of heuristics mentioned in the beginning ofsection 3. First, when two or more conflicting rewrites apply, they would be added to the same E-Class, and theextraction step will pick the more effective one based on the global cost estimate. Second, there is no need to carefullyorder rewrites, because saturation simultaneously considers all possible orders. For example, when rules R and R can rewrite expression e to either R ( R ( e )) or R ( R ( e )) , one iteration of saturation would add R ( e ) and R ( e ) tothe graph, and another iteration would add both R ( R ( e )) and R ( R ( e )) to the same E-Class. Third, rules do notneed to reason about their dependency on input or program properties, because extraction uses a global cost modelthat holistically incorporates factors like input sparsity and common subexpressions. Finally, every rule applicationin saturation applies one step of rewrite on top of those already applied, naturally composing complex rewrites out ofsimple ones. We integrate SPORES into SystemML to leverage its compiler infrastructure. Figure 13 shows the architecture ofthe integrated system: the optimizer plugs into the algebraic rewrite pass in SystemML. It takes in a DAG of linearalgebra operations, and outputs the optimized DAG. Within the optimizer, it first translates the LA DAG into relationalalgebra, performs equality saturation, and finally translates the optimal expression back into LA. We obtain matrixcharacteristics such as dimensions and sparsity estimation from SystemML. Since we did not focus our efforts insupporting various operators and data types unrelated to linear algebra computation (e.g. string manipulation), weonly invoke SPORES on important LA expressions from the inner loops of the input program.
We evaluate SPORES to answer three research questions about our approach of relational equality saturation:11
PREPRINT - D
ECEMBER
24, 2020Method Name
UnnecessaryOuterProduct X*(Y%*%1) -> X*Y, if Y col vectorColwiseAgg colsums(X) -> sum(X) or X, if col/row vectorRowwiseAgg rowsums(X) -> sum(X) or X, if row/col vectorColSumsMVMult colSums(X*Y) -> t(Y) %*% X, if Y col vectorRowSumsMVMult rowSums(X*Y) -> X %*% t(Y), if Y row vectorUnnecessaryAggregate sum(X) -> as.scalar(X), if 1x1 dimsEmptyAgg sum(X) -> 0, if nnz(X)==0EmptyReorgOp t(X) -> matrix(0, ncol(X), nrow(X)) if nnz(X)==0EmptyMMult X%*%Y -> matrix(0,...), if nnz(Y)==0IdentityRepMatrixMult X%*%y -> X if y matrix(1,1,1)ScalarMatrixMult X%*%y -> X*as.scalar(y), if y is a 1-1 matrixpushdownSumOnAdd sum(A+B) -> sum(A)+sum(B) if dims(A)==dims(B)DotProductSum sum(v^2) -> t(v)%*%v if ncol(v)==1reorderMinusMatrixMult (-t(X))%*%y -> -(t(X)%*%y)SumMatrixMult sum(A%*%B) -> sum(t(colSums(A))*rowSums(B))EmptyBinaryOperation X*Y -> matrix(0,nrow(X),ncol(X)) / X+Y->X / X-Y->XScalarMVBinaryOperation X*y -> X*as.scalar(y), if y is a 1-1 matrixUnnecessaryBinaryOperation X*1 -> X (after rm unnecessary vectorize)BinaryToUnaryOperation X*X -> X^2, X+X -> X*2, (X>0)-(X<0) -> sign(X)MatrixMultScalarAdd eps+U%*%t(V) -> U%*%t(V)+epsDistributiveBinaryOperation (X-Y*X) -> (1-Y)*XBushyBinaryOperation (X*(Y*(Z%*%v))) -> (X*Y)*(Z%*%v)UnaryAggReorgOperation sum(t(X)) -> sum(X)UnnecessaryAggregates sum(rowSums(X)) -> sum(X)BinaryMatrixScalarOperation as.scalar(X*s) -> as.scalar(X)*spushdownUnaryAggTransposeOp colSums(t(X)) -> t(rowSums(X))pushdownCSETransposeScalarOp a=t(X), b=t(X^2) -> a=t(X), b=t(X)^2 for CSE t(X)pushdownSumBinaryMult sum(lamda*X) -> lamda*sum(X) if lamdba is scalarUnnecessaryReorgOperation t(t(X))->X potentially introduced by other rewritesTransposeAggBinBinaryChains t(t(A)%*%t(B)+C) -> B%*%A+t(C)UnnecessaryMinus -(-X)->X potentially introduced by other rewrites Figure 14: Sum-product rewrites in SystemML. The first column lists the name for each rewrite method. Each methodimplements a number of rewrite patterns, and the second column shows how many. The last column shows an examplerewrite for each method. Following SystemML’s notation, scalars are in lower-case and matrices/vectors in upper-case. %*% is matrix multiply, t() is transpose, and nnz(X) is the number of non-zeroes in X . Equality saturation derivesrewrites form all 31 methods (84 patterns) with relational rules.• Section 4.1 : can SPORES derive hand-coded rewrite rules for sum-product optimization?•
Section 4.2 : can SPORES find optimizations that lead to greater performance improvement than hand-codedrewrites and heuristics?•
Section 4.3 : does SPORES induce compilation overhead afforded by its performance gain?We ran experiments on a single node with Intel E74890 v2 @ 2.80GHz with hyper-threading, 1008 GB RAM, 8TBdisk, and Ubuntu 16.04.6. We used OpenJDK 1.8.0, Apache Hadoop 2.7.3, and Apache Spark 2.4.4. Spark wasconfigured to run locally with 6 executors, 8 cores/executor, 50GB driver memory, and 100GB executor memory. Ourbaselines are from Apache SystemML 1.2.0. All datasets have been synthetically generated to evaluate a range ofscenarios, where we used algorithm specific data generators from SystemML’s benchmark suite.
Theoretically, our first hypothesis is validated by the fact that our relational equality rules are complete w.r.t. linearalgebra semantics. To test completeness in practice , our first set of experiments check if SPORES can derive thehand-coded sum product rewrite rules in SystemML. To do this, we input the left hand side of each rule into SPORES,perform equality saturation, then check if the rule’s right hand side is present in the saturated graph. The optimizer is “I have only proved it correct, not tried it” – Donald Knuth PREPRINT - D
ECEMBER
24, 2020 R un T i m e [ s e c ] ALS baseopt2saturation 0.1Mx1K 1Mx1K 10Mx1K
GLM
SVM
MLR
PNMF
Figure 15: Run time of programs compiled by different optimizers.
ALS GLM SVM MLR PNMF0.00.51.01.52.02.5 C o m p il e T i m e [ s e c ] DFS, greedy extraction
ALS GLM SVM MLR PNMF sampling, greedy extraction
ALS GLM SVM MLR PNMF sampling, ILP extraction
ALS GLM SVM MLR PNMF
SystemML translatesaturateextractSystemMLtimeout
Figure 16: Compile time breakdown for different saturation and extraction strategies. Depth-first saturation reachesthe 2.5s timeout compiling GLM and SVM.able to derive all 84 sum-product rewrite rules in SystemML using relational equality rules. See Figure 14 for a list ofthese rewrites. We believe replacing the 84 ad-hoc rules with our translation rules R LR and equality rules R EQ wouldgreatly simplify SystemML’s codebase. Together with equality saturation, our relational rules can also lead to betterperformance, as we demonstrate in the next set of experiments. We compare SPORES against SystemML’s native optimizations for their performance impact. In particular, we runSystemML with optimization level 2 ( opt2 ), which is its default and includes all advanced rewrites like constantfolding and common subexpression elimination. We additionally enable SystemML’s native sum-product rewrites andoperator fusion. For baseline ( base ) we use level 1 optimization from SystemML, since level 0 (no optimization)timeouts on almost all input sizes. base includes no advanced rewrites, sum-product optimization or operator fusion;it only performs local optimizations with basic pattern-matching. We compile and execute 5 real-world algorithmsunder 3 configurations: 1. SystemML without optimization, 2. SystemML with optimization configured as above,and 3. our equality saturation optimizer. The algorithms include Generalized Linear Model (GLM), MultinomialLogistic Regression (MLR), Support Vector Machine (SVM), Poisson Nonnegative Matrix Factorization (PNMF),and Alternating Least Square Factorization (ALS). We take the implementation of these algorithms from SystemML’sperformance benchmark suite.Figure 15 shows the performance improvement for each optimization setting. Overall, equality saturation is com-petitive with the hand-coded rules in SystemML: for GLM and SVM, saturation discovers the same optimizationsthat improve performance as SystemML does. For ALS, MLR and PNMF, saturation found new optimizations that13
PREPRINT - D
ECEMBER
24, 2020 R un T i m e [ s e c ] ALS
SystemMLS+ILPS+greedyD+greedy 1Mx10 10Mx10 10Mx100
GLM
SVM
MLR
PNMF
Figure 17: Performance impact of different saturation and extraction strategies. S is saturation with sampling, and Dis depth-first saturation. Depth-first saturation runs into timeout compiling GLM and SVM.lead to speedups from 1.2X to 5X as compared to SystemML. We analyze each benchmark in detail in the followingparagraphs.For
ALS , SPORES leads to up to 5X speedup beyond SystemML’s optimizations using our relational rules. Investigat-ing the optimized code reveals the speedup comes from a rather simple optimization: SPORES expands ( U V T − X ) V to U V T V − XV to exploit the sparsity in X . Before the optimization, all three operations (2 matrix multiply and 1minus) in the expression create dense intermediates because U and V are dense. After the optimization, XV can becomputed efficiently thanks to the sparsity in X . U V T V can be computed in one go without intermediates, takingadvantage of SystemML’s mmchain operator for matrix multiply chains. Although the optimization is straightforward,it is counter-intuitive because one expects computing A ( B + C ) is more efficient than AB + AC if one does notconsider sparsity. For the same reason, SystemML simply does not consider distributing the multiplication and missesthe optimization.For PNMF , the speedup of up to 3X using RA rules attributes to rewriting sum ( W H ) to sum col ( W ) · sum row ( H ) which avoids materializing a dense intermediate W H . Interestingly, SystemML includes this rewrite rule but didnot apply it during optimization. In fact, SystemML only applies the rule when
W H does not appear elsewhere, inorder to preserve common subexpression. However, although
W H is shared by another expression in PNMF, theother expression can also be optimized away by another rule. Because both rules uses heuristics to favor sharing CSE,neither fires. This precisely demonstrates the limitation of heuristics.For
MLR , the important optimization by saturation is P ∗ X − P ∗ sum row ( P ) ∗ X to P ∗ (1 − P ) ∗ X , where P is a column vector. This takes advantage of the sprop fused operator in SystemML to compute P ∗ (1 − P ) ,therefore allocating only one intermediate. Note that the optimization factors out P , which is the exact opposite to theoptimization in ALS that distributes multiply. Naive rewrite rules would have to choose between the two directions,or resort to heuristics to decide when to pick which.For SVM and
GLM , equality saturation finds the same optimizations as SystemML does, leading to speedup mainlydue to operator fusion. Upon inspection, we could not identify better optimizations for
SVM . For
GLM , however, wediscovered a manual optimization that should improve performance in theory, but did not have an effect in practicesince SystemML cannot accurately estimate sparsity to inform execution.
In our initial experiments, SPORES induces nontrivial compilation overhead compared to SystemML’s native rule-based rewrites. Figure 16 (sampling, ILP extraction) shows the compile time breakdown for each benchmark, andthe majority of time is spent in the ILP solver. We therefore experiment with a greedy algorithm during extractionto see if we can trade off guarantees of optimality for a shorter compile time. This algorithm traverses the saturatedgraph bottom-up, picking the cheapest operator in each class at every level. Figure 17 shows the performance impactof greedy extraction, and Figure 16 (sampling, greedy extraction) shows the compile time with it. Greedy extractionsignificantly reduces compile time without sacrificing any performance gain! This is not surprising in light of the Simplified here for presentation. In the source code P and X are not variables but consist of subexpressions. PREPRINT - D
ECEMBER
24, 2020optimizations we discussed in Section 4.2: all of these optimizations improve performance regardless of commonsubexpressions, so they are selected by both the ILP-based and the greedy extractor.We also compare saturation with sampling against depth-first saturation in terms of performance impact and compiletime. Recall the depth-first saturation strategy applies all matches per rule per iteration. As Figure 16 shows, samplingis slightly slower for ALS, MLR and PNMF, but resolves the timeout for GLM and SVM. This is because samplingtakes longer to converge when full saturation is possible, and otherwise prevents the graph from blowing up beforereaching the iteration limit. Indeed, saturation converges for ALS, MLR and PNMF, which means SPORES canguarantee the optimality of its result under the given cost model. Saturation does not not converge before reachingthe iteration limit for GLM and SVM because of deeply nested ∗ and + in the programs. Convergence may come asa surprise despite E-Graph’s compaction – expansive rules like associativity and commutativity commonly apply inpractice. However, the expression DAGs we encounter are often small (no more than 15 operators), and large DAGsare cut into small pieces by optimization barriers like uninterpreted functions.Figure 16 compares the overall DAG compilation overhead of SystemML against SPORES with different extractionstrategies. Note that the overhead of SystemML also includes optimizations unrelated to sum-product rewrites that aredifficult to disentangle, therefore it only gives a sense of the base compilation time and does not serve as head-to-headcomparison against SPORES. Although SPORES induces significant compilation overhead in light of SystemML’stotal DAG compilation time, the overhead is afforded by its performance gain. As we did not focus our efforts onreducing compile time, we believe there is plenty room for improvement, for example organizing rewrite rules tospeed up saturation. There is a vast body of literature for both relational query optimization and optimizing compilers for machine learning.Since we optimize machine learning programs through a relational lens, our work relates to research in both fields. Aswe have pointed out, numerous state-of-the-art optimizing compilers for machine learning resort to syntactic rewritesand heuristics to optimize linear algebra expressions [1] [27] [11]. We distinguish our work which performs optimiza-tion based on a relational semantics of linear algebra and holistically explore the complex search space. A majorityof relational query optimization focus on join order optimization [7] [19] [20] [25]; we distinguish our work whichoptimizes programs with join (product), union (addition), and aggregate (sum) operations. Sum-product optimizationconsiders operators other than join while optimizing relational queries. Recent years have seen a line of excellent the-oretical and practical research in this area [17] [14]. These work gives significant improvement for queries involving ∗ and (cid:80) , but fall short of LA workloads that occur in practice. We step past these frameworks by incorporating commonsubexpressions and incorporating addition ( + ).In terms of approach, our design of relational IR ties in to research that explores the connection between linear algebraand relational algebra. Our design and implementation of the optimizer ties into research that leverage equality sat-uration and AND - OR DAG s for query optimization and compiler optimization for programming languages. Since wefocus on optimizing sum-product expressions in linear algebra, our work naturally relates to research in sum-productoptimization. We now discuss these three lines of research in detail.
Elgamal et al. [5] envisions
SPOOF , a compiler for machine learning programs that leverages relational query opti-mization techniques for LA sum-product optimization. We realize this vision by providing the translation rules fromLA to RA and the relational equality rules that completely represents the search space for sum-product expressions.One important distinction is, Elgamal et al. proposes restricted relational algebra where every expression must haveat most two free attributes. This ensures every relational expression in every step of the optimization process to beexpressible in LA. In contrast, we remove this restriction and only require the optimized output to be in linear algebra.This allows us to trek out to spaces not covered by linear algebra equality rules and achieve completeness. In additionto sum-product expressions, Elgamal et al. also considers selection and projection operations like selecting the positiveentries of a matrix. We plan to explore supporting selection and projection in the future. Elgamal et al. also proposescompile-time generation of fused operators, which is implemented by Boehm et al. [2]. SPORES readily takes advan-tage of existing fused operators, and we plan to explore combining sum-product rewrite with fusion generation in thefuture.MorpheusFI by Li et al. [18] and LARA by Hutchison et al. [12] explore optimizations across the interface of machinelearning and database systems. In particular, MorpheusFI speeds up machine learning algorithms over large joins bypushing computation into each joined table, thereby avoiding expensive materialization. LARA implements linear15
PREPRINT - D
ECEMBER
24, 2020algebra operations with relational operations and shows competitive optimizations alongside popular data processingsystems. Schleich et al.[24] and Khamis et al.[16] explore in-database learning, which aims to push entire machinelearning algorithms into the database system. We contribute in this space by showing that even without a relationalengine, the relational abstraction can still benefit machine learning tasks as a powerful intermediate abstraction.
Equality saturation and
AND - OR DAG s have been applied to optimize low-level assembly code [15], Java programs[28], database queries [7], floating point arithmetics [23], and even computer-aided design models [21]. The designof our relational IR brings unique challenges in adopting equality saturation. Compared to database query optimizersthat focus on optimizing join orders, unions and aggregates play a central role in our relational IR and are prevalentin real-world programs. As a result, our equality rules depend on the expression schema which is not immediatelyaccessible from the syntax. We propose class invariants as a solution to access schema information, and show it to be apowerful construct that enables constant folding and improves cost estimation. Compared to optimizers for low-levelassembly code or Java program, we commonly encounter linear algebra expressions that trigger expansive rules andmake saturation convergence impractical. We propose to sample rewrite matches in order to achieve good explorationwithout full saturation. Equality saturation takes advantage of constraint solvers which have also been applied toprogram optimization and query optimization. In particular, the use of solvers for satisfiability modulo theories by[26] has spawned a paradigm now known as program synthesis . In query optimization research, [29] applies MixedInteger Linear Programming for optimizing join ordering. Although constraint solvers offer pleasing guarantees ofoptimality, our experiments show their overhead does not bring significant gains for optimizing LA expressions.
We propose a novel optimization approach for compiling linear algebra programs, using relational algebra as anintermediate representation and equality saturation to explore the search space. We implement our equality saturationbased optimizer and show it is effective for improving real-world machine learning algorithms.
The authors would like to thank Alexandre Evfimievski, Matthias Boehm, and Berthold Reinwald for their insights indeveloping SystemMLs internals. We would also like to thank Mathew Luo and Brendan Murphy for their guidancein piloting an ILP extractor as well as Zach Tatlock, Max Willsey and Chandrakana Nandi for their valuable feedbackand discussion.
References [1] M. Boehm. Apache systemml. In S. Sakr and A. Y. Zomaya, editors,
Encyclopedia of Big Data Technologies .Springer, 2019.[2] M. Boehm, B. Reinwald, D. Hutchison, P. Sen, A. V. Evfimievski, and N. Pansare. On optimizing operator fusionplans for large-scale machine learning in systemml.
PVLDB , 11(12):1755–1768, 2018.[3] S. Chaudhuri and M. Y. Vardi. Optimization of
Real conjunctive queries. In
Proceedings of the Twelfth ACMSIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 25-28, 1993, Washington, DC,USA , pages 59–70, 1993.[4] S. Cohen, Y. Sagiv, and W. Nutt. Equivalences among aggregate queries with negation.
ACM Trans. Comput.Log. , 6(2):328–360, 2005.[5] T. Elgamal, S. Luo, M. Boehm, A. V. Evfimievski, S. Tatikonda, B. Reinwald, and P. Sen. SPOOF: Sum-ProductOptimization and Operator Fusion for Large-Scale Machine Learning. In
CIDR , 2017.[6] C. Forgy. Rete: A fast algorithm for the many patterns/many objects match problem.
Artif. Intell. , 19(1):17–37,1982.[7] G. Graefe. The Cascades Framework for Query Optimization.
IEEE Data Eng. Bull. , 18(3), 1995.[8] T. J. Green. Containment of conjunctive queries on annotated relations. In
Database Theory - ICDT 2009, 12thInternational Conference, St. Petersburg, Russia, March 23-25, 2009, Proceedings , pages 296–309, 2009.16
PREPRINT - D
ECEMBER
24, 2020[9] T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In L. Libkin, editor,
Proceedings ofthe Twenty-Sixth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 11-13,2007, Beijing, China , pages 31–40. ACM, 2007.[10] L. Gurobi Optimization. Gurobi optimizer reference manual, 2019.[11] B. Huang, S. Babu, and J. Yang. Cumulon: optimizing statistical data analysis in the cloud. In K. A. Ross, D. Sri-vastava, and D. Papadias, editors,
Proceedings of the ACM SIGMOD International Conference on Managementof Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013 , pages 1–12. ACM, 2013.[12] D. Hutchison, B. Howe, and D. Suciu. Laradb: A minimalist kernel for linear and relational algebra computation.
CoRR , abs/1703.07342, 2017.[13] Y. E. Ioannidis and R. Ramakrishnan. Containment of conjunctive queries: Beyond relations as sets.
ACM Trans.Database Syst. , 20(3):288–324, 1995.[14] M. R. Joglekar, R. Puttagunta, and C. R´e. Ajar: Aggregations and joins over annotated relations. In
PODS .ACM, 2016.[15] R. Joshi, G. Nelson, and K. H. Randall. Denali: A goal-directed superoptimizer. In J. Knoop and L. J. Hendren,editors,
Proceedings of the 2002 ACM SIGPLAN Conference on Programming Language Design and Implemen-tation (PLDI), Berlin, Germany, June 17-19, 2002 , pages 304–314. ACM, 2002.[16] M. A. Khamis, H. Q. Ngo, X. Nguyen, D. Olteanu, and M. Schleich. In-database learning with sparse tensors.
CoRR , abs/1703.04780, 2017.[17] M. A. Khamis, H. Q. Ngo, and A. Rudra. FAQ: Questions Asked Frequently. In
PODS . ACM, 2016.[18] S. Li, L. Chen, and A. Kumar. Enabling and optimizing non-linear feature interactions in factorized linear alge-bra. In P. A. Boncz, S. Manegold, A. Ailamaki, A. Deshpande, and T. Kraska, editors,
Proceedings of the 2019International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands,June 30 - July 5, 2019 , pages 1571–1588. ACM, 2019.[19] G. Moerkotte and T. Neumann. Analysis of Two Existing and One New Dynamic Programming Algorithm forthe Generation of Optimal Bushy Join Trees without Cross Products. In
VLDB , 2006.[20] G. Moerkotte and T. Neumann. Dynamic Programming Strikes Back. In
SIGMOD , 2008.[21] C. Nandi, A. Anderson, M. Willsey, J. R. Wilcox, E. Darulova, D. Grossman, and Z. Tatlock. Using e-graphs forCAD parameter inference.
CoRR , abs/1909.12252, 2019.[22] C. G. Nelson.
Techniques for Program Verification . PhD thesis, Stanford University, Stanford, CA, USA, 1980.AAI8011683.[23] P. Panchekha, A. Sanchez-Stern, J. R. Wilcox, and Z. Tatlock. Automatically improving accuracy for floatingpoint expressions. In D. Grove and S. Blackburn, editors,
Proceedings of the 36th ACM SIGPLAN Conference onProgramming Language Design and Implementation, Portland, OR, USA, June 15-17, 2015 , pages 1–11. ACM,2015.[24] M. Schleich, D. Olteanu, and R. Ciucanu. Learning linear regression models over factorized joins. In F. ¨Ozcan,G. Koutrika, and S. Madden, editors,
Proceedings of the 2016 International Conference on Management of Data,SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016 , pages 3–18. ACM, 2016.[25] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in arelational database management system. In
Proceedings of the 1979 ACM SIGMOD international conference onManagement of data , pages 23–34. ACM, 1979.[26] A. Solar-Lezama, L. Tancau, R. Bod´ık, S. A. Seshia, and V. A. Saraswat. Combinatorial sketching for finiteprograms. In J. P. Shen and M. Martonosi, editors,
Proceedings of the 12th International Conference on Ar-chitectural Support for Programming Languages and Operating Systems, ASPLOS 2006, San Jose, CA, USA,October 21-25, 2006 , pages 404–415. ACM, 2006.[27] A. K. Sujeeth, H. Lee, K. J. Brown, T. Rompf, H. Chafi, M. Wu, A. R. Atreya, M. Odersky, and K. Olukotun.Optiml: An implicitly parallel domain-specific language for machine learning. In L. Getoor and T. Scheffer, edi-tors,
Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington,USA, June 28 - July 2, 2011 , pages 609–616. Omnipress, 2011.[28] R. Tate, M. Stepp, Z. Tatlock, and S. Lerner. Equality saturation: A new approach to optimization.
LogicalMethods in Computer Science , 7(1), 2011. 17
PREPRINT - D
ECEMBER
24, 2020[29] I. Trummer and C. Koch. Solving the join ordering problem via mixed integer linear programming. In S. Sal-ihoglu, W. Zhou, R. Chirkova, J. Yang, and D. Suciu, editors,
Proceedings of the 2017 ACM InternationalConference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017 , pages1025–1040. ACM, 2017.[30] M. Willsey, C. Nandi, Y. R. Wang, O. Flatt, Z. Tatlock, and P. Panchekha. egg: Fast and extensible equalitysaturation, 2020. 18
PREPRINT - D
ECEMBER
24, 2020
A Uniqueness of RA Canonical Form
To prove Lemma 2.2 ( uniqueness of canonical form ), we first give formal definitions for several important constructs.First, we interpret a tensor (high dimensional matrix) as a function from a tuple of indices to a real number:
Definition A.1. (Tensors) A tensor is a function A : [ N ] d → R such that A ( i, j, . . . ) returns the tensor entry A ij... . d is the dimensionality of A , and N is the dimension size . Each argument to A is an index , and we write iii (in bold)for a tuple of indices. We write AAA (in bold) for A ( iii ) , i.e. a tensor indexed with a tuple of indices. We use the terms tensor and relation interchangingly. Note that we assume every dimension has the same size N forsimplicity. If the dimension sizes differ in a tensor A , we can easily set N to be the maximum size and fill A withzeroes accordingly. For example, for a 2D matrix A with dimensions | d | = m < | d | = n , we simply set N = n andhave A ij = 0 for i > m .Now we give names for special forms of expressions at each level of the normal form: Definition A.2. (Atoms, Monomials, Terms, Polyterms)
An indexed tensor
AAA is also called an atom . A monomial m is a product of any number of atoms. A term t is an aggregate of a monomial over a set of indices. A polyterm e isthe sum of terms (with a constant term c at the end), where each term has a constant factor: AAA := A ( iii ) (1) atomsm := AAA × AAA × . . . AAA n (2) monomialst := (cid:80) iii m (3) termse := c t + c t + . . . c n t n + c (4) polyterms We identify the monomial m with a bag of atoms, denoted bag ( m ) , such that for every atom AAA occurring n times in m , bag ( m ) contains n copies of AAA . An index i in a term t = (cid:80) iii m is bound if i ∈ iii ; otherwise it is free . We write bv ( t ) for the set of bound indices in t , f v ( t ) for the set of free indices in t , and vars ( t ) for the set of all indices in t . Example 1.
The polyterm (cid:80) j A ( i, j ) × A ( i, j ) × B ( j, k ) × B ( j, k ) + 3 (cid:80) l A ( i, l ) × C ( l, k ) + 2 representsthe linear algebra expression A B + 3 AC + 2 , where A squares the matrix element-wise. The monomial A ( i, j ) × A ( i, j ) × B ( j, k ) × B ( j, k ) is the same as A ( i, j ) × B ( j, k ) × A ( i, j ) × B ( j, k ) , and we view it as thebag { A ( i, j ) , A ( i, j ) , B ( j, k ) , B ( j, k ) } . Before giving the formal definition of a canonical form, we need to define two syntactical relationships between ourexpressions, namely homomorphism and isomorphism . Fix terms t = (cid:80) iii m and t (cid:48) = (cid:80) i (cid:48) i (cid:48) i (cid:48) m (cid:48) , and let f : iii → iii (cid:48) beany function. Let AAA ∈ bag ( m ) be an atom of m . We write f ( A ) for the result of applying f to all bound indices of AAA .We write f ( bag ( m )) for the bag obtained by applying f to each atom AAA ∈ bag ( m ) . Definition A.3. (Term Homomorphism)
A homomorphism f : t → t (cid:48) is a function f : iii → iii (cid:48) such that f ( bag ( m )) = bag ( m (cid:48) ) . Note that f is a one-to-one mapping between bag ( m ) and bag ( m (cid:48) ) . Example 2.
Between terms t = (cid:80) vwst A ( i, v ) × B ( v, w ) × A ( i, s ) × B ( s, t ) and t = (cid:80) jk A ( i, j ) × B ( j, k ) there is a homomorphism: t → t : [ v (cid:55)→ j, w (cid:55)→ k, s (cid:55)→ j, t (cid:55)→ k ] We extend f to take vars ( t ) where free variables map to themselves. It is easy to see for any homomorphism f , f ( vars ( t )) = vars ( t ) . Corollary 1.
Homomorphism is surjective on the indices.Proof.
Suppose for the sake of contradiction that a homomorphism f : t → t is not surjective. Then there exists anindex i ∈ vars ( t ) that is not in f ( vars ( t )) , and the atom containing i does not appear in f ( t ) . That implies themonomial in f ( t ) is cannot be equal to the monomial in t , so f is not a homomorphism – contradiction. Corollary 2.
Homomorphism is closed under composition: given homomorphisms f : t → t and g : t → t , g ◦ f : t → t is a homomorphism from t to t .Proof. A homomorphism is a function on indices, so composing homomorphisms is just composing functions.A stronger correspondence between terms is an isomorphism : Definition A.4 (Term Isomorphism) . Terms t = (cid:80) i i i m and t = (cid:80) i i i m are isomorphic iff there is a bijectivehomomorphism between them. We write t ≡ t to mean t and t are isomorphic. PREPRINT - D
ECEMBER
24, 2020A pair of homomorphisms t → t and t → t produce an isomorphism: Lemma A.1.
Given two terms t and t , if there is a homomorphism f : t → t and a homomorphism g : t → t then t ≡ t . If there is a cycle of homomorphisms among a number of terms, all terms on the cycle are isomorphic.Proof. By Corollary 1, a pair of homomorphisms between two terms are a pair of surjective maps between the terms’indices. A pair of surjective maps induce a bijective map which is an isomorphism. By Corollary 2, if t and t are ona cycle of homomorphism, we can retract the homomorphism chains between them to obtain a pair of homomorphisms f : t → t and g : t → t , which implies t ≡ t .We are now ready to formally define the canonical form for RA expressions: Definition A.5. (Canonical Form)
An RPlan expression (as defined in Table 1) is canonical if it is a polyterm (Defini-tion A.2) containing no isomorphic terms.
Our ultimate goal is to identify canonical form isomorphism with equivalence. That is, two canonical expressions areequivalent iff they are isomorphic. By equivalence we mean the expressions evaluate to the same result given anysame inputs:
Definition A.6. (Equivalence of Expressions)
Two expressions are equivalent iff ∀ TTT .e ( TTT ) = e ( TTT ) , where TTT is theset of input tensors to the expressions. We write e = e to mean e and e are equivalent. We can canonicalize any expression by pulling + to the top and pushing × to the bottom, while combining isomorphicterms c t + c t into ( c + c ) t : Lemma A.2.
For every RPlan expression, there is an equivalent canonical expression.Proof.
The proof is a standard application of the rewrite rules R EQ in Figure 3.We can identify canonical expressions syntactically using term isomorphism: Definition A.7. (Isomorphic Canonical Expressions)
Given two canonical expressions e = (cid:80) iii c t + · · · + c n t n + c and e (cid:48) = (cid:80) i (cid:48) i (cid:48) i (cid:48) c (cid:48) t (cid:48) + · · · + c (cid:48) m t (cid:48) m + c (cid:48) , e and e (cid:48) are isomorphic if there is a bijection σ : [ n ] → [ m ] (in particular n = m ), such that ∀ i ∈ [ n ] , c i = c (cid:48) σ ( i ) , t i ≡ t (cid:48) σ ( i ) , and c = c (cid:48) . Note that isomorphic expressions have the same free variables. We now show two expressions are isomorphic iff theyare equivalent:
Theorem A.3. ( Isomorphism Captures Equivalence ) For canonical expressions e and e : e ≡ e ⇐⇒ e = e Proof.
The left-to-right direction is straightforward: isomorphism only renames indices and reorders + and × , whichdoes not change semantics. And because RA semantics is deterministic, two isomorphic expressions compute thesame function.The right-to-left direction is exactly Lemma 2.2, the uniqueness of canonical forms: e = e ⇒ e ≡ e Without loss of generality, we make the following simplifying assumptions. First, we assume e and e contain noconstant terms. Otherwise the expressions would evaluate to their respective constants on all-0 inputs, so the constantterms must be equal. Subtracting the same constant preserves equivalence and isomorphism, so we can simply removeequal constant terms. Second, we assume no term in e is isomorphic to any term in e . Otherwise if e contains c t and e contains c t with t ≡ t , then we can write e = c t + e (cid:48) and e = c t + e (cid:48) where e (cid:48) and e (cid:48) are polyterms.Assume w.l.o.g that c ≤ c . Since c t + e (cid:48) = c t + e (cid:48) , e (cid:48) = ( c − c ) t + e (cid:48) and we have removed t from e .Furthermore, e = e ⇒ e (cid:48) = e (cid:48) and e (cid:48) ≡ e (cid:48) ⇒ e ≡ e . Then by proving e (cid:48) = e (cid:48) ⇒ e (cid:48) ≡ e (cid:48) we can show thefollowing: e = e ⇒ e (cid:48) = e (cid:48) ⇒ e (cid:48) ≡ e (cid:48) ⇒ e ≡ e Finally we assume e and e are fully aggregated, i.e. every index is bound. Without this assumption, we first observe e and e must have the same free indices to output tensors with the same dimensionality. Then we define e (cid:48) = (cid:80) iii e and e (cid:48) = (cid:80) iii e , where iii is the set of free variables of e and e . We can easily show e = e ⇒ e (cid:48) = e (cid:48) and e (cid:48) ≡ e (cid:48) ⇒ e ≡ e , and with a proof of e (cid:48) = e (cid:48) ⇒ e (cid:48) ≡ e (cid:48) it follows e = e ⇒ e ≡ e .20 PREPRINT - D
ECEMBER
24, 2020Each expression can now be viewed as a set of terms, where each term has a constant factor and no two terms areisomorphic. Denoting by t < t if there is a homomorphism t → t , we observe homomorphism induces a partialorder on the terms of e and e . There is no cycle of homomorphisms, because by Lemma A.1 such a cycle impliesisomorphisms, but we have assumed no isomorphic terms.We now prove Lemma 2.2 through its contrapositive: e (cid:54)≡ e ⇒ e (cid:54) = e We proceed by constructing a set of witness input tensors given which e and e return different results. First, let t = (cid:80) i i ... XXX × XXX × . . . be the minimal term under the partial order induced by homomorphism. Assume w.l.o.g t ∈ e . Define bijective function φ : i n → n mapping each index to a unique number. For each XXX n = X n ( iii n ) introduce a real-valued variable x n and construct the input tensor X n such that X iii n = x n . Finally, set all undefinedentries of each input tensor to .Given the input tensors defined above, t evaluates to a polynomial p of variables x . . . x n . Every monomial in p corresponds to a function from t ’s atoms XXX a , XXX b , . . . to some variables x m , x n , . . . , and we call such a function f m . There is one special (bijective) f m that maps each atom to its own variable: let m be the monomial in t , m = XXX × XXX × . . . . We define f p ( t ) = m ( φ ( i ) , φ ( i ) , . . . ) = x x . . . . f p ( t ) does not appear in anypolynomial from other terms. Otherwise if it appears in polynomial p from term t , we would be able to construct ahomomorphism t → t = f m ◦ f − p , where f m maps the monomial m in p to t . But that is impossible because wehave picked t to be the minimal term under homomorphism. Therefore, p differs from any polynomial from termsin t , and by the fundamental theorem of algebra the polynomial are inequivalent. As a result, e (cid:54) = e2