Equality Saturation for Tensor Graph Superoptimization
Yichen Yang, Phitchaya Mangpo Phothilimtha, Yisu Remy Wang, Max Willsey, Sudip Roy, Jacques Pienaar
EE QUALITY S ATURATION FOR T ENSOR G RAPH S UPEROPTIMIZATION
Yichen Yang * 1
Phitchaya Mangpo Phothilimtha Yisu Remy Wang Max Willsey Sudip Roy Jacques Pienaar A BSTRACT
One of the major optimizations employed in deep learning frameworks is graph rewriting. Production frameworksrely on heuristics to decide if rewrite rules should be applied and in which order. Prior research has shown that onecan discover more optimal tensor computation graphs if we search for a better sequence of substitutions instead ofrelying on heuristics. However, we observe that existing approaches for tensor graph superoptimization both inproduction and research frameworks apply substitutions in a sequential manner. Such sequential search methodsare sensitive to the order in which the substitutions are applied and often only explore a small fragment of theexponential space of equivalent graphs. This paper presents a novel technique for tensor graph superoptimizationthat employs equality saturation to apply all possible substitutions at once. We show that our approach canfind optimized graphs with up to 16% speedup over state-of-the-art, while spending on average 48x less timeoptimizing.
NTRODUCTION
Deep learning frameworks and compilers have enabled di-verse kinds of machine learning models to run efficientlyon numerous compute platforms. Neural network models inthese frameworks are typically represented as tensor com-putation graphs. To improve the runtime performance of atensor graph, these frameworks perform various optimiza-tions.One of the most important optimizations is graph rewriting,which takes in a tensor graph g and a set of semantics-preserving graph rewrites R , and by applying rewrites to g seeks to find an semantically equivalent g (cid:48) with lowercost according to some cost model. The current industry-standard approach adopted by most frameworks is to use amanually curated set of rewrite rules and rely on a heuristicstrategy to determine the order in which to apply the rewriterules. However, this approach often leads to sub-optimalresults both due to the non-comprehensive set of rewriterules, as well as the sub-optimal graph substitution heuristic(Jia et al., 2019a;b).This paper aims to address the sub-optimality problemof graph rewrite strategies, while leveraging the existingrewrite rules generation technique (Jia et al., 2019a). Priorresearch has shown that searching for sequences of substi- * Work done during internship at Google MIT EECS &CSAIL Google, Mountain View, CA, USA University ofWashington, Seattle, USA. Correspondence to: Yichen Yang < [email protected] > .Preprint, under review. Search time (s) Runtime speedup (%)
TASO T
ENSAT
TASO T
ENSAT
BERT 13.6
ResNeXt-50 25.3
NasNet-A 1226
NasRNN 177.3
Inception-v3 68.6
SqueezeNet 16.4
VGG-19 8.9
Table 1.
Comparison of optimization time and runtime speedupof the optimized computation graphs over the original graphs,TASO (Jia et al., 2019a) v.s. T
ENSAT . tutions (Jia et al., 2019a;b; Fang et al., 2020) outperformsheuristic approaches. However, both heuristic and search-based solutions rely on sequential application of substitu-tions. Since rewrites often depend on or enable one another,optimization depends heavily on the order in which rewritesare applied; this classically tricky problem is known in thecompilers community as the “phase-ordering” or “rewrite-ordering” problem.This paper presents T ENSAT , a tensor graph superoptimiza-tion framework that employs equality saturation (Tate et al.,2009; Stepp et al., 2011; Willsey et al., 2020), a recenttechnique that mitigates the phase-ordering problem by ap-plying all possible rewrites at once. Equality saturationsplits program optimization into two phases: exploration and extraction . The exploration phase uses a data structurecalled e-graph to compactly generate and store all rewrit-ings of the input program. The exploration can continue a r X i v : . [ c s . A I] J a n quality Saturation for Tensor Graph Superoptimization until saturation , where the e-graph stores all possible waysto write the input program using a given set of rewrites.Finally, the extraction phase selects from the e-graph theequivalent program with the lowest cost according to a givencost model. The compact representation of the exponentiallylarge search space using e-graphs enables extraction algo-rithms that can find the globally optimal equivalent programquickly.Applying equality saturation to tensor graph rewriting re-quires non-trivial extensions in both the exploration andextraction phases. We extend the exploration phase to sup-port complex, non-local rewrite rules that are necessary toproduce highly efficient tensor graphs. Additionally, we in-troduce a novel method to filter out invalid subgraphs froman e-graph, which enables our extraction procedure basedon Integer Linear Programming (ILP) to quickly find theoptimal solution.We evaluated T ENSAT on a number of well-known machinelearning models executing on a GPU. As highlighted inTable 1, T
ENSAT can synthesize optimized graphs that areup to 23% faster in runtime than state-of-the-art (Jia et al.,2019a), while reducing the optimization time by up to 300x.By having the e-graph compactly representing an exponen-tial number of equivalent graphs, T
ENSAT is able to covera larger search space more efficiently than the sequentialsearch methods. As a result, our search approach is bothextremely effective and fast enough to be used as part of anormal complation flow.
QUALITY S ATURATION B ACKGROUND
Term rewriting is a time-tested approach to program opti-mizations. T
ENSAT employs equality saturation , a recenttechnique that provides a search-based approach to termrewriting that can avoid problems encountered by the tradi-tional approach.
In the term rewriting paradigm, the input to the optimizer isan initial program expression (term) e and a set of rewriterules R . Each rewrite in R takes the form l → r , where l and r are both patterns, or terms from the program grammar andplaceholder variables. Each rewrite rule states the semanticequivalence between l and r . To apply a rewrite l → r to e ,the optimizer searches for the pattern l in e , yielding a list of matches , where a match σ maps variables in l to subtermsof e . The matches can then be applied to the right-hand side,denoted r [ σ ] , to produce the subterms to be replaced in e . Example
Let e = ( a × / and the rewrite l → r be x × → x (cid:28) . To apply l → r to e , the optimizer firstsearches for l in e , yielding a single match σ = { x (cid:55)→ a } . Then, the term r [ σ ] = a (cid:28) replaces the matched subtermin e , giving the final result: ( a (cid:28) / .As the above example suggests, term rewriting for optimiza-tion suffers from the problem of choice: applying the wrongrewrite at the wrong time can “hide” optimizations fromother rewrites. In our example, the classical strength re-duction x × → x (cid:28) is beneficial in some contexts,but in this case will prevent the ultimate goal of rewriting ( a × / to a . The problem stems from the fact the termrewriting is typically destructive; i.e., applying a rewrite“forgets” the initial term. An e-graph is a data structure originally devised for use intheorem provers (Nelson, 1980; Detlefs et al., 2005) thatcompactly encodes an equivalence relation over many terms.An e-graph is a set of equivalence classes ( e-classes ), eachof which is a set of equivalent e-nodes . An e-node is anoperator from the language paired with a list of childrene-classes.An e-graph is said to represent the terms that can be seenby choosing a representative e-node for each e-class. Moreformally,• An e-graph represents a term if any of its e-classes do.• An e-class represents a term if any of its equivalente-nodes do. All terms represented by an e-class areequivalent.• An e-node f ( c , ..., c n ) represents a term f ( t , ..., t n ) if each e-class c i represents t i . A childless e-node a represents a constant term a .Figure 1 shows an e-graph that represents the term ( a × / .The e-class containing the division e-node is called the root e-class since it represents our initial term.Figure 1 also demonstrates how e-graphs can support rewrit-ing. Similar to traditional term rewriting, applying a rewrite l → r to an e-graphs first entails searching for instancesof the left-hand pattern l . This yields matches, but noweach match σ maps pattern variables to e-classes insteadof subterms. Then r [ σ ] is added to the matched e-class,only adding information to the e-graph, instead of destruc-tively replacing the term as before. Applying the rule x × → x (cid:28) in Figure 1 only adds information (whitee-nodes). A recent technique called equality saturation (Tate et al.,2009; Stepp et al., 2011; Willsey et al., 2020) mitigates therewrite-choice problem by allowing rewrites to be applied quality Saturation for Tensor Graph Superoptimization /a * 2 /a * 2 1<<
Figure 1.
Left:
An e-graph representing the term ( a × / . Dottedboxes show e-classes, and arrows connect e-nodes to their e-classchildren. Right:
The e-graph after applying the rewrite x × → x (cid:28) . Only a few e-nodes were added (highlighted in white),and the result represents both the initial and rewritten terms. simultaneously. Na¨ıvely generating all possible rewritingsof a term would require exponential space and time, soequality saturation uses e-graphs to efficiently represent themassive space of equivalent terms.Equality saturation takes in a term to optimize and a set ofrewrites. First, an initial e-graph is created from the inputterm. Then, the rewrites are applied until either saturation is reached, meaning the rewrites added no new informationto the e-graph, or until a specified timeout. The resultinge-graph encodes an equivalence relation over a large set ofterms, all of which were built by applying rewrites to theinitial term. Finally, a procedure called extraction selectsthe best represented term from the root e-class accordingto a user-provided cost function. Extraction procedurescan vary in their speed and the cost functions they support;both simple greedy algorithms (Panchekha et al., 2015) andmore complex ILP-powered solutions (Tate et al., 2009;Wang et al., 2020) have been used. The extracted termis guaranteed (if the rewrites themselves are sound) to beequivalent to the input term, and is thus returned as theoptimized program.Equality saturation effectively breaks down optimizationinto two phases: the exploration phase grows an e-graph byapplying rewrites, and then the extraction phase selects thebest term from the search space. This decomposition avoidsthe rule choice problem in conventional term rewriting;instead of having to choose which rewrite to apply to end upat the best term, the algorithm first generates all rewrittenterms, leaving the choice of which term to select to theextraction procedure.Equality saturation can handle rules that could lead to non-termination in other rewriting settings, such as x + y → y + x and a → a × . These rules may not be useful on their own,but applying them may allow other, more useful rewritesto fire. Equality saturation allows fewer, smaller rules tocompose to prove large equalities, where other rewriting systems have to use many larger, similar rewrite rules toavoid problematic, non-terminating rewrites.Equality saturation can also prove things that directed rewrit-ing could not, even with an oracle to choose the correctrewrite at the correct time. Consider the two rewrites f ( a, b ) → c and a → b , and let the input term be f ( b, a ) . Directed rewriting could only rewrite the term to f ( b, b ) , but applying a → b in a e-graph would prove that f ( a, b ) = f ( b, a ) = f ( a, a ) = f ( b, b ) , i.e., those would allbe represented in the same e-class. Once the first rewriteis applied, c would also be added to the e-class. Whilecontrived, this example demonstrates that rewriting with e-graphs can give strictly more proving power; in other words,if equality saturation can rewrite e to e (cid:48) with ruleset R , theremay not be a way to do the same with R using traditionalrewriting. ENSAT ’ S R EPRESENTATIONS
This section describes how T
ENSAT represents tensor com-putation graphs and rewrite rules.
We use a representation based on the one in TASO (Jia et al.,2019a), with modifications to make it suitable for equalitysaturation. Table 2 shows the set of operators we consider.Each operator o i corresponds to a node n i in the graph; thenode represents the output tensor of the operator. The nodescorresponding to the inputs of o i are the children nodes of n i . Each tensor computation graph is a DAG under thisrepresentation.The formulations in equality saturation becomes simpler ifa graph is single-rooted. Therefore, we combine all the finaloutput nodes of a graph with noop s to make the graph single-rooted. The noop nodes do not have any actual operatorsassociated with them, and they will not be altered duringthe exploration phase, so there is no side effects. A rewrite rule for tensor computation graph specifies thatsome local subgraph pattern ( source pattern ) is equivalent toanother subgraph pattern ( target pattern ). The input tensorsto the source and target patterns are variable nodes , whichcan be substituted with any concrete nodes (or e-class inequality saturation) in the current graph. Each output tensorin the source pattern corresponds to an output tensor in thetarget pattern. The two corresponding output nodes arecalled a pair of matched outputs . A rewrite rule states theequivalence between each pair of matched outputs.We represent each source (and target) pattern using symbolicexpressions (S-exprs) with variables. Patterns with a single quality Saturation for Tensor Graph Superoptimization
Table 2.
Operators supported by T
ENSAT . There are four types for the nodes in our representation: tensor type (T), string type (S), integertype (N), and tensor tuple type (TT). The integer type is used to represent parameters of the operators, such as stride, axis, and alsopadding and activation modes (by representing different modes using different integers). The more complex, variable-length parameters(e.g. shape, axes permutation) are represented using the string type according to the specified formats.
Operator Description Inputs Type signature ewadd Element-wise addition input , input (T, T) → Tewmul Element-wise multiplication input , input (T, T) → Tmatmul Matrix multiplication activation, input , input (N, T, T) → Tconv a Grouped convolution stride h , stride w , pad., act., input, weight (N, N, N, N, T, T) → Trelu Relu activation input T → Ttanh Tanh activation input T → Tsigmoid Sigmoid activation input T → Tpoolmax Max pooling input, kernel { h,w } , stride { h,w } , pad., act. (T, N, N, N, N, N, N) → Tpoolavg Average pooling input, kernel { h,w } , stride { h,w } , pad., act. (T, N, N, N, N, N, N) → Ttranspose b Transpose input, permutation (T, S) → Tenlarge c Pad a convolution kernel with zeros input, ref-input (T, T) → Tconcat n d
Concatenate axis, input , . . . , input n (N, T, . . . , T) → Tsplit e Split a tensor into two axis, input (N, T) → TTsplit Get the first output from split input TT → Tsplit Get the second output from split input TT → Tmerge f Update weight to merge grouped conv weight, count (T, N) → Treshape g Reshape tensor input, shape (T, S) → Tinput Input tensor identifier h S → Tweight Weight tensor identifier h S → Tnoop i Combine the outputs of the graph input , input (T, T) → T a Same representation as TASO (Jia et al., 2019a). Normal and depth-wise convolutions are special cases of grouped convolutions. b Axis permutation for transpose is specified using a string with format: axis axis . . . . c Pad a convolution kernel (input) with zeros to make it the same size as ref-input. d Since each type of node needs to have a fixed number of inputs, we have a separate concat for each number of inputs. e Split the tensor in the given axis. The position of the split is at the place of the most recent concat. f Merge every count number of groups in the grouped convolution. See TASO (Jia et al., 2019a) for more details. g Specify the target shape using a string with format: dim dim . . . . h The identifier for an input or weight tensor contains its name and shape, specified as a string with format: name@dim dim . . . . i For combining the outputs of the graph to make the graph single-rooted. No actual operator is associated with noop. output is represented with an S-expr rooted on the output.Rewrite rules with such patterns are called single-patternrewrite rules . Patterns with multiple outputs are representedas a list of S-exprs rooted on each output. Rewrite rules withmultiple matched outputs are called multi-pattern rewriterules . Figure 2 shows an example rewrite rule and its repre-sentation.
ENSAT ’ S E XPLORATION P HASE
We initialize the e-graph with the original tensor computa-tion graph. In each iteration of the exploration phase, wesearch for matches of all rewrite rules in the current e-graph,and add the target patterns and equivalence relations to thee-graph. This process continues until either the e-graphsaturates or a user-specified limit (in terms of time, e-graphsize, or number of iterations) is reached. Before applyinga rewrite at a found match, we perform a shape checking to verify if the tensor shapes in the target pattern are com-patible. This is necessary since some rewrite rules requiresinput tensor shapes to satisfy specific preconditions, in addi- matmulinput_1 input_2 input_3matmul matmulinput_1 input_2 input_3concat_21 splitsplit_0 split_1
Source: (matmul ?input ?input ), (matmul ?input ?input )Target: (split (split 1 (matmul ?input (concat ?input )))),(split (split 1 (matmul ?input (concat ?input )))) Figure 2.
Example rewrite rule and its representation in S-expressions. Identifiers starting with ”?” denote variable nodes.For clarity, we omit the activation mode inputs to matmul . Ar-rows point from parent nodes to children nodes. 1 is the axis for split and concat operators. tion to the syntactic match. We perform shape checking inthe same way as TASO (Jia et al., 2019a).
Multi-Pattern Rewrite Rules
Multi-pattern rewrite rulesare an important type of rules for tensor graph superopti- quality Saturation for Tensor Graph Superoptimization
Algorithm 1
Applying multi-pattern rewrite rules
Input: starting e-graph G , set of multi-pattern rewrite rules R m . Output: updated e-graph G .1: canonicalized S-expr e c = Set( {} )2: for rule r ∈ R m do for i = 0 , . . . , | r | − do (cid:46) | r | :
4: ( e , rename map) = C ANONICAL ( r .source[ i ])5: e c .insert( e )6: r .map[i] = rename map7: end for end for for iter = 0, . . . , MAX ITER do M = S EARCH ( G , e c ) (cid:46) all matches for all patterns11: for rule r ∈ R m do for i = 0 , . . . , | r | − do
13: canonical matches mc i = M [ r .source[i]]14: matches m i = D ECANONICAL (mc i , r .map[ i ])15: end for for ( σ , . . . , σ | r |− ) ∈ m × · · · × m | r |− do if C OMPATIBLE ( ( σ , . . . , σ | r |− ) ) then
18: A
PPLY ( G , r, σ , . . . , σ | r |− )19: end if end for end for end for return G mization (Jia et al., 2019a). However, most equality satura-tion toolkits only support efficient search methods to findmatches for single-pattern rewrite rules (Willsey et al., 2020;de Moura & Bjørner, 2007). We introduce an algorithm forapplying multi-pattern rewrites, as shown in Algorithm 1.Our algorithm leverages the existing efficient search routinefor single-pattern rewrites as a subroutine.At the beginning of the exploration phase, we collect theset of unique S-exprs present in the source patterns of therewrite rules after canonicalization. Here, if one S-exprcan be transformed into another S-expr by variable renam-ing only, they will be mapped to the same canonicalizedS-expr. In each iteration of the exploration phase, we usethe single-pattern search subroutine to search for matchesof the canonical S-exprs. Then for each multi-pattern rule,we take the Cartesian product of the matches found, de-canonicalize the variable-to-e-class map into the originalvariables (using the variable renaming map stored duringcanonicalization), and check if the matches are compati-ble at the shared variables between the S-exprs (i.e., if theshared variables refer to the same e-class after the mapping).We apply the matches that are compatible.In our experience, one feature of multi-pattern rules fortensor graph is that they can grow the e-graph extremelyrapidly. Let’s consider again the example rewrite rule inFigure 2. This rule can be matched with any two matmul nodes with a shared input (input ). By applying this ruleonce on some match, a new matmul node will be createdand added to the e-graph (the one on the RHS of Figure 2), which also has input as its input. If the e-graph contains N matmul nodes that has some input at the beginning, thenafter iteration 1, O ( N ) new matmul nodes sharing input will be created. In iteration 2, each pair in these O ( N ) nodes will be a match, which will create O ( N ) new nodes.Such double exponential growth can quickly explode thee-graph.Based on this feature, we set a separate limit k multi on thenumber of iterations to apply the multi-pattern rules. After k multi iterations, we only apply the single-pattern rules untilsaturation or some user-specified limit. ENSAT ’ S E XTRACTION P HASE
During extraction, the goal is to pick one e-node from eache-class in the e-graph to obtain an optimized graph. Theoptimized graph should minimize the total cost with respectto a given cost model. In tensor graph superoptimization,the cost model reflects the inference time taken by the graph.
Cost model
We use the same cost model as TASO (Jiaet al., 2019a). Each operator has a separate and independentcost, which is the measured runtime of that operator (withthe specific input sizes and parameters) on hardware. Thetotal cost of a graph is the sum of costs of each of its nodes.This cost model is suitable for GPUs, since GPUs typicallyrun one operator at a time when executing a graph. Note thatan operator can be a fused operator, consisting of multipleprimitive operators, such as a fused convolution and ReLU.
We first experiment with a greedy ex-traction strategy that has been shown to be effective forcertain domains (Panchekha et al., 2015; Wang et al., 2020;Willsey et al., 2020). For each e-class, the greedy strategycomputes the total cost of the subtrees rooted on each of thee-nodes, and picks the e-node with the smallest subtree cost.Greedy extraction is not guaranteed to extract the graphwith the minimum cost, even under our independent costmodel. For example, if two children of an e-node share asubgraph, greedy extraction would ignore the sharing andoverestimate the cost.
ILP extraction
The second approach we experiment withis formulating the extraction problem as an Integer LinearProgram (ILP).Let i = 0 , ..., N − be the set of e-nodes in the e-graph. Let m = 0 , ..., M − be the set of e-classes in the e-graph. Let e m denote the set of e-nodes within e-class m : { i | i ∈ e m } .Let h i denote the set of children e-classes for e-node i . Let g ( i ) denote the e-class of e-node i , i.e. i ∈ e g ( i ) . Let m = 0 be the root e-class. Each e-node is associated with a cost c i . quality Saturation for Tensor Graph Superoptimization We then formulate our problem as follows:Minimize: f ( x ) = (cid:88) i c i x i Subject to: x i ∈ { , } , (1) (cid:88) i ∈ e x i = 1 , (2) ∀ i, ∀ m ∈ h i , x i ≤ (cid:88) j ∈ e m x j , (3) ∀ i, ∀ m ∈ h i , t g ( i ) − t m − (cid:15) + A (1 − x i ) ≥ , (4) ∀ m, ≤ t m ≤ , (5)Here we introduce a binary integer variable x i for eache-node i ; node i is selected if x i = 1 , and not selectedotherwise. Constraint (2) ensures that one node is pickedin the root e-class. Constraint (3) ensures that if a nodeis picked, then at least one node in each of its children e-classes needs to be picked. We rely on the fact that at theoptimal solution, each e-class can have at most one pickednode (otherwise we can remove more picked nodes in thise-class to reduce the objective while still satisfying all theconstraints). Constraints (1)–(3) and the objective encodethe main extraction logic.A more subtle requirement on the extraction phase is that theextracted graph cannot contain cycles. While the e-graphcan (and likely will) contain cycles, the extracted graph ismeant to map directly to an executable tensor DAG. Theextraction procedure must therefore take care to respect theacyclic invariant of DAGs.Figure 3 shows an example to illustrate how valid rewritescan produce cycles in the e-graph. To ensure the extractedgraph does not contain cycles, we introduce a real variable t m for each e-class m in the ILP. Constraint (4) ensures thatthe order defined by t m ’s is a valid topological order forthe extracted graph. Here (cid:15) < /M is a small constant foreffectively encoding strict inequalities in ILP. A is a largeenough constant such that A > (cid:15) . Constraint (5) is tolimit the range for the topological order variables t m ’s.We also experiment with using integer variables for t m ’s.In this case, t m ’s are constrained to take integer valuesbetween 0 to M − . Constraint (4) changes accordingly to: ∀ i, ∀ m ∈ h i , t g ( i ) − t m + A (1 − x i ) ≥ , where A ≥ M .Unlike greedy extraction, the optimal solution to the ILP isguaranteed to give a valid graph (no cycles) with the lowestcost. Similar to previous work that uses ILP extraction (Tate et al.,2009; Wang et al., 2020), we find that as the size of the e- matmulX Ymatmul matmulX Ymatmul splitsplit_0 split_1matmulconcat_21
Figure 3.
Example on how a valid rewrite can introduce cycles intothe e-graph. RHS is the resulting e-graph after applying the rewriterule from Figure 2 to the LHS. Dotted lines circles the e-classes.We omit the e-classes with a single node for clarity. If the nodesplit is picked in the right e-class, then the resulting graph willhave a cycle (indicated by the red edges). graph grows bigger, the ILP solver takes a long time andbecomes the main bottleneck. This is mainly due to the cycleconstraint (4): ILP solver struggles to find a feasible solutionwith these constraints. Therefore, we explore an alternativeapproach by filtering cycles during the exploration phase tomake sure that the e-graph does not contain any cycles atthe end of the exploration phase. This way, we can get ridof the cycle constraints in the ILP. Vanilla cycle filtering
The first method is to check if ap-plying a substitution introduces cycles to the e-graph, anddiscard such a substitution. This check is run every time be-fore applying a substitution. Each check requires a pass overthe entire e-graph. For one iteration during the explorationphase, if we denote N as the current size of the e-graph and n m as the total number of matches of the rewrite rules onthe e-graph, then this vanilla cycle filtering has complexity O ( n m N ) . Efficient cycle filtering
As the number of matches n m istypically large and scales with N , vanilla cycle filtering canbe slow. We therefore design a novel and more efficient cy-cle filtering algorithm, consisting of a pre-filtering step anda post-processing step. Algorithm 2 shows the pseudocodefor the exploration phase with efficient cycle filtering.At the start of each iteration, we do one pass over the e-graph to record the set of descendent e-classes for eache-node (stored in a descendants map). During the iteration,for each match of the rewrite rules, we use the pre-storeddescendants map to check if applying a rewrite introducescycles to the e-graph; if so, we skip this match. Line 3–9implements the pre-filtering step. Notice that this check issound but not complete: a match that passes this check canstill introduce cycles to the e-graph. This is because newdescendants relations introduced by the previous rewrite inthis iteration are not included in the pre-stored descendantsmap. quality Saturation for Tensor Graph Superoptimization Algorithm 2
Exploration phase with efficient cycle filtering
Input: starting e-graph G , set of rewrite rules R . Output: updated e-graph G , filter list l l = {} for iter = 0, . . . , MAX ITER do
3: descendants map d = G ET D ESCENDANTS ( G , l )4: matches = S EARCH ( G , R , l )5: for match ∈ matches do if not W ILL C REATE C YCLE (match, d ) then
7: A
PPLY ( G , match)8: end if end for while true do
11: cycles = D FS G ET C YCLES ( G , l )12: if len(cycles) == 0 then break end if for cycle ∈ cycles do
16: R
ESOLVE C YCLE ( G , l , cycle)17: end for end while end for return G , l To resolve the cycles we missed in the pre-filtering step,we add a post-processing step at the end of each iteration(line 10-18). We make a pass over the e-graph in DFS orderand collect a set of cycles in the e-graph. For each cycle,we choose the last node that is added to the e-graph, andadd that node to a filter list. The nodes in the filter list areconsidered as removed from the e-graph. We make surethose nodes are not picked during extraction by explicitlyadding constraints ∀ i ∈ l, x i = 0 to the ILP.By constructing a descendants map once before each itera-tion, each of the checking in the pre-filtering step takes con-stant time. The worst case complexity of the post-processingstep is O ( n c N ) , where n c is the number of cycles in thee-graph. Since n c is typically much smaller than n m , thisalgorithm is much faster than the vanilla cycle filtering. Inpractice each DFS pass over the e-graph can find manycycles, which makes O ( n c N ) a very conservative upperbound. VALUATION
We implement T
ENSAT in Rust using egg (Willsey et al.,2020), an open source equality saturation library. For theextraction phase, we use SCIP (Gamrath et al., 2020) asthe ILP solver, wrapped by Google OR-tools (Perron &Furnon).We utilize egg’s e-class analysis feature for the shape check-ing discussed in Section 4. An e-class analysis associatesdata with each e-class to support rewrites that are not purelysyntactic. We store all the relevant information of the ten-sors (shape, layout, split locations) in the analysis data and use these information for shape checking.
We compare T
ENSAT with TASO (Jia et al., 2019a) to evalu-ate our equality saturation based search. We use the same setof rewrite rules as TASO for our experiments. We evaluateon the inference graphs of 7 models:
BERT (Devlin et al.,2019),
ResNeXt-50 (Xie et al., 2017),
NasNet-A (Zophet al., 2018),
NasRNN (Zoph & Le, 2017),
Inception-v3 (Szegedy et al., 2016),
VGG-19 (Liu & Deng, 2015),and
SqueezeNet (Iandola et al., 2017). This benchmarkset covers a wide range of commonly used state-of-the-artmodels, including both models for computer vision tasksand models for NLP tasks, both human-designed modelsand automatically-discovered models by neural architecturesearch. We perform all experiments on a Google Cloudinstance with one NVIDIA Tesla T4 GPU, a 16-core CPU,and 60 GB of memory. We also experiment with ResNet-50(He et al., 2016), but find that on T4 GPU, the rewrite rulesfrom TASO cannot provide any speedup to the graph.For T
ENSAT , our full approach uses the efficient cycle fil-tering algorithm (Section 5.2) during the exploration phaseand the ILP method without the cycle constraints (Section5.1) for extraction. We set a limit on the number of nodes inthe e-graph N max = 50000 and the number of iterations forexploration k max = 15 . We terminate the exploration phasewhen any of the limit is reached, or the e-graph is saturated.We set a separate limit k multi on the number of iterations toapply the multi-pattern rules. We use a default of k multi = 1 for the main results in Section 6.2 and 6.3, and study theeffect of varying k multi in Section 6.4. We set a timeout of 1hour for the ILP solver.For TASO’s backtracking search, we use their default set-tings from their artifact evaluation code on the number ofiterations n = 100 and the hyperparameter α = 1 . foreach benchmark. We also test α = 1 . as mentioned intheir paper, and find that the difference is tiny (differencein speedup percentage is less than 0.1% on average overthe benchmarks). Increasing to n = 1000 leads to lessthan 1% speedup gain with the cost of over 11x longer inoptimization time on average. We compare the speedup percentage of the optimized graphwith respect to the original graph between T
ENSAT andTASO. We use TASO’s cuDNN backend to measure theruntime of the full computation graphs. Figure 4 shows theresults. We can see that T
ENSAT discovers better optimizedgraphs compared with TASO’s backtracking search in most The number of iterations of the outer loop, see Algorithm 2 in(Jia et al., 2019a) for more details quality Saturation for Tensor Graph Superoptimization
NasRNN BERT ResNeXt NasNet-A Squeeze. VGG Incept. Incept. k=2010203040506070 Sp ee d u p p e r c e n t a g e TASO Tensat
Figure 4.
Speedup percentage of the optimized graph with respectto the original graph, TASO v.s. T
ENSAT . Each setting (optimizer × benchmark) is run for five times, and we plot the mean andstandard error for the measurements. NasRNN BERT ResNeXt NasNet-A Squeeze. VGG Incept. Incept. k=210 O p t i m i z e r t i m e ( s e c o n d s ) TASO total TASO best Tensat
Figure 5.
Comparison of the optimization time (log scale) betweenTASO and T
ENSAT . “TASO total” is the total time of TASO search.“TASO best” indicates when TASO found its best result; achievingthis time would require an oracle telling it when to stop. benchmarks. T
ENSAT ’s optimized graphs are on average6.6% faster than TASO’s. We see the biggest speedup of23% over TASO on NasRNN. Note that for Inception-v3,T
ENSAT with k multi = 1 gives a smaller speedup than TASO,but increasing k multi to 2 achieves a better speedup thanTASO while still being 13.4 × faster than TASO’s search(see Figure 5).This improvement comes from the fact that equality satura-tion covers a much larger space of equivalent graphs thansequential backtracking search. By using e-graph as a com-pact representation of an exponential number of equivalentgraphs, T ENSAT is able to cover orders of magnitude moreequivalent graphs than TASO.We inspect the optimized graphs from T
ENSAT and recordedsome rewrite patterns that is used in them. We presentseveral examples of useful patterns in the Appendix.
Another important metric is the time taken by the optimizeritself. For T
ENSAT , this is the sum of time taken by theexploration phase and the extraction phase. For TASO, werecord two times for a single backtracking search. Thefirst is the total time of the backtracking search with thedefault number of iterations ( T total ). The second one is thetime taken to first reach the best graph found during itssearch ( T best ). T best is the best possible time for TASO’ssequential backtracking search. In practice, it is difficult (ifnot impossible) to achieve T best since the sequential searchalgorithm would have no way to know that it can stop at thatpoint. Figure 5 shows the time taken by the optimizers acrossbenchmarks. We can see that T ENSAT runs 9.5x to 379xfaster than TASO’s T total , and 1.8x to 260x times faster than T best . This shows that T ENSAT can not only cover a muchlarger search space, but also achieve this in drastically lesstime. Furthermore, T
ENSAT ’s optimization time is smallenough that we believe our approach can be integrated intoa default compilation flow instead of running the search asan additional offline autotuning process.
As we discuss in Section 4, multi-pattern rewrite rules cangrow the e-graph in a extremely rapid manner. Here westudy the effect of varying the number of iterations for multi-pattern rewrites k multi . Figure 6 shows the results. We cansee the explosion of the number of nodes in the e-graph as k multi increases (due to the double exponential growth). ForNasRNN, Inception-v3, BERT, NasNet-A, and ResNeXt-50,by increasing k multi , T ENSAT discovers better graphs withlarger speedups. But for SqueezeNet, speedup decreaseswith k multi . This is due to the discrepancy between thecost model and the real graph runtime. As k multi increasesfor SqueezeNet, the cost model suggests that certain newrewrites can reduce the cost, while they in fact increasefull graph runtime. Despite this special case where thediscrepancy has an effect, T ENSAT on SqueezeNet with k multi = 3 still achieves a better speedup than TASO. Byincreasing k multi , T ENSAT can explore a larger search spaceand find better optimized graphs for most benchmarks, atthe cost of longer time taken by the optimizer. quality Saturation for Tensor Graph Superoptimization Sp ee d u p p e r c e n t a g e O p t i m i z e r t i m e ( s e c o n d s ) e n o d e s NasRNNBERTResNeXtNasNet-ASqueeze.VGGIncept.
Figure 6.
Effect of varying the number of iterations of multi-pattern rewrites k multi . For BERT, NasNet-A, NasRNN, Inception-v3, the ILPsolver times out at one hour for k multi = 3 . Left: speedup of the optimized graphs (the y -axis is split for clarity). Middle: time taken bythe T ENSAT . Right: final e-graph size (number of e-nodes). The middle and right figures are in log scale.
Graph Runtime (ms) Original Greedy ILP
BERT 1.88 1.88
NasRNN 1.85 1.15
NasNet-A 17.8 22.5
Table 3.
Comparison between greedy extraction and ILP extrac-tion, on BERT, NasRNN, and NasNet-A. This table shows theruntime of the original graphs and the optimized graphs by greedyextraction and ILP extraction. The exploration phase is run with k multi = 1 . In this section, we study the effect of the important designchoices in our approach.
Greedy v.s. ILP extraction
The first important designchoice is the extraction method. Table 3 shows the com-parison between greedy extraction and ILP extraction. Al-though greedy extraction works fine on some benchmarks(e.g. NasRNN), it fails to extract an optimized graph onothers (e.g. BERT and NasNet-A). This is due to the natureof greedy extraction: it makes the choices on which node topick separately and greedily, without considering the inter-dependencies between the choices. Consider the rewrite inFigure 2 (merging two matmul s by concat and split )as an illustrative example. After applying this rewrite tothe e-graph, there will be two e-classes that have multiplee-nodes: one e-class per each output. This rewrite can re-duce the cost only if both e-classes choose the split node,since the RHS subgraph can be reused by the two outputs.However, greedy extraction will never pick the split nodes,since it does not know the RHS subgraph is shared betweenthe two split nodes.
ILP with or without cycle constraints
Here we studythe effect of whether or not to include the cycle constraintsin ILP. Table 4 presents the effect on extraction time as k multi (thus e-graph size) varies. With the cycle constraints, Extraction k multi With cycle Withouttime (s) real int cycle
BERT 1 0.96 0.98 > > NasRNN 1 1116 1137 > > NasNet-A 1 424 438 > > Table 4.
Effect of whether or not to include cycle constraints in ILPon extraction time (in seconds), on BERT, NasRNN, and NasNet-A. For the cycle constraints, we compare both using real variablesand using integer variables for the topological order variables t m . ILP solver time quickly increases with the e-graph size, andreaches timeout when k multi = 2 . In our experiments, theILP solver has not yet found a feasible solution at time-out. Removing the cycle constraints leads to approximately10x–1000x speedup on ILP solving time on larger e-graphs.These results show that the main difficulty for the ILP solveris to satisfy the cycle constraints. Thus, removing the cycleconstraints makes it possible for our approach to scale tolarger e-graphs. Efficient cycle filtering
To remove the cycle constraintsfrom ILP, we need to perform cycle filtering during theexploration phase. Here we compare the two cycle filteringtechniques introduced in Section 5.2. Table 5 shows theeffect on the exploration phase time, as k multi varies. Wecan see that the efficient cycle filtering algorithm achievesup to 2000x speedup compared with the vanilla algorithm,making it possible to explore a larger e-graph. ELATED W ORK
Graph Rewrite Optimizations
Our work improves thesearch mechanism for finding the most optimal tensor graphsubstitutions upon existing work (Jia et al., 2019a;b; Fang quality Saturation for Tensor Graph Superoptimization k multi BERT NasRNN NasNet-AVan. Eff. Van. Eff. Van. Eff.1 0.18 > Table 5.
Comparison between vanilla cycle filtering and efficientcycle filtering, on the exploration phase time (in seconds) for BERT,NasRNN, and NasNet-A. et al., 2020). TASO (Jia et al., 2019b;a) uses a backtrack-ing search algorithm with a hard threshold for allowingsubstitutions that increase runtime. Compared to TASO, asubsequent work (Fang et al., 2020) presents a more efficientsampling-based search algorithm that prunes redundant sub-stitutions. While the sampling-based approach is faster thanTASO, it does not lead to discovering more optimized pro-grams, unlike our approach.An optimization via graph substitutions is also critical toother domains outside deep learning. NeuRewriter (Chen& Tian, 2019) exploits reinforcement learning to iterativelyselect which rewrite rule to apply on which region of thegraph for multiple problem domains, including algebraic ex-pression simplification, job scheduling, and vehicle routing.This technique is complement to our approach. In particular,we can enhance our approach by applying machine learn-ing to select more promising multi-pattern rules to apply ineach iteration when we cannot apply multi-pattern rules tosaturation.Unlike our approach, these prior techniques suffer from thereliance on iteratively applying substitutions in sequences.
Superoptimization
Superoptimization is a program opti-mization technique that searches for a correct and optimalprogram with respect to a cost model. Most superoptimizersoptimize relatively short sequences of low-level instructions(Massalin, 1987; Joshi et al., 2002; Bansal & Aiken, 2006;Schkufza et al., 2013; 2014; Phothilimthana et al., 2016a;b;Sasnauskas et al., 2017). While most of them do not rely onrewrite rules, Denali (Joshi et al., 2002) takes the approachof using rewrite rules for scalability but sacrificing someoptimality guarantee. Similar to ours, Denali (Joshi et al.,2002) employs e-graphs and extracts optimal programs us-ing a constrain solver. Unlike our work, none of these priorresearch focuses on tensor graph superoptimization.
Equality Saturation Applications
The core of T EN - SAT ’s approach is the application of equality saturation,which has been successfully applied in other domains aswell. The first works (Tate et al., 2009; Stepp et al., 2011)focused on traditional compiler optimizations, but more re-cent applications include CAD simplification, numericalaccuracy, and code search (Nandi et al., 2020; Panchekha et al., 2015; Premtoon et al., 2020).Wang et al. (2020) also use equality saturation to optimizemachine learning programs. However, they optimize lin-ear algebra kernels consisting of few simple operations likematrix multiplication and summation, whereas T
ENSAT op-timizes at the computation graph level. We contribute themulti-pattern extension to better explore the search spaceof deep learning models, as well as the cycle-filtering algo-rithm to make ILP extraction efficient.
ONCLUSIONS
We have presented a new approach to tensor graph opti-mization using equality saturation. We explained necessaryextensions to equality saturation to make it work for ourproblem domain: supporting multi-pattern rewrite rules, andintroducing a new extraction algorithm with an efficient cy-cle filtering for scalability. We show that our approach canfind optimized graphs with up to 23% speedup over state-of-the-art, while spending on average 48x less time optimizing.Our approach is able to find graphs that are globally optimal,and is fast enough that it can be integrated into the normalcompilation flow for inference graphs. A CKNOWLEDGMENTS
We thank Zhihao Jia, Martin Maas, Hyeontaek Lim, andthe anonymous reviewers for their insightful and helpfulcomments. R EFERENCES
Bansal, S. and Aiken, A. Automatic generation of peepholesuperoptimizers. In
ASPLOS , 2006.Chen, X. and Tian, Y. Learning to perform localrewriting for combinatorial optimization. In Wallach,H. M., Larochelle, H., Beygelzimer, A., d’Alch´e-Buc,F., Fox, E. B., and Garnett, R. (eds.),
Advancesin Neural Information Processing Systems 32 An-nual Conference , NeurIPS ’19, pp. 6278–6289,2019. URL http://papers.nips.cc/paper/8858-learning-to-perform-local-rewriting-for-combinatorial-optimization .de Moura, L. and Bjørner, N. Efficient e-matching forsmt solvers. In Pfenning, F. (ed.),
Automated Deduc-tion – CADE-21 , pp. 183–198, Berlin, Heidelberg, 2007.Springer Berlin Heidelberg. ISBN 978-3-540-73595-3.Detlefs, D., Nelson, G., and Saxe, J. B. Simplify: A theoremprover for program checking.
J. ACM , 52(3):365–473,May 2005. ISSN 0004-5411. doi: 10.1145/1066100.1066102. URL http://doi.acm.org/10.1145/1066100.1066102 . quality Saturation for Tensor Graph Superoptimization Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:Pre-training of deep bidirectional transformers for lan-guage understanding. In
NAACL-HLT , 2019.Fang, J., Shen, Y., Wang, Y., and Chen, L. Optimizing DNNComputation Graph Using Graph Substitutions.
Proc.VLDB Endow. , 13(12):2734–2746, July 2020. ISSN 2150-8097. doi: 10.14778/3407790.3407857. URL https://doi.org/10.14778/3407790.3407857 .Gamrath, G., Anderson, D., Bestuzheva, K., Chen,W.-K., Eifler, L., Gasse, M., Gemander, P., Gleixner,A., Gottwald, L., Halbig, K., Hendel, G., Hojny,C., Koch, T., Le Bodic, P., Maher, S. J., Matter, F.,Miltenberger, M., M¨uhmer, E., M¨uller, B., Pfetsch,M. E., Schl¨osser, F., Serrano, F., Shinano, Y., Tawfik,C., Vigerske, S., Wegscheider, F., Weninger, D., andWitzig, J. The SCIP Optimization Suite 7.0. Technicalreport, Optimization Online, March 2020. URL .He, K., Zhang, X., Ren, S., and Sun, J. Deep residuallearning for image recognition. In , pp.770–778, 2016.Iandola, F. N., Moskewicz, M. W., Ashraf, K., Han, S.,Dally, W., and Keutzer, K. Squeezenet: Alexnet-levelaccuracy with 50x fewer parameters and ¡1mb model size.
ArXiv , abs/1602.07360, 2017.Jia, Z., Padon, O., Thomas, J., Warszawski, T., Zaharia,M., and Aiken, A. TASO: Optimizing Deep Learn-ing Computation with Automatic Generation of GraphSubstitutions. In
Proceedings of the 27th ACM Sympo-sium on Operating Systems Principles , SOSP ’19, pp.47–62, New York, NY, USA, 2019a. Association forComputing Machinery. ISBN 9781450368735. doi: 10.1145/3341301.3359630. URL https://doi.org/10.1145/3341301.3359630 .Jia, Z., Thomas, J., Warszawski, T., Gao, M., Zaharia, M.,and Aiken, A. Optimizing DNN Computation with Re-laxed Graph Substitutions. In
Proceedings of the 2ndSysML Conference , SysML ’19, 2019b.Joshi, R., Nelson, G., and Randall, K. Denali: a goal-directed superoptimizer. In
PLDI , 2002.Liu, S. and Deng, W. Very deep convolutional neural net-work based image classification using small training sam-ple size. In , pp. 730–734, 2015.Massalin, H. Superoptimizer: a look at the smallest program.In
ASPLOS , 1987. Nandi, C., Willsey, M., Anderson, A., Wilcox, J. R.,Darulova, E., Grossman, D., and Tatlock, Z. Syn-thesizing structured CAD models with equality satu-ration and inverse transformations. In
Proceedings ofthe 41st ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation , PLDI 2020, pp.31–44, New York, NY, USA, 2020. Association for Com-puting Machinery. ISBN 9781450376136. doi: 10.1145/3385412.3386012. URL https://doi.org/10.1145/3385412.3386012 .Nelson, C. G.
Techniques for Program Verification . PhDthesis, Stanford, CA, USA, 1980. AAI8011683.Panchekha, P., Sanchez-Stern, A., Wilcox, J. R., andTatlock, Z. Automatically improving accuracy forfloating point expressions.
SIGPLAN Not. , 50(6):1–11, June 2015. ISSN 0362-1340. doi: 10.1145/2813885.2737959. URL https://doi.org/10.1145/2813885.2737959 .Perron, L. and Furnon, V. Or-tools. URL https://developers.google.com/optimization/ .Phothilimthana, P. M., Thakur, A., Bodik, R., and Dhur-jati, D. Scaling up superoptimization. In
Proceed-ings of the Twenty-First International Conference onArchitectural Support for Programming Languages andOperating Systems , ASPLOS ’16, pp. 297–310, NewYork, NY, USA, 2016a. Association for ComputingMachinery. ISBN 9781450340915. doi: 10.1145/2872362.2872387. URL https://doi.org/10.1145/2872362.2872387 .Phothilimthana, P. M., Thakur, A., Bodik, R., and Dhurjati,D. Greenthumb: Superoptimizer construction frame-work. In
Proceedings of the 25th International Confer-ence on Compiler Construction , CC 2016, pp. 261–262,New York, NY, USA, 2016b. Association for Com-puting Machinery. ISBN 9781450342414. doi: 10.1145/2892208.2892233. URL https://doi.org/10.1145/2892208.2892233 .Premtoon, V., Koppel, J., and Solar-Lezama, A. Semanticcode search via equational reasoning. In
Proceedings ofthe 41st ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation , PLDI 2020, pp.1066–1082, New York, NY, USA, 2020. Association forComputing Machinery. ISBN 9781450376136. doi: 10.1145/3385412.3386001. URL https://doi.org/10.1145/3385412.3386001 .Rust. Rust programming language. . URL . quality Saturation for Tensor Graph Superoptimization Sasnauskas, R., Chen, Y., Collingbourne, P., Ketema, J.,Taneja, J., and Regehr, J. Souper: A synthesizingsuperoptimizer.
CoRR , abs/1711.04422, 2017. URL http://arxiv.org/abs/1711.04422 .Schkufza, E., Sharma, R., and Aiken, A. Stochastic super-optimization. In
ASPLOS , 2013.Schkufza, E., Sharma, R., and Aiken, A. Stochastic opti-mization of floating-point programs with tunable preci-sion. In
PLDI , 2014.Stepp, M., Tate, R., and Lerner, S. Equality-based trans-lation validator for llvm. In Gopalakrishnan, G. andQadeer, S. (eds.),
Computer Aided Verification , pp. 737–742, Berlin, Heidelberg, 2011. Springer Berlin Heidel-berg. ISBN 978-3-642-22110-1.Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,Z. Rethinking the inception architecture for computervision. , pp. 2818–2826, 2016.Tate, R., Stepp, M., Tatlock, Z., and Lerner, S. Equalitysaturation: A new approach to optimization. In
Pro-ceedings of the 36th Annual ACM SIGPLAN-SIGACTSymposium on Principles of Programming Languages ,POPL ’09, pp. 264–276, New York, NY, USA, 2009.ACM. ISBN 978-1-60558-379-2. doi: 10.1145/1480881.1480915. URL http://doi.acm.org/10.1145/1480881.1480915 .Wang, Y. R., Hutchison, S., Leang, J., Howe, B., and Suciu,D. SPORES: Sum-product optimization via relationalequality saturation for large scale linear algebra.
Proceed-ings of the VLDB Endowment , 2020.Willsey, M., Wang, Y. R., Flatt, O., Nandi, C., Panchekha,P., and Tatlock, Z. egg: Fast and extensible e-graphs,2020.Xie, S., Girshick, R., Doll´ar, P., Tu, Z., and He, K. Aggre-gated residual transformations for deep neural networks.In , pp. 5987–5995, 2017.Zoph, B. and Le, Q. V. Neural architecture search withreinforcement learning. 2017. URL https://arxiv.org/abs/1611.01578 .Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learningtransferable architectures for scalable image recognition.In , pp. 8697–8710, 2018. quality Saturation for Tensor Graph Superoptimization
A E
XAMPLE P ATTERNS OF R EWRITE U SED BY T ENSAT xw1matmul w2matmul x w1 w2concatmatmulsplit
Figure 7.
Example pattern used in BERT. w1 and w2 are weightnodes. The optimized graphs for BERT also contain this patterngeneralizing to more than 2 matmul s sharing a common inputnode. xw1 conv w2conv x w1 w2concatconvsplit Figure 8.
Example pattern used in NasNet-A and Inception-v3.The optimized graphs also contain this pattern generalizing tomore than 2 conv s sharing a common input node. quality Saturation for Tensor Graph Superoptimization xw1conv w2conv w3conv w4conv+ x w1 w3concatconv w2 w4concatconv
Figure 9.
An example pattern of rewrite useful for NasNet-A. Left: pattern in the original graph. Right: pattern in the optimized graphby T
ENSAT . We only show the core operator nodes for clarity. Here, each w i is a convolution weight kernel. The dimensions areordered by (out channels, in channels, height, width) . concat(w1, w3) is over axis=0 (output channels), and concat(w2, w4) is over axis=1 (input channels). Since the two concat operators only involve weight nodes as inputs, they canbe pre-computed in inference time. Therefore, this rewrite pattern effectively convert four convolutions into two. x w1matmul y w2matmul+ x yconcat w1 w2concatmatmul Figure 10.
Example pattern used in NasRNN. The optimized graphs also contain this pattern generalizing to more than 2 matmulmatmul