[PDF] A Learned Performance Model for Tensor Processing Units

Abstract

Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden. We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for Tensor Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks -- tile-size selection and operator fusion -- and that it helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.

Full PDF

AA Learned Performance Model for the TensorProcessing Unit

Samuel J. Kaufman ∗ Paul G. Allen School of Computer Science & EngineeringUniversity of WashingtonSeattle, WA [email protected]

Phitchaya Mangpo Phothilimthana

Google ResearchMountain View, CA [email protected]

Yanqi Zhou

Google ResearchMountain View, CA [email protected]

Mike Burrows

Google ResearchMountain View, CA [email protected]

Abstract

Accurate hardware performance models are critical to efﬁcient code generation.They can be used by compilers to make heuristic decisions, by superoptimizersas an minimization objective, or by autotuners to ﬁnd an optimal conﬁguration ofa speciﬁc program. However, they are difﬁcult to develop because contemporaryprocessors are complex, and the recent proliferation of deep learning acceleratorshas increased the development burden. We demonstrate a method of learningperformance models from a corpus of tensor computation graph programs forthe Tensor Processing Unit (TPU). We train a neural network over kernel-levelsub-graphs from the corpus and ﬁnd that the learned model is competitive to aheavily-optimized analytical cost model used in the production XLA compiler.

Performance models are used in compiler optimizations [18, 11], reinforcement learning [20], andNeural Architecture Search (NAS) [8, 9, 13]. A performance model is particularly useful for compileroptimizations because collecting performance numbers (e.g., execution time) from a real machinecan be expensive or infeasible, such as during ahead-of-time compilation when we have no access tothe target hardware. For example, LLVM’s loop vectorizer uses a performance model to computethe optimal vectorization and unroll factors [18]. GCC’s auto-vectorizer uses a performance modelto decide when to apply loop-peeling, loop-versioning, outer-loop vectorization, and intra-iterationvectorization [11]. In addition, a performance model can be used by a compiler autotuner as a fasteralternative to evaluate candidate conﬁgurations in a search space [15, 16, 21].It is challenging and time-consuming to develop an accurate analytical performance model topredict program performance metrics, such as the execution time, on a modern processor. Programperformance is tightly coupled with the underlying complex processor architecture as well as theperformance-affecting decisions that are made during compilation [3]. However, a performancemodel often lacks an in-depth view of the processor architecture and low-level generated code. Therecent proliferation of deep learning accelerators has only exacerbated this problem as it demandsrapid, repeated development of performance models targeting new accelerators. ∗ Work done at Google Research. An autotuner automatically searches a space of conﬁgurations of a program, and selects the best-performingconﬁguration according to a performance metric, such as execution time, throughput, or power consumption. a r X i v : . [ c s . PF ] A ug onfigs for Various Optimizations: - data/model parallelism- layout assignment- operation fusion- remeteralization- memory assignment- tiling etc. Compiler Autotuner input program config cost best configexe

Evaluator

ML CompilerHardware Partial ML CompilerLearned Cost ModelIR O R Various Search Strategies: random, genetic,simulated annealing, MCMC, MCTS, RL, etc.

Figure 1: The compiler autotuner typically relies on real hardware to evaluate the performance ofgenerated code. We propose a learned performance model as a cheaper alternative to obtain rewardsignals without using the hardware.In this paper, we propose to automatically learn a performance model to predict execution time of ten-sor programs on a Tensor Processing Unit (TPU), an accelerator for machine learning workloads [17].Similar to prior work [2, 5, 19], we formulate the runtime estimation problem as a regression task. Ourapproach extracts features directly from an unmodiﬁed program representation, minimizing manualeffort to develop; in contrast, Halide’s learned performance model requires additional performancecounters generated from a static analyzer [2]. Note that tensor programs contain complex, multi-levelnested loops, whose runtimes are harder to predict than those of loop-free instruction sequences,such as those tackled by Ithemal [19]. To represent a tensor program naturally and generalize theperformance model to unseen programs, we use a graph-based neural network. We show that ourmodel generalizes relatively well to unseen tensor kernels and is retargetable to different optimizationtasks.We evaluate our performance model on predicting runtimes for two different tensor compiler op-timization tasks — operator fusion and tile size selection — on a TPU. The trained performancemodel is applied to evaluate program conﬁgurations generated by an autotuner for the AcceleratedLinear Algebra (XLA) compiler [23], in replacement of the real hardware, as depicted in Fig. 1.In summary, this paper presents the following contributions:• We develop the ﬁrst learned performance model for tensor programs, which contain multi-level nested loops, that (1) is generalized to unseen programs, (2) is retargetable for differentcompiler optimization tasks, and (3) does not rely on any additional static analysis.• We show better generalization results using a graph neural network. The test accuracyon the fusion optimization task is 13% and 10% better than a sequential model’s and thehand-tuned analytical model built into the compiler’s respectively.• We integrate our learned performance model into an XLA fusion autotuner, and demonstratethat when access to real hardware accelerators is limited, the performance model helps theautotuner discover fusion conﬁgs that are up to 19% faster than the default conﬁgs.

Ithemal uses a hierarchical recurrent neural network to estimate the throughput of x86-64 basic blocksrunning on highly complex processors [19]. Basic blocks are relatively short, loop-free sequences ofinstructions (6.06 instructions on average). In contrast, our work addresses larger machine learningkernels with implicit nested loops (which are represented naturally as graphs), containing up toa thousand operators. Ithemal was evaluated on its ability to generalize to held-out basic blocks.However, our method is tested for its ability to generalize to wholly novel tensor programs andtargeting a drastically different processor.Both the code-feature-based performance model [7] and Halide’s performance model [2] use simpleneural nets to predict runtime from manually-engineered features produced by a static analyzer thatexamines an optimized program. Since extracting these features from an XLA graph is non-trivial,2e train a more complex neural net, using features that can be extracted directly from the XLA graph,with sufﬁcient capacity to recover similarly powerful representations.AutoTVM also uses a learned performance model to optimize tensor kernels, by ranking candidates [5].However, AutoTVM’s model shows limited ability to generalize between kernels and is trained forper-kernel search over a kernel-speciﬁc set of parameters. In contrast, our model can be used to bothestimate the runtime of an entire tensor program and rank conﬁguration candidates per kernel, andour model generalizes to novel kernels and applications.Additionally, approaches to NAS often employ a closely related idea, learning models to predict theerror or error curve of an deep learning model architecture [6, 14, 24]. Others, such as ReNAS [25],learn to rank sets of candidate neural architectures rather than predict runtimes in isolation.

Our goal is to predict runtime of XLA programs on a TPU. XLA — a machine learning compilerfor multiple hardware targets — is used as a backend for various machine learning programmingframeworks, including TensorFlow [1], PyTorch [22], and JAX [10]. An XLA program consists ofbasic blocks, called computations ; loop bodies and conditions in a computation are represented aspointers to other computations. Each computation is represented by a directed acyclic graph called a computation graph . A node in a computation graph represents a tensor operation, processing one ormore input tensors into a single output. An edge connects an output tensor from one node to an inputtensor of another node. In this paper, we apply a performance model to two speciﬁc optimizationtasks — operator fusion (program-level) and tile-size selection (kernel-level) — for XLA programsrunning on the TPU v2.

Operator fusion is an important program-level optimization that merges multiple operations into asingle unit. Before this pass, a node in a computation graph is a single primitive tensor operation (e.g.convolution, element-wise add, etc). When two producer-consumer ops are fused, the intermediatedata is immediately consumed by the consumer, without the need to perform read and write transac-tions with main memory, thereby reducing data communication. After the fusion pass, a node in acomputation graph is either a single primitive op or a fused op with many primitive ops. In this paper,we call a node in an optimized computation graph a kernel , as illustrated in Fig. 2.We have developed a fusion autotuner that searches for the fastest fusion conﬁguration of an XLAprogram. It has found up to 15% speedup on some production deep learning models, but theautotuning process is slow, with most of its time spent compiling and executing programs on the TPU.The search space is also extremely large, containing up to , conﬁguration candidates, so weneed a fast mechanism to evaluate as many candidates as possible within a time budget. Therefore,we propose using a learned performance model to reduce evaluation time on real hardware. Currently,there is no manual performance model built for this task in XLA. Tile-size selection is a performance-critical, kernel-level optimization. The goal is to select an optimaltile size for a kernel’s output tensor that ﬁts in the fast scratchpad memory; one tile is computed at atime and copied to the slower main memory before the next tile is computed. The number of validtile sizes ranges from two to 500,000 depending on the kernel. XLA selects the tile size based on amanually-written analytical performance model. This model is extremely complex, taking severalperson-years to develop. Ultimately, we would like to replace this manual performance model withthe learned performance model, demonstrating a new, less costly way of developing compilers.

Our performance model is developed for the TPU, a fast, energy-efﬁcient deep learning accelerator.Its architecture is in some ways simpler and in others more complex than modern general-purposeprocessors like x86. It has no out-of-order execution, hardware caching, or virtual memory. However,it incorporates a VLIW instruction set, 2D registers, a matrix multiplication unit, and a cross-lane3 onvparam.2param.1 param.3mult broadcast param.1expreducereshape kernel 1 kernel 2kernel 3 outputoutput output

Figure 2: Optimized XLA graph, inwhich each node (called kernel) in turncontains a graph of primitive op(s).

MaxFeedforward y’ : predicted execution time concat X ε k Sum [n, m+m’] [n, m”] concat A : adjacency matrix [n, n][1, m”] [1, m”] FeedforwardEmbed Opcode x o : opcode ids [n, 1] X f : ops’ features [n, m’] operation i (row i ) [n, m] GraphSAGEMean

Figure 3: Architecture of a neural network which predictsexecution time of a kernel.unit. This TPU does not support multi-threading; one kernel is executed at a time, reading from andwriting to main memory at start and termination respectively. Thus, we can compute the total runtimeof an entire program by summing the runtimes of its kernel executions. This approach of estimatingthe total program runtime from kernels’ runtimes can be applied to many accelerators; prior workhas shown that this technique is sufﬁciently accurate for graph rewrites [15] and parallelizationconﬁgurations autotuning [16, 21] on GPUs.

Our approach ﬁrst decomposes a XLA program into kernels and formulates the kernel runtimeestimation problem as a regression task. We can then compute the program’s total runtime bysumming the kernel runtimes. This approach confers two beneﬁts. First, this simple decompositionremains general enough that we can apply the neural network model to various tasks, includingboth whole-program optimizations and kernel-level optimizations. Second, it introduces a restrictionconsistent with how a compiler transforms a high-level program into a set of optimized kernels,reducing the size of the graphs for which our model will be trained to produce embeddings by ordersof magnitude, therefore improving the sample-parameter ratio and at no cost. This improves ourmodel’s ability to generalize to unseen programs.The rest of this section focuses on our neural network model that predicts the execution time of eachindividual kernel. For the purpose of predicting cost, we represent a kernel as a directed graph withnodes corresponding to primitive operations.

Figure 3 depicts the architecture of our performance model for predicting the execution time ( y (cid:48) )of a kernel. Inputs to the model are opcodes ( x o ), non-opcode features of the ops ( X f ), and adirected adjacency matrix ( A ) that captures the connections of ops in the kernel. A row of X f includes attributes extracted from an XLA program representation, such as an output tensor shape,tensor layout, striding, padding, tile size, and where applicable, convolution ﬁlter size. Kernel inputsare expressed by nodes with the parameter opcode, and outputs are expressed via an extra featureassociated with the output nodes. The opcode ( x oi ) of an operation i is embedded into a vector ofﬂoats via a simple embedding lookup table. An op’s features occupy a ﬁxed region of the X fi vector. Neighborhood Embedding

We use a single feedforward layer f followed by GraphSAGE [12](without edge sampling) to combine information from the opcode, the op’s features, and the graphstructure to generate the node’s embedding. The embedding of node i considering k -hop neighborscan be computed as follows: ε ki = l (cid:18) f k (cid:18) concat (cid:16) ε k − i , (cid:88) j ∈ neighbors ( i ) f k ( ε k − j (cid:17)(cid:19)(cid:19) when k > ε i = f ( X i ) (1)where f k ... denote feedforward layers speciﬁc to depth k . l denotes L2 normalization. neighbors ( i ) is a set of immediate neighbors of node i . (cid:80) is a reduction chosen during hyperparameter search.4e employ GraphSAGE because (i) a tensor computation kernel is naturally represented as a graph,and (ii) learning node representations conditioned only on their own features and local neighborhoodshas shown to improve generalization. In our setting, we expect the effect of most ops to be determinedby their own properties and nearby ops’. These are the sorts of features that can be learned fromneighborhood, so our choice of model encourages an inductive bias toward mostly local contributions. Kernel Embedding

Once we have node embeddings ε k , we create the embedding ( κ ) of the kernelby computing sum, mean, and max over rows of ε k . Then, we pass κ , the concatenation of acombination of the sum, mean, and max vectors into the ﬁnal feedforward layer (without activation),to produce the estimated execution time ( y (cid:48) ) of the kernel; the exact combination of sum, mean, andmax vectors is tuned via hyperparameter search. For both fusion and tile-size selection tasks, we use the same neural netmodel architecture and node features X fi , which include a tile size feature of the kernel the nodebelongs to. We represent the tile size feature as a ﬁxed-length sub-vector, in which elements are thesizes of a tile from minor to major dimensions, ending with their sum and product; including theproduct of all dimensions’ sizes is crucial as it represents the volume of the tensor. Fusion Task

In this task, we would like the neural network to predict kernel runtimes in anabsolute unit (nanoseconds) so that we can use the predictions to compute total program runtime.Thus, we train the neutral network model using the common squared error loss, ( y (cid:48) i − y i ) , againstlog-transformed targets. We apply log transformation because targets vary widely, ranging from ananosecond to a second. Using this loss function, our model is biased towards ﬁtting long-runningkernels more than short-running kernels. This is desirable because small kernels do not contributemuch to overall program runtime. Tile-Size Selection Task

In this task, we are interested in the relative speed between different tilesizes within each kernel. Therefore, the performance model does not need to predict runtime, butinstead should be able to rank tile sizes by speed within each kernel. With this intuition, we train themodel with a pairwise rank loss [4]: L = n (cid:88) i =1 n (cid:88) j =1 φ ( y (cid:48) i − y (cid:48) j ) · pos ( y i − y j ) n · ( n − / (2)where n is the number of samples in each batch; pos ( z ) is 1 if z > , or 0 otherwise; φ ( z ) is eitherthe hinge function (1 − z ) + or logistic function log (1 + e − z ) , tuned via hyperparameter search. Withthis loss function, we modify our batching mechanism by grouping samples of different tile sizes ofthe same kernel into the same batch.Alternatively, we can use the same MSE loss as in the fusion task, but weight a loss value of eachsample appropriately so that the model is optimized for all kernels equally. Our dataset consists of computation graphs from 104 XLA programs that implement either productionmodels or common models used in research.

Fusion Dataset

We run our fusion autotuner with a random search strategy to generate 50,000fusion conﬁgurations or until timeout (four hours using 50 machines) for each input computationgraph. The graphs are then decomposed according to these fusion conﬁgurations, yielding 207 millionfused kernels (examples) after duplicate elimination. We observe that program runtimes differ byno more than 4% between runs on the TPU. To improve stability, we execute each kernel 3 times,then interpret the minimum runtime as our targets. Examples in this dataset are heavily skewed.Approximately half have runtimes below 5 µ s, but they contribute little to total program runtimes, sothe kernels that take at least 5 µ s are of more interest.5 anual Split Random SplitPrograms Kernels Programs KernelsSplit Fusion Tile-Size Fusion Tile-Size Fusion Tile-Size Fusion Tile-Size

Train

79 92 198.6M 23.0M 78 93 157.5M 21.8M

Val.

Test

Table 1: The number of unique programs and kernels in the fusion and tile-size datasets. M = million.

Tile-Size Dataset

We compile each XLA program using the compiler’s default fusion heuristics,obtaining an optimized computation graph that we decompose into kernels. For each kernel, wequery the compiler for a list of valid tile sizes. The target for each kernel/tile-size pair (example) isthe minimum runtime from three runs. A kernel may have as many as 500,000 valid tile sizes, so wemeasure runtimes for as many as possible for each kernel within 30 minutes across 50 machines.

Dataset Splitting

We estimate our approach’s ability to generalize in two ways, corresponding totwo separate test splits: one split where held-out test programs were chosen randomly, and anotherwhere held-out test programs were manually chosen to minimize their (subjective) similarity toprograms in the training set. See Table 1 for relevant statistics.

In this section, we show that our learned performance model is comparable to the manually-writtenmodel used in XLA for TPU: 10% more accurate on the fusion dataset (Section 6.1) while performingslightly worse on the tile-size dataset (Section 6.2). Additionally, we integrated the model into theXLA fusion autotuner, and show that it can help the autotuner discover faster programs when accessto real hardware accelerators is limited (Section 6.3).For all experiments, we trained our models on a single NVidia V100 instance, 96GB of RAM, with 10CPU cores for data processing. For all the learned models, we did a hyperparameter search (presentedin Supplementary Material) and selected the best-performing models on the validation split.

To understand the accuracy of our proposed performance model, we compare mean absolute per-centage error (MAPE) and rank correlation between our model and two baselines. We compare ourapproach’s ability to generalize to novel programs against both an existing analytical performancemodel and an LSTM baseline. We run separate experiments, including separate hyperparametersearches, for each dataset split described in Section 5.

Analytical Baseline

The XLA compiler backend has a mature analytical performance model thatestimates the execution time of a kernel on a TPU, as described in Section 3.2. However, thisanalytical model was not intended for predicting the runtime of an entire computation graph, soestimated costs of different types of kernels (e.g., fused kernels with and without convolutions) arein different scales. Hence, we map the model’s output to an estimated runtime by scaling with acoefﬁcient associated with the kernel’s type. Coefﬁcients are determined by executing each programin the test set on the real hardware target with a default fusion conﬁguration, and dividing the actualtotal runtime for all kernels of each type by the estimate in its original scale. LSTM Baseline

Prior work proposes an LSTM-based performance model for x86 basic blocks [19].To understand the effect of representing program examples as graphs rather than sequences, wecompare our proposed graph neural network to an LSTM trained over topologically sorted sequencesof nodes, whose embeddings are the same per-node representations used in our proposed model.

Results

As seen in Table 2, our model, the LSTM baseline, and the analytical model have medianMAPE of 13.9, 26.6, and 23.9 on longer-running kernels when considering the random dataset The analytical performance model does not support kernels without tile-size options, which account for 1%of kernels in our dataset. We ignore these kernels in our comparisons in Section 6.1. APE Kendall’s τ Our Model LSTM Analytical Our Model LSTM Analytical

ConvDRAW

WaveRNN

NMT Model

SSD

RNN

ResNet v1

ResNet v2

Translate

Median

Table 2: Fusion dataset: Mean absolute percentage error of predicted kernel runtimes for kernelswith ≥ µ s true runtimes, which account for the majority of total runtime in our programs, on therandom-split test set. Our Model (Rank Loss) Our Model (MSE Loss) AnalyticalConvDRAW

WaveRNN

NMT Model

SSD

RNN

ResNet v1

ResNet v2

Translate

Median

Table 3: Tile-size dataset: Mean Kendall’s tau between targets and predictions within each kernel, onall applications in the random-split test set.split. Our proposed model substantially outperforms both an LSTM baseline and the analyticalmodel both in terms of predicting absolute performance and in terms of rank correlation. On kernelswith < µ s runtimes, results are similar in terms of runtime predictions, while our model showshigher correlation; our model, the LSTM baseline, and the analytical model have median MAPEsof 8.4, 12.1, and 21.0 and median Kendall’s τ coefﬁcients of .82, .82, and .71 respectively. Allmodels, including the analytical performance model, perform poorly on at least one application whenevaluating on ≥ µ s kernels. This shows that the tensor programs vary widely, and it is challenging tobuild a perfect performance model for all programs.On the harder task — programs which were chosen deliberately to have the least similarity to trainingprograms — the comparison between our model and the analytical model is less conclusive butcompetitive. On kernels with runtimes ≥ µ s, our model, the LSTM baseline, and the analyticalmodel have median MAPEs of 31.8, 40.0, and 12.6 respectively, while their median Kendall’s τ coefﬁcients are .71, .70, and .92.Overall, these results suggest that our learned performance model is competitive with the analyticalbaseline. This motivates our later experiments in applying it to a downstream task (in Section 6.3). In this experiment, we drop the LSTM baseline as it is inferior to the graph-based model for ourapplication domain. We train our model with two different loss functions — MSE and rank loss— as explained in Section 4.2, and compare our model against the same analytical model. In thistask, we are interested in only relative runtimes between different tile sizes within each kernel. Thus,we measure the models’ accuracy only on the Kendall correlation, and not MAPE. We computethe correlation between targets and predictions of tile-size runtimes within each kernel, and thencompute the average over all kernels in each program. Recall that the analytical model is developedspeciﬁcally for this task, and we do not need to predict runtime in nanoseconds; as a result, the scalingcoefﬁcients used in the fusion task are no longer needed.Table 3 displays the result on the random dataset split. Our best learned performance model (trainedusing pairwise rank loss) performs slightly worse than the analytical performance model: .07 lower7 un t i m e s peedup ( % ) T r a n s f o r m e r C h a r F e a t s R e s N e t - p a r a ll e l R e s N e t v S S D HW only Cost model + HW Best known (a) Autotuning from default conﬁguration R un t i m e s peedup ( % ) -75-50-25025 T r a n s f o r m e r C h a r F e a t s R e s N e t - p a r a ll e l R e s N e t v S S D HW only Cost model + HW (b) Autotuning from random conﬁguration

Figure 4: Runtime speedup found by autotuning with and without the learned performance modelover the default heuristic conﬁguration.correlation; on the harder split: .16 lower correlation. Regarding the loss function, we found thatthe model trained using the pairwise rank loss performs better than with MSE; .04 and .13 highercorrelation on the random and hard splits respectively. This result conﬁrms our intuition that traininga model to predict relative speeds is easier than absolute runtimes.

We integrate the best learned performance model from Section 6.1 in the XLA fusion autotuner. Wemodify the autotuner to support evaluating fusion conﬁgs by either executing generated kernels onreal hardware or estimating their runtimes using the learned model, running prediction on a CPU. Theanalytical model is not used in this experiment because it cannot estimate runtimes for kernels that donot have tile-size options; kernels that are not fusion, convolution, or data formatting operations.

Experiment Setup.

Since our target hardware is in demand and more scarce than CPUs, we aim tominimize the time we use the accelerators during autotuning. Hence, we limit the time to use theaccelerators to ﬁve minutes in our experiment setup. We run simulated annealing search using thelearned performance model (from Section 6.1) for one hour on a CPU. After that, we run as many topfusion conﬁgs in the order ranked by the predicted costs on the real hardware within the ﬁve-minutetime limit. The baseline is the original autotuner, which uses only the real hardware to evaluate fusionconﬁgs, running for ﬁve minutes. We run the autotuner in two modes: starting the search from (i) adefault conﬁg and (ii) a random conﬁg. A default conﬁg is the conﬁguration generated by the defaultheuristic algorithm in the compiler given a speciﬁc program.In this experiment, we run the autotuner on a set of programs that gain signiﬁcant speedup fromautotuning according to our prior data. Although some programs (Transformer, Char2Feats, andResNet-parallel) are in our training set, most kernels are not because our training data is not generatedfrom the simulated annealing search starting from a default fusion conﬁguration.

Result.

We run the autotuner on each program 20 times and report the best speedup found over thedefault conﬁguration in Fig. 4a. Using the learned performance model together with the hardware,we are able to discover fusion conﬁgurations that are on average 2% faster than using the hardwarealone, and they are on average only 1% slower than the best known conﬁgurations found whenrunning the autotuner on hardware for four hours. When running simulated annealing starting from arandom conﬁguration (Fig. 4b), the beneﬁt from the performance model is even more pronounced.On average, using the performance model led to discovering 8% faster conﬁgurations compared tonot using the performance model. This result demonstrates that the learned performance model canindeed help generate faster code in practice when an access to a hardware target is limited.

We have presented ﬁrst steps toward learning a performance model for tensor programs. We havefound that a model trained on our corpus of research and production models can generalize wellto programs with some similarity to our training set, usually matching or improving upon the8erformance of the best known analytical baseline for our target hardware, and performs acceptablywell on programs which differ substantially. When evaluating on the task for which the analyticalmodel is heavily-optimized (e.g. tile-size selection), our learned model is slightly worse. However,while the learned cost model is less accurate, its requires much less effort to develop. Finally, wedemonstrated that the learned cost model can be employed by an autotuner to discover faster tensorprograms than using hardware targets alone when hardware access is limited.

Acknowledgments and Disclosure of Funding

We would like to thank the XLA team, especially Amit Sabne and Bjarke Roune, on feedback andhelps while developing this project, Hyeontaek Lim on code review, Sudip Roy on paper feedback,and Rishabh Singh on occasional guidance.

References [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, MatthieuDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg,Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan,Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: A System forLarge-Scale Machine Learning. In , 2016.[2] Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi,Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley.Learning to Optimize Halide with Tree Search and Random Programs.

ACM Trans. Graph. ,38(4):121:1–121:12, July 2019.[3] Hugues Berry, Daniel Gracia Pérez, and Olivier Temam. Chaos in Computer Performance.

Chaos (Woodbury, N.Y.) , 16:013110, 2006.[4] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and GregHullender. Learning to rank using gradient descent. In

Proceedings of the 22nd Interna-tional Conference on Machine Learning , ICML ’05, page 89–96, New York, NY, USA, 2005.Association for Computing Machinery.[5] Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, CarlosGuestrin, and Arvind Krishnamurthy. Learning to Optimize Tensor Programs. In

Proceedingsof the 32Nd International Conference on Neural Information Processing Systems , NIPS’18,2018.[6] Boyang Deng, Junjie Yan, and Dahua Lin. Peephole: Predicting network performance beforetraining.

CoRR , abs/1712.03351, 2017.[7] Christophe Dubach, John Cavazos, Björn Franke, Grigori Fursin, Michael F.P. O’Boyle, andOlivier Temam. Fast Compiler Optimisation Evaluation Using Code-feature Based PerformancePrediction. In

Proceedings of the 4th International Conference on Computing Frontiers , CF’07, 2007.[8] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Efﬁcient Multi-objective NeuralArchitecture Search via Lamarckian Evolution. arXiv e-prints , page arXiv:1804.09081, Apr2018.[9] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural Architecture Search: A Survey. arXiv e-prints , page arXiv:1808.05377, Aug 2018.[10] Roy Frostig, Matthew James Johnson, and Chris Leary. Compiling machine learning programsvia high-level tracing. In

Advances in Neural Information Processing Systems , 2017.[13] Chi-Hung Hsu, Shu-Huan Chang, Jhao-Hong Liang, Hsin-Ping Chou, Chun-Hao Liu, Shih-Chieh Chang, Jia-Yu Pan, Yu-Ting Chen, Wei Wei, and Da-Cheng Juan. MONAS: Multi-Objective Neural Architecture Search using Reinforcement Learning. arXiv e-prints , pagearXiv:1806.10332, Jun 2018.[14] R. Istrate, F. Scheidegger, G. Mariani, D. Nikolopoulos, C. Bekas, and A. C. I. Malossi. Tapas:Train-less accuracy predictor for architecture search, 2018.[15] Zhihao Jia, James Thomas, Todd Warszawski, Mingyu Gao, Matei Zaharia, and Alex Aiken.Optimizing dnn computation with relaxed graph substitutions. In , 2019.[16] Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and model parallelism for deep neuralnetworks. In , 2019.[17] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, RaminderBajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin,Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb,Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. RichardHo, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski,Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy,James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin,Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, RaviNarayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda,Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, GregorySizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, GregoryThorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang,Eric Wilcox, and Doe Hyun Yoon. In-Datacenter Performance Analysis of a Tensor ProcessingUnit. In

Proceedings of the 44th Annual International Symposium on Computer Architecture ,ISCA ’17, 2017.[18] LLVM. Auto-Vectorization in LLVM. https://bcain-llvm.readthedocs.io/projects/llvm/en/latest/Vectorizers. [Online; accessed 03-Feb-2020].[19] Charith Mendis, Alex Renda, Saman P. Amarasinghe, and Michael Carbin. Ithemal: Accu-rate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks. In

Proceedings of the 36th International Conference on Machine Learning, ICML , 2019.[20] Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Jiang, Ebrahim Songhori, Shen Wang,Young-Joon Lee, Eric Johnson, Omkar Pathak, Sungmin Bae, et al. Chip placement with deepreinforcement learning. arXiv preprint arXiv:2004.10746 , 2020.[21] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur,Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. Pipedream: Generalized pipelineparallelism for dnn training. In

Proceedings of the 27th ACM Symposium on Operating SystemsPrinciples , SOSP ’19, page 1–15, New York, NY, USA, 2019. Association for ComputingMachinery.[22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, AndreasKopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,