[PDF] MetaTune: Meta-Learning Based Cost Model for Fast and Efficient Auto-tuning Frameworks

Abstract

Deep learning compiler frameworks are gaining ground as a more portable back-end for deep learning applications on increasingly diverse hardware. However, they face the daunting challenge of matching performance offered by hand-tuned target-specific libraries. While auto-tuning frameworks with statistical cost models can provide dynamic and efficient code optimization, they suffer from large space exploration and cost model training overheads. This paper proposes MetaTune, a meta-learning based cost model that more quickly and accurately predicts the performance of optimized codes with pre-trained model parameters. MetaTune encodes convolution kernel codes as structurally similar graphs to facilitate meta-learning, meta-trains a GNN model with a very small input data set, and then predicts optimization parameters for unseen convolution operations with varying sizes and structures during compilation. The resulting framework with MetaTune provides 8 to 13% better inference time on average for four CNN models with comparable or lower optimization time while outperforming transfer learning by 10% in cross-platform cases.

Full PDF

MMetaTune: Meta-Learning Based Cost Modelfor Fast and Efﬁcient Auto-tuning Frameworks

Jaehun Ryu Hyojin Sung Abstract

Deep learning compiler frameworks are gainingground as a more portable back-end for deeplearning applications on increasingly diverse hard-ware. However, they face the daunting challengeof matching performance offered by hand-tunedtarget-speciﬁc libraries. While auto-tuning frame-works with statistical cost models can provide dy-namic and efﬁcient code optimization, they sufferfrom large space exploration and cost model train-ing overheads. This paper proposes MetaTune, ameta-learning based cost model that more quicklyand accurately predicts the performance of op-timized codes with pre-trained model parame-ters. MetaTune encodes convolution kernel codesas structurally similar graphs to facilitate meta-learning, meta-trains a GNN model with a verysmall input data set, and then predicts optimiza-tion parameters for unseen convolution operationswith varying sizes and structures during compila-tion. The resulting framework with MetaTune pro-vides 8 to 13% better inference time on averagefor four CNN models with comparable or loweroptimization time while outperforming transferlearning by 10% in cross-platform cases.

1. Introduction

Deep learning models have recently emerged as an appli-cation domain that drives innovations in domain-speciﬁccompiler technologies. While many deep learning program-ming frameworks (Abadi et al., 2015; Paszke et al., 2019;Chen et al., 2015) rely on hand-optimized libraries such asNVIDIA cuDNN/cuBLAS or Intel MKL for their back-ends,code-generating compiler frameworks (Chen et al., 2018a;Rotem et al., 2019; Ragan-Kelley et al., 2013) are explor-ing its potential as a more ﬂexible and portable solution onincreasingly diverse and heterogeneous platforms. Pohang University of Science and Technology, Pohang. Cor-respondence to: Hyojin Sung < [email protected] > . The key challenge for deep-learning compilers lies in gen-erating highly optimized codes with comparable or betterperformance than hand-tuned libraries.To address the limitations of traditional rule-based heuristicsthat cannot dynamically adapt to input and hardware charac-teristics, more data-driven approaches have been proposedto determine optimization strategies (Chen et al., 2018b;Ragan-Kelley et al., 2013; Ahn et al., 2020). These “auto-tuning” compilers use statistical cost models to learn thecorrelation between programs and runtime behaviors fromproﬁling runs and guide the search of ideal optimizationconﬁgurations.However, this approach incurs signiﬁcant compilation over-heads from searching the huge optimization space even forsimple convolution kernels and repeatedly proﬁling opti-mized codes to train a cost model. The effort to address theissue focused mainly on improving the search algorithmsor reducing/accelerating proﬁling runs (Ahn et al., 2020;Dhakal et al., 2020), while cost models are still freshlytrained from hardware measurements of each tensor opera-tion which has a unique optimization space. (Chen et al.,2018b) reused pre-trained cost models for different tensoroperations and with transfer learning, but it still requiresseveral hours and hundreds of iterations for cost models toconverge, and its effectiveness is task-dependent.In a search for a more general and systematic way of reusingknowledge obtained from other tensor operations or plat-forms to ﬁne-tune the cost model, we observed that meta-learning can enable the cost model to “learn to learn” code-performance correlations and very quickly adapt to the op-timization space of any tensor operations with structuralsimilarity. A performance regression task with code em-beddings as input is not a typical meta-learning applicationand has sparse prior work. However, with a more ﬂexiblemeta-learning method that works with any model with gra-dient descent (Finn et al., 2017), a cost model may predictperformance more accurately in much fewer iterations withmeta-learned hyperparameters.Thus, we propose MetaTune, a meta-learning based costmodel for more accurate and efﬁcient auto-tuning frame-works. We implemented MetaTune as an alternative cost a r X i v : . [ c s . L G ] F e b itle Suppressed Due to Excessive Size model for the auto-tuning framework in TVM (Chen et al.,2018a). MetaTune includes (1) a feature generator that en-codes code templates of convolution kernels as graphs andaugments the embeddings to improve structural similarity,and (2) a meta-learning based cost model that meta-trainswith performance prediction tasks generated from a smalldata set during pre-training and then ﬁne-tunes with proﬁl-ing runs of target kernel during auto-tuning.Our experimental results show that MetaTune outperformsexisting cost models, regardless of candidate search al-gorithms used together, and enables fast adaptation innot only cross-operation but also cross-platform scenar-ios. MetaTune with batch Bayesian optimization searchwe suggested in the paper improves the inference time by8% and 13% for all evaluated convolution kernels, com-pared to (Chen et al., 2018b) and (Ahn et al., 2020) respec-tively. Compared to transfer-learning based cost models,the MetaTune model provides faster and better ﬁne-tuningresults (10% on average) on NVIDIA GPU’s of differentgenerations than the pre-training target hardware.The contributions of the paper are as follows:• Super-graph input augmentation:

Extending graph-based representation of codes in prior work (Leary andWang, 2017) (Tomczak et al., 2019), we augmentedgraphs to ﬁt in a common super-graph template and pro-duced input data with higher structural similarity. As aresult, the same MetaTune model with template augmen-tation provides 4.6% higher inference time on averagethan the original input.•

Meta-learning model formulation:

We designed meta-training tasks for performance regression so that the modelcan be pre-trained for few-shot learning. The result showsthat the auto-tuning frameworks with MetaTune jump-start to predict high-performing optimization parameterswithin the ﬁrst dozens of iterations.•

Performance portability:

We showed that MetaTunecan seamlessly interface with existing search algorithmsand consistently improve autotuning framework’s overallefﬁciency on different GPU platforms.•

Complete solution:

We implemented a batch Bayesianoptimization search algorithm to cater to the needs ofMetaTune, which would not require as much space ex-ploration due to pre-training but incurs higher ﬁne-tuningoverheads. Our proposed framework with MetaTune andbatch Bayesian optimization achieves the lowest inferencetime for all CNN models evaluated.In the rest of the paper, we ﬁrst provide an overview of auto-tuning frameworks in deep learning compilers we assumefor MetaTune, and present the design and implementation ofthe MetaTune model in that context. Then we evaluate theefﬁciency of MetaTune in various scenarios in the evaluationsection and wrap up with related work and conclusions.

Knobs Description tile x Loop tiling parameter on thenumber of ﬁlter,height,weightof feature maps 140tile y 140tile f 120tile rc Loop tile reduction parameteron the number of channels,height,weight of feature maps 8tile rx 2tile ry 2auto unrollmax step Guide max unroll iteration 3unroll explicit Turn on unroll loop 2

Table 1.

Description for knobs in TVM

2. Background

Deep learning compiler frameworks consist of languagefront-ends and code-generating back-ends, as shown in Fig-ure 1. The front-end takes an input model and translatesit into high-level IR (often graph-based) to which target-independent optimizations such as operator fusion and datalayout transformation are applied. The back-end takes opti-mized IR as input and goes through target-dependent opti-mization passes that further transform IR to better exploittarget hardware features. Auto-tuning is implemented as apart of such target-dependent optimization passes.TVM (Chen et al., 2018a) is an open-source deep learningcompilers with the said structure that are widely adopted byindustry and academia. Its auto-tuning framework aims tomatch the performance of hand-tuned libraries and showedpromising results in prior work (Chen et al., 2018b; Ahnet al., 2020). This paper focuses on proposing an alternativecost model for TVM to further improve auto-tuning efﬁ-ciency without changing high-level structures, while our ap-proach can be applied to any cost-model based auto-tuningeffort.

The auto-tuning problem, in general, is about automati-cally generating a search space of optimized codes andﬁnding the best-performing version from hardware measure-ments (Ding et al., 2015a;b). To efﬁciently navigate the hugesearch space of all possible implementations, a cost modelfunction is often used to predict the performance of codesand guide the search. With a code-generating function ρ that uses optimization parameters φ (e.g., tile size), programoperation σ (e.g., 2D convolution) and operation options c (e.g., kernel size) to create a search space of optimizedcodes D and a cost model function f predicts performanceof codes in D , the problem can be formulated as follows: φ ∗ = argmax φ f ( ρ ( φ | σ, c )) itle Suppressed Due to Excessive Size Front-end Back-end

Auto-tuner

Design SpaceExploration Module Ta r g e t I n d e p e n d e n t p a ss Target hardware C o d e G e n e r a t i o n P a ss Cost Model

CodeTemplate Optimization Space Exploration Performance PredictionHardware Measurement Optimal Knobs Optimized Binary I R G e n e r a t i o n p a ss Update cost model

Figure 1.

The structure of deep learning compiler frameworks

Low level AST Graph Super-Graph Augmentation i j kfor for for i j kfor forfor forfor root root for (i .. )for (j .. ) c[i][j]=0 for (k .. ) c[i][j] += a[k][i]*b[k][j] Generategraph Populatetemplate

Figure 2.

Graph-based input generation with super-graph

Graph input representation

Graph

Feature Embedding Regression

Graph convolution

Feature Map

Relu activation+sum&max pooling

Feature Map flops

Output

Figure 3.

MetaTune model architecture

In TVM, φ consists of a combination of parameters called knobs , so the above expression turns into ﬁnding an optimalcombination of knobs as shown in Table 1 to maximize theperformance predicted by f .The accuracy of the cost model f is crucial in locating idealoptimization parameters, and many auto-tuning frameworksincluding TVM adopts machine-learning based cost modelsthat can more dynamically adapt to search results than ﬁxedheuristics or rule-based models.As shown in Figure 1, Exploration module in TVM searchesthe space of optimized codes for ideal optimization parame-ters and selects what to proﬁle next based on the predictedperformance of the cost model which is in turn updatedbased on hardware measurements. The key insight behindMetaTune is that f can be pre-conditioned with randomlysampled points in a small number of similar optimizationspaces so that it can quickly adapt to f ∗ , target cost modelfunction.

3. Data Representation and Augmentation

Prior work on automatic optimizations emphasized the im-portance of ﬁnding input features representative of programcodes for machine-learning based cost models to accuratelyunderstand code-performance correlations (Cummins et al.,2017; Chen et al., 2018b; Fursin et al., 2011; Vasilache et al.,2018). When a cost model is meta-learned, it requires inputdata to be sufﬁciently similar in terms of its structure as wellsince the difference in data representation in meta-trainingand meta-testing dataset can cause meta-overﬁtting (Finnet al., 2017) leading to a learning failure. For example, meta-learning an image-based model assumes input images of thesame size where values for each pixel are in the same valuerange. Thus, we focus on designing input data representa-tion of MetaTune to have to enable meta-learning.

TVM offers three ways to generate input data for the costmodel: a set of knob parameters, “loop context vectors”which include computed statistics about loops such as looplengths, access patterns, and many different arithmetic in-structions, and curve features which summarize relationalinformation about loop context vectors (Chen et al., 2018b).Our attempt to design a meta-learning based cost modelwith these representations consistently failed due to a lackof structure in them.They express structural information as second-hand statis-tics computed and summarized from codes. As weakly struc-tured data, they have variable lengths for tensor operationswith different structures. Such format divergence interferedwith meta-learning adaptation which assumes structurallysimilar tasks. Inspired by prior work that showed morestructured, tree/graph-based code representation improvesthe modeling efﬁciency in general (Kaufman et al., 2020;Tomczak et al., 2019; Sung et al., 2019), we use graph-basedinput data for meta-learning and ﬁne-tuning the MetaTunecost model.Convolution kernels are expressed as code templates inTVM where kernels of the same type (e.g., conv1d , conv2d+transponse ) have the same code template.Code templates are in turn represented as low-level abstract itle Suppressed Due to Excessive Size Meta-train Task Finetune Task

Winograd

Input:1x16x64x64Stride: 2x2Kernel size: 3x3

Conv2D

Input:1x64x128x128Stride: 1x1Kernel size: 3x3

Conv2D

Input:1x64x128x128Stride: 2x2Kernel size: 3x3

Depthwise

Input:1x32x64x64Stride: 2x2Kernel size: 5x5 … Figure 4.

Example of few-shot regression tasks for MetaTune. The model learns the universal performance distribution from a pair ofgraph-structured data which is sampled input and its label (the blue line in the distribution). syntax trees (AST).As shown in Figure 2, we extract loop structures by recur-sively traversing the AST and build a summarized graph rep-resentation in the middle. Nodes in the graph can be a root node, for nodes, or iterval nodes. Directed edges be-tween the root node and all for nodes represent controlﬂow, while edges between for and iterval nodes matchloops with detailed loop metadata. Each iterval nodehas loop context vectors computed by TVM as its nodefeature. root and for nodes, without any node features,serve the role of placing the loop context vectors in a precisecontext.

To produce more structurally uniform graphs, we augmentthe AST-based graph input to ﬁt in the shared super-graphtemplate. This augmentation can be viewed as a furtherformat standardization of code templates (which are alreadyper-type format standardization) across all types of convolu-tion operations.We perform a union of all possible graph representations andcreate a super-graph template with placeholder nodes andedges between them. For super-graph augmentation, we lo-cate a corresponding iterval node in the template whichis uniquely identiﬁed and populate the node with context fea-ture vectors. Figure 2 shows that only i , j , and k iterval nodes corresponding to the nodes in the original graph con-tain context feature vectors, while other iterval nodesare null in the super-graph version on the right. To save theoverhead of augmentation, we pre-deﬁne the super-graphtemplate for all convolution kernels supported in TVM, iden-tify mappings between matching nodes in a table, and usethe table to quickly generate ﬁnal output. We plan to ex-tend the super-graph formulation process in the future todynamically support a wider range of tensor operations.

4. Model Architecture

The MetaTune cost model consists of a graph convolutionalnetwork (GCN) (Kipf and Welling, 2017), an activation andaggregation layer, and three fully connected (FC) layerswith ReLU activation, as shown in Figure 3.The GCN model serves as an embedding layer that takesaugmented graphs as input data and encodes relationshipsbetween performance-critical information through convolu-tions on graph nodes. With the following ReLU activationand weighted sum and max aggregation, low-dimensional,ﬁxed-length embedding vectors are generated as inputfor meta-training and ﬁne-tuning tasks. The regressionmodel uses fully connected layers with non-linearity asexempliﬁed in (Finn et al., 2017) for regressions tasks.This model is meta-trained with the model-agnostic meta-learning (MAML) method and continues to ﬁne-tune withinference queries during auto-tuning. The output of themodel is the predicted performance of input implementationin FLops.

5. Meta-learning Process

We trained our cost model using MAML (Finn et al., 2017;Koch, 2015) as shown in Algorithm 1.1.

Supervised pre-training [line 1-5]: As previouswork (Soh et al., 2020) on regression models learnedwith meta-learning proposed, we prepare a small datasetwith labels and pre-train the GCN embedding layer viasupervised learning to obtain stable embedding resultsfrom GCN and model weights that are close to a targetcost model function f ∗ .2. Input data preparation for meta-training [line 8]: Weprepare another labeled dataset for all types of convolu-tion kernels with varying conﬁgurations (more details ontraining dataset in Section 6) and randomly organizedthem into meta-training tasks.We create few-shot meta-training tasks to approximate auniversal performance distribution for knobs for all con- itle Suppressed Due to Excessive Size

Algorithm 1

Metatune Meta Train-step

Require:

Input-Performance dataset D and distribution p ( T ) over kernel performance Require: α , β : step size Require: n , γ : epochs, learning rate Randomly initialize parameters θ for i . . . n do Sample input,output pair( knob, perf ) in D Evaluate training loss ∇ θ L ( f θ ( knob ) , perf ) Update θ ← θ − γ ∇ θ L ( f θ ( knob ) , perf )) end for while not done do Sample batch of tasks T i ∼ p ( T ) for all T i do Evaluate training loss ∇ θ L ( f θ ( knob i ) , perf i ) Compute adapted parameters with gradient de-scent: θ (cid:48) i = θ − α ∇ θ L ( f θ ( knob i ) , perf i ) end for Update θ ← θ − β ∇ θ (cid:80) T i ∼ p ( T ) L ( θ (cid:48) i ) end while volutional kernels. For the example shown in Figure 4, a3-way, 2-shot learning task includes two examples withdifferent optimization parameters from each of threeclasses for conv2d , winograd , and depthwise op-erations with speciﬁc input and kernel size and stride.3. Meta-training [line 9-13]: The regression model withFC layers learns the meta-training tasks with MAML(in this phase, we only use the pre-trained GCN layerwithout further training). For each meta-training task,the model computes loss (line 11) and updates θ i withgradient descent (line 12), where α is the learning rate forlocal parameter updates within a task. After all tasks areevaluated, θ of the cost model f is updated with gradientdescent with the global learning rate β and the sum of θ i (line 13).The example in Figure 4 will sample a batch of tasks ineach iteration and trains the model with the training lossbetween predicted FLops and actual FLops from eachtask. The model will learn to predict the performanceof an unseen implementation of the same task after be-ing trained with other graphs in the task through manyepochs. At the end of training, the model will be in astate (in terms of model weights) to need only a smallnumber of samples to regress a structurally similar butpossibly unseen task.During compilation, the pre-trained MetaTune model keepsimproving with adaptation to inference queries from theexploration module. Unlike meta-training tasks, ﬁne-tuningtasks include candidate samples for a single task to be op-timized. Online hardware measurement will be used for Operation Inputsize Inputchannel Outputchannel conv1d,transpose1d 150-600 32-128 32-512conv2d,transpose2d,winograd 7-224 3-128 16-128

Table 2.

Parameters for MetaTune model training dataset gradient updates.

6. Evaluation Methodology

We implemented MetaTune as a plug-in cost model compo-nent for TVM version 0.7.dev. We replaced the input featuregenerator and cost model with MetaTune implementationwhile experimenting with different search algorithms in theexploration module as follows.• Batch Bayesian optimization:The default search algorithm in TVM, simulated anneal-ing (SA) (Bertsimas et al., 1993), is known to effectivelyapproximate global optimization in a large search spacewith extensive random exploration. We observed thatMetaTune would not require as much space explorationbut incur higher inference overheads due to a complicatedmodel structure than prior work, thus implemented a batchBayesian optimization (BO) algorithm inspired by (De-sautels et al., 2014; Wilson et al., 2017; Wu and Frazier,2016). The BO is more exploitation-centric and performsa search with a much lower number of iterations than SA.Our implementation using botorch library deﬁnes eachknob to be the input of the BO to understand the knob re-lationships and executes all BO operations and inferencequeries in parallel on GPU.• Adaptive exploration and sampling (Ahn et al., 2020):Reinforcement-learning based search algorithms.

Evaluated models and data:

We evaluated inferencetimes, total optimization times, cost model prediction accu-racy (MSE) for four CNN models:

ResNet-18 , Vgg-16 , Squeezenet v1.0 , and

AlexNet .For meta-training, we created a meta-training dataset from47 convolution kernels as shown in Table 2. Their inputsize and input/output channels are randomly chosen withinthe range in the table, while the stride and padding wereﬁxed at 3 and 1 respectively. Then we randomly extracted200 samples of optimized codes for each class, from whichmeta-training tasks are created.

Evaluated frameworks:

We evaluated the following sevenauto-tuning frameworks in TVM including reproduction ofprior work (Chen et al., 2018b; Ahn et al., 2020). itle Suppressed Due to Excessive Sizexgb RL meta-BO meta-BO-T resnet-18 580.04 622.71

Table 3.

End-to-end optimization time (minutes) • xgb: The default TVM implementation with gradientboosting (Chen and Guestrin, 2016) for the cost modeland SA for search algorithm as in (Chen et al., 2018b).• xgb-Xfer: xgb with transfer learning enabled (as inTVM v0.7.dev).•

RL:

The reinforcement-learning based implementationin (Ahn et al., 2020).• meta-RL and meta-RL-T:

MetaTune with RL-basedcandidate selection. meta-RL-T and meta-RL work withand without super-graph augmented data respectively.• meta-BO and meta-BO-T:

MetaTune with batchBayesian optimization based candidate selection. meta-BO-T and meta-BO work with and without super-graphaugmented data respectively.

Evaluation environment:

We conducted our primary per-formance evaluation for all frameworks on a server withIntel Xeon CPUs and NVIDIA geForce 2080 Ti GPUs.Hardware measurements, batch Bayesian optimization, andRL-based implementation use a single GPU, while othercomponents run on CPU. For experiments on additionaltarget hardware, we used an NVIDIA Titan XP GPU serverand an AMD Radeon RX VEGA 64 GPU server.

7. Experimental Results

Our experimental evaluation aims to answer the follow-ing questions: Can MetaTune consistently adapt faster andﬁnd better solutions with different search algorithms ondifferent platforms? We answered the question by (1) mea-suring inference time of CNN models optimized by auto-tuning frameworks with MetaTune (evaluating the over-all framework efﬁciency), (2) measuring the model accu-racy of MetaTune (isolating the cost model efﬁciency), and(3) repeating the experiments on other platforms (showingportability). We performed hyperparameter tuning for theMetaTune model with batch Bayesian optimization and RL-based candidate search algorithms, and the details can befound in the appendix.

Figure 5 shows the output performance for four CNNmodels in FLops, while Figure 6 compares highest FLopsachieved for the models normalized to xgb . Overall, meta-BO-T provides the best performance for all conﬁgura-tions , outperforming xgb by 8% (up to 10%) and RL by 13% (up to 33%) on average. meta-BO without super-graphaugmentation also consistently improves xgb and RL by 5%and 10% on average. meta-BO-T consistently outperform-ing meta-BO by 4% on average supports our assumptionthat super-graph augmentation helps with a quick adaptationof meta-learning models by increasing structural similarityin input data.You can see from the ﬁgure that both meta-BO and meta-BO-T adapt much faster to the compiled kernel than the others,i.e., move to points more likely to perform well and startcandidate search from them, and reach the highest FLopspredicted by xgb earlier between 200 to 650 iterations (dot-ted lines). As a result, MetaTune can search the space moreeffectively and widely with the given number of iterationlocating better-performing optimization parameters.Both MetaTune and xgb-Xfer reuse previous experience forthe current task. While xgb-Xfer slightly improves outputperformance against xgb , all MetaTune models show a morenoticeable performance impact of the pre-trained model byoutperforming xgb-Xfer by 4 to 7% on average.MetaTune with RL-based search algorithm in (Ahn et al.,2020) instead of batch BO signiﬁcantly outperforms RL by 7% on average (up to 28%), isolating the performanceimpact of the cost model. In this scenario, meta-RL producesbetter results than meta-RL-T for all evaluated models. Weidentiﬁed the source in adaptive sampler in (Ahn et al.,2020), which performs much fewer hardware measurements,thus adaptation for meta-RL-T than meta-RL after ﬁltering.We plan to investigate the issue further, but it is out of thescope of this paper.The evaluation results on AMD GPU as shown in Figure 8(b)and 8(a) shows 18% and 5% performance improvement by meta-BO-T against xgb . This conﬁrms that meta-learningbased cost models are a portable solution whose optimiza-tion capability is not platform-dependent. Table 3 shows the end-to-end auto-tuning time for xgb , RL , meta-BO , and meta-BO-T . While meta-BO is the fastestwhile all four are in a close range (We could not reproducethe results for (Ahn et al., 2020)). meta-BO-T seems to incurmore inference overheads with larger super-graph data than meta-BO , but is comparable or faster than xgb or RL . Table 4 shows the prediction accuracy of the MetaTune costmodel and the gradient-boosting based cost model in TVMthrough 1,000 iterations. We analyzed the inconsistencywith MSE distribution and found out that infeasible or erro-neous candidates randomly appear during exploration, andfailing to predict performance for them unduly increases itle Suppressed Due to Excessive Size T f l op s xgbxgb-XferRLmeta-BOmeta-RLmeta-BO-Tmeta-RL-T (a) Resnet-18 T f l op s xgbxgb-XferRLmeta-BOmeta-RLmeta-BO-Tmeta-RL-T (b) Vgg-16 T f l op s xgbxgb-XferRLmeta-BOmeta-RLmeta-BO-Tmeta-RL-T (c) Squeezenet T f l op s xgbxgb-XferRLmeta-BOmeta-RLmeta-BO-Tmeta-RL-T (d) Alexnet Figure 5.

Output performance x g b x g b - X f e r R L m e t a - B O m e t a - R L m e t a - B O - T m e t a - R L - T . . . . . . P e r f o r a m an c e ga i n o v e r A u t o T V M (a) Resnet-18 x g b x g b - X f e r R L m e t a - B O m e t a - R L m e t a - B O - T m e t a - R L - T . . . . . . P e r f o r a m an c e ga i n o v e r A u t o T V M (b) Vgg-16 x g b x g b - X f e r R L m e t a - B O m e t a - R L m e t a - B O - T m e t a - R L - T . . . . . . P e r f o r a m an c e ga i n o v e r A u t o T V M (c) Squeezenet x g b x g b - X f e r R L m e t a - B O m e t a - R L m e t a - B O - T m e t a - R L - T . . . . . P e r f o r a m an c e ga i n o v e r A u t o T V M (d) Alexnet Figure 6.

End-to-end evaluation of normalized inference time T f l op s xgbxgb-Xfermeta-BO-T (a) Resnet-18 T f l op s xgbxgb-Xfermeta-BO-T (b) Alexnet Figure 7.

Output performance on NVIDIA Titan XP GPU T f l op s xgbmeta-BO-T (a) Resnet-18 T f l op s xgbmeta-BO-T (b) Alexnet Figure 8.

Output performance on AMD Vega 64 GPU network method MSE MSE(D) network method MSE MSE(D) alexnet xgb 0.1329 0.5884 resnet-18 xgb 0.1175 0.6318alexnet xgb-Xfer 0.1394 0.5289 resnet-18 xgb-Xfer 0.1281 0.5666alexnet meta-BO 0.1342 0.5888 resnet-18 meta-BO resnet-18 meta-BO-T 0.1061 vgg-16 xgb 0.1223 0.5859 squeezenet xgb squeezenet meta-BO-T 0.1485 alexnet(X) xgb resnet-18(X) meta-BO-T alexnet(A) xgb 0.1018 0.7129 resnet-18(A) xgb 0.1134 0.6264alexnet(A) meta-BO-T resnet-18(A) meta-BO-T

Table 4.

Model error for xgb and xgb-Xfer vs. meta-BO-T . (X) for results on Titan XP, (A) on AMD GPU. the model’s MSE. Therefore, we introduced an additionalaccuracy metric than MSE; MSE(D) is a mean-square errorof ﬁne-tuning instances whose hardware-measured FLops are in the top 25%. This metric is useful in focusing onhow effective the cost model is in identifying possibly best-performing candidates and guiding the search in the right itle Suppressed Due to Excessive Size M ean S qua r e E rr o r meta-BO-Tmeta-BOxgbxgb-Xfer Figure 9.

Model accuracy in early iterations direction. meta-BO-T shows the highest MSE(D) for allmodels on all platforms, which aligns with the output per-formance result.We also investigated the model accuracy in early auto-tuningiterations in detail to see if MetaTune adapts more quicklythan non-meta-learning models. From Figure 9, you can seethat the MetaTune models start at a lower MSE and maintainhigh accuracy while xgb and xgb-Xfer may explore furtheraway from ideal optimizations with more wrong predictions.

We experimented with a pre-trained MetaTune model to seeif it can adapt cross-platform as well, i.e., if meta-learnedmodel parameters can universally work for performancedistributions on which the model is not trained. Figure 7(a)and 7(b) show the output performance (inference time) for alexnet and resnet-18 on NVIDIA Titan XP GPUwhere meta-BO-T are pre-trained on NVIDIA 2080 Ti GPU. meta-BO-T provides 14% and 6% higher performance than xgb and xgb-Xfer showing that knowledge reuse across plat-forms is possible. Cross-platform meta-learning across moreheterogeneous platforms is a part of our future work.

8. Related Work

Adaptive exploration for automatic optimizations

There has been active research in improving the explorationmodule in auto-tuning frameworks to reduce the total op-timization time. Chameleon (Ahn et al., 2020) addressedthe issues of not selectively exploring feasible optimizationsonly (Chen et al., 2018b) by proposing adaptive samplingand exploration algorithms using reinforcement learningALT (Zeng et al., 2020) paid attention to the suboptimal-ity caused by a premature cost model during the explo-ration process. It proposes a ﬁltering mechanism for uselesscandidates that uses score prediction and active learning.AdaTune (Li et al., 2020) addresses the exploration vs. ex-ploitation problem that affects total optimization and timeand output quality. It proposes surrogate modeling to solvethe problem using uncertainty quantiﬁcation and an evalua-tor that determines early stopping based on the coefﬁcient variation.These are orthogonal to MetaTune focusing on the costmodel and can be combined with MetaTune for furtheroptimizations.

Auto-scheduling for tensor programs

FlexTensor (Zheng et al., 2020b) aims to generate dynamicschedules for tensor operations and perform automatic opti-mization with them on heterogeneous systems. Schedulesare generated without code templates, focusing on the searchspace of each operation at a time. While this approach maylocate more ﬂexible and potentially efﬁcient codes thanusing template-based schedules, it incurs much higher over-heads with ﬁne-grained searches. Our results show that theuse of templates judiciously restricts the search space whilenot hurting the quality of the solution. On a related note,Ansor (Zheng et al., 2020a) also automatically generates op-timization candidates hierarchically without template-basedrestrictions. There are no predeﬁned knobs per operation,and a task scheduler learns the right level of hardware mea-surements as well. This approach requires a larger numberof hardware measurements at each hierarchical level, rang-ing from 300 to 6,000 iterations.

Meta learning for fast adaptation

Recent research used MAML to initialize model parametersfor quick adaption in various applications.(Winata et al., 2020) used MAML to adapt to the high vari-ability and complex characteristics of accents for speechrecognition. T-NAS (Lian et al., 2020) leveraged MAML tohandle multiple tasks in network architecture search, whileMAML is used to improve the accuracy of a super-resolutionmodel for unseen images in (Soh et al., 2020).

9. Conclusion and Future Work

In this paper, we presented a meta-learning based approachto improve the efﬁciency and reduce the overheads of theauto-tuning process by reusing prior experiences. Our ex-perimental results showed that the resulting cost model,MetaTune, can identify high-performing optimization pa-rameters for unseen convolution operations faster and moreaccurately than prior work, regardless of search algorithmsused together. We believe that the meta-learning based costmodel, or any machine-learning based model that can sys-tematically leverage previously learned knowledge and ex-periences with similar programs, is a pivotal contribution tothe auto-tuning solutions, addressing the high optimizationoverheads and allowing them to be a more widely-adoptedcompiler solution.Our future work includes extending MetaTune to support itle Suppressed Due to Excessive Size a wider range of tensor operations beyond convolution op-erations on different platforms. We believe that the meta-learning based approach can show even stronger and moreconsistent performance than related work with auto-tuningtargets with varying static and dynamic characteristics. Wealso consider going out of the boundary of domain-speciﬁccompiler frameworks and looking more into solving moregeneral optimization problems with MetaTune approach.

Acknowledgement

This work was supported by Institute for Information & com-munications Technology Promotion (IITP) grant funded bythe Korea government (MSIP) (No. 2019-0-01906, Arti-ﬁcial Intelligence Graduate School Program (POSTECH))and the Super Computer Development Leading Program ofthe National Research Foundation of Korea (NRF) fundedby the Korean government (Ministry of Science and ICT(MSIT)) (No. 2020M3H6A1084853).

References

Mart´ın Abadi, Ashish Agarwal, Paul Barham, EugeneBrevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving,Michael Isard, Yangqing Jia, Rafal Jozefowicz, LukaszKaiser, Manjunath Kudlur, Josh Levenberg, DandelionMan´e, Rajat Monga, Sherry Moore, Derek Murray, ChrisOlah, Mike Schuster, Jonathon Shlens, Benoit Steiner,Ilya Sutskever, Kunal Talwar, Paul Tucker, VincentVanhoucke, Vijay Vasudevan, Fernanda Vi´egas, OriolVinyals, Pete Warden, Martin Wattenberg, Martin Wicke,Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow:Large-Scale Machine Learning on Heterogeneous Sys-tems.

Soft-ware available from tensorﬂow.org.Byung Hoon Ahn, Prannoy Pilligundla, Amir Yazdan-bakhsh, and Hadi Esmaeilzadeh. 2020. Chameleon:Adaptive Code Optimization for Expedited Deep NeuralNetwork Compilation. arXiv:2001.08743 [cs.LG]Dimitris Bertsimas, John Tsitsiklis, et al. 1993. Simulatedannealing.

Statistical science

8, 1 (1993), 10–15.Tianqi Chen and Carlos Guestrin. 2016. XGBoost: AScalable Tree Boosting System. In

Proceedings of the22nd ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining (San Francisco, Cal-ifornia, USA) (KDD ’16) . Association for ComputingMachinery, New York, NY, USA, 785–794. https://doi.org/10.1145/2939672.2939785

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang,Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible andEfﬁcient Machine Learning Library for HeterogeneousDistributed Systems.

CoRR abs/1512.01274 (2015). http://dblp.uni-trier.de/db/journals/corr/corr1512.html

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng,Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang,Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Kr-ishnamurthy. 2018a. TVM: An Automated End-to-EndOptimizing Compiler for Deep Learning. In

Proceed-ings of the 13th USENIX Conference on Operating Sys-tems Design and Implementation (Carlsbad, CA, USA) (OSDI’18) . USENIX Association, USA, 579–594.Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang,Thierry Moreau, Luis Ceze, Carlos Guestrin, and ArvindKrishnamurthy. 2018b. Learning to Optimize TensorPrograms. In

Proceedings of the 32nd InternationalConference on Neural Information Processing Systems (Montr´eal, Canada) (NIPS’18) . Curran Associates Inc.,Red Hook, NY, USA, 3393–3404.C. Cummins, P. Petoumenos, Z. Wang, and H. Leather. 2017.End-to-End Deep Learning of Optimization Heuristics.In . 219–232. https://doi.org/10.1109/PACT.2017.24

Thomas Desautels, Andreas Krause, and Joel W. Burdick.2014. Parallelizing Exploration-Exploitation Tradeoffs inGaussian Process Bandit Optimization.

J. Mach. Learn.Res.

15, 1 (Jan. 2014), 3873–3923.Aditya Dhakal, Junguk Cho, Sameer G. Kulkarni, K. K.Ramakrishnan, and Puneet Sharma. 2020. Spa-tial Sharing of GPU for Autotuning DNN models.arXiv:2008.03602 [cs.NE]Yufei Ding, Jason Ansel, Kalyan Veeramachaneni, XipengShen, Una-May O’Reilly, and Saman Amarasinghe.2015a. Autotuning Algorithmic Choice for Input Sen-sitivity. In

Proceedings of the 36th ACM SIGPLANConference on Programming Language Design and Im-plementation (Portland, OR, USA) (PLDI ’15) . As-sociation for Computing Machinery, New York, NY,USA, 379–390. https://doi.org/10.1145/2737924.2737969

Yufei Ding, Jason Ansel, Kalyan Veeramachaneni, XipengShen, Una-May O’Reilly, and Saman Amarasinghe.2015b. Autotuning Algorithmic Choice for Input Sensitiv-ity.

SIGPLAN Not.

50, 6 (June 2015), 379–390. https://doi.org/10.1145/2813885.2737969

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017.Model-Agnostic Meta-Learning for Fast Adaptation ofDeep Networks. arXiv:1703.03400 [cs.LG] itle Suppressed Due to Excessive Size

Grigori Fursin, Yuriy Kashnikov, Abdul Wahid Memon,Zbigniew Chamski, Olivier Temam, Mircea Namolaru,Elad Yom-Tov, Bilha Mendelson, Ayal Zaks, EricCourtois, Franc¸ois Bodin, Phil Barnard, Elton Ashton,Edwin V. Bonilla, John Thomson, Christopher K. I.Williams, and Michael F. P. O’Boyle. 2011. MilepostGCC: Machine Learning Enabled Self-tuning Com-piler.

Int. J. Parallel Program.

39, 3 (2011), 296–327. http://dblp.uni-trier.de/db/journals/ijpp/ijpp39.html

Samuel J. Kaufman, Phitchaya Mangpo Phothilimthana,Yanqi Zhou, and Mike Burrows. 2020. A LearnedPerformance Model for the Tensor Processing Unit.arXiv:2008.01040 [cs.PF]Thomas N. Kipf and Max Welling. 2017. Semi-SupervisedClassiﬁcation with Graph Convolutional Networks. In

Proceedings of the 5th International Conference onLearning Representations (Palais des Congr`es Neptune,Toulon, France) (ICLR ’17) . https://openreview.net/forum?id=SJU4ayYgl Gregory Koch. 2015. Siamese neural networks for one-shotimage recognition.Chris Leary and Todd Wang. 2017. XLA: TensorFlow,compiled.

TensorFlow Dev Summit (2017).Menghao Li, Minjia Zhang, Chi Wang, and MingqinLi. 2020. AdaTune: Adaptive Tensor ProgramCompilation Made Efﬁcient. In . Dongze Lian, Yintao Xu Yin Zheng, Yanxiong Lu, LeyuLin, Peilin Zhao, Junzhou Huang, and Shenghua Gao.2020. Towards Fast Adaptation of Neural Architectureswith Meta Learning.

ICLR (2020).Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,James Bradbury, Gregory Chanan, Trevor Killeen,Zeming Lin, Natalia Gimelshein, Luca Antiga, AlbanDesmaison, Andreas Kopf, Edward Yang, ZacharyDeVito, Martin Raison, Alykhan Tejani, Sasank Chil-amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, andSoumith Chintala. 2019. PyTorch: An Imperative Style,High-Performance Deep Learning Library. In

Advancesin Neural Information Processing Systems 32 , H. Wallach,H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc, E. Fox,and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams,Sylvain Paris, Fr´edo Durand, and Saman Amarasinghe.2013. Halide: A Language and Compiler for Optimiz-ing Parallelism, Locality, and Recomputation in ImageProcessing Pipelines. In

Proceedings of the 34th ACMSIGPLAN Conference on Programming Language De-sign and Implementation (Seattle, Washington, USA) (PLDI ’13) . Association for Computing Machinery, NewYork, NY, USA, 519–530. https://doi.org/10.1145/2491956.2462176

Nadav Rotem, Jordan Fix, Saleem Abdulrasool, GarretCatron, Summer Deng, Roman Dzhabarov, Nick Gibson,James Hegeman, Meghan Lele, Roman Levenstein, JackMontgomery, Bert Maher, Satish Nadathur, Jakob Olesen,Jongsoo Park, Artem Rakhov, Misha Smelyanskiy, andMan Wang. 2019. Glow: Graph Lowering Compiler Tech-niques for Neural Networks. arXiv:1805.00907 [cs.PL]Jae Woong Soh, Sunwoo Cho, and Nam Ik Cho. 2020. Meta-Transfer Learning for Zero-Shot Super-Resolution. In

Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition . 3516–3525.H. Sung, T. Chen, A. Eichenberger, and K. K. O’Brien.2019. POSTER: CogR: Exploiting Program Structuresfor Machine-Learning Based Runtime Solutions. In . 485–486. https://doi.org/10.1109/PACT.2019.00057

Jakub M. Tomczak, Romain Lepert, and Auke Wiggers.2019. Simulating Execution Time of Tensor Programsusing Graph Neural Networks. arXiv:arXiv:1904.11876Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodor-idis, Priya Goyal, Zachary DeVito, William S. Moses,Sven Verdoolaege, Andrew Adams, and Albert Co-hen. 2018. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstrac-tions. arXiv:1802.04730 [cs.PL]James T. Wilson, Riccardo Moriconi, Frank Hut-ter, and Marc Peter Deisenroth. 2017. Thereparameterization trick for acquisition functions.arXiv:1712.00424 [stat.ML]Genta Indra Winata, Samuel Cahyawijaya, Zihan Liu, Zhao-jiang Lin, Andrea Madotto, Peng Xu, and Pascale Fung.2020. Learning Fast Adaptation on Cross-AccentedSpeech Recognition. arXiv:2003.01901 [eess.AS]Jian Wu and Peter I. Frazier. 2016. The Parallel KnowledgeGradient Method for Batch Bayesian Optimization. In

Proceedings of the 30th International Conference on Neu-ral Information Processing Systems (Barcelona, Spain) (NIPS’16) . Curran Associates Inc., Red Hook, NY, USA,3134–3142. itle Suppressed Due to Excessive Size

X. Zeng, T. Zhi, Z. Du, Q. Guo, N. Sun, and Y. Chen. 2020.ALT: Optimizing Tensor Compilation in Deep LearningCompilers with Active Learning. In . 623–630. https://doi.org/10.1109/ICCD50377.2020.00108

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu,Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang,Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez,and Ion Stoica. 2020a. Ansor: Generating High-Performance Tensor Programs for Deep Learning. In . USENIX Asso- ciation, 863–879.

Size Zheng, Yun Liang, Shuo Wang, Renze Chen, andKaiwen Sheng. 2020b. FlexTensor: An AutomaticSchedule Exploration and Optimization Framework forTensor Computation on Heterogeneous System. In

Pro-ceedings of the Twenty-Fifth International Conferenceon Architectural Support for Programming Languagesand Operating Systems (Lausanne, Switzerland) (ASP-LOS ’20) . Association for Computing Machinery, NewYork, NY, USA, 859–873.. Association for Computing Machinery, NewYork, NY, USA, 859–873.