[PDF] MLGO: a Machine Learning Guided Compiler Optimizations Framework

Abstract

Full PDF

MMLGO: a Machine Learning Guided CompilerOptimizations Framework

Mircea Trofin ∗ Google, Inc. [email protected]

Yundi Qian ∗ Google, Inc. [email protected]

Eugene Brevdo

Google, Inc. [email protected]

Zinan Lin

Carnegie Mellon University [email protected]

Krzysztof Choromanski

Google, Inc. [email protected]

David Li

Google, Inc. [email protected]

Abstract

Leveraging machine-learning (ML) techniques for compileroptimizations has been widely studied and explored in academia.However, the adoption of ML in general-purpose, industrystrength compilers has yet to happen.We propose MLGO , a framework for integrating ML tech-niques systematically in an industrial compiler — LLVM. Asa case study, we present the details and results of replacingthe heuristics-based inlining-for-size optimization in LLVMwith machine learned models. To the best of our knowledge,this work is the first full integration of ML in a complexcompiler pass in a real-world setting. It is available in themain LLVM repository.We use two different ML algorithms: Policy Gradient andEvolution Strategies, to train the inlining-for-size model, andachieve up to 7% size reduction, when compared to state ofthe art LLVM -Oz. The same model, trained on one corpus,generalizes well to a diversity of real-world targets, as well asto the same set of targets after months of active development.This property of the trained models is beneficial to deployML techniques in real-world settings. Previous work [13, 25] has shown promise in replacing com-piler optimization heuristics with machine-learned policies.Heuristics are algorithms that, empirically, produce reason-ably optimal results for hard problems, within pragmatic con-straints (e.g. "reasonably fast"). In the compiler case, heuris-tics are widely used in optimization passes, even those lever-aging profile feedback, such as inlining and register alloca-tion. Such passes have significant impact on the performanceof a broad variety of programs. These problems are oftenNP-hard and searching for optimal solutions may requireexponential time or memory. Reinforcement Learning (RL) isa family of machine learning techniques that may be appliedto find increasingly optimal solutions through an automatediterative exploration and training process. ∗ These authors contributed equally. We welcome your feedback! Please open an issue at https://github.com/google/ml-compiler-opt with the label paper . Our focus is ahead-of-time (AOT) compilers, specifically,C/C++. In a real-world setting, we expect two main bene-fits from machine learning techniques: first, heuristics arehuman-trained based on a human-manageable set of bench-marks and regression cases. Machine learning easily scalesto large corpora of training examples - which we expect toincrease the likelihood of obtaining policies that generalizewell. This is important because, as we will explore in detail,we do not want to retrain policies too frequently (it is anadoption blocker), nor do we want to train ’online’, while thecompiler is running in production (it would affect determin-ism). Second, heuristics are human-written code that needsto be maintained. This places a downward pressure on thenumber of program properties ("features") and the combi-nations between them that can be practically leveraged. Webelieve using more features and feature combinations wouldresult in better optimization decisions. ML scales well withthe addition of features, and can discover profitable featurecombinations. While ML techniques may be able to addressthese two points, a trade-off is that maintaining and evolvingthem requires practices and approaches different from thoseused for heuristics.As pointed out, applying ML to compiler optimizations hasbeen explored by academia, but it has not been adopted inproduction environments. To explore why, we chose a pilotoptimization problem and approached it with the intentionto deploy in production. The goal of the pilot is to informproblem framing and design choices. Other than performingbetter than the tip-of-tree production compiler, we did notaim to advance the state of the art for the pilot problem.The chosen problem is inlining-for-size in LLVM, and inparticular, the inlining decision heuristics. The expectationwas that this would offer representative challenges - sizeoptimization is important for real-world scenarios, such asmobile software, and inlining is a particularly challengingoptimization (see Section 2.2).We chose size rather than speed for the pilot becausesize is relatively easy to measure and non-noisy, which weexpected to aid in rapid prototyping by removing one sourceof potential problems (noisy rewards). We acknowledge thattranslating our experience to performance problems may a r X i v : . [ c s . P L ] J a n ircea Trofin, Yundi Qian, Eugene Brevdo, Zinan Lin, Krzysztof Choromanski, and David Li Figure 1.

MLGO Overviewseem non-immediate at a first glance, but we believe thatnot to be the case: for example, inlining for speed can beunderstood partially as a "for-size" problem, with respect tothe instruction working set (more in Section 8).From a very high level, our MLGO framework separatesthe use of the compiler from the training of policies as shownin Figure 1. Day-to-day production use is unchanged; as animplementation detail, a trained model embedded in thecompiler is used to make decisions (in this case, inlining)that were previously handled by a manual heuristic. Train-ing happens separately using a large, representative corpusof intermediate representation (IR) modules. The trainingprocess is iterative, each step using an updated policy, sothe policy is not embedded in the compiler. During training,the inliner produces a log that records the inlining process(features, decisions, etc). The logs are collected and fed tothe training algorithm to produce a new model.The paper is organized as follows: Section 2 provides anoverview of the relevant ML techniques and the LLVM in-liner. Section 3 gives an overview of MLGO: our framingof the problem of applying ML to compiler optimizations,and our methodology. Section 4 describes the policy train-ing in MLGO for the inlining problem. Section 5 details theimplementation, in LLVM, of our pilot project. The resultsare presented in section 6 and the related work is describedin Section 7. Finally in Section 8, we discuss our plans forapplying the lessons learned so far to speed problems, aswell as next steps in ML techniques that we are considering.Section 9 concludes the paper.A note: throughout this paper, we use interchangeably theterms policy and model . A compiler optimization policyis a decision rule that takes actions inside the optimizationpipeline (e.g.,"should we inline this call graph edge or not?"). "Model" refers to a neural network implementing an opti-mization policy. Also, we use the term heuristic to refer tomanually-crafted decision rules.

There are two characteristics that make reinforcement learn-ing (RL) a suitable tool for replacing compiler optimizationheuristics: 1) there are no examples showing optimal strate-gies for these heuristics — in the inlining problem, we don’tknow whether inlining or not for a certain call site is the op-timal choice; 2) we can efficiently explore different strategies,and improve strategies from those experiences. The absenceof examples ("labels") means we cannot use supervised learn-ing. In contrast, RL is an area of machine learning that learnsfrom trial and error instead of given labels. It has proven itssuccess in robotics, playing Atari games, playing the gameof Go, etc [16, 18, 24]. In RL, an agent (i.e., the compiler)learns by repeatedly interacting with the environment (i.e.,compiling) and gradually improves its policy (i.e., decisionrules). More specifically, by compiling software again andagain with different strategies, the compiler will come upwith better and better policy on its own with RL algorithms.Previous work has shown Evolution Strategies (ES) tobe a competitive alternative to RL algorithms in MuJoCoand Atari tasks [8, 22]. Motivated by this, we also tried thismethod in compiler optimization problems. ES are a class ofblack box optimization techniques. Like RL, ES training isalso able to gradually improve the strategy with trial and er-ror, and thus is also a suitable tool for compiler optimizationproblems.

Today’s LLVM inliner is a pass operating on a strongly-connected component (SCC) of the static call graph in amodule , at a time in bottom up order. The inlined callee’scall sites are added to a work list and iteratively consideredfor inlining in a top down fashion. A pipeline of optimiza-tions (Inst combine, scalar replacement of aggregates (SROA),loop optimizations, etc) is then applied on each function inthe SCC , after the SCC was processed. The effects of theseoptimizations impact inlining decisions of call sites in theSCC calling into the last one.LLVM inlining consists of many heuristics: the choice ofcall site traversal, the set of "cleanup" function passes runon functions after they are modified because call sites were in LTO or ThinLTO mode, a module can consist of IR from multiple sourcefiles The DAG walk is repeated up to a set of times if de-virtualization happensin cleanups2

LGO: a Machine Learning Guided Compiler Optimizations Framework inlined, the timing of these cleanups, and finally, the decisionto inline or not a specific call site.The decision to inline or not a call site is itself built ontop of a rich set of heuristics. The compiler first computesthe static "cost" of the callee post inlining by traversing thecallee body simulating post-inline cleanup passes. If somecall site arguments are known to be constant at compile time,that information is used to evaluate what instructions / basicblocks would be simplified, should the inlining be carriedout. The computed cost is then compared with a threshold.The threshold value is based on things like call site hotness,inline keyword, etc. Bonuses are also given to callees with asingle basic block or high percentage of SIMD instructions. Incertain cases, the compiler may also choose to defer inliningif inlining the caller itself to its own callers first may result inbetter savings - it may be better to make a local non-optimaldecision that, later, would open the opportunity for betteroptimizations due to more context being available, such ascall parameters propagating from callers further up in thecall graph.These sets of heuristics have been tuned for years and ourpilot project replaces the manual decision process describedabove with ML models.

MLGO is a set of guidelines and requirements derived fromour understanding of the problem of leveraging ML tech-niques for replacing manual optimization heuristics. We startwith our understanding of the participating personas andtheir scenarios, which then motivates MLGO guidelines anddesign decisions.

This user wants to benefit from improved compiler opti-mizations. They care about: correctness and performance ofthe generated code; compilation determinism (i.e. identicaloutput for identical input) - to leverage incremental builds;avoiding the added cost and complexity of new infrastruc-ture requirements on build and release pipelines, such as newcompiler run-time dependencies or new steps (like training);and timeliness of the build, as it impacts hardware resourceplanning and developer productivity . Our goal is to intro-duce no changes to this user .To achieve this, the MLGO guidelines are:1. To maintain correctness guarantees, we replace heuris-tics, not semantics-preserving code. For example, wechange the decision making process for carrying outfunction inlining, not how the inlining action is im-plemented. This is along the insight of separation ofcorrectness vs policy observed earlier in [27] There are non-ML driven optimization alternatives that trade off signif-icantly increasing compilation time for improved optimizations. An MLalternative needs to be competitive to justify its other trade-offs

2. ’Online’ training - meaning, training while the com-piler is executing in production - is an anti-goal forus: it would hurt determinism and compilation perfor-mance. Instead, policy training happens offline. Trainedpolicies are embedded in the compiler as staticallylinked native code, and the resulting compiler is sub-jected to the same release process it currently is. Buildand release infrastructures and pipelines of targetsusing the compiler do not need to be changed. Builddeterminism using the ML-enabled compiler is ensuredbecause the policies are fixed - no training happenswhen the compiler runs, only inference. While nativecompilation doesn’t guarantee timeliness, it eliminatesone source of concern. Due engineering diligence stillneeds to be applied to ensure timely feature extraction,for instance.3. We require ML techniques that yield policies that gen-eralize well over different code bases and code changes,and do not need frequent retraining. The compiler userdoesn’t have to worry about policy training (althoughthey are free to do so and potentially get better re-sults). This is akin to how, in the context of manualheuristics, a compiler user doesn’t have to fine-tunepasses (or author code in them) to get reasonable re-sults. In particular, we do not see automated tuningof existing heuristic parameters as a viable solution.Tuning parameters have been available in compilersfor a long time, and the experience has been that aset of values does not translate well from target to tar-get. The policy, while adjustable, is still dominated bycombinations/evaluations identified manually. In ad-dition, requiring re-tuning would complicate productbuild and release pipelines, which we want to avoidon behalf of our user.We refer to this use of policies as release mode (since itis encountered by users of a released/shipped compiler).

This user wants to drive better optimizations in the compiler,diagnose regressions, and incorporate findings:1.

Policy Creation.

The engineer wants to incorporateML techniques in a compiler optimization pass.2.

Policy Improvement.

Here, they investigate a spe-cific regression encountered in production, or want toimprove a ML-enabled pass.3. "The Ship Blocker".

The engineer must quickly re-solve a ship-blocking regression introduced by a hotpatch, and caused by a misbehaving RL-enabled policy.In all of these cases, the compiler engineer improves a pol-icy through repeated exploration and training (see Section4.4). They want flexibility in replacing the model under train-ing, and have less concern with timeliness and determinism,especially since models under training may use small random ircea Trofin, Yundi Qian, Eugene Brevdo, Zinan Lin, Krzysztof Choromanski, and David Li perturbations, to facilitate exploration. We refer to this useof policies as development mode . Here, models are loadedvia a command line option, the compiler may have extraruntime dependencies, and model evaluation may involvechanges to the runtime behavior of the compiler — because,for example, the model evaluators may be multi-threadedand/or JIT-ing.Because of the tension between heuristic code complexityand hypothesized ability to improve the heuristic by incor-porating more features (as discussed in the introduction),MLGO forgoes goals of human comprehensibility of the re-sulting policy (in contrast to [25]). Instead, we focus on devel-oping and evaluating alternative methodologies to addressthe above scenarios. We discuss our current understandingof the trade-offs, and expect that more clarity will arise aswe apply the approach through the lifetime of a number ofdiverse projects. Section 4 will detail our experience withdeveloping Policy Creation . We have less experience, at thispoint, with

Policy Improvement and "The Ship Blocker" , andderive our direction from experiences in other domains.

Policy Improvement is currently (i.e. for manual heuristics)an iterative engineering process. The trigger is typicallyregressions identified in the field. The compiler engineerdiagnoses the problem, hypothesizes a solution, then ensuresthat the solution does not introduce regressions in somecorpus of benchmarks; if regressions happen, the process isrepeated. In the MLGO methodology, we envision a gradualprocess. We do not believe it presents significant negativetrade-offs compared to the state of the art:1. Start by incorporating regression use-case(s) into train-ing corpus and retrain the policy.2. If that fails, hypothesize missing features. This requiressome manual diagnosing of the current policy behavior.While we treat the policy as a black box, we do observeits effects, and can formulate hypotheses as to whatinformation may be missing - since the informationwe provide (features) is also observable. The neededskill set is close to what compiler engineers currentlyemploy for manual heuristic development, and, justlike for manual heuristics, evolving the feature set islikely an iterative process. Typically, adding featuresand retraining shouldn’t result in regressions for theprevious training corpus, which is a benefit of ourapproach over the manual heuristic case. We shouldnote that, if applying feature auto-extraction [10, 13]proves feasible in production, this step collapses to theprevious step.3. If the above also fails, involve an ML expert to inves-tigate alternative training algorithms. This is akin totoday’s (rare) full pass rewrites (for example: new reg-ister allocation pass). The difference is the need forcross-disciplinary interaction. Our hope is that, withtime and experience, MLGO offer a reusable library of best practices and training solutions available "off theshelf" to compiler engineers.

Ship blockers are those cases where the compiler engineerdoesn’t have the luxury to do deep investigations into com-piler behavior, since they are on a tight time budget. Assum-ing the pathological case is identified (i.e. which compilationunit causes the compiler to misbehave), in the case of man-ual heuristics, the levers of control are: trial-and-error withdifferent compilation flag values (change policy thresholds,for instance); modify user code (use inlining directives, forinstance); or disable the specific optimization for a specificmodule.In MLGO, the picture is similar. Other than policy thresh-olds, the control levers available to the engineer are the same.In addition, the engineer may choose to revert to an earlierversion of the policy, or manual heuristics for the problem-atic module, and, if needed, experiment with threshold flags.Specific to ML-based policies, we are exploring with localtraining and overfitting: as we will detail, our experience sofar shows that it is possible to train a policy on a single mod-ern, multi-core workstation, and obtain a reasonably goodresult within a day. An engineer could attempt to specializea policy to overfit for the pathological case, and compilethat case with the specialized policy (while compiling therest of the project with the non-overfitted policy). This issimilar to "experimenting with flags", with the exceptionthat the exploration is directed by a training algorithm andmore likely to quickly converge to a solution. The trade-offis that changing heuristic flag values does not require a train-ing infrastructure — even if that infrastructure could be runlocally.

In this section, we show how we use reinforcement learning(RL) and evolution strategies (ES) to train inlining policiesin the MLGO framework. Sections 4.1 and 4.2 present howwe train the inlining policy for the inlining-for-size problemwith RL and ES algorithms, and Section 4.3 compares thepros and cons of the two algorithms. Section 4.4 concludesthis section by giving an overview of our policy traininginfrastructure . RL aims to find an optimal policy for a Markov Decision Pro-cess (MDP). MDP is a mathematical framework that modelssequential decision making — in inlining-for-size problem,we make sequential decisions whether to inline or not. AnMDP can be represented by the tuple < S , A , P , R > withstate space S , action space A , state transition distribution P ( 𝑠 ′ | 𝑠, 𝑎 ) , and reward function R ( 𝑠, 𝑎 ) . In the MDP formal-ism, at time 𝑡 , the agent observes the state 𝑠 𝑡 ∈ S of the https://github.com/google/ml-compiler-opt LGO: a Machine Learning Guided Compiler Optimizations Framework environment, then decides to take an action 𝑎 𝑡 ∈ A . It alsoreceives the reward 𝑟 𝑡 = R ( 𝑠 𝑡 , 𝑎 𝑡 ) . The environment statethen transitions to 𝑠 𝑡 + ∈ S by sampling from the proba-bility distribution P ( 𝑠 𝑡 + | 𝑠 𝑡 , 𝑎 𝑡 ) . This process repeats untilthe agent reaches a termination state at time 𝑇 . The agent’sdecisions are a function (we call it policy) 𝜋 = 𝑃𝑟 ( 𝑎 | 𝑠 ) thatmaps observed state 𝑠 to a distribution over actions. In ourcase, 𝜋 is a neural network and we call it policy network. RLalgorithms aim to find the optimal policy 𝜋 ∗ to maximizethe total reward 𝑅 = (cid:205) 𝑇𝑡 = 𝑟 𝑡 .We first formulate the inlining-for-size problem as an MDP.The inlining pass traverses over the call sites in the call graphin a deterministic order and decides at each call site whetherto inline or not. Every inlining operation changes the callgraph. We treat this as a sequential decision process, and weformulate it into an MDP as: state S : we define the current call graph and the call sitebeing visited to be the state. action A : A = { , } , where 1 means inline and 0 meansdo not inline. state transition probability P : unlike usual MDPs, thestate transition is deterministic (no randomness) in the inlin-ing problem. After an action is taken (inline or not inline),the compiler determines what the next state is (updates thecall graph and decides the next call site to visit). reward R : reward is defined to be the native size reductionafter the action is taken. If 𝑎 = 𝑎 = 𝑆 ( 𝐶𝑎𝑙𝑙𝑒𝑟 𝑏𝑒𝑓 𝑜𝑟𝑒 )− 𝑆 ( 𝐶𝑎𝑙𝑙𝑒𝑟 𝑎𝑓 𝑡𝑒𝑟 )+ (cid:40) 𝑆 ( 𝐶𝑎𝑙𝑙𝑒𝑒 ) , callee deleted0 , callee remains(1)where 𝑆 ( 𝑓 ) is the native size of function 𝑓 . Note that wedo not actually know what the native size would be for acertain function while performing inlining, since inliningoperates at the IR level. The definition for reward here isnot practical for training. We will discuss how we tackle thischallenge next. Policy Gradient (PG) [28] is a family of RL algorithms derivedfrom REINFORCE [34]. Though we use Proximal Policy Op-timization (PPO) [23], we first briefly introduce REINFORCE— as PPO is an enhancement to REINFORCE and they workin very similar ways.On a high level, all PG algorithms gradually improve thepolicy 𝜋 𝜃 by computing the gradients of the parameters 𝜃 in the policy network w.r.t. the total reward 𝑅 , and thenupdate 𝜃 with the gradient to improve the policy. With 𝐽 ( 𝜃 ) denoting the expected reward under policy 𝜋 𝜃 , the gradient In general, the total discounted reward is 𝑅 = (cid:205) 𝑇𝑡 = 𝛾 𝑡 𝑟 𝑡 , where 𝛾 is thediscounting factor; but we take 𝛾 = ∇ 𝜃 𝐽 ( 𝜃 ) in REINFORCE is computed as: ∇ 𝜃 𝐽 ( 𝜃 ) = E (cid:34) 𝑇 ∑︁ 𝑡 = 𝑅 ∇ 𝜃 log 𝜋 𝜃 ( 𝑎 𝑡 | 𝑠 𝑡 ) (cid:35) (2)Here E is an expectation over the policy 𝜋 𝜃 being appliedto an inlining pass. In practice, this expectation is approx-imated with Monte Carlo methods — with 𝑛 trajectories collected from compiling with policy 𝜋 𝜃 , the parameter 𝜃 isupdated with: 𝜃 ← 𝜃 + 𝛼 𝑛 𝑛 ∑︁ 𝑖 = (cid:40) 𝑇 ∑︁ 𝑡 = 𝑅 𝑖 ∇ 𝜃 log 𝜋 𝜃 ( 𝑎 𝑖,𝑡 | 𝑠 𝑖,𝑡 ) (cid:41) (3)where 𝛼 is the learning rate. As 𝜃 is updated, the policy 𝜋 ( 𝜃 ) tends to evolve in the direction that increases the totalreward. Algorithm 1 describes the process — as training pro-gresses, the policy gradually improves on its own by iteratingbetween two stages: 1) compile with a new policy and collectfresh trajectories; 2) update policy network parameters 𝜃 . Algorithm 1

MLGO PG Training Algorithm Initialize 𝜃 for iteration = 1, 2, ... do Compile with policy 𝜋 𝜃 to collect 𝑛 trajectories Update 𝜃 using Equation 3 end for The details of training with PPO, which has several addi-tional terms in the loss function, are available in [23]. Onecore improvement of PPO is to subtract a baseline 𝐵 fromthe reward to reduce the variance. Equation 2 is modified as: ∇ 𝜃 𝐽 ( 𝜃 ) = E (cid:34) 𝑇 ∑︁ 𝑡 = ( 𝑅 − 𝐵 )∇ 𝜃 log 𝜋 𝜃 ( 𝑎 𝑡 | 𝑠 𝑡 ) (cid:35) (4)Here the baseline 𝐵 describes what the 𝑅 is supposed tobe (irrelevant to policy). By subtracting it, 𝑅 − 𝐵 providesbetter information about the effectiveness of the policy 𝜋 𝜃 .The total reward 𝑅 in these equations can be replacedwith the returns following action 𝑎 𝑡 : (cid:205) 𝑇𝑡 ′ = 𝑡 𝑟 𝑡 ′ . In this case,the baseline 𝐵 is a value network 𝑉 ( 𝑠 𝑡 ) predicting the futurereturns (cid:205) 𝑇𝑡 ′ = 𝑡 𝑟 𝑡 ′ from the state 𝑠 𝑡 . We choose to use the totalreward 𝑅 since: 1) it is directly available in the inlining-for-size problem, while partial returns would have to beapproximated; 2) it is difficult to build the value network 𝑉 ( 𝑠 𝑡 ) with the reduced state. We will discuss the details inthe next section. A trajectory is defined as ( 𝑠 , 𝑎 , 𝑟 , 𝑠 , 𝑎 , 𝑟 , ...𝑠 𝑇 , 𝑎 𝑇 , 𝑟 𝑇 ), and total reward 𝑅 = (cid:205) 𝑇𝑡 = 𝑟 𝑡 ircea Trofin, Yundi Qian, Eugene Brevdo, Zinan Lin, Krzysztof Choromanski, and David Li We run into two challenges when applying PPO to the inlining-for-size problem: 1) complex state space; 2) impractical re-ward definition.

Complex state space:

Our MDP formulation defines thestate as the current call graph and the call site being visited.Unfortunately, encoding and processing a call graph at eachdecision point may not be computationally practical for ageneral-purpose compiler to afford.

Impractical reward definition:

It is difficult to know afunction’s native size 𝑆 ( 𝑓 ) during the inlining pass becausenative code lowering happens in a later pass, and becauseits structure may change due to more of its call sites beinginlined.To tackle the first challenge, we approximate the truestate by distilling the state space to 11 numerical features aslisted in Table 1. These features describe the local call site andprovide some global information about the call graph. Section5.1 details the features we use. We considered, but rejectedfor now, the use of (IR) code embedding techniques[3]; thisallows us to minimize additional computational/memorycosts. We plan to consider such techniques in the future.Type Featurecallerfeature caller_basic_block_countcaller_conditionally_executed_blockscaller_userscalleefeature callee_basic_block_countcallee_conditionally_executed_blockscallee_userscall sitefeature callsite_heightcost_estimatenumber_constant_paramscall graphfeature edge_countnode_count Table 1.

Features for Inlining for SizeOne drawback of the simplified state is that it greatlyreduces information available to the policy — it only containsa part of the local call site information, and limited global callgraph information. We do not expect this to hurt the policynetwork because it is roughly the same information availableto the current inlining heuristic. However, this reduced statevector does not allow us to build the value network 𝑉 ( 𝑠 𝑡 ) baseline — at a certain time 𝑡 , the simplified state 𝑠 𝑡 is notinformative enough to predict the future return (cid:205) 𝑇𝑡 ′ = 𝑡 𝑟 𝑡 ′ .A simple approach to side-stepping the lack of partialreward information and the side effect of reduced state rep-resentation is to use the total reward 𝑅 instead of the partialreturn as shown in Equation 2. While the per-step reward isdifficult to get, the total reward is the sum of the individual (unknown) rewards — it is relatively easy to evaluate: evalu-ate native size with / without inlining, and subtract. In thetotal reward setup, the baseline 𝐵 is defined as the estimatednative size reduction of the module after the inlining pass.We can use the native size reduction under the heuristicpolicy as the baseline 𝐵 .Using the total reward instead of partial rewards has itsdrawbacks: 1) more data needs to be collected to achieve thesame performance; 2) the final model quality may be worse. Instead of having the RL algorithm learn from the scratch(initialize 𝜃 randomly), we facilitate training by initializing 𝜃 from some "warmstart" policy. To facilitate the RL training,we need to have a "warmstart" policy that already performsreasonably well. An intuitive idea is the heuristic inliningdecisions in LLVM. Therefore, we train the warmstart policyto imitate the heuristic inlining decisions in LLVM using be-havioral cloning algorithm[4]. The behavioral cloning algo-rithm essentially views the problem as a supervised learningproblem where the features are the same as the RL trainingand the label is the heuristic inlining decision — it trainsa neural network that makes inlining decisions as close asthe heuristic inliner does. In this way, we get a policy thatmakes decisions similar to LLVM’s current inlining heuris-tics and thus can serve as the warmstart policy to make ourRL training much faster. Previous work has shown that ES, as a gradient-free blackbox optimization technique, is a competitive alternative toRL algorithms on MuJoCo and Atari tasks [22].ES focuses on black-box optimization problems of theform max 𝜃 𝐹 ( 𝜃 ) , where 𝐹 can be any black-box function thatcan be evaluated. Given 𝜃 , we essentially assume we havean oracle that calculates 𝐹 ( 𝜃 ) . In our specific case, 𝜃 are theparameters of the policy network 𝜋 𝜃 and 𝐹 ( 𝜃 ) is the totalreward 𝑅 — the native size reduction after inlining underpolicy 𝜋 𝜃 for a certain module.Instead of directly optimizing 𝐹 ( 𝜃 ) , ES focuses on optimiz-ing 𝐽 ( 𝜃 ) , which is a smoothed version of 𝐹 ( 𝜃 ) :max 𝜃 𝐽 ( 𝜃 ) = max 𝜃 E 𝜀 ∼N( ,𝐼 ) 𝐹 ( 𝜃 + 𝜎𝜀 ) . (5)Here N ( , 𝐼 ) denotes the multivariate normal distributionwith zero mean and identity covariance matrix. Similar toPG, ES also takes the gradient of the parameter 𝜃 w.r.t. 𝐽 ( 𝜃 ) : ∇ 𝜃 𝐽 ( 𝜃 ) = 𝜎 E 𝜀 ∼N( ,𝐼 ) { 𝐹 ( 𝜃 + 𝜎𝜀 ) 𝜀 } , (6)and uses Monte Carlo approximation of the gradient to up-date 𝜃 to improve the policy: 𝜃 ← 𝜃 + 𝛼 𝑛𝜎 𝑛 ∑︁ 𝑖 = { 𝐹 ( 𝜃 + 𝜎𝜀 𝑖 ) 𝜀 𝑖 } (7) LGO: a Machine Learning Guided Compiler Optimizations Framework where 𝛼 is the learning rate, and 𝜀 𝑖 are vectors sampled from N ( , 𝐼 ) .Algorithm 2 describes the ES algorithm. Similar to PG, ESalso iterates between data collection and policy update togradually improve the policy. Algorithm 2

MLGO ES Training Algorithm initialize 𝜃 for iteration = 1, 2, ... do Sample 𝜀 , 𝜀 , ..., 𝜀 𝑛 ∼ N ( , 𝐼 ) Compile with policy 𝜋 𝜃 + 𝜎𝜀 𝑖 to get 𝐹 ( 𝜃 + 𝜎𝜀 𝑖 ) Update 𝜃 based on Equation 7 end for4.3 PG v.s. ES While the policy gradient algorithm and the evolution strate-gies algorithm are similar on a high level, they are differentin many ways and have their pros/cons.

Complexity:

The key advantage of ES is that it is con-ceptually simpler: 1) it requires less engineering complexity— unlike the PG algorithm where we need the logged tra-jectory during inlining ( 𝑠 , 𝑎 , 𝑠 , 𝑎 , ..., 𝑠 𝑇 , 𝑎 𝑇 ) and the totalreward 𝑅 for training, the ES algorithm only needs the totalreward. Thus we do not need to log the trajectory while do-ing compilation; this reduces both engineering complexityand storage/network requirements; 2) it has less require-ments on the problem structure as long as there is an oracletelling the total reward 𝐹 ( 𝜃 ) under the policy parameters 𝜃 .As a result, it is easier to apply ES to other compiler optimiza-tion problems as PG requires formulating the optimizationproblem as an MDP. Sample Efficiency:

Sample efficiency quantifies the amountof data required for training. The key advantage of PG isthat it has much higher sample efficiency than ES. In theinlining-for-size problem, we observed that even though PGis trained using total reward, over 20 𝑋 computational re-sources are needed to train an ES policy of similar quality.In problems where partial reward information is availableafter every decision point, we expect the sample efficiencygap to be even larger. PG and ES algorithms are very similar in policy training on ahigh level — both of them improve the policy 𝜋 𝜃 by iteratingbetween compiling with policy 𝜋 𝜃 to collect data and updateparameters 𝜃 . Figure 2 demonstrates their training workflow.Before the training, we prepare an IR corpus consisting ofpre-inlining IR files extracted from some software. At eachiteration, the trainer sends the policy 𝜋 𝜃 to data collector,the data collector samples several IR files from the IR cor-pus, does compilation to collect training data, and sends thetraining data back to the trainer for training. The training is done after several iterations, and the trainer exports thetrained policy. We use TF-Agents[29] — an RL library inTensorFlow[1] for training and the policy is in the format ofTensorFlow SavedModel. Figure 2.

System Overview: Policy TrainingThe bottleneck of training for the inlining problem is datacollection. Therefore, the data collection is carried out in aparallel way to improve the overall training efficiency.Figure 3 details how the data collector module works. It issupported by the development mode in MLGO framework.It takes a pre-inlining IR file and a policy (optional) as inputs,conducts inlining on the IR file based on the policy, has thepost-inlining IR file optimized by other opt passes after inlin-ing, converts the optimized IR file into native code, and getsthe native size of this module. The native size, together withthe log file generated during inlining by MLGO that containsthe trajectory ( 𝑠 , 𝑎 , 𝑠 , 𝑎 , ..., 𝑠 𝑇 , 𝑎 𝑇 ), composes the outputof the data collector module — training data. If the policyis not given, the inliner will conduct the current heuristicinlining and log the trace. It has two use-cases as discussed inSection 4.1: 1) collect data to train the warmstart policy withbehavioral cloning algorithm; 2) use the heuristic inlining asthe baseline. The ES algorithm only needs the reward (nativesize) for training so the log file is not needed. Figure 3.

Data Collection for Inlining-for-Size ircea Trofin, Yundi Qian, Eugene Brevdo, Zinan Lin, Krzysztof Choromanski, and David Li We implemented the pilot project in LLVM , together withreusable support for release and development modes, as wellas continuous integration build bots. We introduced an abstraction for the inline decision-makingpolicy, the

InlineAdvisor , and a module analysis,

InlineAdviso-rAnalysis , that may be used to retrieve the InlineAdvisor. Theanalysis can not be accidentally invalidated by other passes.This is necessary, since the inliner pass is interleaved withthe execution of function passes, as previously discussed,and we want to track module-wide features throughout theperformance of inlining and related passes over a module. In-stead, the analysis is managed explicitly - see ModuleInliner-WrapperPass (llvm/Transforms/IPO/Inliner.h). The specificimplementation of the advisor is chosen through a LLVMflag (-enable-ml-inliner). By default, the implementation isthe manual heuristic. Passing ’release’ or ’development’ tothe flag selects the respective mode, if the compiler was builtwith support for that mode.Feature extraction is modeled as a separate analysis,

Func-tionPropertiesAnalysis , and reused by the release and devel-opment implementations. The full feature set is captured inllvm/Analysis/InlineModelFeatureMaps.h. We capture somecall site-local information, as well as global information, suchas module-wide number of functions and static calls, caller/-callee user counts; position in the original call graph, as thedistance of the call site to the farthest SCC; and an estimateof removed instructions given the call site context.We use TensorFlow [1] as the model training and infer-ence framework. In both modes, the interface between LLVMand the model is defined solely in terms of input and outputtensor specifications (tensor name, type, and shape). Theinternal structure of the model is an implementation detail.This means that during training, the compiler engineer isfree to explore hyper-parameters or add/modify hidden lay-ers. Also, ingesting a new model with a different internalstructure, in release mode, is just a matter of recompilingLLVM.Refer to lib/Analysis/{MLInlineAdvisor | ReleaseMode-ModelRunner | DevelopmentModeInlineAdvisor}.cpp for moreimplementation details. At a high level, both release and de-velopment modes: • Handle user inlining directives and correctness aspects(these are done without model evaluation) • Extract the features associated with a call site and formfixed-sized tensors (primitive data type vectors), andefficiently maintain the module-wide features • Pass the tensors to the model evaluator, and request itperform an evaluation Code references are made in the context of commit 71059257bd4. • Take the result of the evaluation as advice (i.e. inline/-don’t) and make that available to the inliner pass. • It is possible that a policy misbehaves in unforeseencircumstances (which, as a note, should then be in-corporated into the training loop). The resulting IR,while correct, could become increasingly expensive toprocess by subsequent passes. To avoid this, we set ahard threshold to the amount by which the number ofinstructions may grow in a compilation unit.

The model is encoded in the TensorFlow serialization format,SavedModel [31], which is compiled into native code by thesaved_model_cli tool [32]. To use this tool, we added a buildrule (see llvm/cmake/modules/TensorFlowCompile.cmake)to the LLVM build system. Applying the rule to a modelgenerates a header file and an object file. From here, themodel may be consumed as a C function; for simplicity, theSavedModel compiler provides a thin C++ wrapper, exposingplain C/C++ APIs (primitive types), which is compiled aspart of the LLVM build process. The SavedModel is checkedin as source .To build with support for the release mode, the Saved-Model compiler must be available during the build time ofLLVM. The compiler may be installed through a pythonpip package . Once installed, its location is provided to theLLVM build via the TENSORFLOW_AOT_PATH cmake flag.Specifying that flag also defines a conditional compilationflag, HAVE_TF_AOT, which enables the compilation as partof the Analysis component of the support for release mode.Note that this mechanism would build the release modeimplementations of all optimization passes that have RL-driven policies, meaning that implementers just need to reusethe same mechanisms - conditional compilation flag, buildrule, etc - to plug in a ML-based policy replacement. As discussed, in development mode, we want to supportloading models from the command line. For the developmentmode, the build time dependency is to the TensorFlow C APIlibrary, instead of the TensorFlow pip package. Model load-ing, initialization, and evaluation is performed via a reusableC++ API wrapper (see lib/Analysis/TFUtils.cpp) that simpli-fies the programming model of this aspect of developmentmode implementations. The SavedModel separates the evaluation graph structure from the valueof the trained weights used for evaluation. The graph is stored as text. Theweights/serialized float arrays are stored as a binary blob. Their evolution(due to training) does not diff well, so the compactness of a binary formatis more economical for the project repository. See the buildbot setup script available at https://github.com/google/ml-compiler-opt/blob/58bf347286c21519b3cc418f659c485cbb7ad82f/buildbot/buildbot_init.sh LGO: a Machine Learning Guided Compiler Optimizations Framework

In addition to facilitating a different model ingestion mech-anism, the development mode is responsible for producingtraces necessary for training ("training logs") . These logscapture the succession of feature values observed when thepolicy is asked to make a decision, and the decision madeafterwards ("trajectories"). Training logs may be producedfor both the heuristic policy (for bootstrapping training -"warmstart") as well as for the ML policy currently undertraining. Exploration - i.e. deviating from policy, with thepurpose of finding new learning opportunities - is delegatedto a TensorFlow mechanism that introduces some random-ness in decisions. This mechanism is an implementationdetail of the model as produced by the training algorithm,and is outside the control of the compiler. Care must betaken to remove such randomness before shipping a model,and re-validate its effectiveness. We encode the training logsas textual SequenceExamples[30] proto-buffers, the typicalabstraction Tensorflow training algorithms would expect.We produce a textual output to avoid an additional depen-dency to LLVM, and to simplify diagnostics and testing ofthe feature.Analogous to the release mode, enabling developmentmode in LLVM requires the dependency be made available tothe build system. In this case we use the TENSORFLOW_C_APIflag, which in turn defines the HAVE_TF_API conditionalcompilation flag. More details may be obtained from the pre-viously noted build bot scripts. Also similar to the releasemode, this mechanism enables all cases that have the Ten-sorFlow C API library dependency. Unlike release mode, thedevelopment mode’s use of the TensorFlow C library is arun-time dependency, and needs to be on the loader path. Model evaluation in release mode has fixed cost , both interms of compiler run-time memory utilization, as well asCPU utilization. This is because models are fixed size graphsconnecting functional operators, taking fixed sized inputs,using constant weights, and producing fixed sized outputs.For the current model, we observed 0.65% increase in mem-ory utilization at run-time. When inlining a large IR mod-ule ( 33MB), we measured a 10% increase in inlining time,mostly attributable to feature extraction; since inlining tendsto represent 10-15% of total compile time, the net contri-bution of the release mode is only 1%. Finally, clang binarysize increase due to the inclusion of the compiled modelwas 115KB, representing 0.08% size increase.We did not formally measure the overhead of the devel-opment mode, mainly because timeliness is less of a con-cern here, and also because model evaluation may happenthrough a variety of means, including JIT-ing, which makesmeasurements more unstable. We did want to validate that RL algorithms require these. ES algorithms do not

PG ES ES (L)Size Reduction 4 .

95% 3 .

74% 5 . Table 2.

Policy Gradient v.s. Evolution Strategiesthe solution is practically usable in training loops, and ob-served 26K IR modules being inlined in parallel on a 72 threadmachine, 192GB RAM, in around 10 minutes, and withoutgoing past half of the available RAM . We trained the inlining for size policy on an internal searchapplication containing over 28000 IR modules with a vari-ety of different code patterns . The rich set of patterns willimprove generalizability, across both time and software do-main, of the trained policy. As mentioned, this is importantfor real-world deployment.We trained the policy using both PG and ES on the internalsearch software. Table 2 compares their effectiveness in termsof reduction of the .text section compared with heuristic-driven -Oz. We trained 3 policies: PG and ES with a 2 hiddenlayer ( , ) neural network, and ES(L) with a deeper 4 layer ( , , , ) neural network . We can see that: 1) PG hasbetter sample efficiency than ES — it consumes ~5% trainingresources of ES (100 ∗

12 v.s. 488 ∗ We deploy the trained PG and ES policies to a wide range ofsoftware to evaluate their generalizability. Figure 4 showshow the 3 models we trained on the search application per-form on 3 different internal applications and on Clang .Figure 5 shows their effectiveness on SPEC 2006. We can seethat all the 3 policies show good generalizability — they areable to reduce the native size to some extent. Policy effective-ness is ES(L) > PG > ES for most software, which is the sameas what we see on the search application. It also suggestsgood generalizability as a policy performs better on a certainsoftware is likely to also perform better on other software. Anecdotally, we were able to built Fuscia using a development mode clang,and timeliness was not a noticeable issue. We also have an end-to-end demo at https://github.com/google/ml-compiler-opt/blob/main/docs/demo/demo.md that trains on publicly avail-able code and achieves similar performance. Detailed hyper-parameters at https://github.com/google/ml-compiler-opt/tree/main/compiler_opt/rl/inlining/gin_configs Specifically, clang @4ca60915bcc (2020/8/28) building clang @d469133f95b(2020/4/25).9 ircea Trofin, Yundi Qian, Eugene Brevdo, Zinan Lin, Krzysztof Choromanski, and David Li

Figure 4.

Generalizability across Software

To evaluate the trained policies’ generalizability across time,we deploy the 3 trained policies on the same software as inFigure 4 4 months later. We also use the LLVM 4 monthslater . Both the software and the compiler have been underactive development in that period. Figure 6 demonstratesthe results. We can see that their effectiveness may degradesomewhat (compared with Figure 4), but they still have de-cent wins compared with the current -Oz. There have been many academic efforts in using machinelearning and related techniques to replace hand-crafted heuris-tics in compilers. Our contribution is identifying the problemframing and design constraints that enable applying thesetechniques to production.Wang and O’Boyle [33] present an extensive survey ofthe use of machine learning in compiler optimizations. Most,however, employ supervised learning techniques, which, asexplained, are not in our scope. The closest, Cavazos et all- [6] used unsupervised learning to automatically tune theinlining parameters (thresholds) of a research Java VirtualMachine (JVM), which features a very simple manually writ-ten heuristic. In subsequent work, Simon et al. [25] constructa heuristic as a decision tree, to address maintainability andevolvability. While the ML techniques are similar to what weare using in MLGO, both parameter tuning and direct policycomprehensibility are counter to our goals, as described insection 3.Adams et al. [2] employed ML to train a cost model to au-tomatically schedule Halide programs for image processing.With runtime sampling, the cost model is used to find theoptimal schedule parameters using beam search. Similarly,Chen et al.[7] used deep learning to train a statistic modelfor TensorFlow programs.Inlining-specific, Dean et al. [11] build a database of ob-served decisions and their effects, and consult it in subse-quent compiler runs - while this is not learning, it is a pre-cursor of efforts in this area. In [9], Cooper et al. presented ascheme to parameterize the inline heuristics (decision tree)and the hill-climbing parameter space search. Clang selfhost @4ca60915bcc (2020/8/28).

Haj-Ali et al. [13] use reinforcement learning to instru-ment source code with pragma directives to drive the vector-ization pass. The policy does not replace a compiler heuristic,rather it informs one, by augmenting source code as a pre-build step. This does not make the technique transparentlydeployable for compiler users. That being said, we currentlysee no fundamental reason their solution cannot be adaptedto MLGO. The main practical issue we see is understandingtrade-offs of automated feature extraction, which we intendto explore as a next step as well.Supervised learning is used by Stephenson and Amaras-inghe [26] to predict loop unrolling factors and by Eliotet al. [19] to train a local (single basic block) instructionscheduler. It uses a machine model to predict the so calledpreference relationship given a partial schedule and twocandidate/ready instructions. Cummins et al. [10] automati-cally extract features from source code, and use supervisedlearning to learn heuristics for predicting optimal mappingfor heterogeneous parallelism and GPU thread coarseningfactors.Instruction scheduling is a hard problem in the compilerthat extensively uses heuristics. The application of learningto instruction scheduling within straight line code has beenexplored by Moss et al. [20] and McGovern et al. [17].Data prefetching plays a similar role in bringing data intothe processor without stalls. In [15], Hashemi et al. treatedthe memory prefetching strategies as an n-gram classifica-tion problem in natural language processing, and used LSTMbased Recurrent Neural Network (RNN) to do the prediction.Peled et al. [21] define the notion of semantic locality and usereinforcement learning techniques to build a context-basedmemory prefetcher that approximates semantic locality.Another approach to optimize programs without dealingwith specific optimizations is super-optimization . This refersto the process of finding a better version of a given pro-gram that is semantically equivalent. Early efforts in super-optimization relied on brute force search. Recent efforts havefocused on using stochastic search to improve the efficiency.Bunel et al. [5] have used reinforcement learning to optimizestochastic search based super-optimization techniques.Milepost GCC [12] is a self-tuning GCC-based compiler,where program features are used to predict compiler flagsbeneficial to some goal (such as performance or size). It doesnot use ML-trained policies as part of its implementation.

The immediately-observable difference between our pilotproject and speed problems is that the reward is measured dif-ferently: speed is measured through benchmark runs, whichare more time consuming and more noisy than size measure-ments. Using benchmark runs results as reward for speedoptimization will have difficulties scaling, so our current LGO: a Machine Learning Guided Compiler Optimizations Framework

Figure 5.

SPEC 2006 Size Reduction

Figure 6.

Generalizability across Timepreference is to avoid benchmark runs altogether, and focusinstead on using problem-specific reward approximations.For register allocation, for example, a natural reward iscalculating, per function, the block frequency-weighed sumof introduced moves. For inlining for speed, we plan to usea linear combination of a per-critical call graph estimateof working set (i.e. cache lines needed for execution) anddynamic instruction count. Both approaches require profilinginformation for carrying out the analysis, which we assumeas a pre-requisite for workloads that are concerned withspeed.

There are multiple directions to pursue in terms of the MLtechniques:

Richer State Representations: instead of using the 11numerical features to represent the state, we can have richerstate representations. For example, we can use code em-bedding techniques [3] to embed the caller/callee to getmore detailed information about the call site; or we can usegraph neural network techniques [14] on the neighboringsub-graph of the call site to get more information about thecall graph.

PG with Partial Reward : PG with partial reward wouldgreatly improve the sample efficiency and trainability. How-ever, there are two challenges to tackle: 1) find an efficientway to encode the global call graph information into state; 2)train a supervised model to predict a function’s native sizefrom its IR.

We investigated the problem of leveraging ML techniques forcompiler optimization in a real-world setting. We proposed aparticular understanding of the problem space, and derivedthe MLGO framework. We applied it to inlining-for-size anddescribed the resulting implementation, available in LLVM asa build-time opt-in, as well as the training methodology, twotraining algorithms and their trade-offs, and results. We arecurrently applying the same principles to addressing inliningfor speed and register allocation policies, and hope that,through our experience, as well as that of the community,we can further refine MLGO and eventually mature it toa solution that compiler engineers can broadly apply andleverage machine learning for compiler optimizations in real-world settings.

References [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving,Michael Isard, et al. 2016. Tensorflow: A system for large-scale machinelearning. In { USENIX } symposium on operating systems designand implementation ( { OSDI } . 265–283.[2] Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-mao Li, Michael Charbi, Benoit Steiner, Steven Johnson, Kayvon Fa-tahlian, Fredo Durand, and Jonathan Regan-kelley. 2019. Learning tooptimize Halide with tree search and random programs (ACM Trans-actions on Graphics) . 2–12.[3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019.code2vec: Learning distributed representations of code. Proceedings ofthe ACM on Programming Languages

3, POPL (2019), 1–29.[4] Michael Bain and Claude Sammut. 1995. A Framework for BehaviouralCloning.. In

Machine Intelligence 15 . 103–129.[5] Rudy Bunel, Alban Desmaison, M Pawan Kumar, Philip HS Torr, andPushmeet Kohli. 2016. Learning to superoptimize programs. arXivpreprint arXiv:1611.01787 (2016).[6] John Cavazos and Michael FP O’Boyle. 2005. Automatic tuning of inlin-ing heuristics. In

SC’05: Proceedings of the 2005 ACM/IEEE Conferenceon Supercomputing . IEEE, 14–14.[7] Tianqi Chen, Lianmin Zheng, Eddie Q. Yan, Ziheng Jiang, ThierryMoreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018.Learning to Optimize Tensor Programs.

CoRR abs/1805.08166 (2018).arXiv:1805.08166 http://arxiv.org/abs/1805.08166 ircea Trofin, Yundi Qian, Eugene Brevdo, Zinan Lin, Krzysztof Choromanski, and David Li [8] Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard ETurner, and Adrian Weller. 2018. Structured evolution with com-pact architectures for scalable policy optimization. arXiv preprintarXiv:1804.02395 (2018).[9] Waterman T. Cooper K.D., Harvey T.J. 2008. An Adaptive Strategy forInline Substitution. Hendren L. (eds) Compiler Construction. CC 2008.Lecture Notes in Computer Science vol 4959 (2008). https://doi.org/10.1007/978-3-540-78791-4_5 [10] Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather.2017. End-to-end Deep Learning of Optimization Heuristics.[11] Jeffrey Dean, Jeffrey Dean, Craig Chambers, and Craig Chambers. 1993.

Training Compilers for Better Inlining Decisions . Technical Report.[12] Grigori Fursin, Yuriy Kashnikov, Abdul Wahid Memon, ZbigniewChamski, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, BilhaMendelson, Ayal Zaks, Eric Courtois, François Bodin, Phil Barnard,Elton Ashton, Edwin V. Bonilla, John Thomson, Christopher K. I.Williams, and Michael F. P. O’Boyle. 2011. Milepost GCC: MachineLearning Enabled Self-tuning Compiler.

Int. J. Parallel Program.

39, 3(2011), 296–327. https://doi.org/10.1007/s10766-010-0161-2 [13] Ameer Haj-Ali, Nesreen K. Ahmed, Ted Willke, Yakun Sophia Shao,Krste Asanovic, and Ion Stoica. 2020. NeuroVectorizer: End-to-EndVectorization with Deep Reinforcement Learning. In

Proceedings ofthe 18th ACM/IEEE International Symposium on Code Generation andOptimization (CGO 2020) . Association for Computing Machinery, NewYork, NY, USA, 242–255. https://doi.org/10.1145/3368826.3377928 [14] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive repre-sentation learning on large graphs. In

Advances in neural informationprocessing systems . 1024–1034.[15] M. Hashemi, K. Swersky, G. Ayers A. Smith, J. Chang H. Litz, C.Kozyrakis, and P. Ranganathan. 2018. Learning Memory Access Pat-terns (ICML, 2018) .[16] Jens Kober, J Andrew Bagnell, and Jan Peters. 2013. Reinforcementlearning in robotics: A survey.

The International Journal of RoboticsResearch

32, 11 (2013), 1238–1274.[17] Amy McGovern, Eliot Moss, and Andrew G. Barto. 2002. Building aBasic Block Instruction Scheduler with Reinforcement Learning andRollouts.

Machine Learning

49, 2 (01 Nov 2002), 141–160. https://doi.org/10.1023/A:1017976211990 [18] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioan-nis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playingatari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).[19] Eliot Moss, Paul Utgoff, John Cavazos, Bordley Carla, and DavidScheeff. [n.d.]. Learning to Schedule Straight-Line Code (NIPS 1997) .029– 935.[20] J. Eliot B. Moss, Paul E. Utgoff, John Cavazos, Doina Precup, DarkoStefanovic, Carla E. Brodley, and David Scheeff. 1998. Learning toSchedule Straight-Line Code. In

Advances in Neural Information Pro-cessing Systems 10 , M. I. Jordan, M. J. Kearns, and S. A. Solla (Eds.).MIT Press, 929–935. http://papers.nips.cc/paper/1349-learning-to-schedule-straight-line-code.pdf [21] L. Peled, S. Mannor, U. Weiser, and Y. Etsion. 2015. Semantic local-ity and context-based prefetching using reinforcement learning. In . 285–297. https://doi.org/10.1145/2749469.2749473 [22] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever.2017. Evolution strategies as a scalable alternative to reinforcementlearning. arXiv preprint arXiv:1703.03864 (2017).[23] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, andOleg Klimov. 2017. Proximal policy optimization algorithms. arXivpreprint arXiv:1707.06347 (2017).[24] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre,George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou,Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature

Proceedings of the 2013 IEEE/ACM InternationalSymposium on Code Generation and Optimization (CGO) (CGO ’13) .IEEE Computer Society, Washington, DC, USA, 1–12. https://doi.org/10.1109/CGO.2013.6495004 [26] Mark Stephenson and Saman Amarasinghe. 2005. Predicting Un-roll Factors Using Supervised Classification. In

Proceedings of the In-ternational Symposium on Code Generation and Optimization (CGO’05) . IEEE Computer Society, Washington, DC, USA, 123–134. https://doi.org/10.1109/CGO.2005.29 [27] Mark W. Stephenson. 2006.

Automating the Construction of CompilerHeuristics Using Machine Learning . Ph.D. Dissertation. USA. Advisor(s)Amarasinghe, Saman. AAI0810106.[28] Richard S Sutton, David A McAllester, Satinder P Singh, and YishayMansour. 2000. Policy gradient methods for reinforcement learningwith function approximation. In

Advances in neural information pro-cessing systems . 1057–1063.[29] unspecified. 2020.

Tensorflow Agents . [30] unspecified. 2020. Tensorflow tf.train.SequenceExample . [31] unspecified. 2020. Using the SavedModel format . [32] unspecified. 2020. XLA — Tensorflow, Compiled . https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html [33] Zheng Wang and Michael O’Boyle. 2018. Machine Learning in Com-piler Optimization. Proc. IEEE

PP (05 2018), 1–23. https://doi.org/10.1109/JPROC.2018.2817118 [34] Ronald J Williams. 1992. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning.