Static Neural Compiler Optimization via Deep Reinforcement Learning
SStatic Neural Compiler Optimization viaDeep Reinforcement Learning
Rahim Mammadli
Technische Universit¨at DarmstadtGraduate School of ExcellenceComputational Engineering [email protected]
Ali Jannesari
Iowa State UniversityDepartment of Computer Science [email protected]
Felix Wolf
Technische Universit¨at DarmstadtDepartment of Computer Science [email protected]
Abstract —The phase-ordering problem of modern compilershas received a lot of attention from the research communityover the years, yet remains largely unsolved. Various optimiza-tion sequences exposed to the user are manually designed bycompiler developers. In designing such a sequence developershave to choose the set of optimization passes, their parametersand ordering within a sequence. Resulting sequences usuallyfall short of achieving optimal runtime for a given sourcecode and may sometimes even degrade the performance whencompared to unoptimized version. In this paper, we employa deep reinforcement learning approach to the phase-orderingproblem. Provided with sub-sequences constituting LLVM’s O3sequence, our agent learns to outperform the O3 sequence onthe set of source codes used for training and achieves competitiveperformance on the validation set, gaining up to 1.32x speedup onpreviously-unseen programs. Notably, our approach differs fromautotuning methods by not depending on one or more test runsof the program for making successful optimization decisions. Ithas no dependence on any dynamic feature, but only on thestatically-attainable intermediate representation of the sourcecode. We believe that the models trained using our approachcan be integrated into modern compilers as neural optimizationagents, at first to complement, and eventually replace the hand-crafted optimization sequences.
Index Terms —code optimization, phase-ordering, deep learn-ing, neural networks, reinforcement learning
I. I
NTRODUCTION
Code optimization remains one of the hardest problemsof software engineering. Application developers usually relyon a compiler’s ability to generate efficient code and rarelyextend its standard optimization routines by selecting indi-vidual passes. The diverse set of applications and computeplatforms make it very hard for compiler developers to producea robust and effective optimization strategy. Modern compilersallow users to specify an optimization level which triggers acorresponding sequence of passes that is applied to the code.These passes are initialized with some pre-defined parametervalues and are executed in a pre-defined order, regardlessof the code being optimized. Intuitively, this rigidness limitsthe effectiveness of the optimization routine. Indeed, a recentstudy [1] has shown that even the highest optimization levelof different compilers leaves plenty of room for improvement.In order to establish an even playing field with existingoptimization sequences, we limit the scope of our approach by allowing as input only information which is statically avail-able during compilation. This sets us apart from autotuningapproaches where the data gathered from one or multiple runsof the program is used to supplement the optimization strategy.Our predictive model uses the intermediate representation (IR)of the source code to evaluate and rank different optimizationdecisions. By iteratively following the suggestions of ourmodel, we are able to produce an optimization strategy tailoredto a given IR. This is the main difference of our approach frompre-defined optimization strategies shipped as part of moderncompilers.More formally, we rephrase the phase-ordering problemas a reinforcement learning problem. The environment isrepresented by the operating system and the LLVM optimizer.The agent is a deep residual neural network, which interactswith the environment by means of actions . The actions can beof various levels of abstraction, but ultimately they translate topasses run by LLVM’s optimizer on the IR. We will discussactions in greater detail in Section III. The state informationthat is used by the agent to make predictions is representedby the IR of the source code and the history of actionsthat produces the IR. In response to actions, the environmentreturns the new state and the reward. The new state is producedby the LLVM optimizer which runs the selected pass(-es)on the current IR and produces the new IR. The rewardis calculated by benchmarking the new IR and comparingits runtime to that of the original IR (i.e., a reduction inruntime produces a positive reward while an increase producesa negative one). Through interchanging steps of explorationand exploitation we train an agent that learns to correctly valuethe various optimization strategies.We believe that the agents produced by our approach couldbe integrated into existing compilers alongside other routines,such as O2 or O3. Our agent outperforms O3 in multiplescenarios, achieving up to 1.32x speedup on previously-unseenprograms, and can therefore be beneficial in a toolkit ofoptimization strategies offered to an application developer.While the agent learns to achieve superior performance on thetraining set, it is, on average, inferior to O3 on the validationset. However, we believe that this is due to current limitationsof encoding we use for the IRs and the relatively small size ofour dataset. We are convinced that the optimization strategies a r X i v : . [ c s . L G ] O c t f the future will resemble learned agents rather than manuallydesigned sequences, and that reinforcement learning will likelybe the framework used to produce these agents. This workintends to be one of the first steps in this direction.The static phase-ordering problem considered in this workis challenging because of several factors. First, the limitedamount of information available during compilation, suchas the unknown input size already reduces the optimizationpotential of the agent. In order to partially offset this weinclude benchmarks with various problem sizes in our dataset.Next, the number of possible optimization sequences growsexponentially with the number of passes. The space growseven more if we consider possible parameterizations of distinctpasses. To deal with the large optimization space we employseveral strategies: (i) we make the agent pick only a singleaction at a time instead of predicting the whole sequencefrom scratch, (ii) we experiment with different levels ofabstraction for our actions, from triggering a sequence ofoptimization passes down to selecting a parameter for a singlepass. Moreover, the space of possible IRs of each source codecan also be quite large, therefore to encode the IRs we use theembeddings by Ben-Nun et al. [2]. Another challenge is thatthe efficacy of different optimizations might vary dependingon the underlying hardware. In our approach we use only oneout of two available distinct system configurations per agent torun all of the benchmarks. This means that the learned agentsare fine-tuned for the given hardware. However, this is notnecessarily a disadvantage, because it is possible to train anagent once per processing unit and ship it alongside a compileroptimizer. Moreover, it could also be possible to train a singleversatile agent by supplementing state with the informationabout the underlying hardware.We use a dataset of 109 single-source benchmarks from theLLVM test suite to train and evaluate our model. The modelis trained using IRs of source codes from the training set andevaluated on the source codes in the validation set. Usingpasses from the existing O3 sequence of the LLVM optimizerwe are able to train an agent which is on average 2.24x fasterthan the unoptimized version of a program in the trainingset, whereas O3 is 2.17x faster. The best-performing agent onthe validation set achieves an average of 2.38x speedup overthe unoptimized version of the code, while the O3 sequenceachieves an average of 2.67x speedup.Most of the prior work related to ours [3], [4] focuses onthe autotuning problem, where a program has to be run oneor more times before it is possible to choose the optimizationsequence. The advantage of these approaches is that thedynamic information gathered during program runs providesan accurate characterization of the program. These methods aretherefore usually quite successful in outperforming compilers’pre-defined optimization sequences. However, a big disadvan-tage of these approaches is that they require extra developereffort to run the program and gather the necessary information,which prevents them from being integrated into compilersas part of a standard compilation routine. The supervisedlearning methods applied to compiler optimization problem require a pre-existing labeled dataset that is then used totrain a model. Producing such a dataset is not an easy taskbecause the search space is usually very large and the value ofdifferent data points is unkonwn beforehand. Reinformcementlearning, in contrast to supervised learning, allows the trainedagent to explore the environment and continuosly choose thedata points itself as it learns. The problem of developingmethods competing with pre-defined optimization sequencesusing only static information has not gained much attentionin the scientific literature in recent years. This is partiallybecause the problem is very challenging. Nonetheless, webelieve that this problem is at least of equal importanceand to the best of our knowledge we are the first to applydeep reinforcement learning to solve it. This paper makesthe following contributions: • A novel deep reinforcement learning approach to staticcode optimization. The approach does not rely on manualfeature engineering by a human, but instead learns byobserving the effects of the various optimizations on theIR and the rewards from the environment. The approachis therefore fully automatic and relies only on the initialsupply of the source codes. • A trained optimization agent that can be integrated intomodern compilers alongside existing optimization se-quences exposed through compiler flags such as -O2, -O3,etc. The agent can produce IRs that are up to 1.32x fasterthan the ones resulting from using the O3 optimizationsequence. • An efficient framework ”Compiler Optimization via Re-inforcement Learning” (CORL) allowing fast explorationand exploitation in batches. The dynamic load-balancingmechanism distributes the benchmarking workload acrossthe number of available workers and facilitates efficientexploration. Using a large replay memory allows for fastoff-policy training of the agent. The results of bench-marks are further stored in a local database to allowreproducibility as well as higher efficiency of subsequentruns.This paper is structured in the following manner. We startby providing background information in Section II, beforeintroducing our approach and CORL framework in Section III.We evaluate our approach in Section IV and describe therelated work in Section V. Finally, we conclude our paperand sketch future work in Section VI.II. B
ACKGROUND
Modern compilers expose multiple optimization levels viatheir command line interface. For example, the current versionof LLVM offers a selection of seven such levels. These aim tostrike a certain trade-off between the size of the produced bi-nary and its performance. Each optimization level corresponds Deep reinforcement learning encompasses a subset of reinforcementlearning methods where the learning part is performed by a deep neuralnetwork. https://releases.llvm.org/10.0.0/tools/clang/docs/CommandGuide/clang.html
2o a unique sequence of passes that are run on the sourcecode. These passes are constructed using hard-coded valuesmatching the selected optimization level. Having to maintainmultiple manually-designed optimization sequences is one ofthe drawbacks of the current design. Another disadvantage isthat while the optimization sequences are generally efficient,they are not optimal, and in certain cases can even increasethe runtime when compared to an unoptimized version. Forexample, after applying the O3 optimization sequence tothe evalloop.c benchmark from the LLVM test suite weobserved a more than three-fold slowdown. In comparisonto hand-crafted sequences of passes our method is fully-automatic and can learn to achieve any sort of a trade-offgiven the correct reward function. Constructing such a rewardfunction for the size of the binary or the runtime is trivial, asdiscussed in Section III-B.The majority of existing machine learning methods tocompiler optimization tackle the problem of autotuning. Thismeans that they depend on dynamic runtime information andrequire at least one run of the source program to makea prediction. Apart from this, some of these methods relyon static features extracted from the source code, such astokens [5], IR [2], etc. In contrast to these methods we abandonthe dependence on the dynamic features and aim to achievethe best-possible performance by only using the static featuresextracted from the IR [2].III. A
PPROACH
We first give a high level overview of our approach inSection III-A, before formally defining the problem in Sec-tion III-B. Then, we discuss the three levels of abstractionfor actions we consider and the corresponding action spacesin Section III-C, before introducing the tools used to mapactions to concrete LLVM passes in Section III-D. We finish bygoing over the functionality offered by the CORL frameworkin Section III-E.
A. Overview
The high-level overview of our approach is shown inFigure 1. The agent takes the IR of the source code aswell as the initially empty history of actions and calculatesthe expected cumulative rewards for different actions. If thepredicted reward of an action is positive then it is expectedto eventually lead to a speedup, while the negative rewardsare expected to result in a slowdown. To ensure this, therewards are calculated as log ( speedup ) during training. Next,the action with the highest reward is chosen and if its value ispositive LLVM optimizer applies the chosen optimization(s)to the input IR. The produced IR alongside the updatedhistory of actions is then fed into the agent once againand the cycle continues. Eventually, the cycle breaks whenthe highest predicted reward is negative or the maximumnumber of allowed optimizations is applied. Having an upper-bound on the number of optimizations prevents the agent frompotentially being stuck in an infinite loop. Maxinput_ir.ll Action321 Reward-2.3+1.5+0.34 -0.9
Predictions
Value > 0LLVMOptimizer ENDSTART falseAgent RecordActionHistory true +1.52Action Reward
Fig. 1: The CORL workflow.
B. Problem Definition
We define the phase-ordering problem as a reinforcementlearning problem. Since the IR of a source code carries onlystatic information that we use to represent states, the stateslack the Markovian property. Moreover, using embeddingsproduced by Ben-Nun et al. [2] results in a further lossof information about the IR such as immediate values ofinstructions. To enrich the state representation we supplementit with the history of actions performed by the agent.For a set of all possible states S and actions A , our goalis to learn the value function Q ( s, a, w ) parameterized withweights w , such that for any state s ∈ S and action a ∈ A ,the function predicts the highest cumulative reward attainableby taking that action. In order to learn the value function weenforce consistency: Q ( S t , A t , w ) = R ( S t , A t ) + γ max a ∈ A Q ( S t +1 , a, w ) (1)In the equation, R ( S t , A t ) is the reward awarded for takingaction A t in state S t , after which the agent ends up in state S t +1 , and γ ∈ [0 , is the discount factor for future rewards.Assuming that function T ( s ) represents the runtime of theexecutable produced by compiling the IR corresponding tostate s ∈ S , the reward for the action transitioning the agentfrom state S t to state S t +1 is calculated as follows: R = ln T ( S t ) T ( S t +1 ) (2)Representing the reward as the logarithm of the attainedspeedup or slowdown allows for the rewards to be accumulatedacross transitions. Notably, to train an agent to minimize thesize of the produced binary instead of its runtime, one wouldonly need to update the reward function. Specifically, thefunction T ( s ) calculating the runtime of an executable wouldneed to be replaced with another function calculating its size.Similarly, using both the runtime and the size of an executablein the reward calculation would stimulate the agent to learnthe trade-off between the two.3n order to learn an approximation of a function Q ( s, a, w ) we first initialize a deep residual neural network (DQN) withrandom weights w . Then, we use a replay memory to sampleexperiences each represented as a set { S t , A t , R, S t +1 } andcompute the loss (TD-error) of our DQN as the squared meanof the difference between the left and right sides of Equation 1. C. Action Spaces
In order to produce an optimization sequence for a givenIR an agent must decide on a chain of actions. To representthe actions, we experiment with three different levels ofabstraction, which are illustrated in Figure 2. At the highestlevel of abstraction an action triggers a series of passes to beapplied to an IR. At the middle level of abstraction each actioncorresponds to an individual pass. Finally, at the lowest level ofabstraction an action might select a pass or a parameter valuefor an already selected pass. For high and middle level actionsthe passes are initialized with pre-defined parameter values.The lower the level of abstraction for actions, the harder isthe learning problem.In this work we experiment with all three levels of abstrac-tion. We label the action spaces produced by high , middle and low level actions as H , M , L respectively. The size of eachaction space has exponential dependence on the maximumallowed number of consecutive actions, which we designateas parameter µ . Selecting larger values for the parameters µ and γ allows an agent to learn the existence of rewardslying many steps ahead. However, having µ too large mayunnecessarily complicate the learning problem if such long-term dependences among actions do not exist. Furthermore,compilation time potentially also increases proportionally to µ .Note that in action space L only actions selecting individualpasses and not parameters contribute towards µ . Moreover,since only the last parameter selection for every pass withmultiple parameters makes it possible to construct and evaluatea pass, all the preceding intermediate actions produce a rewardof 0. Therefore, to allow the agent to learn the values ofdifferent parameter selections of a given pass the value of γ for all intermediate actions is set to 1. Action
Pass
Action
Pass
Action
Pass
Action
Parameter
Action
Parameter
Action
Pass ......
Action
Pass ...
HighLevelMediumLevelLowLevel
Fig. 2: Three levels of abstraction for actions.
D. Implementation
Some of the passes in LLVM’s O3 sequence are initializedwith non-default constructors, and are therefore impossible to replicate using the command line interface of the LLVMoptimizer opt . To allow experimentation at the highest levelof abstraction in action space H using the exact passes fromLLVM’s O3 sequence, we create a special optimizer opt corl .This optimizer alters the functionality of opt by using oneor more user-specified subsequences of passes out of O3 tooptimize a given IR. Each subsequence of passes is specifiedusing its starting and ending indexes within the O3 sequence.Both opt corl and opt allow experimentation in actionspace M. However, for the sake of generality, we only use opt for the action spaces M and L, since it allows us to specifyboth individual passes and set their parameters. When dealingwith action space L, opt is only invoked when both pass andparameter selections have been finalized.
E. CORL Framework
The majority of reinforcement learning algorithms can bedescribed as iterative processes with interchanging explorationand exploitation steps performed in a loop. The sequentialnature of these algorithms is usually not an issue for manyreinforcement learning problems for which the explorationstep completes in a short amount of time. Receiving a quickresponse to an action from the environment allows for fastgeneration of training data and consequently faster training [6],[7]. In contrast, for our problem the benchmarking step re-quired to calculate the reward takes a relatively long time tocomplete. Therefore, waiting for the exploration step to finishbefore proceeding with the exploitation is suboptimal bothin terms of the agent’s training and efficient use of computeresources. To that end, we devise an algorithm which allowsfor the exploration and exploitation steps to be performed inparallel.Figure 3 illustrates the essential elements of the CORLframework, which is designed as a client-server architecture.The server-side functionality is divided across several objectsresponsible for training agents, managing workers and replaymemory, and visualizing progress. As part of the explorationprocess the learner object generates new tasks in the form ofstate-action pairs and sends them to the manager object. After-wards, as part of exploitation process, the learner continuouslysamples batches of experiences from the replay memory andtrains the agent. The manager distributes the tasks generatedby the learner across workers and updates the replay memorywith the newly-generated experiences. Both the learner andthe manager run in separate server-side processes, allowingfor exploration and exploitation to be performed in parallel.Below we describe the various functionalities of the CORLframework in a greater level of detail.
1) Initialization:
The server-side logic starts with the man-ager scanning the source codes provided by the user andsplitting them into training and validation sets. The programsare randomly shuffled and assigned to respective sets based onthe user-specified ratio. Then, the manager loads previously-saved IRs and transitions from the SQL database into memoryand populates the replay memory with experiences. After-wards, workers are utilized to produce and benchmark unop-4
ERVER CLIENT
Learner ExplorationExploitationEvaluation OptimizationVisualization
ReplayMemoryLogsSourceCodes SQL DB
Visualizer WorkersManager BenchmarkingLogging PersistenceTask Distribution async a s y n c Fig. 3: Overview of the CORL framework.timized base
IR and its O3-level optimized version for everysource code in the dataset if not already present in memory.All the data generated at this stage and during explorationis asynchronously saved to the database. Upon completingthis step the initialization is finished and the learner startsexploration.
2) Benchmarking:
A single exploration step involves apply-ing selected pass(-es) to a given IR, producing a new IR, whichis then benchmarked to calculate the reward. Benchmarkingany program is prone to noise and based on our observationsthe variation in runtime in terms of percentage of deviationfrom the mean is itself dependent on runtime. For the sourcecodes in our dataset the bigger the runtime the smaller is theobserved variation. Therefore, to calculate the runtime of an IRa worker runs it between 20 and 1000 times, depending on itsruntime, and sends the median runtime back to the manager.To hide the latency induced by benchmarking, we performexploration in batches.
3) Exploration:
To perform exploration we use an (cid:15) -greedystrategy. The value of (cid:15) is linearly-annealed throughout train-ing. The agent starts every exploration step by sampling abatch of base states. These states correspond to unoptimizedversions of the IR for every source code in the training set. Foreach sampled state an agent selects an action either greedilyor randomly based on the toss of a coin. State-action pairsalready present in memory are used to perform server-sidestate transitions and the exploration proceeds with the newstate until the transition for a selected action is not yet presentin memory. Finally, an assembled set of state-action pairs issent to the manager and the agent proceeds to exploitation.
4) Exploitation:
Before the exploitation process starts thereplay memory has to be populated with a sufficient numberof experiences. Once the replay memory is large enough, thelearner starts to train the agent by minimizing the loss functiondescribed previously in Section III-B. To stabilize training weuse fixed Q-targets which are updated once every τ steps.Every δ steps, where δ is a multiple of τ , the frameworkswitches to evaluation mode, during which both explorationand exploitation halts and the agent’s performance is evaluated.
5) Evaluation, Logging, and Visualization:
Evaluation isperformed similarly to exploration, with two main differences.First, instead of sampling base states from the training set,the agent is evaluated in all of the base states in the dataset,including the validation set. Second, instead of letting thetoss of a coin determine the chosen action, the agent alwaysbehaves greedily. The learner logs all of the data pertainingto a single run of the CORL framework, including evaluationand exploitation progress, to a separate file. The visualizer object continuously scans the logs directory and performsvisualization using the VisDom framework .IV. E VALUATION
We first explain the experimental setup in Section IV-A,before discussing the quality of the fit achieved by our agentsin Section IV-B. Then, we introduce the metrics we useto evaluate the performance of our agents in Section IV-C.We conclude with reviewing the results of our evaluation inSections IV-D and IV-E.
A. Experimental Setup
The dataset for training our optimizing agents consists of109 single-source benchmarks from the LLVM test suite.Single-source benchmarks were chosen as they provide asimple and convenient way of building and executing thebenchmarks. The complete list of benchmarks and sourcecodes is available in Table I. The programs were split be-tween training and validation sets in a 4:1 ratio. To speed-upexperimentation we excluded top-four source codes with thelongest runtime when dealing with action spaces M and L.To execute optimization sequences we use LLVM optimizerversion 3.8. The choice of this particular version is motivatedby the ability to use pre-trained embeddings from the study byBen-Nun et al. [2]. However, our approach can be used withthe newer versions of the LLVM optimizer as well.To define the action space H, we partition the O3 se-quence of the LLVM optimizer into eight different actions, https://github.com/facebookresearch/visdom Benchmark SourcesPolybench correlation.c covariance.c 2mm.c 3mm.c atax.cbicg.c cholesky.c doitgen.c gemm.c gemver.cgesummv.c mvt.c symm.c syr2k.c syrk.c tri-solv.c trmm.c durbin.c dynprog.c gramschmidt.clu.c ludcmp.c floyd-warshall.c reg detect.c adi.cfdtd-2d.c fdtd-apml.c jacobi-1d-imper.c jacobi-2d-imper.c seidel-2d.cShootout ackermann.c ary3.c fib2.c hash.c heapsort.c lists.cmatrix.c methcall.c nestedloop.c objinst.c random.csieve.c strcat.c ackermann.cpp fibo.cpp heapsort.cppmatrix.cpp methcall.cpp random.cpp except.cppMisc dt.c evalloop.c fbench.c ffbench.c flops-1.c flops-2.c flops-3.c flops-4.c flops-5.c flops-6.c flops-7.cflops-8.c flops.c fp-convert.c himenobmtxpa.c low-ercase.c mandel-2.c mandel.c matmul f64 4x4.coourafft.c perlin.c pi.c ReedSolomon.c revert-Bits.c richards benchmark.c salsa20.c whetstone.cmandel-text.cpp oopack v1p8.cpp sphereflake.cppStanford Bubblesort.c FloatMM.c IntMM.c Oscar.c Perm.cPuzzle.c Queens.c Quicksort.c RealMM.c Towers.cTreesort.cBenchmarkGame fannkuch.c n-body.c nsieve-bits.c partialsums.c puz-zle.c recursive.c spectral-norm.c fasta.cLinpack linpack-pc.cMcGill chomp.c misr.c queens.cDhrystone dry.c fldry.cCoyoteBench almabench.c huffbench.c lpbench.cSmallPT smallpt.cpp as shown in Table II. The division follows the observationthat the optimization sequence consists of smaller logical sub-sequences ending with a simplifycfg pass. To allow afair comparison of the results of our experiments we definethe actions in spaces M and L using 42 unique transformationpasses which are part of actions in space H. In action space M,the passes are initialized with the default parameter values,while in action space L agents also choose the parametervalues. Table III lists the passes in action space L which havetunable parameters along with the values for these parameters.The value µ = 16 , the maximum number of actions, is usedin all of the experiments.For our experiments, we run the server-side and the client-side logic of the CORL framework on two different hardwarearchitectures. Below we describe these architectures in detail.
1) Server:
The server-side logic responsible for training theagents and distributing tasks to clients was run on a singleserver with two Intel(R) Xeon(R) Gold 6126 2.60GHz CPUs,64GBs of main memory, two NVIDIA GeForce GTX 1080 TiGPUs, and Ubuntu 16.04 LTS operating system. We trainedour models using a single GPU.
2) Client:
We ran the clients on the nodes of the HardwarePhases I and II of the Lichtenberg High Performance Com-puter. The nodes within Hardware Phases I and II each havetwo Intel(R) Xeon(R) E5-2670 CPUs and Intel(R) Xeon(R) TABLE II: Passes within the O3 sequence of the LLVMoptimizer version 3.8, divided into eight different actions forexperiments in the action space H.
Order Pass Action0 tti 01 verify2 tbaa3 scoped-noalias4 simplifycfg5 sroa6 early-cse7 lower-expect8 targetlibinfo 19 tti10 forceattrs11 tbaa12 scoped-noalias13 inferattrs14 ipsccp15 globalopt16 mem2reg17 deadargelim18 instcombine19 simplifycfg20 globals-aa 221 prune-eh22 inline23 functionattrs24 argpromotion25 sroa26 early-cse27 jump-threading28 correlated-propagation29 simplifycfg30 instcombine 331 tailcallelim32 simplifycfg33 reassociate 434 loop-rotate35 licm36 loop-unswitch37 simplifycfg Order Pass Action38 instcombine 539 indvars40 loop-idiom41 loop-deletion42 loop-unroll43 mldst-motion44 gvn45 memcpyopt46 sccp47 bdce48 instcombine49 jump-threading50 correlated-propagation51 dse52 licm53 adce54 simplifycfg55 instcombine 656 barrier57 rpo-functionattrs58 elim-avail-extern59 globals-aa60 float2int61 loop-rotate62 loop-vectorize63 instcombine64 slp-vectorizer65 simplifycfg66 instcombine 767 loop-unroll68 instcombine69 licm70 alignment-from-assumptions71 strip-dead-prototypes72 globaldce73 constmerge
E5-2680 v3 CPUs respectively, 64GBs of main memory, andrun CentOS Linux version 7. Each node ran a single clientat a time, with the number of clients dynamically changingthroughout the runs as the workers were added and removedfrom the pool. Due to availability constraints experiments withaction space H were performed on the nodes of HardwarePhase II while experiments with action spaces M and L wereperformed on the nodes of Hardware Phase I.
B. Convergence
To measure the quality of the fit achieved by our agentswe record the mean value of the loss function for sampledbatches of experiences throughout training. Figure 4 showshow the loss converges in all three action spaces. We achievethe best fit in the action space H with the relatively high valueof γ = 0 . which allows the network to account for long-termrewards when predicting the values of different actions. We use γ = 0 . and increase the value of τ for larger action spacesto stabilize the training. While the loss converges in action6ABLE III: Passes in the action space L that have tunable parameters. The first value is the default for each parameter. Pass Parameter Valuesloop-vectorize vectorizer-maximize-bandwidth [false, true]max-interleave-group-factor [8, 6, 10]enable-interleaved-mem-accesses [false, true]vectorizer-min-trip-count [16, 8, 32, 64]enable-mem-access-versioning [true, false]max-nested-scalar-reduction-interleave [2, 1, 4]enable-cond-stores-vec [false, true]enable-ind-var-reg-heur [true, false]vectorize-num-stores-pred [1, 2, 4]enable-if-conversion [true, false]enable-loadstore-runtime-interleave [true, false]loop-vectorize-with-block-frequency [false, true]small-loop-cost [20, 10, 30]simplifycfg bonus-inst-threshold [1, 2]phi-node-folding-threshold [2, 3, 4]simplifycfg-dup-ret [false, true]simplifycfg-sink-common [true, false]simplifycfg-hoist-cond-stores [true, false]simplifycfg-merge-cond-stores [true, false]simplifycfg-merge-cond-stores-aggressively [false, true]speculate-one-expensive-inst [true, false]max-speculation-depth [10, 5, 20]loop-unroll percent-dynamic-cost-saved-threshold [20, 15, 25]runtime [false, true]allow-partial [false, true]max-iteration-count-to-analyze [0, 10, 100, 1000, 10000]dynamic-cost-savings-discount [2000, 1500, 2500]threshold [150, 75, 300, 600] Pass Parameter Valuesslp-vectorizer slp-vectorize-hor [true, false]slp-threshold [0, 1, 2]slp-vectorize-hor-store [false, true]slp-max-reg-size [128, 64, 256, 512]slp-schedule-budget [100000, 50000, 200000]inline inlinecold-threshold [275, 175, 225, 325, 400]inline-threshold [275, 175, 225, 325, 400]inlinehint-threshold [325, 175, 275, 225, 400]loop-unswitch with-block-frequency [false, true]threshold [100, 60, 140]coldness-threshold [1, 2, 3]indvars liv-reduce [true, false]verify-indvars [true, false]replexitval [cheap, never, always]gvn enable-pre [true, false]enable-load-pre [true, false]max-recurse-depth [1000, 2000, 3000]sroa sroa-random-shuffle-slices [false, true]sroa-strict-inbounds [false, true]jump-threading implication-search-threshold [3, 2, 4]threshold [6, 3, 9, 12]loop-rotate rotation-max-header-size [16, 8, 32, 64]licm disable-licm-promotion [true, false]lower-expect likely-branch-weight [64, 32, 128]float2int float2int-max-integer-bw [64, 32, 128] space M, it diverges in action space L in spite of larger valuesof τ . The disadvantage of increasing τ is that the trainingtime also increases proportionally. As can be observed fromFigure 4c, using larger values of τ in action space L stabilizestraining. However, it also prohibitively increases training timeand therefore we refrain from further experiments with evenbigger values of τ . C. Metrics
In order to evaluate the optimization potential of an agentwe compare its performance with that of LLVM’s built-inO3 optimization sequence. To do that, we first calculate thespeedup achieved by an agent on every source code in thedataset. Then we aggregate these values across training andvalidation sets by computing geometric means of speedup forsource codes in the respective sets. We do similar calculationsfor LLVM’s O3 sequence and compare the computed metricsto evaluate the performance of an agent.For an agent to learn the values of taking different ac-tions, these actions have to be explored first. As the agentcontinuously explores its environment, it accumulates newexperiences which potentially yield higher speedups. In otherwords, highest observed speedups on source codes in thedataset continue to grow over time. These values put anupper bound on the agent’s performance and enable us to tellhow close it is to the best possible one. Therefore, duringevaluation we also record the highest observed speedup forevery source code in the dataset. Below we first present the results for aggregate metrics before showing the performanceof our agents on individual source codes.
D. Aggregate Results
As can be observed in Figure 5a the agent learns to out-perform the O3 strategy on the training set in action space H,achieving an average speedup of 2.24x over the unoptimizedversion, while the O3 sequence achieves an average speedupof around 2.17x. The agent’s performance is nearly 95% of theobserved best-possible performance, which confirms that themodel achieves a good fit on the training data. Figure 5d showsthat, while the validation set performance also increases overtime, it only approaches the performance of the O3 strategy,achieving an average speedup of 2.38x over the unoptimizedversion versus the 2.67x average speedup achieved by O3.The growing best-observed performance on the validationset shows that by behaving greedily the agent independentlydiscovers states corresponding to IRs with lower runtime thanthose produced by O3. As we will see later, while the agentseldom significantly outperforms the O3 strategy, it fails tobe equally robust across all the source codes. We attributethis mainly to a lack of diversity in the distribution of sourcecodes in our training set and believe that having a larger morediverse training set would likely solve the issue. Nonetheless,that we were able to discover the IRs with lower runtime byre-arranging sub-sequences of passes comprising LLVM’s O3routine shows that it is far from optimal.7 a) Action space H L o ss =0.9 =400 (b) Action space M L o ss =0.5 =1000 (c) Action space L L o ss =0.5 =4000=0.5 =6000 Fig. 4: From left to right, convergence of the loss value during training for action spaces H, M and L. Loss values are runningmeans on logarithmic scale. As the size of the action space increases, the quality of the fit achieved by our model decreases.For action space L, the loss value diverges even despite increasing parameter τ .TABLE IV: Top 5 best and worst performances on individual programs of an agent trained in action space H. SpeedupDataset Source Code Action Sequence O3 Agent Agent vs O3Training floyd-warshall.c 4 → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → → At every step in action space M, our agent has to choose oneof 42 actions, each corresponding to a particular LLVM pass.Since this space is much larger than space H it takes muchlonger for the agent to discover advantageous states. Figure 5bshows that after more than forty evaluation steps, whichincludes nearly six days of exploration within that period,the agent is able to observe experiences yielding the sameaverage speedup over the baseline as the O3 sequence. Duringthis time the agent continuously improves its performance onthe training set and given enough time is likely to achieveand surpass the performance of the O3 strategy. However, itsperformance on the validation set does not seem to improveas seen in Figure 5e. Therefore, in view of the limited accessto compute resources, we terminate the experiment in actionspace M after 56 evaluations. We believe that given a largernumber of actions in space M when compared to H it iseasier for the agent to memorize action sequences yieldinghigh speedups on specific source codes in the training set. Similar to action space H, increasing the size and diversity ofthe training set is likely to force the agent to generalize andachieve better performance on the validation set.Given that in our experiments in the action space L the lossfunction diverges, we do not see any meaningful improvementin agent’s performance during the evaluation, as shown inFigures 5c and 5f.
E. Performance on Individual Programs
To examine the behavior of an agent trained in actionspace H on individual source codes, we record the sequence ofactions chosen by the model for every program. This allowsus to verify that the model does indeed produce a differentoptimization strategy for different programs. Furthermore, wecalculate the speedup achieved by our agent and LLVM’s O3strategy over the unoptimized base version of the IR of everysource code in our dataset. Table IV presents top five best andworst performance results in both training and validation sets.8 a) Space H, training, δ =4K Sp ee d u p ( g m e a n ) ModelObservedO3 (b) Space M, training, δ =10K Sp ee d u p ( g m e a n ) ModelObservedO3 (c) Space L, training, δ =20K Sp ee d u p ( g m e a n ) ModelObservedO3 (d) Space H, validation, δ =4K Sp ee d u p ( g m e a n ) ModelObservedO3 (e) Space M, validation, δ =10K Sp ee d u p ( g m e a n ) ModelObservedO3 (f) Space L, training, δ =20K Sp ee d u p ( g m e a n ) ModelObservedO3
Fig. 5: Aggregate speedups in all three action spaces. The three curves in every plot show the average performance of a model,the average of the best observed performance on every program in the specified set and the average performance of LLVM’sO3 sequence.By observing the results we can conclude that the agent doesindeed produce specialized optimization strategy for everysource code. Interestingly, the agent utilizes the balance of µ = 16 available actions to the fullest in almost all cases,except some that are not shown in Table IV. This means thatin most cases the agent predicts at least one action to yieldpositive reward. Although the IR itself does not necessarilychange as a result of every action, the history of actionsis always updated to store the latest action. Since the stateconsists of both the IR and the history of actions, it changesafter every action of the agent. Therefore we stop samplingthe agent only when all of the actions are predicted to leadto slowdown, i.e., have value 0 or less, or after the maximumnumber of actions µ is taken.V. R ELATED W ORK
Compiler optimization problem has been in focus of re-search community for several decades, with earliest worksdating back to late 1970s [8]. Its subproblems of variouscomplexity, ranging from the simplest, parameter value selec-tion, to the most complex, phase-ordering, were tackled viadifferent classes of methods [9]. Among these methods areiterative search techniques [10], genetic algorithms [11]–[13],and machine learning methods [ ? ], [3], [4] with deep learningmethods gaining popularity in recent years [5], [14].In order to leverage the advantages of (deep) machinelearning methods when it comes to compiler optimization,several challenges need to be addressed: (i) correctly definingthe learning problem, (ii) choosing or building the right setof features to represent the program, (iii) generating the dataset for training, and (iv) selecting the right neural networkarchitecture which is both expressive enough to learn thetask and allows efficient training. The learning problem isdefined as either an unsupervised learning problem, often usedto learn features [2], [5], [14], [15], a supervised learningproblem [4], [5], or a reinformcement learning problem [3]. Aset of features includes statically-available ones, such as codetoken sequences [5], [16], [17], abstract syntax trees (AST)and AST paths [18]–[20], IRs and learned representations builton top of IRs [2], [14], [15], [21]–[23]. An additional set offeatures includes the problem size [5] and dynamic perfor-mance counters [24]. Training data is often generated manuallyfor supervised learning methods [4], [5], while reinforcementlearning methods use initial training set to generate data viaexploration [3]. Unsupervised learning methods can take theadvantage of the large code corpora available online [2], [14].There also exist methods for automatic generation of trainingdata using deep neural networks [17].Our work is most similar to the approach by Kulkarniet al. [3], who also use reinforcement learning and train aneural network to tackle the phase-ordering problem. However,important differences from the above work are the following:(i) our approach does not depend on dynamic features andtherefore does not require a program to be run to makea prediction, (ii) the search space of possible optimizationsconsidered in our work is much larger, (iii) our approachdepends on the IR of the program and is therefore agnosticto the front-end language a program is written in, and (iv)instead of NEAT, we use gradient-based optimization to train9ur neural network.Ashouri et al. [4] developed the MiCOMP framework totackle the phase-ordering problem by first clustering LLVMpasses composing the O3 sequence of the LLVM optimizerand then using a supervised learning approach to devisean iterative compilation strategy which outperforms the O3sequence within several trials. Similar to Kulkarni et al. [3],they use dynamic features and consider a smaller search spaceof size compared to , which is the size of H, the smallestaction space considered in our work.VI. C ONCLUSION
We formulated compiler phase-ordering as a deep reinforce-ment learning problem and developed the CORL framework,which allows for efficient training of optimizing agents. Ourapproach is fully automatic and relies only on the initial supplyof a dataset of programs. We were able to train the agentswhich surpass the performance of LLVM’s hard-coded O3optimization sequence on the observed set of source codes andachieve competitive performance on the validation set, gainingup to 1.32x speedup over the O3 sequence with previouslyunseen programs. We believe these results exhibit the bigpotential of deep reinforcement learning in tackling phase-ordering problem of compilers.Our approach has several shortcomings, which we planto address in the future. Firstly, increasing the size of thedataset to include a more diverse set of source programsmight be enough to achieve superior performance comparedwith the hard-coded optimization strategy. Secondly, usinghigher-quality embeddings for the IR and the appropriateneural architecture can result in more efficient and robustoptimizing agents. Next, current design requires that theprograms are compiled and benchmarked on every new targetsystem, which requires substantial computational resources.While calculation of rewards by running the benchmarkson the end systems is at the center of our approach, webelieve the data efficiency of the learning procedure couldbe improved by including a self-supervised learning step bythe agent. This would potentially result in a more efficientexploration strategy, and reduce the computational burden byallowing faster convergence of an agent. Finally, optimizingthe agents’ training procedure could allow for similar resultsto be achieved in higher dimensional action spaces.A
CKNOWLEDGMENTS
This work is supported by the Graduate School CE withinthe Centre for Computational Engineering at Technische Uni-versit¨at Darmstadt and by the Hessian LOEWE initiativewithin the Software-Factory 4.0 project. The calculations forthis research were conducted on the Lichtenberg Cluster ofTU Darmstadt. R
EFERENCES[1] Z. Gong, Z. Chen, J. Szaday, D. Wong, Z. Sura, N. Watkinson, S. Maleki,D. Padua, A. Veidenbaum, A. Nicolau, and J. Torrellas, “An empiricalstudy of the effect of source-level transformations on compiler stability,”in
OOPSLA , vol. 2, 2018, pp. 126:1–126:29. [2] T. Ben-Nun, A. S. Jakobovits, and T. Hoefler, “Neural code compre-hension: A learnable representation of code semantics,” in
Advancesin Neural Information Processing Systems 31 , S. Bengio, H. Wallach,H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds.Curran Associates, Inc., 2018, pp. 3585–3597.[3] S. Kulkarni and J. Cavazos, “Mitigating the compiler optimizationphase-ordering problem using machine learning,” in
Proceedings of theACM international conference on Object oriented programming systemslanguages and applications , 2012, pp. 147–162.[4] A. H. Ashouri, A. Bignoli, G. Palermo, C. Silvano, S. Kulkarni, andJ. Cavazos, “Micomp: Mitigating the compiler phase-ordering problemusing optimization sub-sequences and machine learning,”
ACM Trans.Archit. Code Optim. , vol. 14, no. 3, pp. 29:1–29:28, Sep. 2017.[5] C. Cummins, P. Petoumenos, Z. Wang, and H. Leather, “End-to-end deeplearning of optimization heuristics,” in , pp. 219–232.[6] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. A. Riedmiller, “Playing atari with deep reinforcementlearning,”
CoRR , vol. abs/1312.5602, 2013.[7] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot et al. , “Mastering the game of go with deep neural networksand tree search,”
Nature , vol. 529, no. 7587, pp. 484–489, 2016.[8] B. W. Leverett, R. G. Cattell, S. O. Hobbs, J. M. Newcomer, andA. H. Reiner, “An overview of the production quality compiler-compilerproject,” Carnegie-Mellon University, Department of Computer Science,Tech. Rep., 1979.[9] A. H. Ashouri, W. Killian, J. Cavazos, G. Palermo, and C. Silvano,“A survey on compiler autotuning using machine learning,”
ACMComputing Surveys (CSUR) , vol. 51, no. 5, pp. 1–42, 2018.[10] F. Bodin, T. Kisuki, P. Knijnenburg, M. O’Boyle, and E. Rohou,“Iterative compilation in a non-linear optimisation space,” 1998.[11] K. D. Cooper, P. J. Schielke, and D. Subramanian, “Optimizing forreduced code space using genetic algorithms,” in
Proceedings of theACM SIGPLAN 1999 workshop on Languages, compilers, and tools forembedded systems , 1999, pp. 1–9.[12] K. D. Cooper, D. Subramanian, and L. Torczon, “Adaptive optimizingcompilers for the 21st century,”
The Journal of Supercomputing , vol. 23,no. 1, pp. 7–22, 2002.[13] P. Kulkarni, S. Hines, J. Hiser, D. Whalley, J. Davidson, and D. Jones,“Fast searches for effective optimization phase sequences,”
ACM SIG-PLAN Notices , vol. 39, no. 6, pp. 171–182, 2004.[14] C. Cummins, Z. V. Fisches, T. Ben-Nun, T. Hoefler, and H. Leather,“Programl: Graph-based deep learning for program optimization andanalysis,” arXiv preprint arXiv:2003.10536 , 2020.[15] A. Brauckmann, A. Goens, S. Ertel, and J. Castrillon, “Compiler-basedgraph representations for deep learning models of code,” in
Proceedingsof the 29th International Conference on Compiler Construction , 2020,pp. 201–211.[16] M. Allamanis and C. A. Sutton, “Mining source code repositories atmassive scale using language modeling,” in
Proceedings of the 10thWorking Conference on Mining Software Repositories, MSR ’13, SanFrancisco, CA, USA, May 18-19, 2013 , 2013, pp. 207–216.[17] C. Cummins, P. Petoumenos, Z. Wang, and H. Leather, “Synthesizingbenchmarks for predictive modeling,” in
CGO , 2017, pp. 86–99.[18] H. K. Dam, T. Pham, S. W. Ng, T. Tran, J. Grundy, A. Ghose, T. Kim,and C.-J. Kim, “A deep tree-based model for software defect prediction,” arXiv preprint arXiv:1802.00921 , 2018.[19] U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “A general path-based representation for predicting program properties,”
ACM SIGPLANNotices , vol. 53, no. 4, pp. 404–419, 2018.[20] ——, “code2vec: Learning distributed representations of code,”
Pro-ceedings of the ACM on Programming Languages , vol. 3, pp. 40:1–40:29, 2019.[21] R. Aggarwal, S. Jain, M. S. Desarkar, R. Upadrasta, Y. Srikant et al. ,“Ir2vec: A flow analysis based scalable infrastructure for programencodings,” arXiv preprint arXiv:1909.06228 , 2019.[22] E. Park, J. Cavazos, and M. A. Alvarez, “Using graph-based programcharacterization for predictive modeling,” in
Proceedings of the TenthInternational Symposium on Code Generation and Optimization , ser.CGO ’12. New York, NY, USA: ACM, 2012, pp. 196–206.[23] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to repre-sent programs with graphs,”
CoRR , vol. abs/1711.00740, 2017.
24] J. Cavazos, G. Fursin, F. Agakov, E. Bonilla, M. F. O’Boyle, andO. Temam, “Rapidly selecting good compiler optimizations using per-formance counters,” in
International Symposium on Code Generationand Optimization (CGO’07) . IEEE, 2007, pp. 185–197.. IEEE, 2007, pp. 185–197.