Learning to Make Compiler Optimizations More Effective
Rahim Mammadli, Marija Selakovic, Felix Wolf, Michael Pradel
LLearning to Make Compiler OptimizationsMore Effective
Rahim Mammadli
Department of Computer ScienceTechnical University of Darmstadt [email protected]
Marija Selakovic
Department of Computer ScienceTechnical University of Darmstadt [email protected]
Felix Wolf
Department of Computer ScienceTechnical University of Darmstadt [email protected]
Michael Pradel
Department of Computer ScienceUniversity of Stuttgart [email protected]
Abstract
Because loops execute their body many times, compiler de-velopers place much emphasis on their optimization. Never-theless, in view of highly diverse source code and hardware,compilers still struggle to produce optimal target code. Thesheer number of possible loop optimizations, including theircombinations, exacerbates the problem further. Today’s com-pilers use hard-coded heuristics to decide when, whether,and which of a limited set of optimizations to apply. Often,this leads to highly unstable behavior, making the successof compiler optimizations dependent on the precise way aloop has been written. This paper presents LoopLearner,which addresses the problem of compiler instability by pre-dicting which way of writing a loop will lead to efficientcompiled code. To this end, we train a neural network tofind semantically invariant source-level transformations forloops that help the compiler generate more efficient code.Our model learns to extract useful features from the rawsource code and predicts the speedup that a given trans-formation is likely to yield. We evaluate LoopLearner with1,895 loops from various performance-relevant benchmarks.Applying the transformations that our model deems mostfavorable prior to compilation yields an average speedupof 1.14x. When trying the top-3 suggested transformations,the average speedup even increases to 1.29x. Comparing theapproach with an exhaustive search through all availablecode transformations shows that LoopLearner helps to iden-tify the most beneficial transformations in several orders ofmagnitude less time.
The optimization techniques used in modern compilers arecontinuously improving. In view of the increasing complex-ity of hardware and software, the effectiveness of compileroptimizations becomes crucial in achieving satisfactory sys-tem performance. However, despite the tremendous progressof compiler technology, the optimizations a compiler appliesare usually limited to a fixed set of program transformations. Furthermore, compiler developers manually design optimiza-tion heuristics that control program compilation and opti-mization. Writing these heuristics requires expert knowledgeand is one of the most difficult and time-consuming tasks incompiler development. This is why compiler optimizationsare not guaranteed to produce optimal output, and in fact,they may even degrade performance in some cases.A recent study by Gong et al. [23] illustrates the challengescompiler developers face today. Looking at how source-levelloop transformations affect performance, the authors ob-served that compilers are not only far from producing opti-mal code, but are also highly unstable: given semanticallyequivalent variants of the same piece of code, compilersproduce target code that differs significantly in terms of per-formance. As a result of this “compiler instability”, as Gonget al. named the problem, programmers are left without anyguidance as to which variant of the source code to feed intothe compiler. To maximize performance, a programmer maychoose to deal with compiler instability by (a) systemati-cally trying as many semantically equivalent code variantsas possible and measure which performs best, or (b) learningthrough experience which variant works best for a givencompiler. Since the first option is very time consuming andthe second option requires expert knowledge of the underly-ing compiler, both strategies are of limited use in practice.To mitigate the problem of compiler instability, wepresent
LoopLearner , a learning-based approach that pre-dicts semantics-preserving transformations of a loop thatwill improve the performance of the compiled program.Given a loop and a search space of such transformations,LoopLearner predicts which transformation or sequenceof transformations will yield the best-performing targetcode with a given compiler. The search space explored byLoopLearner consists of around 3,000 sequences of transfor-mations, composed of five basic optimizations, their combi-nations, and different parametrizations. We focus on loopsfor two reasons. First, optimizing loops is important becausethe loop body is repeatedly executed, not seldom thousandsof times, which in total accounts for a significant fraction a r X i v : . [ c s . P L ] F e b f the overall execution time. Second, loop transformationsare one of the major optimizations supported by moderncompilers, which is why loops are at the core of compilerinstability.We envision LoopLearner to be useful in multiple scenar-ios. First, it can assist developers in deciding how to write aloop. By predicting which variant of a loop yields the bestperformance, developers can make an informed decision, in-stead of relying on their intuition. Second, the approach canguide an automated pre-processing step that applies codetransformations before handing the code over to the com-piler. Such pre-processing does not require any developerattention and mitigates the problem of compiler instabil-ity without the need to change the compiler itself. And, ofcourse, one could also integrate our predictive model directlyinto the compiler to improve its stability. In the second andthird usage scenario, LoopLearner’s predictions complementthe built-in optimization heuristics of the compiler by pre-senting the code in a way that will make best use of theseheuristics.We define the problem of predicting the best transfor-mation for a loop as a regression problem: based on thesource code of a given loop, LoopLearner learns to predictthe speedup that a certain transformation is likely to yield.After training the model with tens of thousands of examples,we query the model for each transformation to determinewhich one gives the highest performance improvement. Toeffectively learn the performance benefits of transforma-tions on specific code, we need a suitable encoding of bothinputs. LoopLearner encodes source code as a sequence oftokens, and we compare different representations of individ-ual tokens. To encode transformations, we present a novel,compact representation that ensures that similar transfor-mations have a similar representation. LoopLearner uses aconvolutional neural network architecture, which has beenproven as very effective on compositional data.One of the key challenges in choosing among the availablecode optimizations is the large space of possible transforma-tions. A naive approach could apply each transformation,then run the compiled code, and measure its execution time.Unfortunately, this approach takes significant time, in par-ticular, because reliable performance measurements requireexecuting the code repeatedly. Instead of executing trans-formed code, LoopLearner queries a predictive model onceper transformation. Since querying our neural model is veryfast and because queries for different transformations can berun in batches, our approach reduces the effort for finding asuitable transformation by multiple orders of magnitude.Prior learning-based work on improving optimizing com-pilers aims at finding suitable compiler heuristics, includ-ing the work by Yuki et al. [57], who predict optimal looptiling sizes, Stephenson and Amarasinghe [53], who deter-mine the best loop unrolling factor, and Simon et al. [52], who construct compiler heuristics automatically. Our ap-proach differs those approaches in several ways. One differ-ence is that we consider a much larger space of optimiza-tions, that is, nearly 3,000 combinations of five common loopoptimizations—unrolling, unroll and jam, tiling, distribution,and interchange, including variations of their parameters.Another distinctive feature of our approach is that it rea-sons about source-level transformations to be applied beforepassing a program to the compiler, instead of optimizationdecisions taken in the compiler. Finally, LoopLearner in-volves neither the manual design nor the pre-selection ofany features. Instead, we feed the source code as-is into aneural network that learns how to identify suitable featureson its own. Cummins et al. [18] also train a neural model thatpredicts from raw code how to support code optimization.However, their model focuses on a small set of optimizationparameters used in the compiler, e.g., whether to map a ker-nel to the CPU or the GPU, whereas we consider a largerspace of transformations applied before passing code to thecompiler.To evaluate LoopLearner we use an extensive collection ofnested loops from the empirical study by Gong et al. [23]. Totrain the model, we consider all transformations the studyused to create loop mutations. In total, the data set amountsto around 70,000 data points, originating from 1,895 uniqueloops from 18 benchmarks and almost 3,000 unique trans-formations. One transformation consists of a sequence ofone or more loop transformations and their parameters. Wefind that our model has a precision of 73% when predict-ing speedups. Furthermore, by ranking all transformationsbased on their predicted performance improvements and byapplying the top-1 transformation, LoopLearner achieves aspeedup of 1.14x, on average across all loops. If the developeror tool tries the top-3 suggested transformations and picksthe best one, the average speedup increases even to 1.29x.In summary, this paper makes the following contributions: • Learning-based approach to mitigate compiler instabil-ity . We are the first to systematically mitigate the prob-lem of compiler instability through a learned modelthat predicts source-to-source transformations likelyto make compiler optimizations more effective. Thedeep learning-based model automatically extracts fea-tures from a given loop, without any manual featureengineering. • Search space . The approach scales to a large searchspace consisting of thousands of transformations. Thesearch space is built from five common and semanti-cally invariant loop transformations, applied alone orin sequence, and their several parameters. • Empirical evidence . We empirically demonstrate thatapplying the transformation our model deems most fa-vorable yields an average speedup of 1.14x (for the best redicted transformation) or 1.29x (when consideringthe top-3 predictions).The remainder of this paper is organized as follows. Sec-tion 2 summarizes the problem of compiler instability de-scribed by Gong et al. [23]. Section 3 presents our approachto the selection of beneficial loop transformations. Section 4discusses experimental settings and results. Finally, we dis-cuss related work in Section 5 and review our results inSection 6. The attribute stable characterizes a compiler that producesthe same performance for any semantically equivalent vari-ant of a program. In their study, Gong et al. [23] evaluate thestability of modern compilers by applying several source-to-source transformations to obtain semantically equivalentcode variants and by measuring the variation in their execu-tion time. To illustrate the effect of program transformationson compiler stability, consider the example in Listing 1. Thefirst loop is extracted from function
Regclass in the SPECCPU2000 benchmark suite. After unrolling the loop witha factor of two, yielding the second loop in the listing, theClang compiler generates output that is, on average, 1.19xfaster than the original loop. /* original loop */ for (Class = 0; Class < 256; ++ Class ){ if (opnd [1 +( Class >> 3 & 31)] & 1 <<(Class & 7)){I32 cf = Perl_fold[Class ];opnd [1 +(cf >> 3 & 31)] |= 1 <<(cf & 7);}}/* unrolled , factor = 2 */ for (Class = 0; Class <= 255; Class += 2) { if (opnd [1 +( Class >> 3 & 31)] & 1 <<(Class & 7)){I32 cf = Perl_fold[Class ];opnd [1 +(cf >> 3 & 31)] |= 1 <<(cf & 7);} if (opnd [1 +( Class +1 >> 3 & 31)] & 1 <<(Class +1 & 7)){I32 cf = Perl_fold[Class +1];opnd [1 +(cf >> 3 & 31)] |= 1 <<(cf & 7);}} Listing 1.
Original and unrolled loop in function
Regclass from the program in the SPEC CPU2000 bench-mark suite.Gong et al. quantify compiler stability using the follow-ing two metrics: intra-compiler and inter-compiler stability .The first metric, which is the focus of this paper, measuresthe stability of a single compiler, while the second metricmeasures the stability across multiple compilers. Althoughthe authors of the study concede that building a perfectlystable compiler is almost impossible, they show that moderncompilers have ample potential for improvement in this di-rection. Specifically, they demonstrate that applying source-level transformations prior to compilation can significantlyreduce the performance gap between variants of a loop. A
Table 1.
Loop transformations and their parameters.Transformation ParametersUnrolling Unroll factor ∈ { , , } Unroll-and-jam Loop level, unroll factor ∈ { , } Tiling Loop level, tile size ∈ { , , } Interchange Lexicographical permutation numberDistribution No parametersproblem not addressed by prior work is which out of manypossible transformations to apply to a given piece of code.The purpose of our work is to address the problem of intra-compiler instability , by learning code transformationsthat should be applied to maximize the performance of thecompiler output. We train our model on the same source codeexamples and transformations used in the original study byGong et al. Each loop transformation consists of a sequenceof well-known base transformations, which are listed inTable 1. To ensure that transformations produce semanticallyequivalent output for every loop, the space of consideredtransformations is limited to sub-sequences of the followingsequences: • interchange → unroll-and-jam → distribution → un-rolling • interchange → tiling → distribution → unrolling In total, this space consists of almost 3,000 unique transfor-mations (i.e., sub-sequences), each of them combining basetransformations with different parameters. The number oftransformations applied to a specific loop is much smaller(37, on average), because only some transformations can beapplied in a semantics-preserving way. Yet, as we show inSection 4.6, exhaustively exploring the performance impactof all transformations is still rather expensive.
In this section, we describe the LoopLearner approach, whichmitigates the problem of compiler instability by predictingloop transformations that enable the compiler to produceefficient target code. We start with a rough overview andpotential usage scenarios, before we define our learning prob-lem. Afterwards, we discuss preprocessing steps applied tothe data, before showing which encoding methods we exper-imented with. Next, we introduce our deep-neural-network(DNN) architecture and discuss design decisions made whilebuilding it. Finally, we specify the set of hyperparametersused to train the neural model.
Figure 1 illustrates our approach on a high level. The inputto our network is a loop and a transformation that may beapplied to it. We assume that the transformation is valid anddoes not affect the semantics of the program. For the dataset sed in the evaluation, which we borrowed from Gong etal. [23], these properties are ensured using the polyhedraloptimizer Polyopt/C and the dependence analyzer Candl .As a first step, we tokenize the loop with the help of a lexer.The resulting sequence of tokens is then encoded using oneof the methods discussed in Section 3.6. To feed the trans-formation into the model, the approach encodes it into acompact, similarity-preserving representation presented inSection 3.7. Given both the code and the transformation,the model predicts the speedup, i.e., the ratio of the originalloop’s execution time divided by the execution time obtainedby applying the transformation. Hence, having a set of validtransformations that can be applied to a given loop, our neu-ral network can be used to rank them by their predictedspeedup. Given a ranked list of transformations, the user ora tool can then apply the transformation that is expected toproduce the highest speedup. To interpret the predictions of our model, we start by speci-fying a speedup threshold , which is a hyperparameter usedto classify the prediction as either advantageous, disadvan-tageous, or neutral. Formally, let 𝑝 be the prediction of themodel, 𝑎 be the actual performance, and 𝑡 be a speedupthreshold with 𝑡 > . Then, the prediction is assigned to oneof three classes: • advantageous, if 𝑝 > 𝑡 • disadvantageous, if 𝑝 < − ( 𝑡 − )• neutral, if − ( 𝑡 − ) ≤ 𝑝 ≤ 𝑡 A prediction is considered to be accurate if: ( 𝑝 > ∧ 𝑎 > ) ∨ ( 𝑝 ≤ ∧ 𝑎 ≤ ) Since our solution is intended to achieve speedup andavoid slowdown, we value a high precision rate for speeduppredictions. Therefore, increasing 𝑡 (i.e., the range wherethe model predicts neutral) allows us to focus on clearerpredictions of speedups and slowdowns, which is likely toincrease precision but to reduce recall. A programmer or a tool facing the problem of choosing thebest transformation for a given loop has multiple options.The first option involves applying no transformations andrelying on the compiler to determine and apply the best set ofoptimizations. The second option is to test the performanceof the loop with 𝑘 different transformations and choose theone producing the highest speedup. As discussed earlier,the number of transformations, their combinations, and thenumber of parameters that each of them accepts can result ina very high number of distinct transformations applicable toa given loop. Therefore, in most real-life scenarios measuring http://web.cse.ohio-state.edu/~pouchet.2/software/polyopt http://icps.u-strasbg.fr/people/bastoul/public_html/development/candl the performance of a loop with all possible transformationsis not feasible. It can, however, be feasible to evaluate 𝑘 transformations if 𝑘 is a relatively small number.To aid the programmer or tool in choosing the best setof transformations for a given loop we consider two usagescenarios of LoopLearner: • If evaluating the performance of loops with and with-out applying transformations is prohibitively expen-sive, we propose using LoopLearner in a static scenario .This scenario implies applying the best advantageoustransformation if such a transformation exists. • If evaluating the performance of up to 𝑘 mutations ofloops is feasible, LoopLearner can be used in a dynamicscenario , which involves applying the top- 𝑘 advan-tageous transformations and measuring their actualperformance. If none of the transformations resultsin actual speedup, the original loop is left untouched.Otherwise, the programmer or a tool chooses the trans-formation resulting in the highest speedup. The task of predicting a speedup achievable by applying agiven transformation to the loop can be viewed as a regres-sion problem. Specifically, given a dataset {( 𝐿 𝑖 ,𝑇 𝑖 ) → 𝑆 𝑖 } 𝑁𝑖 = ,where 𝑁 is the size of the dataset, 𝑆 𝑖 is the speedup or slow-down resulting from applying the transformation 𝑇 𝑖 to theloop 𝐿 𝑖 , our goal is to learn an approximation of the function 𝑓 ( 𝐿,𝑇 ) = 𝑆 . To this end, we train a neural network 𝑓 𝑝 tominimize the mean squared error as our loss function: L = 𝑁 𝑁 ∑︁ 𝑖 = ( 𝑓 𝑝 ( 𝐿 𝑖 ,𝑇 𝑖 ) − 𝑆 𝑖 ) The input given to LoopLearner is a set of loops, each ex-tracted into a separate file from a larger program. As dis-cussed in Section 2, our dataset is based on loops used in thestudy by Gong et al. Their technique for extracting loopscan be easily applied to other programs as well. Before train-ing the model, we preprocess the data as follows. For eachloop in the original program, we extract tokens from thesource code, such that a token is represented as a pair ( 𝑡, 𝑣 ) ,where 𝑡 is its syntactic type and 𝑣 is the value, i.e., a stringrepresentation of the token in the source code.For many learning tasks where the input data is a sequenceof variable length it is common to select the maximum lengthbeforehand. The input sequences of smaller lengths are thenpadded to the maximum length which makes it possible tovectorize the computations. To avoid long training times andbe able to initialize the building blocks of our neural network,we exclude sequences of tokens longer than 250. In this way,we are able to achieve good model efficiency (Section 4.5)while keeping 90% of the loops from the original dataset. Expected speedup
Tokens Loop
Lexer Sequence Encoder Encoded tokens Transformation
Sequence
Transformation
Encoder
Compact encoding
Figure 1.
High-level overview of LoopLearner.
To feed the data to the neural network, we have to encodeboth the sequences of tokens and the transformations. Thequality of encoding strongly impacts both the achievablelevel of accuracy and the generalization capability of thetrained model. We have experimented with multiple methodsof encoding the sequences of tokens. Here we describe aninteresting subset of these methods and their differences.Some encoding methods are based on the frequency oftokens in the code corpus used for training. Specifically, wecompute the following three frequency maps: • 𝐹 𝑡𝑜𝑘𝑒𝑛𝑠 : Token → N , which assigns a frequency toeach token in the code corpus, • 𝐹 𝑖𝑑𝑠 : Identifier → N , which assigns a frequency toeach identifier in the code corpus, • 𝐹 𝑠𝑡𝑑𝑇𝑜𝑘𝑒𝑛𝑠 : Identifier → N , which assigns a fre-quency to each token that is neither an identifier nora literal.Table 2 gives an overview of the six encoding methods weconsider and which we explain in detail in the following. Fixed encoding.
This encoding uses a one-hot encodingof the top 𝑛 most popular tokens in 𝐹 𝑡𝑜𝑘𝑒𝑛𝑠 and assigns aspecial unknown token to all other tokens. This method iseasy to implement, but has several disadvantages. First, thesize of the encoding increases linearly with the size 𝑛 of thevocabulary, resulting in a increasing learning times. Next, allthe words outside the vocabulary are encoded with the sameunique token, which may result in a loss of vital information.Finally, this method does not discriminate between differenttypes of tokens, i.e., keywords, identifiers, literals, etc. areall encoded as equidistant points in space. Basic encoding.
This encoding is based on a one-hot en-coding of all tokens in 𝐹 𝑠𝑡𝑑𝑇𝑜𝑘𝑒𝑛𝑠 , i.e., the set of standardtokens defined by the language, but not identifiers and liter-als. For literals, the encoding converts integer literals to base10 and assigns special id and unknown tokens to identifiersand other tokens, respectively. In contrast to the fixed en-coding, this method encodes the tokens based on their type.The reason for handling integers specially is that we observeintegers to sometimes influence optimization decisions, e.g.,in loop headers. In contrast, other literals, e.g., charactersand floating-point values, are assumed not to influence the prediction accuracy and are therefore encoded as a special unknown token. Omitting these tokens completely wouldchange the structure of the code and potentially inhibit theperformance of the neural network. The main disadvantageof this method is that it uses the same vector representationfor all identifiers and thus hinders the learning capability ofthe network. Type-based encoding.
This encoding is similar to basic,except that it replaces identifiers with the types of the cor-responding variables for the most common data types: int,double, long, float, struct, char, short. While this methodpreserves the data type of many variables, all identifierssharing the same data type get identical vector represen-tations, which prevents the network from distinguishingthem.
Renaming encoding.
This encoding is also similar to ba-sic, except that each unique identifier is encoded as a one-hot vector of size 𝑚 , where 𝑚 defines the maximum numberof distinct identifier representations possible. The mappingfrom variable name to one-hot vector can be seen as a con-sistent renaming. This mapping is determined randomly, soas to prevent the order of the appearance of the identifiersfrom affecting the encoding.Since the majority of unique tokens are identifier names,and because it is impractical to encode all identifiers, weuse 𝐷 𝑖 to calculate the minimum number of identifiers wewould need to encode to cover a given percentage of tokensin the source code and store it in dictionary 𝐼 𝑐𝑜𝑣 , where everyinteger percentage 𝑝 maps to the number of identifiers wewould need to encode. Based on these statistics we devisevarious methods of encoding the data: Complex encoding.
This encoding uses 𝐹 𝑖𝑑𝑠 to computea minimal set of identifiers that covers at least c% of alloccurrences of identifiers across the code corpus. Based onthis set of frequent identifiers, the encoding preserves allfrequent identifiers and only abstracts the remaining onesas unknown . Each integer literal is converted to a one-hotvector of size 64, based on the logarithm of its value. This isdone to pass the scale of the literal to the network. In contrastto the fixed encoding and similar to the basic encoding, thismethod distinguishes among different token types, but alsomanages to cover a high number of unique identifiers. able 2. Encoding methods for tokens.Encoding Standardtokens Identifiers LiteralsFixed One-hot encode top-n tokens, rest as unknown
Basic One-hot All as id Keep integers, rest as unknown
Type-based One-hot Type of the identifier Keep integers, rest as unknown
Renaming One-hot Consistent mapping to one-hot vectors Keep integers, rest as unknown
Complex One-hot One-hot encoding of top c%, rest as id One-hot encoding log(n) of inte-gers, rest as unknown
FastText Learned embeddings of size 100The first five methods above encode tokens as one-hot vec-tors based on pre-calculated statistics. However, they allshare the same disadvantages: the size of the vocabularymight become very large for big code corpora, and the to-kens outside of the vocabulary are all represented as a singlespecial unknown token. The following encoding addressesthese limitations.
FastText encoding.
In natural language processing, anembedding [1, 33, 34, 36] is a mapping of words to a vec-tor of real numbers with a much lower dimension. It is apopular language modeling and feature learning techniquealready used for learning effective source-code representa-tions [9, 47]. In our approach, we apply the FastText em-bedding technique [34] to source code. We build FastTextembeddings using all the sequences of token values in ourtraining data. The size of the embedding vector is set to100 and the model is trained for 100 epochs. Once this pre-training step is complete, we train our model by encodingtoken sequences with the help of the learned vector map-pings for token values. FastText is especially suitable forsource code because many variable names are combinationsof multiple words, for example, array_size, viewCount, etc.Fasttext handles such names by not only learning embed-dings for the tokens in the vocabulary but by also calculatingembeddings for previously unseen words. This is done bybreaking words into smaller sequences, calculating vectorrepresentations of each and using them to reconstruct theencoding of the whole word.
To enable our model to learn effective transformations, weneed to encode nearly 3,000 unique transformations withvarying numbers of training samples for each transformation.A naïve approach is to use a one-hot encoding for all trans-formations. However, in this case, the size of the encodingvector would be very large and less popular transformationswould not have enough associated data points for the train-ing process to be successful. Furthermore, a one-hot encod-ing does not capture similarities between transformations,that is, all transformations are represented as equidistant unr. f = 2 f = 4 f = 84 unr.jam. tiling interchange distrib.
Figure 2.
Vector encoding of transformations.points in space, although some are much more similar thanothers. Another approach is to select only the most popu-lar transformations and to one-hot encode them. While thisallows us to train the model on the most common trans-formations, it has certain disadvantages. For example, bypicking the 50 most popular transformations and ignoringthe rest, we would lose 73% of our data and therefore preventour model from learning many beneficial transformations.To address the aforementioned points, we present compact encodings of code transformations, where each sequence oftransformations is represented as a feature vector. The encod-ing exploits the fact that transformations can only be appliedin particular orders that preserve the semantics of the origi-nal program (Section 2). Because the set of transformationsincluded in a sequence of transformations is sufficient touniquely specify the sequence, the features in the encodingindicate the presence or absence of a particular transforma-tion and the set of its parameters. We formally define theencoding as follows.
Definition 3.1 (Compact encoding of transformations) . Weencode a sequence of transformations 𝑇 as a concatenationof vectors 𝑇 , .. , 𝑇 𝑘 , where each 𝑇 𝑖 represents a vector encod-ing for transformation 𝑖 . The size of a vector 𝑇 𝑖 is equal toa maximum number of different parameterizations of trans-formation i. The first element in 𝑇 𝑖 indicates whether 𝑖 isapplied, while the subsequent elements indicate which pa-rameter of 𝑖 is enabled. For the transformations consideredin this work, we define the size of subvectors 𝑇 𝑖 as follows: • 𝑠𝑖𝑧𝑒 ( 𝑇 𝑢𝑛𝑟𝑜𝑙𝑙 ) = • 𝑠𝑖𝑧𝑒 ( 𝑇 𝑢𝑛𝑟𝑜𝑙𝑙 𝑗𝑎𝑚 ) = • 𝑠𝑖𝑧𝑒 ( 𝑇 𝑡𝑖𝑙𝑖𝑛𝑔 ) = for 2 ( 3 int 4 i 5 = 6 0 . . . 250 } unr. unr.jam tiling interchange distrib. Encoded Transformation Encoded Tokens
Fully Connected Layer
Prediction
Feature Extractor
Dense Layers x 6
Figure 3.
Prediction process for a sequence of tokens andtransformations. The encoded sequence of tokens is firstpassed into the feature extractor. The results are concate-nated with the encoded vector of transformations and passedto the fully connected layer which predicts the speedup.
Dense LayerDense Layer conv1 conv2Input Output
Figure 4.
The dense layer of the DNN consists of two con-volutional layers. The outputs of the dense layer are concate-nated with the inputs and passed on to the next layer. • 𝑠𝑖𝑧𝑒 ( 𝑇 𝑖𝑛𝑡𝑒𝑟𝑐ℎ𝑎𝑛𝑔𝑒 ) = • 𝑠𝑖𝑧𝑒 ( 𝑇 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 ) = Figure 2 illustrates the compact encoding of loop transfor-mations. The final size of the encoding vector is 56. The firstfour elements are reserved for the 𝑇 𝑢𝑛𝑟𝑜𝑙𝑙 subvector. The firstelement in 𝑇 𝑢𝑛𝑟𝑜𝑙𝑙 has a value of 0 or 1, indicating whetherunrolling is part of the transformation (0-no, 1-yes). Thenext three elements are used to encode the unrolling fac-tor. For example, if unrolling is applied with factor 2, thenthe first two elements of 𝑇 𝑢𝑛𝑟𝑜𝑙𝑙 would have value 1 and theremaining ones would be set to 0. We encode other transfor-mations in a similar fashion, taking into account all possiblecombinations of their parameters. To train a model that predicts beneficial transformations fora loop, we consider two different network architectures: re-current and convolutional . Recurrent neural networks (RNN)have been designed to recognize patterns in sequences ofdata, such as text or numerical times series. The main prop-erty of RNNs is the internal memory used to keep outputsof the previous steps, which is then fed as input to the cur-rent step. In contrast, convolutional neural networks (CNN)are suitable for hierarchical data. The most distinctive prop-erty of CNNs are their convolutional layers , which perform amathematical convolution operation on the input data. Con-volutional layers consist of feature matrices that learn to rec-ognize features in the input. Stacking convolutional layerson top of each other allows the later layers to learn increas-ingly complex features, which makes CNNs so powerful forany task involving compositional data.The advantage of recurrent neural networks is that theyprocess sequences of arbitrary length. However, vanishinggradients and increased computational demand for the train-ing process makes it harder to train the network with verylong input sequences. While convolutional neural networkslack the ability to process sequences of variable length, theyexcel on datasets of compositional data. Since the sourcecode is not only sequential but also highly compositional,CNNs are a good fit for this task. Our experimental evalua-tion shows that convolutional networks have a higher levelof accuracy compared to recurrent networks. This is whywe decided to choose CNNs as our default architecture.Specifically, we adopted ideas from DenseNet [31], a well-known design from the field of computer vision. However,we custom-tailored the DenseNet architecture to fit our learn-ing problem. Figure 3 further illustrates the architecture ofour model. The inputs to our model are encoded tokens of aloop and encoded transformations. Because our input is com-positional along a single dimension, we use one-dimensionalconvolutions instead of the two-dimensional variants usedin the original DenseNet. The building blocks of our neuralnetwork are dense layers which learn to extract featuresfrom the source code. As further illustrated in Figure 4, eachdense layer consists of two convolutional layers and the in-puts of each dense layer are concatenated with the outputsof previous layers and fed into subsequent layers. Eventually,the model performs average-pooling on the outputs of thefinal convolutional layer, concatenates the results with thetransformation vector, and passes the concatenated vectorinto a fully connected layer, which is used to predict theexpected speedup.
We feed training samples in batches of 256 into the networkand use the stochastic gradient descent method to train thenetwork for 300 epochs. The initial learning rate of 0.001 is ropped to a third of its value in epochs 100 and 200, anda momentum factor of 0.9 is used for optimization. We clipthe gradients with the absolute value above 10 to avoid theexploding gradients problem. At the end of every epoch, weevaluate the model and save the best-performing model. The code is parsed and tokenized by using the lexer com-ponent of the Python pycparser library, a parser for the Clanguage. To build and train the models we use the PyTorch framework, version 0.4.1 . We implement LoopLearner as anextensible framework that takes as input the following keyparameters: • sequence encoding : fixed, basic, type-based, renaming,complex, or fasttext • transformation encoding : one-hot or compact • model type : recurrent or convolutionalThis allows easy plugin of new types of encodings andneural network architectures. Our evaluation focuses on the following questions. • How effective is LoopLearner at predicting beneficialloop transformations? (Sections 4.2, 4.3, and 4.7) • What speedups do LoopLearner’s predictions enable?(Section 4.4) • How efficient is LoopLearner? (Section 4.5) • How does the approach compare to exhaustively try-ing all loop transformations? (Section 4.6) • What is the influence of the speedup threshold? (Sec-tion 4.8)
Our dataset is built from 1,895 base loops extracted by priorwork [23] from various benchmarks, software libraries, andmachine-learning kernels written in C. Extracting each loopinto a standalone program that replicates the data envi-ronment of the original benchmark program, applying se-quences of transformations, and measuring their perfor-mance yields a dataset of roughly 70,000 (loop, transfor-mation, speedup) triples. The loops are compiled with theGNU GCC compiler, using the -O3 flag, and executed on anIntel Xeon E5-1630 v3 processor.We split the dataset into a training and a validation set byrandomly selecting 80% of all loops and their associated trans-formations for training, and the remainder for validation. Bysplitting by loop, we ensure that the evaluation measureshow well the approach performs on previously unseen loops.Unless explicitly stated otherwise, we use speedup thresh-old 𝑡 = . . We trained our models on a single server withtwo Intel(R) Xeon(R) Gold 6126 2.60GHz CPUs, 64GBs of https://github.com/eliben/pycparser https://pytorch.org/docs/0.4.1/ main memory, two NVIDIA GeForce GTX 1080 Ti GPUs,and Ubuntu 16.04 LTS operating system. For the purpose oftraining any given model, a single GPU was used at a time. We first measure the accuracy ofLoopLearner’s predictions across all loops and transfor-mations in the validation set. Let 𝑇 be the set of all (loop,transformation) pairs. Let 𝑇 + ⊆ 𝑇 and 𝑇 − ⊆ 𝑇 be thesubset of all pairs known to cause a speedup and slowdown,respectively. Let 𝑃 + ⊆ 𝑇 and 𝑃 − ⊆ 𝑇 be the subset of allthe pairs predicted to result in a speedup and slowdown,respectively. We consider the following metrics: • Total accuracy (%) is the percentage of elements out of 𝑇 that are in ( 𝑃 + ∩ 𝑇 + ) ∪ ( 𝑃 − ∩ 𝑇 − ). • Speedup recall (%) is the percentage of elements out of 𝑇 + that are in 𝑃 + ∩ 𝑇 + . • Speedup precision (%) is the percentage of elements outof 𝑃 + that are in 𝑃 + ∩ 𝑇 + . • Slowdown recall (%) is the percentage of elements outof 𝑇 − that are in 𝑃 − ∩ 𝑇 − . • Slowdown precision (%) is the percentage of elementsout of 𝑃 − that are in 𝑃 − ∩ 𝑇 − .We calculate the last four metrics alongside the total accu-racy for two reasons. First, our dataset is imbalanced—morethan 80% of transformations result in slowdown and there-fore high total prediction accuracy alone does not necessarilyimply high accuracy for both speedups and slowdowns. Sec-ond, the recall and precision metrics help understand howwell the approach performs in a particular usage scenario.For example, speedup precision shows how often a predictedspeedup indeed improves the loop’s performance. We alsoshow the F1 score (harmonic mean of precision and recall). Table 3 summarizes the results. To under-stand the influence of different encodings and models, wereport results for different variants of LoopLearner. The bestresult for each metric is highlighted in bold font. Overall, theapproach predicts beneficial loop transformations with highaccuracy (up to 88%). Comparing speedup and slowdownpredictions, the model is particularly effective at predictingthat a transformation will cause a slowdown (95% recall, 92%precision), but also provides reasonable results for speedups(55% recall, 66% precision).
Comparison of source code encodings.
Remarkably,fixed encoding achieves the highest accuracy on the train-ing set and relatively good accuracy on the validation set,while also being the easiest to implement. We attribute thisresult to the higher dimensionality of the input data. Sinceeach token is represented as a vector in space R , thatis, each of the top 1,000 most common tokens and a special unknown token get unique representations, it is quite easyfor the network to learn to differentiate between distinct able 3. Overall accuracies achieved by employing different encoding methods. Training accuracy reflects the highest achievedaccuracy on the training set. All other values refer to the validation set.Accuracy (%) Speedup (%) Slowdown (%)Sequence Encoding Training Validation Recall Precision F1 Recall Precision F1
Transformation Encoding: Compact Model: CNN
Fixed(n=1,000)
Transformation Encoding: One-hot Model: CNN
FastText 89.2 87.1 43.0 65.3 51.8 95.6 89.7 92.5
Transformation Encoding: Compact Model: RNN
FastText 84.8 84.0 4.4 56.0 8.1 99.3 84.3 91.2tokens. However, apart from the size of the input data, thedisadvantage of using fixed encoding when compared tomore advanced methods is that the gap between the trainingand validation set accuracy for this method is also quite high,which means it tends to overfit the training data while notperforming as well on the validation set. The reason is thatthe top 1,000 most common tokens are extracted from thetraining set, which is likely to be somewhat different fromthe validation set.Although the accuracy achieved by the “basic” encodingis roughly that of other encodings, the speedup predictionresults show a significant weakness of the “basic” encoding.The model achieves only 7.7% speedup recall, because cru-cial information is lost when discarding identifier names,float literals, and char literals during encoding. As shownby the results of the “type-based” encoding, replacing identi-fier names with type information does not make the modelany more accurate. Consistently abstracting variable namesinto generic names (“renaming”) slightly improves the re-sults, but still offers only low speedup prediction results. Themain take-away of these results is that identifier names andliteral values are helpful in learning-based program analy-sis, a finding in line with other work on name-based andlearning-based analysis [39, 47].The “complex” encoding method achieves fairly hightraining- and validation-set accuracy. The substantiallyhigher accuracy compared to “basic” encoding confirms theimportance of encoding identifier names. However, compar-ing the two variants of “complex”, which keep 70% and 80%of all identifiers, respectively, shows that adding another 10%of less common identifier names does not raise the accuracyany further. We believe that after a certain point, increas-ing the size of the encoding vector by adding rare identifiernames does not benefit the accuracy of the trained model andcan actually be harmful, since it is likely that the model will learn to overfit the training samples based on the occurrenceof rare identifiers.The “FastText” encoding achieves the highest overall vali-dation accuracy, showing that pre-training general-purposetoken embeddings before passing them into a task-specificmodel is beneficial. The difference between training accuracyand validation accuracy is at a minimum when using the“FastText” encoding, i.e., there is only little overfitting. Sincewe obtain the best overall accuracy the “FastText” encoding,this encoding is the default in the remainder of the section.
Comparison of transformation encodings.
Compar-ing our compact encoding of transformations with a naiveone-hot encoding of transformations shows that the compactencoding is beneficial. In particular, it enables the model topredict otherwise missed speedups. We attribute this resultto the fact that the dense encoding makes it easier for themodel to generalize across similar transformations, as thoseare encoded into similar vectors.
Comparison of neural architectures.
We compare ourdefault CNN-based neural architecture to a recurrent neuralnetwork with two layers of gated recurrent units and a sizesimilar to the CNN architecture. The comparison shows theCNN model to be clearly more effective, in particular inpredicting speedups.
To better understand how effectiveLoopLearner is for individual loops, we evaluate theeffectiveness of those 𝑘 transformations per loop thatLoopLearner predicts to have the highest speedups. Let 𝐿 be the set of all the loops, let 𝐿 + ⊆ 𝐿 be the subset of theloops for which there exists at least one transformation thatproduces a speedup, let 𝑃 ( 𝑙 ) 𝑜 be the set of transformationsthat can be applied to the loop 𝑙 ∈ 𝐿 ordered by the predicted able 4. Top-1, top-3 and top-5 accuracy of the network onthe validation set and the corresponding values for precision,recall, and the mean speedup achieved in both static anddynamic mode of execution.Top Total Speedupk Acc. (%) Recall (%) Precision (%) Static Dynamic1 64.91 39.46 73.05 1.144x 1.235x3 79.95 40.61 75.18 N/A 1.285x5 83.38 41.00 75.89 N/A 1.290xperformance from highest to lowest, let 𝑃 ( 𝑙 ) 𝑜 ( 𝑘 ) ⊆ 𝑃 ( 𝑙 ) 𝑜 bethe first 𝑘 transformations in this set, let 𝑃 ( 𝑙 ) 𝑜𝑠 ( 𝑘 ) ⊆ 𝑃 ( 𝑙 ) 𝑜 ( 𝑘 ) be the subset of transformations that are predicted to beadvantageous, and let 𝐿 𝑠𝑝 ⊆ 𝐿 be the subset of the loops forwhich 𝑃 ( 𝑙 ) 𝑜𝑠 ( ) ≠ ∅ . Then, to measure the top-k effectivenessof our model we calculate: • Total accuracy (%) is the percentage of loops 𝑙 ∈ 𝐿 forwhich at least one of the predictions for transforma-tions in 𝑃 ( 𝑙 ) 𝑜 ( 𝑘 ) is correct. • Speedup recall (%) is the percentage of loops 𝑙 ∈ 𝐿 + forwhich at least one transformation in 𝑃 ( 𝑙 ) 𝑜𝑠 ( 𝑘 ) producesa speedup. • Speedup precision (%) is the percentage of loops 𝑙 ∈ 𝐿 𝑠𝑝 for which at least one transformation in 𝑃 ( 𝑙 ) 𝑜𝑠 ( 𝑘 ) produces a speedup. Table 4 shows the results (the last twocolumns are described later). We find that the approachachieves an accuracy of 65% when considering only thetop-most prediction for a loop, and of 83% within the top-5predictions. The precision of speedups ranges between 73%and 76% percent, i.e., when the model predicts a speedup,then the code indeed performs faster in most cases. Thereason why the validation accuracy for top-1 predictions islower than the overall accuracy is that the distribution ofthe numbers of possible transformations across the loopsis non-uniform. Some loops have a much higher number ofvalid transformations than others, and for some loops thetop-1 prediction is more likely to be accurate than for others.
We evaluate the speedups obtained by ap-plying the transformations suggested by LoopLearner inboth the static and the dynamic usage scenario (Section 3.3).The speedups in the static scenario show the performanceimprovement that can be immediately achieved when apply-ing LoopLearner’s top suggested transformations, while thedynamic scenario shows the potential speedup attainablewhen validating LoopLearner’s predictions. We compute thefollowing two metrics: • Speedup geometric mean (static) is defined only for 𝑘 = and is the geometric mean of speedups across all loops 𝑙 ∈ 𝐿 𝑠𝑝 achieved when applying transformation 𝑃 ( 𝑙 ) 𝑜𝑠 ( ) . • Speedup geometric mean (dynamic) is the geometricmean of speedups across all loops 𝑙 ∈ 𝐿 𝑠𝑝 achievedwhen applying the transformation with the best per-formance out of 𝑃 ( 𝑙 ) 𝑜𝑠 ( 𝑘 ) , or 1.0 if none of the top- 𝑘 transformations results in speedup. The last two columns of Table 4 shows thespeedups for both scenarios. We find that LoopLearner en-ables significant speedups in both cases, with a 1.14x speedupwhen simply using the top-1 prediction, and an 1.29x speedupwhen choosing the best from the top-5 predictions. Becausein the dynamic scenario, the transformed loops are executedto measure their performance, the mean speedup is guaran-teed to be at least as high as in the static scenario.
We summarize the execution time for different phases of ourapproach when running on either CPU or GPU in Table 5.Before training our model we learn FastText embeddings,which takes about 20 seconds on our dataset using 32 workerthreads. By far the most computationally demanding partof our approach is training the neural network. With hy-perparameter settings discussed earlier it takes around 6hours and 40 minutes to complete the training. However,we believe this time can be brought down substantially byusing higher batch sizes along with more memory-efficientimplementations of the DenseNet architecture. Moreover,the training step, despite being the most time-consuming, isonly performed once and afterwards the resulting model isready to be deployed.Because a high number of transformations can be appliedto a given loop, our model must be executed many timesbefore it is possible to decide which transformation is themost beneficial. During prediction, the most computationallyintensive part is the feature extractor, which processes thetoken sequences of a loop. Fortunately, it is sufficient to runthe feature extractor for any given loop only once. Then, thefully connected layer can be used to evaluate many possibletransformations in a batch. As can be observed in Table 5, ittakes less than 20 milliseconds to evaluate 1,000 transforma-tions for a single loop on a CPU and less than 2 millisecondsfor the same task on a GPU. We believe that these resultsshow that it is practical to implement LoopLearner as anautomated pre-processing step before giving code to thecompiler.
An alternative to querying LoopLearner for transformationsthat are likely to improve the performance of a loop is exhaus-tive search through all possible sequences of transformations.By measuring the performance impact of each sequence oftransformations, that alternative approach is guaranteed to able 5. Time requirements of the different phases of ourapproach.Approach phase Time (CPU) Time (GPU)Learning embeddings 20 seconds N/ATraining (1 epoch) N/A 60 secondsEvaluation (single pass) N/A 20 secondsFull training (300 epochs) N/A 6.6 hoursEvaluating 1 transformation 13.0 ms 1.6 msEvaluating 100 transformations 13.5 ms 1.6 msEvaluating 1,000 transformations 15.9 ms 1.7 msalways find the best-performing representation of a loop.The downside is that it is very time-consuming, as repeat-edly executing different variants of a loop takes time. Thefollowing explores the trade-off between time spent on find-ing beneficial transformations and time saved during theloop executions.
Time to find beneficial transformations.
It takesabout 10 hours to exhaustively measure the runtime of allmutations in our dataset. This time is based on executions ofindividual loops extracted from their original program [23],and it excludes the time required for extracting the loops. Incontrast, predicting the speedup of transformations across allloops using our model takes less than 2 seconds. LoopLearnerhence reduces the time taken to select a suitable transforma-tion by multiple orders of magnitude.
Time savings due to optimized loops.
We compareLoopLearner and exhaustive search w.r.t. the speedup ob-tained across all loops for which the respective approachsuggests applying a transformation. For LoopLearner, thoseare all loops for which at least one transformation is pre-dicted to yield a speedup. For exhaustive search, those are allloops that actually have at least one such transformation. In-tuitively, the speedup hence indicates what benefits to expectwhen following the suggestions of the two approaches. Asshown in Table 4, LoopLearner’s static usage scenario yieldsa speedup of 1.144x. In contrast, exhaustive search yields aspeedup of 1.286x. That is, following the top-most sugges-tion of the model without validating its performance impactresults in lower but still relevant speedups. LoopLearner’sdynamic usage scenario shows a different picture. By con-sidering the top-5 suggestions of the model, the obtainedspeedup of 1.290x even exceeds that of exhaustive search.The reason is that exhaustive search also reveals varioustransformations that yield very small speedups, i.e., trans-formations that are less relevant in practice. Intuitively, thetop-5, dynamic scenario can be seen as an exhaustive searchwithin a much reduced space of only the five most promisingtransformations.Overall, we conclude that LoopLearner provides a practi-cal alternative to exhaustive search, allowing developers or automated tools to quickly identify the most beneficialloop optimizations. In particular, the dynamic mode identi-fies many of those transformations that yield a significantspeedup, without paying the cost of exhaustively measuringthe performance impact of all transformations for all loops.
To better understand for which transformations the model’spredictions are more or less accurate, Table 6 shows resultsfor individual sequences of transformations. The abbrevi-ations for the transformations are as in Table 1. The lasttwo columns show the number of loops in the validationset to which a sequence of transformation applies, and whatpercentage of the validation set this number comprises (i.e.,coverage). The results show that the accuracy varies acrosstransformations. For example, tiling followed by unrollinghas a relatively low validation accuracy, but a high trainingaccuracy, which indicates that the model has likely overfitthe training data for this transformation sequence. We alsoobserve that for some under-represented combinations oftransformations, the model fails to identify a single speedup.By observing the results for individual transformation se-quences, one might decide to ignore some sequences whendeploying LoopLearner.
So far in our evaluation we defined the speedup thresholdas being equal to 1. However, as mentioned earlier, this hy-perparameter can be used to adjust the precision and recallof the trained model. To show the effects of tuning this hy-perparameter, we evaluate the speedup precision and recallon the validation set as we increase the speedup thresholdfrom 1.0 to 1.5. Figure 5 shows that, predictably, increas-ing the speedup threshold will result in higher precision ofspeedup predictions but also reduce the recall percentage.Lower value settings for this hyperparameter might be suit-able for a more optimistic approach with high tolerance forspeedup mispredictions. On the other hand, higher valuesguarantee a lower number of mispredictions but are alsolikely to disregard advantageous transformations producingsmaller speedups.
Since many compiler bugs are triggered by optimiza-tions [14], several techniques search for optimization-relatedcompiler bugs via differential testing [35, 56]. Barany [8] com-pare the code generated by different compilers to find opti-mizations performed by one but missed by another compiler.Similarly, Nagai et al. [43, 44] propose testing the validityof arithmetic optimizations using randomly generated pro-grams. Instead of searching for bugs in the implementation able 6. Performance of the neural network on different sequences of transformations. Precision and recall for speedup arecalculated on the validation set. Accuracy (%) Speedup (%) Loop CoverageTransformation Sequence Training Validation Recall Precision Count %unrolling 66.75 60.30 35.49 70.33 379 100.00tiling 80.44 71.08 14.20 53.33 156 41.16tiling -> unrolling 82.76 66.14 10.68 45.67 156 41.16unroll-and-jam -> unrolling 71.00 69.25 27.33 87.23 46 12.14interchange 90.15 89.40 50.98 78.79 46 12.14interchange -> unrolling 93.69 89.37 43.88 88.41 46 12.14interchange -> unroll-and-jam 92.37 91.02 53.28 77.71 46 12.14interchange -> unroll-and-jam -> unrolling 92.91 92.81 54.15 83.19 46 12.14interchange -> tiling -> unrolling 94.52 93.15 20.65 72.52 44 11.61interchange -> tiling 93.28 93.26 22.22 67.92 44 11.61distribution 69.47 54.55 63.64 53.85 22 5.80distribution -> unrolling 74.64 55.56 41.18 63.64 22 5.80tiling -> distribution -> unrolling 85.45 78.53 17.07 63.64 16 4.22tiling -> distribution 81.62 69.49 7.14 16.67 16 4.22interchange -> distribution 96.55 77.78 0.00 0.00 5 1.32interchange -> distribution -> unrolling 98.13 84.62 20.00 100.00 5 1.32interchange -> tiling -> distribution 97.51 94.92 0.00 0.00 4 1.06interchange -> tiling -> distribution -> unrolling 98.82 94.92 0.00 0.00 4 1.06 P e r c e n t a g e ( % ) RecallPrecision
Figure 5.
The effect of the speedup threshold on thevalidation-set speedup precision and recall.of compiler optimizations, our work improves the effective-ness of optimizations by tailoring loops to the optimizationdecisions made by the compiler.Superoptimization tries to find the best program among allsemantics-preserving variants of a given program [41] andcan, e.g., be addressed as a stochasitic search problem [50].Bunel et al. [13] propose a learning-based approach to im-prove superoptimization by predicting the distribution ofcode transformations to sample from. Another search-basedapproach for finding suitable optimizations is evolutionarysearch, e.g., to tune the order of optimizations [16, 17], to de-cide which optimizations to enable [30], or to apply randomcode mutations that reduce energy consumption [51]. Allof the above approaches search the optimization space fora specific program and pay the cost, e.g., for executing and validating candidate programs, for every program. In con-trast, LoopLearner learns a model once, which then predictscode transformations suitable for the given program withoutthe need to execute or validate candidate programs. A dif-ference to the work by Cooper et al. [16], which also looksfor sequences of code transformations, is that their workoptimizes in which order to apply transformations, whereasour work predicts whether applying any transformation willbe beneficial, and if yes, which sequence of transformationsto choose.Monsifrot et al. [42] use decision trees to learn the behav-ior of loop unrolling optimizations to decide which loop tounroll. Stephenson and Amarasinghe [53] propose a super-vised learning algorithm to predict unroll factors. Yuki et al.[57] train a neural network to predict loop tiling sizes. Simonet al. [52] automatically learn effective inlining heuristics us-ing decision trees and static code features. Machine learninghas been also applied to predict an effective application orderof compiler optimizations [7, 22, 40, 46]. Park et al. [46] usea graph-based intermediate representation to train a modelthat predicts optimization sequences that will benefit a givenprogram. Martins et al. [40] propose a clustering approachfor grouping similar functions, reducing the search spaceresulting from the combination of optimizations previouslysuggested for the functions in each group. Ashouri et al. [7]cluster compiler optimizations to predict the speedup of se-quences of optimizations that belong to the same cluster. Allthe above approaches differ from our work by tuning opti-mization decisions made inside the compiler, whereas we resent a pre-processing step that makes optimizations moreefficient without changing the compiler itself. Another dif-ference is that the above methods rely on manually designedfeatures.Recent work by Cummins et al. [18, 19] also proposes adeep neural network that learns optimization heuristics overraw code, similar to our work. Their work focuses on heuris-tics for two optimization problems: predicting the optimalexecution device and the thread coarsening factor. Our workdiffers in at least two ways. First, LoopLearner learns effec-tive transformation sequences from a much larger corpusof transformations. Second, LoopLearner trains a convolu-tional neural network, whereas Cummins et al. build upon arecurrent neural network. Another technique optimizes thememory layout of matrices to enable faster sparse matrixmultiplication [60]. While also being based on convolutionalneural networks, their approach takes a matrix as the input,whereas LoopLearner reasons about the code to optimize.Machine learning has been used to address variousprogramming-related problems in an end-to-end manner [2],including code completion [11, 45, 49], bug detection [27, 37,47], and bug fixing [25, 26, 38]. Recurrent neural networkshave been applied to token sequences, for example, to findfixes for syntax errors [10], to identify code that suffers froma specific kind of vulnerability [37], to predict the types ofvariables [28], or to represent code for code search [24]. Analternative to recurrent neural networks are convolutionalnetworks, which we also use in this paper. Others have usedconvolutional networks to localize bugs [32] and to sum-marize code [4]. We address a different prediction problem,and we are, to the best of our knowledge, the first to adoptthe DenseNet architecture [31] to code. Several techniquestrain models using a graph-based code representation, e.g.,abstract syntax trees [54, 58], paths through abstract syntaxtrees [5, 6, 21], control flow graphs [20], execution trees [29],and other graph-based code representations [3, 9, 12, 55].Other models of code build on conditional random fields [48],memory networks [15], or manually modeled features [59].We build upon a token sequence-based representation in-stead, because it is conceptually simple and makes trainingefficient, while providing accurate predictions. We present LoopLearner, a novel technique to address theprogram of compiler instability. Given the source code of aloop, LoopLearner suggests a semantically invariant trans-formation that will likely allow the compiler to producemore efficient code. Following its recommendations prior tocompilation results in an average speedup of 1.14x. Almostthree quarters (73%) of the suggested transformations yieldpositive speedups. Trying the top-3 recommendations andchoosing the best one raises the average speedup even to1.29x. We envision the approach to be used either as a tool to guide programmers or as a pre-processor run before or aspart of the compiler. Different from most earlier work, ourapproach leverages deep learning and does not require anymanual selection of source code features. In addition, weconsider a much larger set of transformations—3,000 combi-nations of five common loop optimizations in our case. Ourmodel needs to be trained once per compiler and platform,an effort that is likely to pay off in view of the typical lifetimeof either of the two.
Acknowledgments
This work was supported by the Graduate School CE withinthe Centre for Computational Engineering at TechnischeUniversität Darmstadt, the Hessian LOEWE initiative withinthe Software-Factory 4.0 project, European Research Council(ERC, grant agreement 851895), and the German ResearchFoundation within the ConcSys and Perf4JS projects.
References [1] Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Dis-tributed word representations for multilingual nlp. arXiv preprintarXiv:1307.1662 (2013).[2] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and CharlesSutton. 2018. A survey of machine learning for big code and natural-ness.
ACM Computing Surveys (CSUR)
51, 4 (2018), 81.[3] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017.Learning to Represent Programs with Graphs.
CoRR abs/1711.00740(2017). arXiv:1711.00740 http://arxiv.org/abs/1711.00740 [4] Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. 2016. A Con-volutional Attention Network for Extreme Summarization of SourceCode. In
ICML . 2091–2100.[5] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2018.code2vec: Learning Distributed Representations of Code.
CoRR arXiv:1803.09473 (2018).[6] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2018. AGeneral Path-Based Representation for Predicting Program Properties.In
PLDI .[7] Amir H. Ashouri, Andrea Bignoli, Gianluca Palermo, Cristina Silvano,Sameer Kulkarni, and John Cavazos. 2017. MiCOMP: Mitigating theCompiler Phase-Ordering Problem Using Optimization Sub-Sequencesand Machine Learning.
ACM Trans. Archit. Code Optim.
14, 3, Article29 (Sept. 2017), 28 pages. https://doi.org/10.1145/3124452 [8] Gergö Barany. 2018. Finding missed compiler optimizations by dif-ferential testing. In
Proceedings of the 27th International Conference onCompiler Construction, CC 2018, February 24-25, 2018, Vienna, Austria .82–92.[9] Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018.Neural Code Comprehension: A Learnable Representation of CodeSemantics. In
Advances in Neural Information Processing Systems 31 ,S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,and R. Garnett (Eds.). Curran Associates, Inc., 3585–3597.[10] Sahil Bhatia and Rishabh Singh. 2016. Automated Correction forSyntax Errors in Programming Assignments using Recurrent NeuralNetworks.
CoRR abs/1603.06129 (2016).[11] Pavol Bielik, Veselin Raychev, and Martin T. Vechev. 2016. PHOG:Probabilistic Model for Code. In
ICML . 2933–2942.[12] M. Brockschmidt, M. Allamanis, A. L. Gaunt, and O. Polozov. 2018.Generative Code Modeling with Graphs.
ArXiv e-prints (2018).arXiv:1805.08490 [cs.LG]
13] Rudy Bunel, Alban Desmaison, M. Pawan Kumar, Philip H. S. Torr,and Pushmeet Kohli. 2017. Learning to superoptimize programs. In . https://openreview.net/forum?id=r1rz6U5lg [14] Junjie Chen, Wenxiang Hu, Dan Hao, Yingfei Xiong, Hongyu Zhang,Lu Zhang, and Bing Xie. 2016. An Empirical Comparison of CompilerTesting Techniques. In Proceedings of the 38th International Conferenceon Software Engineering (Austin, Texas) (ICSE ’16) . ACM, New York,NY, USA, 180–190. https://doi.org/10.1145/2884781.2884878 [15] Min-je Choi, Sehun Jeong, Hakjoo Oh, and Jaegul Choo. 2017. End-to-End Prediction of Buffer Overruns from Raw Source Code via NeuralMemory Networks.
CoRR abs/1703.02458 (2017).[16] Keith D. Cooper, Philip J. Schielke, and Devika Subramanian. 1999.Optimizing for Reduced Code Space using Genetic Algorithms. In
Pro-ceedings of the ACM SIGPLAN 1999 Workshop on Languages, Compilers,and Tools for Embedded Systems (LCTES’99), Atlanta, Georgia, USA,May 5, 1999 . 1–9. https://doi.org/10.1145/314403.314414 [17] Keith D. Cooper, Devika Subramanian, and Linda Torczon. 2002. Adap-tive Optimizing Compilers for the 21st Century.
The Journal of Super-computing
23, 1 (2002), 7–22. https://doi.org/10.1023/A:1015729001611 [18] Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather.2017. End-to-End Deep Learning of Optimization Heuristics. In . 219–232. https://doi.org/10.1109/PACT.2017.24 [19] Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather.2017. Synthesizing benchmarks for predictive modeling. In
CGO . 86–99.[20] Daniel DeFreez, Aditya V. Thakur, and Cindy Rubio-González. 2018.Path-Based Function Embedding and its Application to SpecificationMining.
CoRR abs/1802.07779 (2018).[21] Jacob Devlin, Jonathan Uesato, Rishabh Singh, and Pushmeet Kohli.2017. Semantic Code Repair using Neuro-Symbolic TransformationNetworks.
CoRR abs/1710.11054 (2017). arXiv:1710.11054 http://arxiv.org/abs/1710.11054 [22] Grigori Fursin, Yuriy Kashnikov, Abdul Wahid Memon, ZbigniewChamski, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, BilhaMendelson, Ayal Zaks, Eric Courtois, et al. 2011. Milepost gcc: Ma-chine learning enabled self-tuning compiler.
International Journal ofParallel Programming
39, 3 (2011), 296–327.[23] Zhangxiaowen Gong, Zhi Chen, Justin Szaday, David Wong, ZehraSura, Neftali Watkinson, Saeed Maleki, David Padua, Alexander Vei-denbaum, Alexandru Nicolau, and Josep Torrellas. 2018. An EmpiricalStudy of the Effect of Source-level Loop Transformations on CompilerStability.
Proc. ACM Program. Lang.
2, OOPSLA, Article 126 (Oct. 2018),29 pages. https://doi.org/10.1145/3276496 [24] Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep CodeSearch. In
ICSE .[25] Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017.DeepFix: Fixing Common C Language Errors by Deep Learning. In
AAAI .[26] Jacob Harer, Onur Ozdemir, Tomo Lazovich, Christopher P. Reale, Re-becca L. Russell, Louis Y. Kim, and Sang Peter Chin. 2018. Learning toRepair Software Vulnerabilities with Generative Adversarial Networks.In
Advances in Neural Information Processing Systems 31: Annual Con-ference on Neural Information Processing Systems 2018, NeurIPS 2018,3-8 December 2018, Montréal, Canada.
CoRR abs/1803.04497 (2018). arXiv:1803.04497 http://arxiv.org/abs/1803. 04497 [28] V. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis. 2018. DeepLearning Type Inference. In
FSE .[29] Jordan Henkel, Shuvendu K. Lahiri, Ben Liblit, and Thomas W. Reps.2018. Code vectors: understanding programs through embedded ab-stracted symbolic traces. In
Proceedings of the 2018 ACM Joint Meetingon European Software Engineering Conference and Symposium on theFoundations of Software Engineering, ESEC/SIGSOFT FSE 2018, LakeBuena Vista, FL, USA, November 04-09, 2018 . 163–174.[30] Kenneth Hoste and Lieven Eeckhout. 2008. Cole: compiler optimizationlevel exploration. In
Sixth International Symposium on Code Generationand Optimization (CGO 2008), April 5-9, 2008, Boston, MA, USA . 165–174. https://doi.org/10.1145/1356058.1356080 [31] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Wein-berger. 2017. Densely Connected Convolutional Networks. In
Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recognition .2261–2269.[32] Xuan Huo, Ming Li, and Zhi-Hua Zhou. 2016. Learning Unified Fea-tures from Natural and Programming Languages for Locating BuggySource Code. In
Proceedings of the Twenty-Fifth International JointConference on Artificial Intelligence, IJCAI 2016, New York, NY, USA,9-15 July 2016 . 1606–1612.[33] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze,Hérve Jégou, and Tomas Mikolov. 2016. Fasttext. zip: Compressingtext classification models. arXiv preprint arXiv:1612.03651 (2016).[34] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov.2016. Bag of tricks for efficient text classification. arXiv preprintarXiv:1607.01759 (2016).[35] Vu Le, Mehrdad Afshari, and Zhendong Su. 2014. Compiler Valida-tion via Equivalence Modulo Inputs. In
Proceedings of the 35th ACMSIGPLAN Conference on Programming Language Design and Implemen-tation (Edinburgh, United Kingdom) (PLDI ’14) . ACM, New York, NY,USA, 216–226. https://doi.org/10.1145/2594291.2594334 [36] Jiwei Li and Dan Jurafsky. 2015. Do multi-sense embeddings im-prove natural language understanding? arXiv preprint arXiv:1506.01070 (2015).[37] Zhen Li, Shouhuai Xu Deqing Zou and, Xinyu Ou, Hai Jin, SujuanWang, Zhijun Deng, and Yuyi Zhong. 2018. VulDeePecker: A DeepLearning-Based System for Vulnerability Detection. In
NDSS .[38] Fan Long and Martin Rinard. 2016. Automatic patch generation bylearning correct code. In
Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL2016, St. Petersburg, FL, USA, January 20 - 22, 2016 . 298–312.[39] Rabee Sohail Malik, Jibesh Patra, and Michael Pradel. 2019. NL2Type:Inferring JavaScript Function Types from Natural Language Informa-tion. In
ICSE .[40] Luiz G. A. Martins, Ricardo Nobre, João M. P. Cardoso, Alexandre C. B.Delbem, and Eduardo Marques. 2016. Clustering-Based Selection forthe Exploration of Compiler Optimization Sequences.
ACM Trans.Archit. Code Optim.
13, 1, Article 8 (March 2016), 28 pages. https://doi.org/10.1145/2883614 [41] Henry Massalin. 1987. Superoptimizer: a look at the smallest program.In
ACM SIGARCH Computer Architecture News , Vol. 15. IEEE ComputerSociety Press, 122–126.[42] Antoine Monsifrot, François Bodin, and Rene Quiniou. 2002. A Ma-chine Learning Approach to Automatic Production of Compiler Heuris-tics. In
Proceedings of the 10th International Conference on Artificial Intel-ligence: Methodology, Systems, and Applications (AIMSA ’02) . Springer-Verlag, London, UK, UK, 41–50. http://dl.acm.org/citation.cfm?id=646053.677574 [43] Eriko Nagai, Hironobu Awazu, Nagisa Ishiura, and Naoya Takeda. 2019.Random Testing of C Compilers Targeting Arithmetic Optimization.(04 2019).
44] Eriko Nagai, Atsushi Hashimoto, and Nagisa Ishiura. 2014. ReinforcingRandom Testing of Arithmetic Optimization of C Compilers by Scalingup Size and Number of Expressions.
IPSJ Trans. System LSI DesignMethodology
Joint Meeting of the European Software EngineeringConference and the ACM SIGSOFT Symposium on the Foundations ofSoftware Engineering, ESEC/FSE’13, Saint Petersburg, Russian Federation,August 18-26, 2013 . 532–542.[46] Eunjung Park, John Cavazos, and Marco A. Alvarez. 2012. UsingGraph-based Program Characterization for Predictive Modeling. In
Proceedings of the Tenth International Symposium on Code Generationand Optimization (San Jose, California) (CGO ’12) . ACM, New York,NY, USA, 196–206. https://doi.org/10.1145/2259016.2259042 [47] Michael Pradel and Koushik Sen. 2018. DeepBugs: A learning approachto name-based bug detection.
PACMPL
2, OOPSLA (2018), 147:1–147:25. https://doi.org/10.1145/3276517 [48] Veselin Raychev, Martin T. Vechev, and Andreas Krause. 2015. Predict-ing Program Properties from "Big Code".. In
Principles of ProgrammingLanguages (POPL) . 111–124.[49] Veselin Raychev, Martin T. Vechev, and Eran Yahav. 2014. Code com-pletion with statistical language models. In
ACM SIGPLAN Conferenceon Programming Language Design and Implementation, PLDI ’14, Edin-burgh, United Kingdom - June 09 - 11, 2014 . 44.[50] Eric Schkufza, Rahul Sharma, and Alex Aiken. 2013. Stochastic Super-optimization. In
Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS) . ACM, 305–316.[51] Eric M. Schulte, Jonathan Dorn, Stephen Harding, Stephanie Forrest,and Westley Weimer. 2014. Post-compiler software optimization forreducing energy. In
Architectural Support for Programming Languagesand Operating Systems, ASPLOS ’14, Salt Lake City, UT, USA, March 1-5,2014 . 639–652. https://doi.org/10.1145/2541940.2541980 [52] Douglas Simon, John Cavazos, Christian Wimmer, and Sameer Kulka-rni. 2013. Automatic Construction of Inlining Heuristics Using Ma-chine Learning. In
Proceedings of the 2013 IEEE/ACM InternationalSymposium on Code Generation and Optimization (CGO) (CGO ’13) .IEEE Computer Society, Washington, DC, USA, 1–12. https://doi.org/10.1109/CGO.2013.6495004 [53] Mark Stephenson and Saman Amarasinghe. 2005. Predicting Un-roll Factors Using Supervised Classification. In
Proceedings of the In-ternational Symposium on Code Generation and Optimization (CGO’05) . IEEE Computer Society, Washington, DC, USA, 123–134. https://doi.org/10.1109/CGO.2005.29 [54] Martin White, Michele Tufano, Christopher Vendome, and DenysPoshyvanyk. 2016. Deep learning code fragments for code clonedetection. In
ASE . 87–98.[55] Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song.2017. Neural Network-based Graph Embedding for Cross-PlatformBinary Code Similarity Detection. In
CCS . 363–376.[56] Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Findingand Understanding Bugs in C Compilers. In
Proceedings of the 32NdACM SIGPLAN Conference on Programming Language Design and Im-plementation (San Jose, California, USA) (PLDI ’11) . ACM, New York,NY, USA, 283–294. https://doi.org/10.1145/1993498.1993532 [57] Tomofumi Yuki, Lakshminarayanan Renganarayanan, Sanjay Rajopad-hye, Charles Anderson, Alexandre E. Eichenberger, and Kevin O’Brien.2010. Automatic Creation of Tile Size Selection Models. In
Proceedingsof the 8th Annual IEEE/ACM International Symposium on Code Gener-ation and Optimization (Toronto, Ontario, Canada) (CGO ’10) . ACM,New York, NY, USA, 190–199. https://doi.org/10.1145/1772954.1772982 [58] Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang,and Xudong Liu. 2019. A Novel Neural Source Code Representationbased on Abstract Syntax Tree. In
ICSE . [59] Gang Zhao and Jeff Huang. 2018. DeepSim: deep learning code func-tional similarity. In
Proceedings of the 2018 ACM Joint Meeting onEuropean Software Engineering Conference and Symposium on the Foun-dations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake BuenaVista, FL, USA, November 04-09, 2018 . 141–151.[60] Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018. Bridgingthe gap between deep learning and sparse matrix format selection. In
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming, PPoPP 2018, Vienna, Austria, February24-28, 2018 . 94–108. https://doi.org/10.1145/3178487.3178495https://doi.org/10.1145/3178487.3178495