[PDF] NOELLE Offers Empowering LLVM Extensions

Abstract

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.

Full PDF

NNOELLE Offers Empowering LLVM Extensions

Angelo Matni

Northwestern University [email protected]

Enrico Armenio Deiana

Northwestern University

[email protected]

Yian Su

Northwestern University [email protected]

Lukas Gross

Northwestern University [email protected]

Souradip Ghosh

Northwestern University [email protected]

Sotiris Apostolakis

Princeton University [email protected]

Ziyang Xu

Princeton University [email protected]

Zujun Tan

Princeton University [email protected]

Ishita Chaturvedi

Princeton University [email protected]

David I. August

Princeton University [email protected]

Simone Campanoni

Northwestern University [email protected]

Abstract

Modern and emerging architectures demand increasinglycomplex compiler analyses and transformations. As the em-phasis on compiler infrastructure moves beyond support forpeephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designedto meet these demands with higher-level analysis-poweredabstractions of wider program scope. This paper introducesNOELLE, a robust open-source domain-independent compi-lation layer built upon LLVM providing this support. NOELLEis modular and demand-driven, making it easy-to-extendand adaptable to custom-tool-specific needs without undulywasting compile time and memory. This paper shows thepower of NOELLE by presenting a diverse set of ten customtools built upon it, with a 33.2% to 99.2% reduction in codesize (LoC) compared to their counterparts without NOELLE.

The compiler community is on the front lines to satisfy thecontinuous demand for computational performance and en-ergy efficiency. The focus of compiler advancements is shift-ing beyond peephole optimizations and the extraction ofinstruction-level parallelism. More aggressive optimizationsand more sophisticated analyses with wider scope are re-quired to accommodate the needs of emerging architecturesand applications.Modern compilers use low-level intermediate representa-tions (IR) to perform optimizations that are language-agnosticand architecture-independent, such as LLVM IR from theLLVM compiler framework [6, 37] and GIMPLE from GCC [4].Low-level IR, along with a set of low-level abstractions built arXiv’21, February, 2021, Virtual around it, is designed to aid program analyses and optimiza-tions and has shown its value for peephole optimizationsand extraction of ILP. However, low-level abstractions arenot enough for advanced code analyses and transformations.Consider automatic parallelization, one of the most power-ful program optimization techniques, exists only in a basicform [1, 2, 7], or does not exist at all in most general-purposecompilers. This paper shows that with proper abstractions,a daunting automatic parallelization transformation can beimplemented in fewer than a thousand lines of code.Advanced code analyses and transformations go handin hand with higher-level abstractions, as shown by manyexisting compilers or frameworks. Several compiler infras-tructures that support automatic parallelization [3, 10, 20]all operate on high-level abstractions and perform source-to-source translation. The recent success of domain-specificcompilers/frameworks also proves the importance of high-level abstractions for optimizations by uncovering optimiza-tion opportunities at a domain-specific graph or operatorlevel [5, 12]. However, these compilers limit themselves tospecific program languages or problem domains, and missopportunities only presented in low-level IRs, including morefine-grained operations and more canonical code patterns.The combination of higher-level abstractions and lower-level IR is the key to advanced program analysis and opti-mizations. The claim can be found in the SUIF compiler [11],which provides low-level IR as well as higher-level con-structs [51]; and the IMPACT compiler [25], which provideshierarchical IRs to enable optimizations at different levels.Despite the claim, we are not aware of actively-maintaineddomain-independent compilers that fulfill this combination.While LLVM has become the de-facto compiler infrastruc-ture to build upon, it does not provide proper abstractions for a r X i v : . [ c s . P L ] F e b rXiv’21, February, 2021, VirtualAngelo Matni, Enrico Armenio Deiana, Yian Su, Lukas Gross, Souradip Ghosh, Sotiris Apostolakis, Ziyang Xu, Zujun Tan,Ishita Chaturvedi, David I. August, and Simone Campanoni advanced analyses and transformations, including abstrac-tions designed to describe properties of a wider code scope(e.g., program dependence graph, program call graph) or ab-stractions that provide the mechanisms for advanced codetransformations (e.g., loop transformations, code scheduler,task creation). These abstractions can ease the implemen-tation of new transformations and make the existing codetransformations available in LLVM more powerful.We propose a new open-source compilation layer calledNOELLE that brings forth abstractions for advanced codeanalyses and transformations. To demonstrate the impor-tance of NOELLE, we have implemented ten advanced codetransformations, nine of which need only a few lines of code.Only one of these transformations is already available inLLVM (loop invariant code motion). We will show that ourversion is significantly more powerful, requires significantlyfewer lines of code, and the implementation is more elegantthan the LLVM counterpart. The other nine transformationsare missing in LLVM because they are challenging to imple-ment with the low-level abstractions LLVM provides.We have implemented a variety of code transformationsupon NOELLE: a few parallelizing compilers, a Pseudo-Randomvalue generator selector, a comparison optimization for tim-ing speculative micro-architectures, a dead function elimi-nation, a memory guard optimization, a code analysis andtransformation to replace hardware interrupts, and a loopinvariant code motion. We call them NOELLE’s custom tools.It is a challenge to implement these custom tools only us-ing the low-level abstractions provided by LLVM. Relyingon NOELLE, instead, most tools are implemented in only afew hundred lines of code. We tested all these tools on 41benchmarks from three benchmark suites (SPEC CPU2017,PARSEC 3.0, and MiBench). All these tools improve the qual-ity of the code generated by LLVM with its highest level ofoptimization. Finally, the high heterogeneity between theseten custom tools suggests NOELLE provides general abstrac-tions and support for a wide variety of advanced code analy-ses and transformations. Finally, we have released NOELLEpublicly (https://github.com/scampanoni/noelle).This paper: • introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM; • describes the abstractions provided by NOELLE (Section 2.2)that ease the development of advanced code transforma-tions and analyses; • presents the tools provided by NOELLE (Section 2.3) thatease the deployment of custom compilation tool-chains; • describes the testing infrastructure provided by NOELLE(Section 2.4) that enables automatic testing of custom tools; • describes a diverse set of ten custom tools built uponNOELLE (Section 3) and highlights the benefits of NOELLE’scustom tools compared to vanilla LLVM (Section 4.2); • evaluates the accuracy of NOELLE’s abstractions (Sec-tion 4.1); and, • further motivates the need for NOELLE by comparing itwith prior work (Section 5). Next, we describe NOELLE, its abstractions, and its tools. noelle-whole-IR

FileNOELLE ToolCustom Pass

C/C++ SourcesIRnoelle-prof-coveragenoelle-meta-prof-embedProfiled IRnoelle-rm-lc-dependencesOptimized IRnoelle-meta-clean

Training Inputs

Cleaned IRnoelle-prof-coveragenoelle-meta-prof-embedProfiled IRnoelle-meta-pdg-embednoelle-arch HELIX TansformationProfiled IR w/ PDGnoelle-loadParallelized IRnoelle-linkerLinked IRnoelle-binParallelized BinaryHELIX RuntimeArchitecture Description ProfilesProfiles

Figure 1.

Compilation flow of the HELIX custom tool usingNOELLE tools and a custom pass, HELIX Transformation. Figure 2shows in detail how to build HELIX Transformation using NOELLEabstractions.

The goal of NOELLE is to provide abstractions that enable asimple implementation of complex code analyses and trans-formations (we call them custom tools) that target wideprogram scopes. Custom tools built upon NOELLE includeLLVM passes that work at the IR level to perform their codeanalyses and transformations. Allowing these custom toolsto be easily implementable and maintainable requires sim-ple and domain-independent abstractions powered by eitheraccurate low-level code analyses or complex low-level codetransformations. NOELLE provides such abstractions (Sec-tion 2.2) with a modular design allowing its users to pay onlythe cost of creating the abstractions requested.NOELLE’s abstractions are powered by code analyses,some of which are provided by third parties. For example, OELLE Offers Empowering LLVM Extensions arXiv’21, February, 2021, Virtual

LBFRL RD IVSIVTENV aSCCDAG INV PRO SCDPDG LS DFEAR LLVMNOELLE Abstractions HELIX Transformation

Figure 2.

HELIX transformation, a custom pass written usingNOELLE abstractions. Arrows in the graph describe the dependencebetween analyses. Refer to Table 1 for descriptions of all NOELLEabstractions. Table 4 and Table 2 describe abstractions used percustom and NOELLE tool respectively. the PDG abstraction NOELLE provides is computed by run-ning several alias analyses implemented by external code-bases (SCAF [16] and SVF [47]). Moreover, NOELLE’s modu-lar design makes it easy to extend the list of external codeanalyses that power NOELLE’s abstractions. NOELLE alsoprovides tools (Section 2.3) for faster user-specific compila-tion flows. Finally, NOELLE provides a testing infrastructure(Section 2.4) to facilitate automatic testing of NOELLE itselfas well as custom tools built upon it.

Input and Output

The input of a compilation flow builtupon NOELLE is the source code of a program and optionally,a set of training inputs that could be used for profile-guidedor autotuning-based custom tools. The output is a binary fora target architecture supported by vanilla LLVM backends.

An Example of Compilation Flow

NOELLE enables itsusers to deploy custom compilation flows by providing aset of tools, described in Section 2.3. Next, we describe anexample of a compilation flow built using NOELLE’s tools(shown in Figure 1). This is the compilation flow used by thecustom tool HELIX (further described in Section 3).Each source file composing a program being compiledis consumed by noelle-whole-IR , which outputs a singleLLVM IR file that includes the whole program’s code as wellas options to use to generate the final binary (e.g., the li-braries to link with). Then, using traininig inputs given toNOELLE, noelle-prof-coverage runs several profilers tocollect statistics about the single IR file’s execution. Thesestatistics include the hotness of code regions (e.g., a loop, abasic block), loop-specific information (e.g., the total numberof iterations of a loop, the average number of iterationsper invocation of a loop), and function-specific informa-tion (e.g., number of invocations of a function, the aver-age number of recursive calls of a recursive function). Theprogram’s profiles are then embedded into the IR file by noelle-meta-prof-embed . The generated IR is consumed by noelle-rm-lc-dependences , which applies a set of codetransformations that aim to reduce loop-carried data de-pendences in hot loops (the minimum hotness required toconsider a loop). The generated IR is now more amenableto loop-centric code parallelization techniques. The tool noelle-meta-clean cleans all NOELLE-specific metadatafrom the IR file. Then, noelle-prof-coverage and the tool noelle-meta-prof-embed re-generate and embed the pro-gram’s profiles, respectively. Then, noelle-meta-pdg-embed computes the program dependence graph (PDG) and embedsit as metadata inside the IR file. The noelle-arch computesarchitecture-specific profiles (e.g., communication latencybetween cores). Its output is used by the HELIX transfor-mation. Finally, the noelle-load tool is invoked, whichloads in memory NOELLE’s compilation layer, to run theHELIX transformation. The HELIX transformation relies onNOELLE’s abstractions to parallelize hot loops. The gener-ated parallelized IR file is then consumed by noelle-linker ,which embeds the HELIX-specific runtime into the IR. Finally, noelle-bin generates the parallel binary.

Next, we describe the abstractions that NOELLE provides toits users. NOELLE’s abstractions (summarized by Table 1) aredemand-driven to preserve compilation time and memory.Hence, users only pay for the abstractions they need. In otherwords, if a user does not need the program dependence graph(PDG), then it will not pay the cost of analyzing the programto compute its dependences.

PDG.

NOELLE provides the Program Dependence Graph(PDG) representation of a program [30]. This is obtained byextending NOELLE’s dependence graph , a templated classdesigned to represent a generic graph of directed depen-dences between nodes. What constitutes a node is decidedwhen the class is instantiated. For example, the PDG in-stantiates this templated class with the LLVM instructionclass. Hence, the nodes of the PDG are the instructions ofa program. Each edge of the dependence graph containsattributes to differentiate between control and data depen-dences. Data dependences are further characterized based onthe dependence type (Read-After-Write, Write-After-Write,Write-After-Read), whether it is loop-carried, dependencedistance, whether it is memory or register dependence, andwhether it is an apparent or actual dependence [26].An analysis or transformation (i.e., pass) built upon NOELLEcan use the PDG abstraction to create loop dependencegraphs and function dependence graphs. The former is adependence graph of a specific loop. The latter refers todependences only between the instructions of a function.When a pass requests the loop dependence graph from aPDG, NOELLE runs loop-centric analyses to refine (and im-prove the precision about) the dependences that are includedin the PDG for the specific loop in-question. rXiv’21, February, 2021, VirtualAngelo Matni, Enrico Armenio Deiana, Yian Su, Lukas Gross, Souradip Ghosh, Sotiris Apostolakis, Ziyang Xu, Zujun Tan,Ishita Chaturvedi, David I. August, and Simone Campanoni Table 1.

Abstractions provided by NOELLE

Abstraction Description LoC Depends on

PDG All dependences between instructions of a program 6775aSCCDAG SCCDAG of a loop with attributes on each SCC PDG(e.g., an SCC has loop-carried data dependence, it is reducible) 4517Call graph (CG) Complete call graph of a program including indirect calls and their possible callees 620 PDGEnvironment (ENV) Variables needed by a task to execute (live-ins and live-outs) 991 PDGTask (T) Code region (and its inputs and outputs) executed by a thread 297 ENVData-flow engine (DFE) Optimized engine to quickly evaluate data flow equations provided as inputs 332Loop structure (LS) Describe the structure of a loop, its exits, latches, header, pre-header, basic blocks. 301Profiler (PRO) Set of profilers at the IR level 1625 LSScheduler (SCD) Mechanisms to change the schedule of instructions wthin and between basic blocks 1523 PDG, LS, DFEInvariant (INV) Instructions, values, or memory locations that are loop invariants for a given loop 137 PDG, LSInduction variable Induction variables of a loop including the identification LS, INV(IV) of the governing one (if it exists) 352 aSCCDAGInduction variable Modifies the code of a loop to implement a change LS, INV, IVstepper (IVS) in step value of its induction variables 425Reduction (RD) Identification and capability of reducing variables of a loop 868 aSCCDAG, INV, IVLoop (L) Canonical loop with its dependence graph, its SCCDAG, its invariants, LS, PDG, IV,its induction variables, and its exits 1508 INV, aSCCDAG, RDForest (FR) Forest of trees with the capability to adjust when a node is deleted to keep the connections L, CGbetween the parent and the children of the deleted node 202Loop builder (LB) Set of loop transformations that modify a loop FR, L, DFE,(e.g., split a loop, translate do-while loops to while form and vice versa) 4535 IV, IVS, INVIslands (ISL) Capability to identify the disconnected sub-graphs of a graph (e.g., call graph, PDG) 56 PDG, CGArchitecture (AR) Description of the underlying architecture in terms of logical/phisical cores, NUMA nodes.It also provides the measured latencies and bandwidths between pairs of cores 381Others 691

LoC of NOELLE’s abstractions 26142

Users of this abstraction often want to know not onlyabout the nodes of a dependence graph that belong to arelated code region (e.g., instructions of a loop for a loopdependence graph) but also about the inputs, the outputs, orboth of the graph. For example, a parallelizing code trans-formation of a loop needs to know the live-in and live-outsets of the target loop. Because of this need, the templatedclass dependence graph offers two sets of nodes, the internaland the external ones. The former belong to the related coderegion; the latter represents the live-ins, live-outs, or bothof that code region. The computation of both sets of nodesis computed by NOELLE when a pass requests either a loopdependence graph or a function dependence graph. aSCCDAG.

Advanced code transformations like paralleliza-tion techniques can be implemented as different strategies toschedule instances of the nodes that compose the SCCDAGof a loop [43, 49]. For instance, HELIX distributes instances ofa given SCCDAG node around the cores. DSWP instead dis-tributes nodes of an SCCDAG between cores while keepingall instances of a given node within the same core. Hence, animportant abstraction is the SCCDAG. To this end, we intro-duce the augmented SCCDAG abstraction or aSCCDAG . AnaSCCDAG of a given loop is a complete description of loopdependences, including those with the rest of the program.A node of an aSCCDAG can be

Independent , Sequential ,or

Reducible . This categorization of a node 𝑛 depends onthe relation between the instructions’ dynamic instances included in 𝑛 for a given loop invocation. If all these in-stances are independent of each other, then 𝑛 is tagged as Independent . If an instance of an instruction of 𝑛 dependson another instance of an instruction of 𝑛 , then this nodeis tagged as Sequential . Finally, if there are dependencesbetween instances of 𝑛 , but they are reducible by a reductioncode transformation (e.g., by cloning the defined variable s in s += work(d) ), then 𝑛 is tagged as Reducible , and therelated reduction is described within the node.

Call graph (CG).

NOELLE provides the call graph of aprogram where nodes are functions, and edges indicate agiven function invokes another. This abstraction relies on thePDG to compute the possible callees of an indirect call. Edgesof the NOELLE’s call graph can be must or may depend onwhether a given caller-callee relation is proved to hold or not.Each edge has sub-edges to indicates with which specificinstructions a caller invokes another function. Finally, CGcan compute the set of disconnected islands of such a graph.NOELLE’s call graph differentiates with LLVM’s one bybeing complete: the latter does not compute an indirect call’spossible callees. By being complete, NOELLE’s call graphenables custom tools to assume that the call graph’s lackof an edge means a function cannot invoke another. CG isused by the DeadFunctionEliminator custom tool built uponNOELLE, aiming to reduce the binary size of a program.

Environment (ENV).

NOELLE offers the

Environment ab-straction, which is an array of pointers of variables. Variables OELLE Offers Empowering LLVM Extensions arXiv’21, February, 2021, Virtual

Table 2.

NOELLE’s tools

Tool Description LoC Depends on noelle-whole-IR

Generate a single IR file from C/C++ source files embedding the compilation 1522options as metadata inside the generated IR file noelle-rm-lc-dependences

Transform loops to remove as many loop-carried data dependences as possible 912 aSCCDAG, CG,L, PRO, FR, LB noelle-prof-coverage

Inject code into the IR file given as input to profile IR instructions 1761 PRO, FR, CG noelle-meta-prof-embed

Embed profiles into the IR file given as input 152 PRO, FR, CG noelle-meta-pdg-embed

Compute and embed the PDG into the IR file given as input 451 PDG noelle-load

Load the NOELLE abstractions into memory without computing them 12 noelle-arch

Generate a file that describes the underlying architecture and its profiles (e.g., core-to-core latencies) 259 AR noelle-linker

Links IR files together while preservering the semantic of metadata generated by NOELLE’s tools 59 noelle-bin

Generate a standalone binary from an IR file using the compilation options specified 15as metadata inside the IR file given as input

LoC of NOELLE’s tools 5143 within an Environment represent the incoming and outgoingvalues from and to a set of instructions. This set of instruc-tions is described by a subset of the nodes of an aSCCDAG.An example pass that relies on the Environment abstractionis a parallelization technique that needs to propagate valuesexplicitly between the cores. Finally, NOELLE provides

Envi-ronment Builder to create, modify, and query environments.

Task (T).

NOELLE offers the

Task abstraction to describe acode region that runs sequentially. Parallelization techniquesuse the above abstraction in the following way. Nodes withinan aSCCDAG are partitioned into tasks. An Environmentis created for each task. At runtime, tasks are submitted toa thread-pool, which will run them in parallel across thecores. The explicit forwarding of data values between tasksis performed by loading/storing values from/to variablespointed by their environments.

Data flow engine (DFE).

NOELLE provides a data flowengine that can be used to implement data flow analyses.DFE implements conventional optimizations like bitvectors,basic block granularity optimization, working list algorithm,and loop-based priority [17]. Finally, NOELLE provides a setof common data flow analyses that rely on DFE.

Profiler (PRO).

NOELLE provides several code profilers,the ability to embed their results into IR files, and abstrac-tions to facilitate high-level queries on such data. Examplesof queries that can be performed are the hotness of a coderegion (e.g., a loop, an SCC of a dependence graph), loop-specific information (e.g., loop iteration count, average loopiteration count per invocation), and function-specific infor-mation (e.g., the average number of times that an invocationof a function invokes another).

Scheduler (SCD).

NOELLE provides the scheduler abstrac-tion that offers the capability of moving instructions withinand among basic blocks while preserving the original codesemantics. The scheduler relies on the PDG abstraction toguarantee semantic preservation. The abstraction providesa hierarchy of schedulers starting from a generic one andincluding loop-specific and within-basic-block schedulers. Each scheduler augments the generic capabilities with spe-cialized capabilities (e.g., reducing the header size of a loop).

Loop Builder (LB).

NOELLE offers the loop builder abstrac-tion that enables passes to modify/create/delete loops. LBis similar to the IRBuilder abstraction offered by LLVM, butinstead of targeting instructions, LB targets loops.

Induction variables (IV).

NOELLE provides the inductionvariable abstraction. Because LLVM’s IR is in SSA form, theconcept of the loop’s induction variable is embodied by anSCC of the aSCCDAG of that loop. NOELLE’s abstractionexposes such SCC, the starting and ending value of an in-duction variable, the step amount per loop iteration, andwhether an induction variable controls the number of loopiterations that will be executed. We call governing inductionvariables those that control the number of loop iterations.Finally, IV exposes the potential relationship with other in-duction variables for those that are derived.The main difference between LLVM’s induction variableand NOELLE’s version is that the former only provides thePHI instruction that composes the SCC of an induction vari-able that belongs to the loop header. Another difference isthat NOELLE’s version implements a more robust algorithmto detect governing induction variables based on the aSC-CDAG abstraction. LLVM’s implementation, instead, relieson the low-level def-use chains of the IR because of the lackof the SCCDAG abstraction within LLVM. NOELLE’s IV,therefore, detects more governing induction variables.

Induction Variable Stepper (IVS).

A common operationfor modern and emerging code transformations is to modifythe step of induction variables. For example, loop rotationneeds to revert the step value of induction variables. Anotherexample is an advanced DOALL parallelization, which needsto perform chunking between iterations to increase spatiallocality. The NOELLE’s abstraction induction variable stepper offers the capability to modify any step value of inductionvariables of a loop; users only need to specify the new stepvalues, and the abstraction modifies the loop accordingly.

Loop (L).

This abstraction includes a representation of theloop structure (called LS). The latter is equivalent to the loop rXiv’21, February, 2021, VirtualAngelo Matni, Enrico Armenio Deiana, Yian Su, Lukas Gross, Souradip Ghosh, Sotiris Apostolakis, Ziyang Xu, Zujun Tan,Ishita Chaturvedi, David I. August, and Simone Campanoni abstraction of LLVM. The abstraction L, instead, adds to LSthe loop dependence graph (computed from the PDG) andthe loop-specific instances of the abstractions IV and INV. Other abstractions.

Above, we have described the most im-portant abstractions NOELLE provides. However, NOELLEprovides additional abstractions used for simple compilationtasks such as control equivalence , reduction operations , ex-tendible metadata attached to control structures like loops, SCCDAG partitioner , forests(FR), and graphs designed to re-store connections among remaining parts when a node isdeleted, architecture to describe how logical cores are mappedto physical cores and NUMA nodes, and deterministic IDs for instructions, loops, functions, and basic blocks.Furthermore, NOELLE offers a new implementation of the loop structure (LS) , dominator , and scalar evolution abstrac-tions. This is because the LLVM abstractions computed byFunction passes free their memory when they are invokedto analyze a different function. This generates subtle, butunfortunately common, bugs that affect module passes. Thebug is generated when a module pass caches the pointersof the abstractions returned by a function pass applied tomultiple functions. All previous pointers but the last one areinvalid. This problem can become even more subtle becausea function pass’s invocation can invalidate the abstractionreturned by another function pass. To avoid this commonbug, NOELLE offers implementations of these LLVM abstrac-tions with the property that only their users can free thesememory objects. NOELLE includes tools (Table 2) to help users deploy theircompilation tool-chain Next are the most important ones. noelle-whole-IR generates a single IR file. Merging allbitcode into a single bitcode file is important for the analysesand transformations that span a wide code region (e.g., thewhole program). Such an example is the alias analyses usedby NOELLE to compute the PDG. This tool is based on gllvm . noelle-rm-lc-dependences modifies an IR program toremove or reduce the impact of loop-carried data depen-dences (e.g., using Loop Builder to split a loop). noelle-prof-coverage profiles IR code using represen-tative program’s inputs. At the moment, NOELLE includesan instruction profiler, a branch profiler, and a loop profiler. noelle-meta-pdg-embed computes the PDG of an IR file.This tool computes the PDG by invoking many time-consumingand accurate alias analyses that power NOELLE. Then, thistool embeds the computed PDG as metadata into the IRfile so that NOELLE can re-construct the requested abstrac-tions without requiring memory analyses. This tool relieson NOELLE’s PDG and IDs abstractions. noelle-load loads the NOELLE’s layer in memory. Cus-tom tools invoke NOELLE’s empowered LLVM pass by using noelle-load rather than the LLVM tool opt . noelle-arch measures architecture-specific characteris-tics. At the moment, this tool measures the core-to-corelatency and bandwidth. This tool also interacts with the tool hwloc [9] to find the number of physical and logical cores ofthe underlying platform, their mapping, and NUMA nodes. NOELLE provides a testing infrastructure composed of hun-dreds of regression tests, unit tests, and performance tests.These tests are micro C/C++ programs to illustrate cornercases or common code patterns found in popular benchmarksuites such as SPEC CPU2017 and PARSEC 3.0. This test-ing infrastructure allows NOELLE’s users to quickly testtheir work with representative code patterns without pay-ing the high compilation and profiling costs of the originalcodebase of the mentioned benchmark suites. Finally, thisinfrastructure is integrated with distributed systems, such asHTCondor and Slurm, to run tests in parallel across multiplemachines. Optionally, NOELLE generates a bash file thatexecutes all tests sequentially.Tests are enabled by exposing NOELLE options and can beextended. This allows NOELLE’s users to surgically generatetests that stress a specific aspect of a specific code transfor-mation. For example, a user can force a parallelizing customtool to parallelize only a given loop.

NOELLE’s abstractions may depend on each other to simplifydesign while keeping high precision. For example, the invari-ant abstraction (INV) uses the PDG abstraction to identifyloop invariants. Next, we compare this NOELLE’s implemen-tation with the LLVM’s to highlight the impact of buildingupon higher-level abstractions rather than lower-level ones.Algorithm 1 shows the simplified logic of LLVM’s imple-mentation that relies on low-level abstractions to decidewhether a given instruction is a loop invariant. First, thealgorithm checks if any operand of I is defined within loop L . If no operands are defined within L , it checks the type ofthe instruction I . If I is a load instruction, it checks if anyother instruction of L can modify the same memory locationaccessed by I . If I is a store instruction, it checks if anymemory use precedes I in L . If not, it checks no memoryinvalidation happens if I would be hoisted outside the loop.Finally, if I is a call instruction, it checks (i) if I can modifyany memory location, (ii) if the only memory accessed arevia arguments to the call, (iii) and if any sub-loop can modifythe same memory accessed via arguments by the call I .Algorithm 2 shows the NOELLE’s implementation thatrelies on the high-level PDG abstraction. It checks if I iscurrently under analysis (i.e., in the stack s ). If not, it checksinstruction that I depends on whether it is outside the loopor a loop invariant. Notice that this algorithm is smaller,simpler, and more precise than Algorithm 1 (Figure 4). OELLE Offers Empowering LLVM Extensions arXiv’21, February, 2021, Virtual

Algorithm 1: isInvariant_llvm(Instruction I, Loop L,Dominator DT, AliasAnalysis AA)

Result:

Return true if instruction I is an invariant in loop L /* Simplified logic of LLVM implementation */ for operand in I. getOperands () doif operand is defined in L then return False ; endif isa (I) thenfor Instruction J in L doif getModRef (J, I) != NoMod then return False ; endendif isa (I) thenfor memory use MU in L do // Conservatively ensures no memory// use precedes this store if not DT. dominates (I, MU) then return

False ; end // Ensures no memory def/use would be// invalidated by hoisting the store M ← AA. getNearestDominatingMemoryAccess ( I ); if M is in L then return

False ; endif call ← dyn_cast (I) thenif AA. getModRefBehavior (call) != NoMod then return

False ;S ← AA. onlyMemoryAccessesAreArguments ( call ); if not S then return False ; for Argument A of call dofor sL in L-> getSubLoops () dofor sI in sL doif AA. getModRef (A, sI) != NoMod then return

False ; endendendendreturn True ; Algorithm 2: isInvariant_noelle(Instruction I, Loop L,PDG dg, Stack s)

Result:

Return true if instruction I is an invariant in loop L /* Implementation using high level abstraction PDG instead oflow level abstractions alias analysis and dominators */ if I in s then return False ;s. push ( I ); for PDG dependence J to I doif

J is in L then inv ← isInvariant_noelle ( J , L , dg , s ); if not inv then return False ; endend s. pop (); return True ; This section describes the code transformations built uponNOELLE. Table 3 summarizes them and their LoC.Each transformation relies on several of NOELLE’s ab-stractions. Table 4 shows the abstractions used by them. It isimportant to notice that every abstraction is used by morethan one custom tool suggesting their wide applicability.

HELIX parallelizes a loop by distributing its iterations be-tween cores [23, 24, 42]. Each iteration is sliced into severalsequential and parallel segments. Different instances of thesame static sequential segment run sequentially between thecores while everything else can overlap. HELIX uses PRO, FR, and L of NOELLE to identify themost profitable loops to parallelize. HELIX uses the PDG andENV to identify and organize the live-in and live-out of eachchosen loop. LB and T abstractions are then used to generatethe parallel version of a loop.HELIX uses aSCCDAG, INV, IV, and the RD abstractionsto identify the SCCs that need to be executed sequentially.HELIX uses DFE to translate SCCs into sequential segments.SCD is then used to reduce the size of each sequential seg-ment as well as to schedule them within the body of each par-allelized loop. Moreover, HELIX uses IVS to perform chuck-ing of loop iterations. Finally, HELIX uses AR to implementhelper thread optimization [23].

DSWP parallelizes a loop by distributing its SCCs betweencores [43]. Instances of a given SCC are executed by the samecore to create a unidirectional communication between cores.DSWP uses NOELLE’s abstractions, similarly to how HELIXdoes while leveraging DSWP-specific knowledge to selectthe loops to parallelize and to parallelize them.

CARAT is co-designed with the underlying operating sys-tem to replace virtual memory. This compiler injects codeto guard IR memory instructions that cannot be proved atcompile time to be valid [46].CARAT relies on the PDG, the aSCCDAG, and INV toidentify the memory instructions that need guarding. Then,it uses DFE and PRO to avoid redundant guards of the samememory location. CARAT also uses L, LB, and IV to mergeguards. Finally, SCD is used to place the guards in the code.

Compiler-Based Timing is co-designed with the underly-ing operating system to inject calls to OS routines [31] intoa program. This compiler uses DFE and PRO to implementits specialized data flow analyses. It also uses L, FR, and LBto handle potentially-infinite loops. Finally, it uses CG toimprove the accuracy of its time analyses.

PRVJeeves selects the pseudo-random value generators(PRVG) for a randomized program (e.g., Monte Carlo sim-ulations) [38]. To do so, it uses the PDG, CG, and DFE toidentify the allocations and uses of the PRVGs. Then, PRV-Jeeves uses PRO to prune the design space (e.g., PRVGs notused frequently are left unmodified). Moreover, it uses L, LB,INV, and IV to identify the uses of a vector of PRVGs. Finally,PRVJeeves uses SCD to place the uses of a PRVG in the code.

DOALL parallelizes a loop that has no loop-carried datadependences by distributing its iterations among cores [34].DOALL’s implementation uses NOELLE’s abstractions simi-larly to the other parallelizing compilers (DSWP and HELIX),the difference being the loop selection process and partsof the parallelized code generation. Yet, the loop selectionprocess and parts of the parallelized-code generation arenaturally different from the other parallelization techniques. rXiv’21, February, 2021, VirtualAngelo Matni, Enrico Armenio Deiana, Yian Su, Lukas Gross, Souradip Ghosh, Sotiris Apostolakis, Ziyang Xu, Zujun Tan,Ishita Chaturvedi, David I. August, and Simone Campanoni Table 3.

Custom tools built upon NOELLE

LLVM + PercentCustom tool Description LLVM NOELLE reduction

Time Squeezer (TIME) Compiler to optimize compare instructions for timing speculative architectures 510 92 82.0%Compiler-based timing (COOS) Compiler to inject calls to Operating System routines to replace hardware interrupts 1641 495 69.8%Loop Invariant Code Motion (LICM) Hoist loop invariants outside their loop 2317 170 92.7%DOALL Parallelizing compiler that applies the DOALL code parallelization technique 5512 321 94.2%Dead Function Elimination (DEAD) Reduce the number of functions without increasing the total number of IR instructions 7512 61 99.2%DSWP Parallelizing compiler that applies the DSWP code parallelization technique 8525 775 90.9%HELIX Parallelizing compiler that applies the HELIX code parallelization technique 15453 958 93.8%PRVJeeves (PRVJ) Compiler to select the Pseudo Random Value Generators for the program given as input 17863 456 97.4%CARAT Inject memory guards to potentially incorrect memory instructions 21899 595 97.3%Perspective (PERS) Parallelizing compiler that minimizes speculation and privatization costs 33998 22706 33.2%

Loop Invariant Code Motion hoists loop invariants out-side their loop. It uses FR to hoist loop invariants from inner-most loops to outermost ones. Then, it uses INV to identifyinstructions that could be hoisted. Finally, it uses LB to per-form the hoist transformation.

Time-Squeezer generates code optimized for timing spec-ulative micro-architectures [28, 29]. To this end, the compilerneeds to decide when to swap the compare operands (andmodify its uses), how to change the schedule of instructions,and where to inject instructions that modify the clock periodof the underlying architecture. This custom tool uses DFE, L,and FR to decide where to inject clock-changing instructions.It then uses SCD to optimize the instruction sequence ofa code region that uses the same clock period previouslychosen. Finally, it uses ISL and PDG to analyze the compareinstructions and their dependences.

This section presents evaluation results for NOELLE and thecustom tools built upon NOELLE. Before presenting the re-sults, we first describe our evaluation platform and our eval-uation methodology. Our results show that each NOELLE’sabstraction can be used by several and significantly differentcustom tools. Results suggest that NOELLE’s implementationof a few abstractions that exist in LLVM is more precise thantheir LLVM counterparts. Finally, results suggest that we canbuild a custom tool in a few lines of code that is powerfulenough to improve the performance or reduce the binarysize compared to the mainline, wildly adopted compilers like clang . We have evaluated NOELLE and ten custom tools on theplatform described next and by following the measurementmethodology described here.

Platform.

Our evaluation was done on a Dell PowerEdgeR730 server with one Intel Xeon E5-2695 v3 Haswell proces-sor running at 2.3GHz. The processor has 12 cores with 2-wayhyperthreading, 35MB of last-level cache, and has a peakpower consumption of 120W. The cores are supported by256GB of main memory in 16 dual rank RDIMMs at 2133MHz.Turbo Boost was disabled, and no CPU frequency governors

Table 4.

Each NOELLE’s abstraction is used by several cus-tom tools.

Customtool NOELLE’s abstractions used P D G a S CC D A G C G E N V T D F E P R O S C D L L B I V I V S I N V F R I S L R D A R L S HELIX ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

DSWP ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

CARAT ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

COOS ✓ ✓ ✓ ✓ ✓ ✓ ✓

PRVJ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

DOALL ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

LICM ✓ ✓ ✓ ✓ ✓

TIME ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

DEAD ✓ ✓

PERS ✓ ✓ were used (i.e., all cores ran at a maximum frequency). TheOS used is Red Hat Enterprise Linux Server 8 on kernel 4.18.NOELLE was built on top of LLVM 9 [37].

Statistics and convergence.

Each data point we showin our evaluation is an average of multiple runs. We ranthe relevant configurations as many times as necessary toachieve a tight confidence interval (95% of the measurementsare within 5% of the mean).

NOELLE simplifies the implementation of code analysesand transformations. Table 3 compares the implementationsof 10 transformations when built upon NOELLE and whenimplemented only using LLVM abstractions. The reductionin LoC is significant, reducing the maintainability cost ofthese custom tools.NOELLE abstractions are general enough to be useful bymany and highly heterogeneous custom tools. Table 4 showsthat each abstraction is used by several custom tools. For ex-ample, the loop builder (LB) is used by eight custom tools outof 10. Moreover, it is important to notice the heterogeneityof these custom tools that use (for example) LB: parallelizingtransformations, loop invariant code motion (LICM), com-pare instruction optimization and code generation for timingspeculative micro-architecture (TIME), memory guard injec-tor and optimization (CARAT), PRVG selector (PRVJ), andscheduler of OS routines within applications (COOS). OELLE Offers Empowering LLVM Extensions arXiv’21, February, 2021, Virtual b l e n d e r _ r d ee p s j e n g _ r i m a g i c k _ r l b m _ r l ee l a _ r m c f _ r n a b _ r n a m d _ r o m n e t pp _ r p a r e s t _ r p e r l b e n c h _ r x _ r x a l a n c b m k _ r x z _ r b l a c k s c h o l e s b o d y t r a c k c a nn e a l f l u i d a n i m a t e f r e q m i n e s t r e a m c l u s t e r s w a p t i o n s x b a s i c m a t h b f _ d b f _ e b i t c n t s c j p e g c r c d j p e g ff t ff t _ i n v q s o r t r a w c a u d i o r a w d a u d i o s e a r c h s h a s u s a n _ c s u s a n _ e s u s a n _ s t o a s t un t o a s t D e p e n d e n c e s [ % ] LLVMNOELLE

Figure 3.

While LLVM is capable of proving the non-existence of most dependences, NOELLE disproves more by relying on state-of-the-artalias analysis techniques (SCAF [16])

Next, we compare the subset of NOELLE’s abstractions thatare also available in LLVM. These abstractions are loop in-variants, loop induction variables, and dependences.Figure 3 shows that NOELLE’s implementation of depen-dences within the PDG abstraction is more accurate thanLLVM’s abstraction. LLVM is capable of proving a signifi-cant fraction of potential memory dependences non-existing.NOELLE further improves these results dramatically by lever-aging state-of-the-art alias analyses [16, 35, 48].Figure 4 compares the number of loop invariants identifiedby both LLVM and NOELLE. NOELLE identifies significantlymore loop invariants than LLVM because the invariant ab-straction of NOELLE is built using the PDG abstraction. Thismakes the invariant detection algorithm within NOELLE(Algorhtm 2) smaller, more elegant, and more powerful com-pared to the LLVM one (Algorithm 1).Finally, we computed the number of loop induction vari-ables that govern a loop using both LLVM and NOELLE. Wedid so for the three benchmark suites for a total of 41 bench-marks. LLVM identifies only a few loop induction variables(11 total) among all loops for the 41 benchmarks. The reasonis that LLVM’s induction variable analysis expects the inputIR to have loops in the do-while shape. However, most loopsin the 41 benchmarks have a while shape, and changingthem into a do-while shape would reduce the applicabilityof all the implemented parallelization techniques. Instead,NOELLE identifies many loop induction variables (385 total)independently of the shape of the loop being analyzed.

Next, we describe the parallelizing code transformationsbuilt upon NOELLE (HELIX, DSWP, DOALL) that do not relyon speculative techniques. This allows us to compare few-hundred lines of code implementations built upon NOELLEwith the parallelizing transformations implemented by icc (Intel) and gcc (GNU) compilers.Figure 5 shows the speedups we obtained in PARSEC andMiBench benchmark suites. The few missing benchmarkshave failed to compile with the unmodified clang compiler,and therefore we cannot use them to test NOELLE-basedtools. Figure 5 shows that the NOELLE-based small customtools already extract more parallelism compared to what gcc and icc extract. Furthermore, we analyzed the few bench-marks that NOELLE-based parallelizing tools could not ex-tract significant performance benefits (e.g., crc ). We foundthis is due to the lack of support for memory object cloning.This is arguably an abstraction that should exist in the par-allelization techniques rather than within NOELLE as thelatter is not specialized for parallelization purposes.We also run these five parallelizing tools on 14 SPECCPU2017 benchmarks (the only missing benchmark is gcc ,which did not compile with clang ). Speedups were obtainedonly by NOELLE-based parallelizing tools and are within1% and 5% for these 14 benchmarks demonstrating the ro-bustness of NOELLE abstractions. Speculative techniquesare likely to be required to unlock further speedups on thesebenchmarks. We argue that speculative techniques shouldbe implemented outside NOELLE as they are specific to theparallelization goal.Finally, we have ported a state-of-the-art parallelizingcompiler (Perspective [15]) together with the authors. Wemodified the original codebase to use the PDG and the aS-CCDAG abstractions. This new version has preserved theperformance shown in the authors’ original paper.

Binary size is an important optimization goal for both em-bedded systems and servers [18]. The compiler clang offersan optimization level specialized for this goal ( -Oz ). Dead-FunctionElimination further reduces the binary size by 6.3%on average among the 41 benchmarks considered.

Providing High-level Abstractions

Researchers have ex-plored bringing high-level abstractions to compilers in manydifferent ways. A few compilers that support automatic par-allelization, including Polaris [20], a parallelizing compilerfor Fortran programs, Cetus [3], a C compiler focusing onmulticore, ROSE [10], a compiler for building custom compi-lation tools, operate on high-level abstractions and performsource-to-source translation, and thus miss opportunitiespresented only in low-level IRs including more fine-grainedoperations and more canonical code patterns.Many domain-specific projects add new abstractionssimilar to NOELLE. SeaHorn [33] provides new abstrac-tions for developing new verification techniques. Polly [8, rXiv’21, February, 2021, VirtualAngelo Matni, Enrico Armenio Deiana, Yian Su, Lukas Gross, Souradip Ghosh, Sotiris Apostolakis, Ziyang Xu, Zujun Tan,Ishita Chaturvedi, David I. August, and Simone Campanoni b l e n d e r _ r d ee p s j e n g _ r i m a g i c k _ r l b m _ r l ee l a _ r m c f _ r n a b _ r n a m d _ r o m n e t pp _ r p a r e s t _ r p e r l b e n c h _ r x _ r x a l a n c b m k _ r x z _ r b l a c k s c h o l e s b o d y t r a c k c a nn e a l f l u i d a n i m a t e f r e q m i n e s t r e a m c l u s t e r s w a p t i o n s x b a s i c m a t h b f _ d b f _ e b i t c n t s c j p e g c r c d j p e g ff t ff t _ i n v q s o r t r a w c a u d i o r a w d a u d i o s e a r c h s h a s u s a n _ c s u s a n _ e s u s a n _ s t o a s t un t o a s t I n v a r i a n t s [ % ] LLVMNOELLE

Figure 4.

NOELLE detects significantly more invariants than LLVM even if the former relies on a simpler and shorter algorithm poweredby higher-level abstraction (Algorhtm 2) compared to LLVM (Algorithm 1). b l a c k s c h o l e s b o d y t r a c k c a nn e a l f l u i d a n i m a t e f r e q m i n e s t r e a m c l u s t e r s w a p t i o n s x b a s i c m a t h b f _ d b f _ e b i t c n t s c j p e g c r c d j p e g ff t _ i n v ff t q s o r t r a w c a u d i o r a w d a u d i o s e a r c h s h a s u s a n _ c s u s a n _ e s u s a n _ s t o a s t un t o a s t P A R S E C M i B e n c h O v e r a ll P r o g r a m s p ee d u p Number of coresclang -O3 -march=native Performanceobtained bythe parallelizationdone bya NOELLE custom tool gcc did notextractparallelismicc did notextractparallelism

DOALLHELIXDSWP gccgcc-par iccicc-par

Figure 5.

Both gcc and icc did not obtain additional performance benefits from their parallelization techniques. Instead, NOELLE-basedparallelizing tools generate additional benefits compared to their baseline, clang . LLVM Projects

As we have built NOELLE on top of LLVM,we want to know how NOELLE would do. To do this, we haveexhaustively reviewed all 544 papers published in PLDI, CGO,and CC during the past five years (2016-2020). Out of thesepapers, 87 papers explicitly mention that they are built ontop of LLVM by either implementing new passes, modifyingthe LLVM internals, or creating a new front-end/back-endbased on LLVM IR. Out of these 87 papers, •

26 (29.9%) use abstractions similar to the ones provided byNOELLE. Thus, they can potentially be re-implementedon top of NOELLE with significantly fewer lines of codeand/or with better performance. We have implementedCARAT [46] and PRVJeeves [38] in NOELLE and presentedthe benefit in 3. Other examples include Spinal Node [36],which uses PDG as well as data flow analysis; Valence [52],which relies on call graph analysis; Clairvoyance [50],which relies on loop-carried dependence analysis. •

10 (11.5%) provide new abstractions or implement analysesor transformations that fulfill NOELLE abstractions. Wehave already integrated SVF [47] and SCAF [16] withinNOELLE. We plan to evaluate other examples [27, 40, 41,44] in the future. •

25 (28.7%) are doing tasks orthogonal to NOELLE’s ab-stractions. Nevertheless, they do not conflict with NOELLEbecause both implementations do not modify LLVM in-ternals. Due to NOELLE’s modular and demand-drivendesign, future work can use NOELLE even if only a subsetof abstractions are of interest. •

26 (29.9%) paper modify LLVM internals or use alterna-tive front-end/back-end. These projects need to be ana-lyzed case by case for the possibility of integration withNOELLE.In conclusion, 41.4% of the projects are highly likely tobenefit from or contribute to NOELLE’s abstractions; 28.7%have the potential for future collaboration; 29.9% need inves-tigation before integration. OELLE Offers Empowering LLVM Extensions arXiv’21, February, 2021, Virtual

Code analyses and transformations need to go beyond peep-hole and ILP optimizations for modern architectures. Theirimplementation requires high-level abstractions that are cur-rently lacking in LLVM. This paper introduces NOELLE, anopen-source compilation layer built upon LLVM that pro-vides the required abstractions. NOELLE has been tested withten highly diverse and complex tools that are built upon it.All these tools gain benefits compared to unmodified LLVMwhile dramatically reducing their LoC.

References

Proceedings of the Twenty-FifthInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS ’20) . Association for Com-puting Machinery, New York, NY, USA, 351–367. https://doi.org/10.1145/3373376.3378458 [16] Sotiris Apostolakis, Ziyang Xu, Zujun Tan, Greg Chan, Simone Cam-panoni, and David I. August. 2020. SCAF: A Speculation-Aware Col-laborative Dependence Analysis Framework. In

Proceedings of the 41stACM SIGPLAN Conference on Programming Language Design and Im-plementation (PLDI 2020) . Association for Computing Machinery, NewYork, NY, USA, 638–654. https://doi.org/10.1145/3385412.3386028 [17] Andrew W Appel. 2008.

Modern compiler implementation in Java .Cambridge university press.[18] Grant Ayers, Nayana Prasad Nagendra, David I. August, Hyoun KyuCho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy,Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan. 2019. As-mDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers. In

Proceedings of the 46th International Symposium onComputer Architecture (ISCA ’19) . Association for Computing Machin-ery, New York, NY, USA, 462–473. https://doi.org/10.1145/3307650.3322234 [19] Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele DelSozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman P. Amarasinghe. 2019. Tiramisu: A PolyhedralCompiler for Expressing Fast and Portable Code. In

IEEE/ACM In-ternational Symposium on Code Generation and Optimization, CGO2019, Washington, DC, USA, February 16-20, 2019 , Mahmut Taylan Kan-demir, Alexandra Jimborean, and Tipp Moseley (Eds.). IEEE, 193–205. https://doi.org/10.1109/CGO.2019.8661197 [20] Bill Blume, Rudolf Eigenmann, Keith Faigin, and John Grout. [n. d.].

Polaris: The Next Generation in Parallelizing Compilers . TechnicalReport. 18 pages.[21] Uday Bondhugula and Jagannathan Ramanujam. 2007. Pluto: A prac-tical and fully automatic polyhedral parallelizer and locality optimizer.(2007).[22] Juan Manuel Martinez Caamaño, Aravind Sukumaran-Rajam, ArtiomBaloian, Manuel Selva, and Philippe Clauss. 2017. APOLLO: Auto-matic Speculative POLyhedral Loop Optimizer. In

IMPACT 2017 - 7thInternational Workshop on Polyhedral Compilation Techniques . 8.[23] Simone Campanoni, Timothy Jones, Glenn Holloway, Vijay JanapaReddi, Gu-Yeon Wei, and David Brooks. 2012. HELIX: AutomaticParallelization of Irregular Programs for Chip Multiprocessing. In

Proceedings of the Tenth International Symposium on Code Generationand Optimization (CGO ’12) . ACM, New York, NY, USA, 84–93. https://doi.org/10.1145/2259016.2259028 [24] Simone Campanoni, Timothy Jones, Glenn Holloway, Gu. Y. Wei, andDavid Brooks. 2012. The HELIX project: Overview and directions. In

DAC Design Automation Conference 2012 . 277–282. https://doi.org/10.1145/2228360.2228412 [25] Pohua P Chong, Scott A Mohike, and Nancy J Warier. [n. d.]. IMPACT:An Architectural Framework for Multiple-Instruction-Issue Processors.([n. d.]), 10.[26] Enrico A. Deiana, Vincent St-Amour, Peter A. Dinda, Nikos Hardav-ellas, and Simone Campanoni. 2018. Unconventional Parallelizationof Nondeterministic Applications. In

Proceedings of the Twenty-ThirdInternational Conference on Architectural Support for Programming Lan-guages and Operating Systems (ASPLOS ’18) . ACM, New York, NY, USA,432–447. https://doi.org/10.1145/3173162.3173181 [27] Johannes Doerfert, Tobias Grosser, and Sebastian Hack. 2017. Op-timistic Loop Optimization. In

Proceedings of the 2017 InternationalSymposium on Code Generation and Optimization, CGO 2017, Austin,TX, USA, February 4-8, 2017 , Vijay Janapa Reddi, Aaron Smith, andLingjia Tang (Eds.). ACM, 292–304.[28] Yuanbo Fan, Simone Campanoni, and Russ Joseph. 2019. Time squeez-ing for tiny devices. In

Proceedings of the 46th International Symposiumon Computer Architecture, ISCA 2019, Phoenix, AZ, USA, June 22-26,2019 . 657–670. https://doi.org/10.1145/3307650.3322268 [29] Yuanbo Fan, Tianyu Jia, Jie Gu, Simone Campanoni, and Russ Joseph.2018. Compiler-guided Instruction-level Clock Scheduling for TimingSpeculative Processors. In

Proceedings of the 55th Annual Design Au-tomation Conference (DAC ’18) . ACM, New York, NY, USA, Article 40,6 pages. https://doi.org/10.1145/3195970.3196013 [30] Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. 1987. The pro-gram dependence graph and its use in optimization.

ACM Transactionson Programming Languages and Systems (TOPLAS)

9, 3 (1987), 319–349.[31] Souradip Ghosh, Michael Cuevas, Simone Campanoni, and Peter Dinda.2020. Compiler-based Timing For Extremely Fine-grain PreemptiveParallelism. In

Super Computing conference (SC) .[32] Tobias Grosser, Hongbin Zheng, Raghesh Aloor, Andreas Simbürger,Armin Größlinger, and Louis-Noël Pouchet. 2011. Polly-Polyhedral op-timization in LLVM. In

Proceedings of the First International Workshopon Polyhedral Compilation Techniques (IMPACT) , Vol. 2011. 1.[33] Arie Gurfinkel, Temesghen Kahsai, Anvesh Komuravelli, and Jorge A.Navas. 2015. The SeaHorn Verification Framework. In

ComputerAided Verification , Daniel Kroening and Corina S. Păsăreanu (Eds.).Vol. 9206. Springer International Publishing, Cham, 343–361. https://doi.org/10.1007/978-3-319-21690-4_20 [34] Ali R Hurson, Joford T Lim, Krishna M Kavi, and Ben Lee. 1997. Par-allelization of doall and doacross loops—a survey. In

Advances in rXiv’21, February, 2021, VirtualAngelo Matni, Enrico Armenio Deiana, Yian Su, Lukas Gross, Souradip Ghosh, Sotiris Apostolakis, Ziyang Xu, Zujun Tan,Ishita Chaturvedi, David I. August, and Simone Campanoni computers . Vol. 45. Elsevier, 53–103.[35] Nick P. Johnson, Jordan Fix, Stephen R. Beard, Taewook Oh, Thomas B.Jablin, and David I. August. 2017. A Collaborative Dependence Analy-sis Framework. In Proceedings of the 2017 International Symposium onCode Generation and Optimization (CGO ’17) . IEEE Press, Piscataway,NJ, USA, 148–159. http://dl.acm.org/citation.cfm?id=3049832.3049849 [36] Bongjun Kim, Seonyeong Heo, Gyeongmin Lee, Seungbin Song, JongKim, and Hanjun Kim. 2019. Spinal Code: Automatic Code Extractionfor near-User Computation in Fogs. In

Proceedings of the 28th Interna-tional Conference on Compiler Construction - CC 2019 . ACM Press, Wash-ington, DC, USA, 87–98. https://doi.org/10.1145/3302516.3307356 [37] Chris Lattner and Vikram Adve. 2004. LLVM: A compilation frame-work for lifelong program analysis & transformation. In

Proceedingsof the international symposium on Code generation and optimization:feedback-directed and runtime optimization . IEEE Computer Society,75.[38] Michael Leonard and Simone Campanoni. 2020. Introducing the Pseu-dorandom Value Generator Selection in the Compilation Toolchain.In

Proceedings of the 18th ACM/IEEE International Symposium on CodeGeneration and Optimization (CGO 2020) . Association for ComputingMachinery, New York, NY, USA, 256–267. https://doi.org/10.1145/3368826.3377906 [39] llvm. [n. d.]. The Loop Optimization Working Group.https://llvm.org/devmtg/2019-10/talk-abstracts.html

Proceedings of the 2017 International Symposiumon Code Generation and Optimization, CGO 2017, Austin, TX, USA,February 4-8, 2017 , Vijay Janapa Reddi, Aaron Smith, and Lingjia Tang(Eds.). ACM, 134–147.[41] Stanislav Manilov, Christos Vasiladiotis, and Björn Franke. 2018. Gener-alized Profile-Guided Iterator Recognition. In

Proceedings of the 27th In-ternational Conference on Compiler Construction - CC 2018 . ACM Press,Vienna, Austria, 185–195. https://doi.org/10.1145/3178372.3179511 [42] Niall Murphy, Timothy Jones, Robert Mullins, and Simone Campanoni.2016. Performance Implications of Transient Loop-carried Data Depen-dences in Automatically Parallelized Loops. In

Proceedings of the 25thInternational Conference on Compiler Construction (CC 2016) . ACM,New York, NY, USA, 23–33. https://doi.org/10.1145/2892208.2892214 [43] G. Ottoni, R. Rangan, A. Stoler, and D. I. August. 2005. Automaticthread extraction with decoupled software pipelining. In .12 pp.–118. https://doi.org/10.1109/MICRO.2005.13 [44] Ankush Phulia, Vaibhav Bhagee, and Sorav Bansal. 2020. OOElala:Order-of-Evaluation Based Alias Analysis for Compiler Optimization. In

Proceedings of the 41st ACM SIGPLAN International Conference onProgramming Language Design and Implementation, PLDI 2020, London,UK, June 15-20, 2020 , Alastair F. Donaldson and Emina Torlak (Eds.).ACM, 839–853. https://doi.org/10.1145/3385412.3385962 [45] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, SylvainParis, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Lan-guage and Compiler for Optimizing Parallelism, Locality, and Recom-putation in Image Processing Pipelines. In

Proceedings of the 34th ACMSIGPLAN Conference on Programming Language Design and Implemen-tation (PLDI ’13) . Association for Computing Machinery, New York,NY, USA, 519–530. https://doi.org/10.1145/2491956.2462176 [46] Brian Suchy, Simone Campanoni, Nikos Hardavellas, and Peter Dinda.2020. CARAT: A Case for Virtual Memory through Compiler- andRuntime-Based Address Translation. In

Proceedings of the 41st ACMSIGPLAN Conference on Programming Language Design and Implemen-tation (PLDI 2020) . Association for Computing Machinery, New York,NY, USA, 329–345. https://doi.org/10.1145/3385412.3385987 [47] Yulei Sui and Jingling Xue. 2016. SVF: Interprocedural Static Value-Flow Analysis in LLVM. In

Proceedings of the 25th International Confer-ence on Compiler Construction, CC 2016, Barcelona, Spain, March 12-18,2016 , Ayal Zaks and Manuel V. Hermenegildo (Eds.). ACM, 265–266. https://doi.org/10.1145/2892208.2892235 [48] Yulei Sui and Jingling Xue. 2016. SVF: interprocedural static value-flowanalysis in LLVM. In

Proceedings of the 25th international conferenceon compiler construction . 265–266.[49] Robert Tarjan. 1972. Depth-first search and linear graph algorithms.

SIAM journal on computing

1, 2 (1972), 146–160.[50] Kim-Anh Tran, Trevor E. Carlson, Konstantinos Koukos, Magnus Sjä-lander, Vasileios Spiliopoulos, Stefanos Kaxiras, and Alexandra Jim-borean. 2017. Clairvoyance: Look-Ahead Compile-Time Scheduling.In

Proceedings of the 2017 International Symposium on Code Genera-tion and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017 ,Vijay Janapa Reddi, Aaron Smith, and Lingjia Tang (Eds.). ACM, 171–184.[51] Robert P Wilson, Robert S French, Christopher S Wilson, Saman PAmarasinghe, Jennifer M Anderson, Steve W K Tjiang, Shih-Wei Liao,Chau-Wen Tseng, Mary W Hall, Monica S Lam, and John L Hennessy.[n. d.].

The SUIF Compiler System: A Parallelizing and OptimizingResearch Compiler . Technical Report. 7 pages.[52] Tong Zhou, Michael R. Jantz, Prasad A. Kulkarni, Kshitij A. Doshi, andVivek Sarkar. 2019. Valence: Variable Length Calling Context Encod-ing. In

Proceedings of the 28th International Conference on CompilerConstruction - CC 2019 . ACM Press, Washington, DC, USA, 147–158. https://doi.org/10.1145/3302516.3307351https://doi.org/10.1145/3302516.3307351