Employing Simulation to Facilitate the Design of Dynamic Code Generators
Vanderson Martins do Rosario, Raphael Zinsly, Sandro Rigo, Edson Borin
EEmploying Simulation to Facilitate the Design ofDynamic Code Generators
Vanderson M. do Rosario Raphael Zinsly , Sandro Rigo Edson Borin Institute of Computing - UNICAMP - Brazil IBM - Campinas - BrazilSeptember 1, 2020
Abstract
Dynamic Translation (DT) is a sophisticated technique that allows theimplementation of high-performance emulators and high-level-languagevirtual machines. In this technique, the guest code is compiled dynami-cally at runtime. Consequently, achieving good performance depends onseveral design decisions, including the shape of the regions of code beingtranslated. Researchers and engineers explore these decisions to bring thebest performance possible. However, a real DT engine is a very sophis-ticated piece of software, and modifying one is a hard and demandingtask. Hence, we propose using simulation to evaluate the impact of de-sign decisions on dynamic translators and present RAIn, an open-sourceDT simulator that facilitates the test of DT’s design decisions, such asRegion Formation Techniques (RFTs). RAIn outputs several statisticsthat support the analysis of how design decisions may affect the behaviorand the performance of a real DT. We validated RAIn running a set ofexperiments with six well known RFTs (NET, MRET2, LEI, NETPlus,NET-R, and NETPlus-e-r) and showed that it can reproduce well-knownresults from the literature without the effort of implementing them on areal and complex dynamic translator engine.
Emulators and high-level-language virtual machines compile applications’ codeduring their execution. This approach, known as Just-In-Time (JIT) compi-lation or Dynamic Translation (DT), is a concept that is as old as high-levelprogramming languages themselves [1]. DT techniques can be used to improveexecution time and space efficiency of programs [2] and to support programminglanguage versatility with High-Level Language Virtual Machines (HLLVM) [3].They can also be used to maintain support for legacy code by the industry [4, 5]or to support new architectures such as RISC-V [6, 7]. In dynamic high-levellanguages, DT can be used to emulate their intermediate representation, such asthe Facebook Hip-Hop Virtual Machine [3], the Java HotSpot [8], and Firefox’sIonMonkey JavaScript JIT [9]. 1 a r X i v : . [ c s . P L ] A ug n order to achieve high-performance using dynamic translation, the costadded by invoking a compiler during runtime should be lower than the perfor-mance gains achieved by the execution of the code produced by the compiler.To pay off the compilation cost, a piece of code needs to execute for a significantamount of time, so the savings achieved by the execution of the optimized codeare also significant. Fortunately, programs usually spend most of their execu-tion time in a minority of their code [10], and DT can achieve high-performanceby only translating and optimizing frequently executed code [11], which we callhot code. For the rest of the application, the cold part, an interpreter, or afast-non-optimizer compiler can be used. Therefore, one of the main strategiesto improve the performance of a DT engine (DTE) is to create heuristics topredict, at runtime, which part of the code is hot and which is cold. In this pa-per, the term Region Formation Techniques (RFTs) will be used in its broadestsense to refer to all these prediction schemes. The quality of the code producedby the dynamic translator for hot regions is also an essential factor that affectsperformance. In this context, both the set of optimizations employed and thegranularity and shape of the region of code being compiled play an importantrole in the code quality and, hence, its performance. For instance, while smallportions of code, such as basic blocks, can be fast compiled, larger ones exposemore opportunities for optimizations [12]. RFTs that capture whole loops oreven code from more than one method in the same region enables more aggres-sive (loop or inter-procedural) optimizations. Moreover, besides affecting thescope of optimizations, the RFT design decisions may also affect the hot codefrequency, the code duplication rate, and the optimization costs, among others.All these variables need to be considered while designing and evaluating aDTE, and this is not a trivial task. A DTE is a sophisticated piece of soft-ware that includes in itself a compiler, may also include an interpreter, complexdata structures for storing compiled binary, a linker, and code to orchestrateall these pieces together. Understanding the code of a DTE or debugging oneis challenging, mainly when the bug occurs on the code generated dynamicallyby the DTE. Consequently, in this kind of software, implementing and vali-dating novel research and design ideas is usually a complex process. In fieldslike processor architecture designing, prototyping ideas in real hardware is alsocomplex and simulation is broadly used to make design exploration approach-able [13, 14, 15, 16]. In this work, we argue that simulation can also be usedto facilitate DT design exploration, and we present RAIn, a novel DT/RFTsimulator.We use RAIn to evaluate several RFTs and reproduce results from the lit-erature allowing its simulation capabilities validation. Our evaluation setupincludes programs from both the SPEC-CPU 2006 [17] and SYSMARK [18]benchmarks and covers six different RFTs techniques from the literature. Thecontributions of this paper can be summarized as follows: • A novel DTE Simulator , called RAIn, which makes the implementationand testing new RFTs simpler and faster. RAIn is capable of producingseveral different statistics that facilitate the evaluation of the behavior andperformance of the RFTs. • A comprehensive study of the performance and behavior of six RFTs(NET [19], MRET2 [20], LEI [21], Netplus [22], Relaxed NET [23], and2xtended Netplus [24]) using programs from SPEC-CPU 2006 [17] andSYSMARK [18], thus covering several different application profiles. Theresults corroborate the findings encountered by previous work and providesa comparison of all the techniques using the same set of applications.The remainder of the text is organized as follows: Section 2 discusses thetypical organization of a trace/region-based DTE and the characterization oftheir overhead (performance issues). Section 3 describes the proposed simulator,including its functionality and its advantages. Section 4 shows the experimentalsetup, and Section 5 presents a comprehensive comparison between several DTdesigns. Finally, Section 6 presents the conclusions. A Dynamic Translator Engine (DTE) is a piece of software that translates guestcode, which may be a binary generated for one computer architecture, intocode compatible with the host architecture, also known as native code. Whenemulating hot code, i.e., frequently executed code, it usually pays off to spendeffort translating and optimizing the guest code into optimized native code. Inthis case, the performance gains achieved by executing the optimized nativecode surpasses the costs of translating the code. For cold code, i.e., infrequentlyexecuted code, it is usually better to employ techniques such as interpretation orquick, basic-block-based, translation, which have no or low translation cost. Inthis paper, we use the term interpretation to represent the mechanisms employedto emulate cold code.Figure 1 illustrates the execution flow of a common DTE. First, it loads theguest code to memory (state 1), which can be, for example, an intermediaterepresentation such as Java Byte Code or an x86 binary. Then, the emulationprocess begins by fetching, decoding, and interpreting all the instructions oneby one (state 2), in a process called interpretation.During the interpretation, an active monitor (profiler), or the interpreteritself, monitors whether the emulation is repeatedly executing the same codefor longer than a given threshold, called hotness threshold. If so, this meansthat the execution is on a hot part of the code, such as a cycle. Once detectedthat the execution is in a hot code, the interpreter starts to record the trace ofinstructions to form a region of code for translation (state 3). These recordedinstructions, normally part of a loop or a cycle in the static code, are passedto a compiler which compiles the region from the guest architecture into asemantically equivalent optimized code compatible with the host architecture,the native code (state 4). The native code is then stored in a code cache andevery time this same piece of code needs to be executed in the future, theexecution jumps to the native code in the cache (state 5) instead of interpretingit. All these steps are repeated until the entire program’s execution is finalized.3 ) LoadingInput2) Interpreting3) RegionRecording 5) Native CodeExecuting4) RegionCompiling
Detect possible hotregionFinish recording aregion to be compiled Executecompiled regionNext Instruction startsa compiled region
Figure 1: Execution flow of a common DTE.DTEs implement specific functionalities to execute each one of the afore-mentioned steps. For example, to predict which regions of code are hot or not,there are three typical implementations [25]: a) frequency counting based oninstrumentation, b) sampling based on interrupt-timers, or c) a combinationof both. To store the compiled code and make it fast to access, DTEs employhash maps organized as caches, called Translated Code Cache, or TCC. Anothercritical process during emulation is mapping guest addresses into their respec-tive emitted host code addresses. As translation does not always result in thesame memory layout, access to memory locations using the address from theguest code in indirect jumps and returns needs to be mapped to the address inthe compiled/translated host code. The mechanism that handles this mappingof addresses during region execution is usually referred to as Indirect BranchTranslation Handler [26]. All these mechanisms and structures carry designdecisions and details that directly affects the performance of a DTE.Another important mechanism is the Region Formation Technique (RFT),which is responsible for : (1) deciding which instructions to profile, (2) decidingwhen to start recording regions, and (3) deciding when to stop recording them.At one hand, to minimize compilation overhead, it is important to translateonly hot code. On the other hand, to accelerate the native code, it is usuallynecessary to form large regions to increase the translation scope and exposemore optimization opportunities to the optimizer. For example, RFTs thatcapture whole loops or even code from more than one method in the same regionenables more aggressive (loop or inter-procedural) optimizations. Consequently,the RFT design may have a big impact on the performance of the native codeand, hence, the DTE.So far, the main RFTs proposed in the literature are NET (a.k.a MRET) [19],NET-r [27], MRET2 [20], LEI [21], NETPlus [22] and NETPlues-e-r [28]. Allof them using different strategies to select hot code and selecting dynamic re-gions with different sizes, shape, and characteristics which directly affects theperformance of a DTE. Below, we include a brief description of each one ofthem: • NET : The authors of Dynamo [29] introduced an RFT called NET (Next-Executing Tail) [19], which was originally called MRET (Most RecentExecuted Tail). In NET, regions are superblocks. Targets of backwardbranches or targets of other superblock exits are considered as potentialsuperblock entries and are assigned a counter that keeps track of its exe-4ution frequency. After the counter reaches a defined hotness threshold,a new region is recorded starting from this instruction and continuinguntil another backward branch is reached or a given maximal number ofinstructions is included in the region. • MRET2 : The authors of StarDBT [30] introduced the MRET2 RFT,a variation of NET that aims at reducing the number of side-exits [20].MRET2 consists of executing the recording phase of the NET techniquetwice. If different code sequences are selected during both recordings, onlythe intersection between them is selected to compose the MRET2 region. • LEI : Hiniker, Hazelwood, and Smith [21] introduced the
Last ExecutedIteration (LEI) technique. LEI selects cyclic superblocks based on a his-tory buffer for the current execution. It focuses on avoiding inner loopduplication on the superblocks. • NETPlus : Davis and Hazelwood pointed out in a more recent work [22]that the history buffer used by LEI imposes a considerable overhead. Inthis latter paper, the authors propose the NETPlus RFT, which followsthe same steps as NET up to the point where a region is being closed. Atthis point, NETPlus will look ahead in the code for a branch whose targetis the beginning of the superblock. When found, all instructions betweenthe current end of the region and the branch are added to the superblock.In this manner, NETPlus aims at capturing more loops inside individualsuperblocks when compared to the original NET, imposing a low overheadon the superblock selection process. It is important to notice, however,that the look-ahead process may touch code or memory positions thathave never been touched before and may never be touched in the future,which may trigger unexpected page faults. The deepness in which thesearch can go is a parameter for the NETPlus RFT. • NET-r and NETPlus-e-r : Hong et al. [27] presented a modified ver-sion of QEMU that uses the LLVM backend to emit highly optimizedregions, named HQEMU. The authors observe that to obtain maximumbenefits from the LLVM optimizations, HQEMU needs to create large re-gions of code. Thus, they present a modification of two known RFTs.The first, called NET-r, is a relaxation of NET that makes it similar tothe cyclical-path-based repetition detection scheme by not end recordinga region when a backward branch is found, but when a cycle is found (re-peated instruction address recorded). The second [28], called NETPlus-e,is an extension of NETPlus that adds not only paths that exit the NETregion and returns to its entrance but also paths that exit the NET re-gion and returns to any part of the region. NETPlus-e can also use theNET-r instead of NET, thus creating an extended and relaxed version ofNETPlus (NETPlus-e-r).
In this section, we discuss the primary sources of overhead in a DTE, mainlythe ones related to the RFT choice.If we consider emulation flow depicted in Figure 1, at any given time, a DTEcan be in any of the five states. State 1, Loading DTE and Guest Code, only5eeds to be executed once, and for long-time executions, it incurs a minimumoverhead; hence, we will not consider it. Instead, we will focus on the perfor-mance and overhead sources on the other four: Interpretation, Region Record,Region Compilation, and Native Execution.The total execution cost of emulating code with a DTE is composed of thecost of interpreting (State 2) and profiling (State 3) cold code plus the cost ofcompiling hot code (State 4) and the cost of executing the native code (State5). Notice that the sooner a code is compiled, the lesser time the DTE spendsemulating it as cold code and more time it spends emulating it as hot code, i.e.,executing optimized native code.Emulating code with native/optimized code is faster than emulating codewith interpretation, so one greed approach would be to compile every single partof the code, but we need also to consider the compilation cost. If the executionfrequency is low, the compilation overhead may exceed the gains achieved bythe hot-code emulation. In the case of cold code, compiling damages the finalperformance instead of improving it. In this case, to only interpret the code isthe best option [11].This can be summarized by equations 1, 2 and 3, where
Interp
Cost is theaverage cost of interpreting each guest instruction,
InterpF req is the number ofinstructions interpreted,
HotStaticSize is the total number of guest instructionsdynamically compiled,
Gen cost is the average cost of compiling a single guestinstruction,
CompilerInitializationCost is the initialization overhead of callingthe compiler,
N umRegions is the number of regions of code being compiled,
N ative
Cost is the average cost of emulating a guest instruction by executing na-tive, compiled, code,
N ativeF req is the number of guest instructions emulatedby native code,
T otalF req is the total number of guest instructions emulated.Notice that compiling a code only results in performance gains when the in-equality of Equation 5 is true.Other important performance overhead in a DTE is the region transitionoverhead. Transitioning between the interpreter and a native region of code ortransitioning between native regions of code may imply in saving and loadingemulation context. Emulators may maintain a context of the machine beingemulated, such as the values of the registers. In native code, these guest reg-isters can be mapped to host registers, but when jumping to the interpreterthese values need to be saved to memory so it can be again accessed by theinterpreter. The same happens when regions are compiled with a register al-location that chooses different guest-host register mapping per region. In thiscase, the guest registers modified that are in host registers need to be savedagain to memory. This overhead can be described as a multiplication betweenthe transition cost multiplied by the number of times the transition happens, asdescribed in Equation 4. Notice that larger regions tend to have entire cyclesinside it, such as entire nest of loops, thus reducing the number of transitions.
Interp
Time = Interp
Cost × InterpF req (1)
Native time = Native
Cost × NativeF req (2)
Gen time = Gen
Cost × HotStaticSize + CompilerInitializationCost × NumRegions (3)
T ransitionT ime = T ransitionCost × NumT ransitions (4)
Interp time + Native time + Gen time + T ransitionT ime < Interp
Cost × T otalF req (5)
N ative
Cost ) performance is directly related to the qualityof the code generator by the DTE compiler. Many decisions such as the shapeand size of the regions affect the quality of the code generated. For instance,regions with a more substantial number of instructions may expose many moreoptimization opportunities to the compiler, but regions with more branches aremore susceptible to early exits due to phase changing [31], leading to region frag-mentation [21], code duplication [32], and infrequently executed region tails [33].Another problem with large regions comes from exception handling: given thedifficulty to map exceptions during native execution, the DTE may need to rollthe execution back and reinterpret the entire region every time an exception oc-curs and regions with frequent exceptions may become a performance issue [34].Larger regions have more probability of including more branches and exceptions.Giving the main performance factors of DTEs and the characteristics of thecompilation units chosen by an RFT to generated high-performance code, weselected seven metrics that are important when trying to better understand andanalyze the performance behavior of a DTE. These metrics are described asfollows: • Total Number Of Regions : indicates how many regions the RFTformed. This metric provides insights about the compilation overhead,the more frequent the compiler is called, the higher will be its overhead. • Regions Coverage : reflects the percentage of the instructions that arebeing emulated by translated code, instead of interpretation (
InterpF req/N ativeF req ).This metric indicates how much the hot code detection and the region for-mation policy are guessing correctly. The more instructions are executedoutside the regions, the higher is the chance of existing hot code that wasnot included in a region. It is important to compare this metric withthe number of regions because forming fewer regions at the cost of lowercoverage is not desirable. • Number of transitions : is the number of entrances in regions whichcame directly from the exit of other regions (
N umT ransitions ). A highamount of transitions may cause a higher pressure over the processor codecache, and it is associated with the fragmentation of code cycles (nestedloops, for instance). Furthermore, transitions over regions have an emu-lation cost. Thus, a low number of transitions imply in a good dynamicregion quality. • Dynamic Region Size : is the total number of dynamic instructionsemulated by the region divided by the number of times that region was7ntered. It is important to notice that the average dynamic size of regionswith low completion ratio can be smaller than its static size. On the otherhand, regions with loops can have a dynamic size much more prominentthan their static size. This metric indicates the locality of execution; themore significant is the dynamic size, the lower is the number of transitionsbetween regions. • Static Region Size : this is the average number of instructions per re-gion. Therefore, it is also correlated with the compilation overhead, asthe compilation time is usually related to the number of instructions be-ing compiled (
Gen
Cost × HotStaticSize ). • Completion Ratio : is the percentage of times a region is executed en-tirely, which means that all instructions in that region were executed fromthe entrance to its exit. This metric makes more sense when dealing withsuperblocks, like the ones formed by NET or MRET2, which have a mainexit well defined. Regions such as the ones formed by NETPlus do nothave a clear distinction between side-exits and main-exits. A good comple-tion ratio on traces means fewer early-exits, which can have a significantimpact on fragmentation. •
90% Cover Set: indicates the minimum number of regions needed tocover 90% of the executed code frequency. The lower is the cover set, thefewer regions are needed to cover the hot part of the code, and these arethe regions that should incur further optimizations.Collecting and understanding these metrics is important to understand theadvantage of each RFT and DTE design choice and its drawbacks. In thefollowing section, we show that more interesting than the metrics themselves,it is not necessary to fully implement a DTE to collect them. We only need tosimulate the states transitions from a DTE during the emulation of a binary. Toprove so, we implemented such a DTE life-cycle simulator, named RAIn, andsimulated the execution of multiple applications using different RFTs.
The implementation and evaluation of region formation techniques in a real-world DBT or HLLVM is not an easy task. It involves debugging dynamicallygenerated machine code among other complex tasks, which is overall a very timedemanding job. For that reason, it is seldom to see DT systems that implementmore than one RFT technique on a single DBT, which is why it is very difficultto make a fair comparison among different region formation techniques.Our approach to this problem was to develop an open-source tool, called the
Region Appraise Infrastructure (RAIn) , to simulate the execution of a dynamictranslator, allowing an easy and flexible prototyping process. RAIn relies on the Trace Execution Automata (TEA) [35] technique to mimic region formation andexecution and to collect accurate region profile information. Initially, the TEAtechnique was used to record execution traces along with profile informationfor future executions [35]. A TEA uses a Deterministic Finite Automata, or RAIn’s source code: https://github.com/vandersonmr/Rain3
No Trace being Executed (NTE) is present in the automaton. Thisstate keeps track of instructions that belong to no regions and is used to accountfor instructions that are executed by an interpreter on a virtual machine thatcouples interpretation with dynamic binary translation, for example. This stateprevents the system from creating a new state for every single instruction exe-cuted, which could bloat the automaton. Figure 2b represents a RAIn DFA thathas been created when executing the trace generated by the program in Fig-ure 2a. The regions R1 and R2 were formed according to the NET RFT. Noticethat each instruction represents a state at the DFA and states are grouped intoregions, representing the regions formed by the RFTs. Edges between statescrossing R1 and R2 boundaries represent transitions between the two regions. (a) NET superblocks (b) RAIn DFA Figure 2: Example of RAIn state blocks.So, whenever an instruction is consumed, RAIn checks the automaton fora valid transition leaving the current state. If there is an outgoing edge la-beled with the address of the consumed instruction, then, RAIn performs thetransition, updating the current state along with the edge and state executionstatistics. If there is no outgoing edge that represents the execution of the con-sumed instruction, then it means that this path has not been recorded before.In this case, a new edge is created and added to the automaton, representinga new valid transition. This may happen due to a side-exit execution, like the9ransition from instruction jeq T2.inc , on region R1 , to instruction inc eax ,on R2 , for example.The RFT technique monitors the automata transitions and, according to itspolicy, it may start the formation of a new region. During this phase, insteadof transitioning on existing states, RAIn records the executed instructions andassociated transitions until it reaches the RFT stop criteria. After reaching thestop criteria, RAIn updates the automaton, creating new states and transitionsthat represent the instructions and correct execution flows inside the new region.The operation of RAIn itself can be seen as a state machine. Starting atthe EXECUTE state, the system consumes instructions performing transitionson the current DFA and recording statistics. Once the RFT triggers the regionformation, the
RECORD state is activated, and RAIn starts recording a new regionbased on the flow and the instructions being consumed. After the RFT identifiesthe stop condition, RAIn enters the
APPEND state, in which it expands the DFAwith the newly formed region. Once the DFA is expanded, the system returnsto the
EXECUTE state, continuing with the automata execution.RAIn is implemented in two modules: the
RegionManager and the
Simula-tor . The
RegionManager is the module responsible for managing the policiesfor RFTs. It controls the start and stops criteria for region recording. To adda new RTF to RAIn, all that is necessary to provide is an implementation ofa
RegionManager and a respective call in the main function to register it onthe
Simulator module. Every
RegionManager implements the method “handle-NewInstruction” that is called for every instruction from the trace and it shouldhandle region creations. Code 1 shows the whole implementation of the NETRFT. With less than 20 lines of code we implement and RFT and are able toanalyze it using RAIn’s metrics with instructions traces from different ISAs andOperational Systems.Code 1: NET Implementation using RAIn
Maybe < Region > NET: :handleNewInstruction ( t r a c e i t e m t &LastInst , t r a c e i t e m t &CurrentInst , State LastTransition ) { i f ( Recording ) { i f ( wasBackwardBranch ( LastInst , CurrentInst ) | | LastTransition == InterToNative ) { Recording = false ; return Maybe < Region > (RecordingRegion ) ; } RecordingRegion − > addAddress ( CurrentInst . addr ) ; } else i f (( LastTransition == StayedInter && wasBackwardBranch ( LastInst , CurrentInst )) | | LastTransition == NativeToInter ) { HotnessCounter [ CurrentInst . addr ] += 1; i f ( isHot ( HotnessCounter [ CurrentInst . addr ] ) ) { Recording = true ;RecordingRegion − > addAddress ( CurrentInst . addr ) ;HotnessCounter [ CurrentInst . addr ] = 0; }} return Maybe < Region > :: Nothing ( ) ; } RAIn processes a sequence of instructions that represent the execution of aprogram. RAIn processes one instruction at a time, similar to an interpreter.Hence, its performance is similar to the performance of an interpreter whenevaluating a single RFT. In case the user aims at evaluating several RFTs, shemay parallelize the simulation by loading the sequence of instructions once andfeeding several RAIn threads, each one simulating a different RFT or hyper-parameter. We employed this approach to collect the results from our experi-ments and it took near to half an hour to collect all statistics from all testedRFTs from each benchmark trace with 10 billion-instruction in an Intel XeonE5-2630 (2.60GHz). 10
Experimental Setup
To evaluate RAIn, the study presented in this paper was conducted using appli-cations from two benchmark suites (SPEC CPU 2006 and SYSmark 2012) anda Linux-compatible and open-source image editor, GIMP. A total of 14 bench-marks from SPEC CPU [17] were applied in this study, ten from SPEC-FP andfour from SPEC-INT; and four benchmarks from SYSmark [18].The usual benchmark suite for RFT related research in the literature isSPEC CPU. However, SPEC CPU and SYSmark have a noticeably differentprofile, and one of the goals is to understand how these DTEs configurationsperform across all these types of applications. SYSmark is described as anapplication-based benchmark that reflects usage patterns of business users inthe areas of office productivity, data/financial analysis, system management,media creation, 3D modeling, and web development . In this work, we haveevaluated the effect of RFT techniques on office applications, combining fourapplications from the SYSmark Office scenario (FineReader Pro 10.0, InternetExplorer 8, PowerPoint 2010, and Word 2010) and GIMP (2.8.20). This setof applications form what we will call
Desktop Apps , and aims to represent agroup of applications with a large 90% Cover set, as opposed to the SPEC CPUbenchmarks.Several applications from both of these benchmark suites generate sequenceswith trillions of executed instructions. The chosen method to handle suchamount of data was to use RAIn to simulate 10 billion-instruction sequencesof each program. We executed all these benchmarks on Bochs Emulator [36]and captured the x86 executed instructions to form the sequence. To avoidinitialization code, the first 10 billion instructions were discarded, and then thenext 10 billion were recorded. This number of instructions proved to be enoughto expose the differences in the behavior of applications and RFTs. These canbe seen in the huge variation of code execution locality demonstrated by the90% cover set, and the number of basic blocks presented in Section 5.2 and theRFT behavior difference showed in Section 5.
There are two important parameters in our tested RFTs: hotness thresholdand NETPlus deepness. To selected a good value, we ran two benchmarks andtested several parameter values.
Threshold Value
Figure 3 shows how the number of compiled regions, the 90% cover set, thepercentage of cold regions, and hot-emulation coverage are affected by the hot-ness threshold. As we can notice from these graphics, a slight variation of thethreshold value can significantly decrease the number of basic blocks selectedfor translation, reducing the compilation overhead. The same effect occurredwhen varying the threshold of all the RFTs, as we can observe in Figure 3(A).We can also observe that by choosing a threshold near 1024, we got a lownumber of regions selected and cold region proportion without losing too muchhot-emulation coverage (only when the threshold is near 2000 the hot-emulation11overage becomes less than 90% for Finereader). Thus, we set 1024 as a fixedthreshold for all the next experiments.Figure 3: Impact of region hotness threshold on the A) 90% Cover Set, B)number of compiled regions, C) native execution coverage, and D) percentageof cold regions for six RFTs. The data was generated using 10-billion-instructionsequences from Finereader (SYSmark) and GCC (SPEC CPU) benchmarks.As depicted in Figure 3, all the tested RFTs are strongly influenced bythe threshold and increasing its value not only reduced the proportion of coldregions, proving the strong correlation between the past execution frequency andits future, but also increased the 90% Cover Set value. Therefore, the thresholdvalue has a large influence over the four metrics and so, its choice should not beneglected or ignored in the design and construction of a DTE.
NETPlus Deepness
As explained in Section 2, the NETPlus RFT has an expansion depth limitthat controls how far the search for loops in the original NET region can go.Figure 4 shows how the number of compiled regions, the average dynamic regionsize, and the average static region size are affected by the NETPlus expansiondepth limit. The graphic shows that there is a stabilization in the metrics whenthe depth limit gets near to ten and, thus, choosing a value higher than tenwould probably have no benefit. Hence, we fix the NETPlus expansion depthlimit in ten for all the following experiments.One important observation is that the results in Figure 4 are very close to theones presented by the authors of NETPlus [22], demonstrating the capabilitiesof RAIn to explore RFT properties and leading its users to obtain the sameconclusions as for when using a real DTE. Additionally, Figure 4 shows thatthe increase in the average static size of the regions made by the NETPlusexpansion can be much more costly for some programs than for others, such asthe case of bwaves and deal , chosen for being in different parts of the spectrumof the 90% cover set metric. A high increase in the static size from bwave didnot lead to any significant increase in the dynamic region size, showing thatNETPlus, for some programs, can add costs that may never be paid-off, a fact,and information that was not first observed by its authors.Hence, all the following results were generated with a hotness threshold set to1024, a NETPlus expansion depth limit set to ten and with all benchmarks barsordered by the 90% Cover Set values presented in next subsection ( Section 5.2).12igure 4: Impact of the NETPlus expansion depth limit on the number ofcompiled regions, avg. dynamic region size and avg. static size variation fortwo SPEC CPU applications (we choose from SPEC to be easier to comparewith results from the NETPlus original paper). The results are normalized bythe first value (depth = 4).
To evaluate the impact that different RFTs have on the metrics discussed in Sec-tion 2.1, we execute RAIn with sequences of x86 instructions extracted from theselected benchmarks. For each application, we skip 10 billion instructions andthen record the next 10 billion ones. We also added a sequence of instructionsfrom a Linux Image Editor, GIMP, to compose our set of desktop applications.In our tests, we considered that only basic blocks executed more than 1024 timesare hot enough to be worth compiling. As can be seen in Figure 5a, some appli-cations have very few basic blocks that reach this frequency, while others havea lot of them. As the total execution frequency for each presented programis constant (10 billion x86 instructions), having more basic blocks with highexecution frequency means that their average execution frequency is smaller.Notice that desktop applications had the less hot basic blocks which are similarto the conclusion in the work undertaken by Cesar et al. [11], where they arguethe importance of having office/GUI benchmarks when evaluating a DTE andargue that the low execution frequency average of these applications is a barrierto DTE’s performance. This is one of the main reasons we included desktopapplications from SYSmarkand GIMP in our experimental setup for evaluatingRAIn.Another straightforward way to verify this is by using the 90% Cover Setmetric, first introduced by the authors of Dynamo [37]. The 90% Cover Setcounts the minimum number of regions (in this example, basic blocks) necessaryto achieve 90% of the execution frequency. The smaller the 90% Cover Set, thelesser the amount of code to be compiled, and the higher the average executionfrequency of these basic blocks. Duesterwald et al. [37] demonstrated that thereis a strong inverse relationship between the 90% Cover Set size and a DTE’s13 a)(b)
Figure 5: (a) Number of basic blocks that execute 1024 or more times and(b) minimum number of basic blocks required to cover 90% of the 10 billioninstructions simulated per application.performance. Therefore, it would be challenging to obtain the same performanceon benchmarks with far different numbers in the 90% Cover Set. Some examplesare the sjeng and IE , as we can see in Figure 5b. The authors of MRET2 [20] argue that its main advantage is the increase in thecompletion ratio of the selected traces over NET. They measured the completionratio with MRET2 and NET with the full execution of applications from SPECCPU 98 and show that, on average, MRET2 improves the completion ratio by20%. Figure 6a shows a re-plot of their data, while Figure 6b shows the data14ollected with RAIn for SPEC CPU 2006 applications. Besides the differencein methodology and benchmarks, both RAIn and the original paper presentvery similar results (distribution and average), leading to the same conclusions,showing again that the simulation performed by RAIn is capable of producingresults similar to the ones obtained with real DTEs. (a) Data from the MRET2 patent [20] – Full execution ofSPEC98 applications.(b) Data generated by RAIn – 10 billion instructions fromSPEC CPU applications.
We extrapolate the experiment using RAIn to compare NET, MRET2, andNET-R. The results in Figure 7 show that NET-R is only better than NET inbenchmarks with low 90% Cover Set (highly dense execution frequency), i.e.,the ones more to the left, while MRET2 is better than NET in almost everybenchmark.
Since the compilation cost is correlated with the number of times the dynamiccompiler is invoked and also with the number of instructions present in thecompiled regions, we use RAIn to evaluate the number of compiled regions (Fig-ure 8a) and the average static region sizes (Figure 8b). NETPlus-e-r producedsmaller amounts of regions to be compiled. However, despite being a much moresimplistic RFT than NETPlus-e-r, NET-r had a very significant impact on thismetric too. MRET2 and NET produced many more regions to be compiled, aresult that is explored and explained by the authors of the LEI technique [21].We can also observe that there is a trade-off between the number of regionscompiled and the average static region size: the majority of RFTs that decrease15igure 7: MRET2 and NET-r normalized completion ratios.the number of compiled regions, also increase the average static region size.NETPlus have the best trade-off for these metrics; it decreased the number ofregions and only slightly increased the average static size; LEI, on the otherhand, has the worst trade-off. Furthermore, notice that NET-r, NETPlus-e-r, and LEI create larger regions when emulating applications with larger 90%Cover Sets, such as the desktop applications. Pointing that the relaxation ofNET (NET-r) and also the expansion of NETPlus (NETPlus-e-r) have a differ-ent impact on benchmarks with different 90% Cover Set.16 a)(b)
Figure 8: (a) Total number of Regions Compiled normalized by the NET values;and (b) the average static region sizes normalized by the NET results for of allRFTs and benchmarks.
We also investigated the dynamic characteristics of the regions formed by allRFTs with RAIn. Figure 9a shows the 90% cover set for all RFTs normalized bythe results of NET. In this metric, only MRET2 had a worse performance thanNET, indicating that it requires more regions to be compiled to cover 90% of theexecution. NETPlus-e-r achieved the best results, followed by NETPlus, LEI,and NET-r, with NETPlus being more efficient on benchmarks with a higher90% Cover Set. A similar result was obtained with the average dynamic size,shown in Figure 9b, with NETPlus-e-r achieving again the best results, followedby NETPlus and LEI, while NET-r had only a slight improvement and MRET2decreased when compared to NET. Last, we can notice that the results were17uch less significant in benchmarks with a high 90% Cover Set. Overall, thesecases support the view that it is more difficult to select regions better than NETwhen the 90% Cover Set is high. (a)(b)
Figure 9: (a) the 90% Cover Sets normalized by the NET results; and (b) theaverage dynamic region sizes normalized by the NET results for all RFTs andbenchmarks.These results are similar to the ones found by previous work, which supportsour claim that using simulation to evaluate RFTs for a dynamic translator isa sound approach. Also, these results show that the relaxation and extensionproposed by Hong et al. [27] is a simple and powerful technique that should beconsidered when designing and implementing a dynamic translator.18
Conclusions
In this work, we presented a novel DTE simulator called RAIn, which is ca-pable of reproducing several results in the literature through simulation. Thesimulation enables the test of multiples DTEs’ designs, producing several use-ful statistics without complex implementations. Therefore, opening the newopportunities for exploration of DTE design decisions in a faster and simplemanner.As far as we know, there is no other simulation framework for testing DTEdesigns like the one proposed in this paper, hence no other work similar to thisone. Moreover, there is no additional comparative study involving several RFTswhatsoever. Typically an article that presents a new RFT only compares it withonly one other more [20, 21, 22], usually NET. Finally, but not less important,the results that we obtained with RAIn in this work corroborate the findingsreported by other authors in previous works.For example, we found with RAIn that MRET2 has a better completion ratethan NET, the same result presented by the MRET2’s original paper [20]. Wealso showed that NET and MRET2 compile far more regions than the other tech-niques and that these regions are smaller, a phenomenon that the LEI authorscalled region fragmentation [21]. Furthermore, our results with the NETPlusdepth limit variation showed the same graphic pattern as in the one in the NET-Plus original paper [22]. Moreover, the point of convergence found by us withRAIn for the depth limit is the same as the one found by its authors. Finally, weconcluded that NETPlus-e-r is the best RFT for the dynamic metrics tested andthis is exactly why NETPlus-e-r was selected to be used in the HQUEMU [27].On top of that, we presented a comprehensive study about several RFTs;we identified the strongest and weakest points of each tested RFT, showingthe importance of using RAIn as a tool for the design of any future DTE.For instance, if one needs to reduce the number of transitions and increasethe average time spent in a region, the best RFT would be the Expanded andRelaxed version of NETPlus. Alternatively, if one needs to create smaller regionswith larger completion ration, MRET2 is by far the best tested RFT.
References [1] J. Aycock, “A brief history of just-in-time,”
ACM Computing Surveys(CSUR) , vol. 35, no. 2, pp. 97–113, 2003.[2] P. Brown, “Throw-away compiling,”
Software: Practice and Experience ,vol. 6, no. 3, pp. 423–434, 1976.[3] K. Adams, J. Evans, B. Maher, G. Ottoni, A. Paroski, B. Simmers,E. Smith, and O. Yamauchi, “The hiphop virtual machine,”
ACM SIG-PLAN Notices , vol. 49, no. 10, pp. 777–790, 2014.[4] J. C. Dehnert, B. K. Grant, J. P. Banning, R. Johnson, T. Kistler,A. Klaiber, and J. Mattson, “The transmeta code morphing TM software:using speculation, recovery, and adaptive retranslation to address real-lifechallenges,” in Proceedings of the international symposium on Code gener-ation and optimization: feedback-directed and runtime optimization . IEEE, 2018, pp. 213–220.[7] V. M. do Rosario, F. Pisani, A. R. Gomes, and E. Borin, “Fog-assistedtranslation: towards efficient software emulation on heterogeneous iot de-vices,” in . IEEE, 2018, pp. 1268–1277.[8] T. Suganuma, T. Ogasawara, M. Takeuchi, T. Yasue, M. Kawahito,K. Ishizaki, H. Komatsu, and T. Nakatani, “Overview of the ibm javajust-in-time compiler,”
IBM systems Journal , vol. 39, no. 1, pp. 175–193,2000.[9] Ionmonkey. [Online]. Available: https://wiki.mozilla.org/IonMonkey[10] D. E. Knuth, “An empirical study of fortran programs,”
Software: Practiceand experience , vol. 1, no. 2, pp. 105–133, 1971.[11] D. Cesar, R. Auler, R. Dalibera, S. Rigo, E. Borin, and G. Araujo, “Mod-eling virtual machines misprediction overhead,” in
Proceedings of the 2013IEEE International Symposium on Workload Characterization (IISWC’13) , September 2013, pp. 153–162.[12] T. Suganuma, T. Yasue, and T. Nakatani, “A region-based compilationtechnique for dynamic compilers,”
ACM Transactions on ProgrammingLanguages and Systems (TOPLAS) , vol. 28, no. 1, pp. 134–174, 2006.[13] D. Burger, T. M. Austin, and S. Bennett, “Evaluating future microproces-sors: The simplescalar tool set,” University of Wisconsin-Madison Depart-ment of Computer Sciences, Tech. Rep., 1996.[14] R. E. Bryant and M. N. Velev, “Verification of pipelined microprocessorsby comparing memory execution sequences in symbolic simulation,” in
Advances in Computing Science — ASIAN’97 , R. K. Shyamasundar andK. Ueda, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1997, pp.18–31.[15] D. Ponomarev, G. Kucuk, and K. Ghose, “Accupower: an accurate powerestimation tool for superscalar microprocessors,” in
Proceedings 2002 De-sign, Automation and Test in Europe Conference and Exhibition , 2002, pp.124–129.[16] F. A. Endo, D. Courouss´e, and H. Charles, “Micro-architectural simula-tion of in-order and out-of-order arm microprocessors with gem5,” in
ACM SIGOPS Operating Systems Review , vol. 34, no. 5, pp.202–211, 2000.[20] C. Wang, B. Zheng, H. Kim, M. B. Jr., and Y. Wu, “Two-pass mret traceselection for dynamic optimization,” Patent number 20070079293, 2007.[21] D. Hiniker, K. Hazelwood, and M. D. Smith, “Improving region selectionin dynamic optimization systems,” in
MICRO 38: Proceedings of the 38thannual IEEE/ACM International Symposium on Microarchitecture , 2005,pp. 141–154.[22] D. Davis and K. Hazelwood, “Improving region selection through loop com-pletion,” in
ASPLOS Workshop on Runtime Environments/Systems, Lay-ering, and Virtualized Environments , 2011.[23] D.-Y. Hong, J.-J. Wu, P.-C. Yew, W.-C. Hsu, C.-C. Hsu, P. Liu, C.-M.Wang, and Y.-C. Chung, “Efficient and retargetable dynamic binary trans-lation on multicores,”
IEEE Transactions on Parallel and Distributed Sys-tems , vol. 25, no. 3, pp. 622–632, 2014.[24] C.-C. Hsu, D.-Y. Hong, W.-C. Hsu, P. Liu, and J.-J. Wu, “A dynamic bi-nary translation system in a client/server environment,”
Journal of SystemsArchitecture , vol. 61, no. 7, pp. 307–319, 2015.[25] M. D. Bond and K. S. McKinley, “Practical path profiling for dynamic opti-mizers,” in
Proceedings of the international symposium on Code generationand optimization . IEEE Computer Society, 2005, pp. 205–216.[26] J. D. Hiser, D. Williams, W. Hu, J. W. Davidson, J. Mars, and B. R.Childers, “Evaluating indirect branch handling mechanisms in software dy-namic translation systems,” in
Proceedings of the International Symposiumon Code Generation and Optimization . IEEE Computer Society, 2007, pp.61–73.[27] D.-Y. Hong, C.-C. Hsu, P.-C. Yew, J.-J. Wu, W.-C. Hsu, P. Liu, C.-M.Wang, and Y.-C. Chung, “Hqemu: a multi-threaded and retargetable dy-namic binary translator on multicores,” in
Proceedings of the Tenth Inter-national Symposium on Code Generation and Optimization . ACM, 2012,pp. 104–113.[28] H. Guan, Y. Yang, K. Chen, Y. Ge, L. Liu, and Y. Chen, “Distribit: adistributed dynamic binary translator system for thin client computing,”in
Proceedings of the 19th ACM International Symposium on High Perfor-mance Distributed Computing . ACM, 2010, pp. 684–691.[29] V. Bala, E. Duesterwald, and S. Banerjia, “Dynamo: a transparent dy-namic optimization system,” in
Proceedings of the ACM SIGPLAN 2000Conference on Programming Language Design and Implementation , 2000,pp. 1–12. 2130] C. Wang, S. Hu, H.-S. Kim, S. R. Nair, M. B. Jr., Z. Ying, and Y. Wu,“Stardbt: An efficient multi-platform dynamic binary translation system,”in
Asia-Pacific Computer Systems Architecture Conference , 2007, pp. 4–15.[31] C.-C. Hsu, P. Liu, J.-J. Wu, P.-C. Yew, D.-Y. Hong, W.-C. Hsu, and C.-M.Wang, “Improving dynamic binary optimization through early-exit guidedcode region formation,” in
ACM SIGPLAN Notices , vol. 48, no. 7. ACM,2013, pp. 23–32.[32] K. Scott, N. Kumar, B. R. Childers, J. W. Davidson, and M. L. Soffa,“Overhead reduction techniques for software dynamic translation,” in
Par-allel and Distributed Processing Symposium, 2004. Proceedings. 18th Inter-national . IEEE, 2004, p. 200.[33] E. Borin and Y. Wu, “Characterization of dbt overhead,” in
WorkloadCharacterization, 2009. IISWC 2009. IEEE International Symposium on .IEEE, 2009, pp. 178–187.[34] C. H¨aubl and H. M¨ossenb¨ock, “Trace-based compilation for the java hotspotvirtual machine,” in
Proceedings of the 9th International Conference onPrinciples and Practice of Programming in Java . ACM, 2011, pp. 129–138.[35] J. P. Porto, G. Araujo, E. Borin, and Y. Wu, “Trace execution automatain dynamic binary translation,” in , 2010.[36] K. P. Lawton, “Bochs: A portable pc emulator for unix/x,”