[PDF] Accelerate Cycle-Level Full-System Simulation of Multi-Core RISC-V Systems with Binary Translation

Abstract

It has always been difficult to balance the accuracy and performance of ISSs. RTL simulators or systems such as gem5 are used to execute programs in a cycle-accurate manner but are often prohibitively slow. In contrast, functional simulators such as QEMU can run large benchmarks to completion in a reasonable time yet capture few performance metrics and fail to model complex interactions between multiple cores. This paper presents a novel multi-purpose simulator that exploits binary translation to offer fast cycle-level full-system simulations. Its functional simulation mode outperforms QEMU and, if desired, it is possible to switch between functional and timing modes at run-time. Cycle-level simulations of RISC-V multi-core processors are possible at more than 20 MIPS, a useful middle ground in terms of accuracy and performance with simulation speeds nearly 100 times those of more detailed cycle-accurate models.

Full PDF

AAccelerate Cycle-Level Full-System Simulation of Multi-CoreRISC-V Systems with Binary Translation

Xuan Guo

[email protected] of CambridgeCambridge, UK

Robert Mullins

[email protected] of CambridgeCambridge, UK

ABSTRACT

It has always been challenging to balance the accuracy and perfor-mance of instruction set simulators (ISSs). Register-transfer level(RTL) simulators or systems such as gem5 [4] are used to executeprograms in a cycle-accurate manner but are often prohibitivelyslow. In contrast, functional simulators such as QEMU [2] can runlarge benchmarks to completion in a reasonable time yet capturefew performance metrics and fail to model complex interactionsbetween multiple cores. This paper presents a novel multi-purposesimulator that exploits binary translation to offer fast cycle-levelfull-system simulations. Its functional simulation mode outper-forms QEMU and, if desired, it is possible to switch between func-tional and timing modes at run-time. Cycle-level simulations ofRISC-V multi-core processors are possible at more than 20 MIPS, auseful middle ground in terms of accuracy and performance withsimulation speeds nearly 100 times those of more detailed cycle-accurate models.

RISC-V is a free, open, and extensible ISA. With the ongoing ecosys-tem development of RISC-V and an increasing number of companiesand institutions switching to RISC-V for both production and re-search, RISC-V has become the test bed instruction set of computerarchitecture research. A key tool when exploring new architecturaltrade-offs is the instruction-set simulator (ISS). Fast cycle-level sim-ulation allows new ideas to be validated quickly at an appropriatelevel of abstraction and without the complexities of hardware de-velopment. In particular we focus on the challenge of simulatingmulti-core RISC-V systems.The design of a processor can broadly be divided into the designof the core and memory subsystem. Characterising the performanceof the core pipeline in isolation is often a simpler task to that ofcharacterising the memory system. While smaller synthetic bench-marks are useful at the core level, larger more complex and longerrunning workloads are often needed to understand the memorysystem and the potential interactions between cores.For example, the smaller synthetic MCU benchmark CoreMark[7] executes at a magnitude of 10 instructions per iteration, whileSPEC2017 [8], a larger and more realistic benchmark running real-life applications, requires a magnitude of 10 instructions for asingle run [12]. SPEC takes from hours to days to run even on real This paper is published under the Creative Commons Attribution 4.0 International(CC-BY 4.0) license.

CARRV 2020, May 29, 2020, Virtual Workshop © 2020 Copyright held by the owner/author(s). machines, and hardly possible for simulators to run to completion.It is therefore helpful to have a fast simulator.Of course, fast simulation involves a trade-off between the fi-delity of the model and the speed at which simulations can be com-pleted. Unfortunately, we are currently forced to choose betweenslow cycle-accurate simulators or fast functional-only simulatorssuch as QEMU. In particular, there is a lack of fast full-systemsimulators that can accurately model cache-coherent multi-coreprocessors.In this paper, we present the Rust RISC-V Virtual Machine (R2VM).R2VM is written in the increasingly popular high-level system pro-gramming language Rust [17]. R2VM is released under permissiveMIT/Apache-2.0 licenses in the hope to encourage its adoption andexpansion by the broader community. To our knowledge, this is thefirst binary translated simulator that supports cycle-level multi-coresimulation. It can accurately model cache coherency protocols andshared caches. Cycle-level simulations are possible at more than 20MIPS, while the performance of functional-only simulations canoutperform QEMU and exceed 400 MIPS. ISSs can be classified as either execution-driven, emulation-drivenor trace-driven [6]. We omit a detailed discussion of execution-driven simulators such as Cachegrind [13] that modify programswith binary instrumentation and execute them natively, becausethey require the host and the guest ISA to be identical and do notsupport full-system simulation. Emulation-driven simulators emu-late the program execution, and gather performance metrics on-the-fly; in contrast, trace-driven simulators run emulation before-handand gather traces from the program, e.g. branches or memory ac-cesses, and later replay the trace against a specific model. Tracesallow ideas to be evaluated quickly without the need to simulatein detail, but cannot easily capture effects that may alter the in-structions that are executed, e.g. inter-core interactions or specu-lative execution [6]. Moreover, storage space required for tracesgrows linearly with the length of execution, making trace-drivensimulators incapable of simulating large benchmarks. R2VM is anemulation-driven simulator and the remainder of this paper willfocus exclusively on emulation-driven simulators.Simulators can also be categorised by their levels of abstraction.One category of simulators is functional simulators. Functionalsimulators simulate the effects of instructions without taking mi-croarchitectural details into account. Because less information is Available at https://github.com/nbdd0121/r2vm. a r X i v : . [ c s . A R ] M a y ARRV 2020, May 29, 2020, Virtual Workshop Guo, et al. needed, aggressive optimisations can be performed, and the perfor-mance is usually several magnitudes faster than timing simulators.QEMU falls into this category. It should be noted though that whileQEMU itself is a purely functional simulator, it can be modified tocollect metadata for off-line or on-line cache simulation [18].The other category is timing simulators. Out of timing simula-tors, RTL simulators can model processor microarchitectures veryprecisely, but the difficulty in implementing a feature in RTL sim-ulator is not much different from implementing it in hardwaredirectly. RTL simulators are also poor in performance, usually runat a magnitude of kIPS [16].At a higher level, there are cycle-level microarchitectural simula-tors. These are able to omit RTL implementation details to improveperformance while retaining a detailed microarchitectural model.An popular example is the gem5 simulator running with In-Orderor O3 mode [4]. For faster performance, we can give up some extramicroarchitectural details and predict the number of cycles takenfor each non-memory instruction instead of computing them inreal-time, and in the extreme case, assume all non-memory opera-tion only takes 1 cycle to execute as gem5’s “timing simple” CPUmodel assumes. This approach is no longer cycle-accurate, but thiscycle-approximate model is often adequate to perform cache andmemory simulations.

Binary translation is a technique that accelerates instruction setarchitecture (ISA) simulation or program instrumentation [11]. Aninterpreter will fetch, decode and execute the instruction pointedby the current program counter (PC) one-by-one, while binarytranslation will, either ahead of time (static binary translation) or inthe runtime, i.e. when the block of code if first executed (dynamicbinary translation (DBT)), translate one or more basic blocks fromthe simulated ISA to the host’s native code, cache the result, anduse the translation result next time the same block is executed.QEMU uses binary translation for cross-ISA simulation or whenthere is no hardware virtualisation support [2]. BÃűhm et al. pro-posed a method to introduce binary translation to single-core timingsimulation in 2010 [5].

Extending single core simulators to handle multiple cores is compli-cated by the performance implications of the ways in which coresmay interact. As cores share caches and memory, simulations ofindividual cores cannot simply be run independently. For example,accurate modelling of cache coherence, atomic memory operationsand inter-processor interrupts (IPIs) must be considered.BÃűhm et al.’s modified ARCSim simulator [5] can model single-core processors with high accuracy and reasonable performance;however, Almer et al.’s extension to BÃűhm et al.’s work [1] that es-sentially runs multiple copies of the single-core simulator in parallelthreads to provide multi-core support is limited in its fidelity. Theauthor comments “detailed behaviour of the shared second-levelcache, processor interconnect and external memory of the simu-lated multi-core platform” cannot be modelled accurately. QEMUis able to exploit multiple cores to emulate a multi-core guest but provides only a functional simulation mode and supports no timingor modelling of the memory system.An accurate model of cache coherence and the memory hierar-chy requires that multiple cores are simulated in lockstep (or ina way that guarantees equivalent results). Simulators that foregothis are unable to properly simulate race conditions and sharedresources. Existing cycle-level simulators such as gem5 achievelockstep by iterating through all simulated cores each cycle. Thiscauses a significant performance drop. Spike (or riscv-isa-sim ),on the other hand, switches the active core to simulate less fre-quently. Its default compilation option only switches the core every1000 cycles, making it impossible to model race conditions whereall cores are trying to acquire a lock simultaneously. No existingbinary translated simulators can model multi-core interaction inlockstep, and therefore none of these can model cache coherencyor shared second-level cache properly.

The high-level control flow of R2VM, as shown in Figure 1, is similarto other binary translators. When an instruction at a particularPC is to be executed, the code cache is looked up and the cachedtranslated binary is executed directly if found; otherwise, the binarytranslator is invoked and an entire basic block is fetched, decoded,and translated. We have used a variety of techniques to improvethe binary translator performance that are often found in otherbinary translators, such as block chaining [2].As full-system simulation is supported, we have to deal withthe case that a 4-byte uncompressed instruction spans two pages.We handle this by creating a stub that reads the 2 bytes that lieon the second page each time the stub is executed, and patchesthe generated code if 2 bytes read are different from that of initialtranslation.Cota et al. [9] suggests sharing a code cache between multiplecores to promote code reuse and boost performance. In contrast,we provide each hardware thread its own code cache. This allowsdifferent code to be generated for each core, e.g. in the case of hetero-geneous cores. This also lessens the synchronisation requirementswhen modifying the code cache, simplify the implementation.

The main difference between our simulator’s flow and existing onessuch as QEMU is that we introduce “pipeline model”s, which com-prises several hooks. Hooks can process relevant instructions andgenerate essential microarchitectural simulation code if necessary.The hooks can also indicate the number of cycles it would take forthe instruction to complete. It should be noted that this is only forthe execution pipeline, while memory systems and cache are in aseparate component.For simple models, such as gem5’s “timing simple” model whereeach instruction takes 1 cycle to execute, implementation is straight-forward as shown in Listing 1.We have also implemented and validated an in-order pipelinemodel that accurately models a classic 5-stage pipeline with a staticbranch predictor. Our implementation captures pipeline hazards,such as data hazards caused by load-use dependency and stalls ccelerate Cycle-Level Multi-Core RISC-V Simulation with Binary Translation CARRV 2020, May 29, 2020, Virtual Workshop

End Block?Fetch & DecodeTranslate Instruction Before Instruction HookAfter Instruction HookAfter Taken Branch HookComplete Code GenerationCode CacheCode Cache AccessExecution Begin Code Generation Block Begin HookBinary Translator Pipeline ModelHit Miss No Yes

Figure 1: Control flow overview of the simulator pub struct SimpleModel ; impl PipelineModel for

SimpleModel { fn after_instruction(& mut self, compiler: &mut DbtCompiler, _op: &Op , _compressed: bool ) { (cid:44) → compiler.insert_cycle_count(1); } fn after_taken_branch(& mut self, compiler: &mut DbtCompiler, _op: &Op , _compressed: bool ) { (cid:44) → compiler.insert_cycle_count(1); } } Listing 1: Timing simple model implementation due to a branch/jump into a misaligned 4-byte instruction. UnlikeBÃűhm et al.’s simulator [5] which needs to call a “pipeline” func-tion after each instruction, our implementation models pipeline be-haviours during DBT code generation and reflects them as numberof cycles taken, therefore requires no explicit code to be executedin runtime.More complex processors may need either to make an estimationof pipeline states (and sacrifice some accuracy) or generate customassembly in the hooks to maintain these states during execution(and sacrifice some performance).

The techniques we described in the previous section works well forsingle-core systems. But as described in the background section, run-ning them in parallel or switch between them in a coarse-grainedmanner has a huge impact on simulation accuracy of multi-threadedprograms. The ideal scheduling granularity is, therefore, a cycle,i.e. having all simulated cores run in lockstep. This is, however,difficult to achieve for binary translators.We experimented the idea of using thread barriers to synchronisemultiple threads each simulating a single core. It turns out we could only synchronise 1 million times per second even after carefuloptimisation at the assembly level.

The approach we use takes inspira-tion from fibers, sometimes also referred to as coroutines or greenthreads. Fibers are cooperatively scheduled by the user-space appli-cation, and they voluntarily “yield” to other fibers, in contrast totraditional threads which are preemptively scheduled by the oper-ating system and are generally heavy-weight constructs. Fibers areoften used in I/O heavy, highly concurrent workloads, such as net-work programming, but this time we borrowed it to our simulator.In our implementation, we create one fiber for each hardwarethread simulated, plus a fiber for the event loop. Each time thepipeline model instructs the DBT to wait for a few cycles, we willgenerate a number of yields. Listing 2 shows an example of gener-ated code under timing simple model. mov rax, qword [rbp+0x78] ; \ mov qword [rbp+0x70], rax ; | add a4, zero, a5 call fiber_yield_raw ; / mov eax, dword [rbp+0x78] ; \ add eax, -0x1 ; | cdqe ; | addiw a5, a5, -1 mov qword [rbp+0x78], rax ; | call fiber_yield_raw ; / mov eax, dword [rbp+0x70] ; \ imul eax, dword [rbp+0x50] ; | cdqe ; | mulw a0, a4, a0 mov qword [rbp+0x50], rax ; | call fiber_yield_raw ; / Listing 2: Example of generated code with yield calls. RBPpoints to the array of RISC-V registers.

Different from normal fiber implementation, we engineered thefiber’s memory layout to look like Figure 2 to suit the need of asimulator. Each fiber is allocated with a 2M memory aligned to2M boundary, and the stack for running code under the fiber iscontained within this memory range. The alignment requirement al-lows the fiber’s start address to be recovered from the stack pointer

ARRV 2020, May 29, 2020, Virtual Workshop Guo, et al.

Stack PointerCycle Number Stack PointerRegisters Stack PointerRegistersEvent Loop Core 0 Core 1Events Priority Queue Core States Core StatesL0 Address Translation Cache L0 Address Translation CacheNext Event Current Base PointerCurrent Stack PointerNext Fiber Next Fiber Next FiberStack Stack Stack

Figure 2: Memory layout of fibers by simply masking out least significant 21 bits. The base pointerpoints to the end of fixed fiber structures rather than the beginning,so that positive offsets from the base pointer can be used freelyby the DBT-ed code, while the negative offsets are used for fibermanagement.The ABI of the host platform for DBT-ed code is not respected;we rather specify all registers other than the base pointer and stackpointer to be volatile, or caller-saved. By doing so, fiber_yield_raw does not need to bear the cost of saving any registers. To yield innon-DBT-ed code, we can alternatively push ABI-specified callee-saved registers into the stack and switch.This careful design makes fiber switching lightning fast; the fiber_yield_raw function is as simple as 4 instructions on AMD64,shown in Listing 3. fiber_yield_raw: mov [rbp - 32], rsp ; Save current stack pointer mov rbp, [rbp - 16] ; Move to next fiber mov rsp, [rbp - 32] ; Restore stack pointer ret Listing 3: Implementation of the fiber yielding code

Simply yielding a few cycles afterevery executed instruction will severely limit performance andin many cases will be unnecessary. In practice, we only need tosynchronise at points where the execution pipeline can producevisible side-effects to other cores and/or the rest of the system, orwhere the rest of the system’s behaviour would affect the runningpipeline.We observe that there are three ways that a pipeline interactswith another: • An memory operation is performed. • An control register operation is performed. This includesread/write to performance monitor registers, or control reg-isters related to the memory system. • An interrupt happens.For the first two types of interaction, we insert a synchronisationpoint before and after they are executed. For the third case (inter-rupts), because it is generally difficult to interrupt an DBT-ed code mid-way, we choose to check for interrupts only at the end of basicblocks. We believe that this decision will not affect the accuracy ofour simulation due to the inherent entropy of I/O operations.The positioning of yielding that lies in between two synchronisa-tion points, therefore, would have no visible side-effects and cannotbe distinguished. Our implementation postpones all yielding untilthe next synchronisation point. We tweaked our yield implementa-tion as shown in Listing 3 slightly to allow multi-cycle yield, andit demonstrates around 10% performance gain compared to naiveyielding.

Previous sections described how we simulate each core’s processingpipeline and how we achieve simulation in lockstep. The techniqueswe described and implemented speeds up pipeline simulation, butthe speedup could be very limited when all memory accesses arestill simulated. Moreover, the instruction cache and translation-lookaside buffers (TLBs) would also need to be simulated for accu-racy.

For memory operations, each running corehas its own “L0 data cache”. When a core needs to read from orwrite to a memory address, it first checks if it is in the L0 data cache.If it hits, then memory access is performed entirely within DBT-edcode, bypassing the memory model entirely.As a result, the memory model will not intercept all memoryaccesses. It is therefore important to control what could be in the L0data cache. We maintain a property that if an access hits the L0 datacache, then it must be a cache hit would the memory access reachthe memory model. We speed up TLB simulation with a similarapproach in our previous work [10].In our previous TLB simulation work, the property mandatesan invariant that all entries in L0 TLB are in the L1 data TLB. Theinvariant kept in R2VM is that all L0 data cache entries are containedboth in L1 data TLB and L1 data cache. Therefore, as shown inFigure 3, when entries are evicted from either the simulated TLBor cache model, corresponding entries need to be flushed from theL0 data cache for the inclusiveness property.We carefully engineered the memory layout of L0 data cacheentries for maximum efficiency. The L0 data cache is direct-mapped,with each entry representing a cache line. Each entry has a memory ccelerate Cycle-Level Multi-Core RISC-V Simulation with Binary Translation CARRV 2020, May 29, 2020, Virtual Workshop

Index into L0 data cache and check tagObtain address with XOR Check permissionWalk page tables and update simulated TLBPerform the actual memory access Insert the entry into L0 data cache Invoke simulated TLB modelFlush entries from L0 data cache(for TLB/cache eviction)Trigger a page fault exceptionFailPass Miss HitPassFailDynamic Binary Translated Code R2VM Memory ModelInvoke simulated cache modelUpdate simulated cacheHit Miss

Figure 3: Control flow for memory accesses paddr ⊕ vaddrvtag RO63 01 TA Figure 4: Memory layout of a tag entry in L0 data cache layout like Figure 4. It does not store actual memory contents; itrather stores a translation from the virtual tag to a physical address.In a sense, it is more like a TLB with cache-line granularity thana cache. We pack the XOR-ed value of guest physical address andcorresponding guest virtual address, plus a bit indicating if thecache line is read-only to a single machine word.For each memory access, the L0 data cache is indexed into usingthe virtual tag. For read access, we compare if

T >> 1 is equal tovtag. For write access, we compare if vtag << 1 is equal to T . Ifthe check passes, the requested virtual address is XOR-ed with A to produce the address to access directly within DBT-ed code. Ifthe check fails, the cold path is executed and the memory modelis invoked. The memory model will simulate both TLB and datacache, and either triggers a page fault or inserts an entry into theL0 data cache.The existence of L0 data cache promises the performance ofR2VM’s fast-path, because it requires only 3 memory operationsfor each memory operation simulated. In the default configuration,because the memory model does not intercept all memory accesses,replacement policies such as least-recently used (LRU) cannot beused for the simulated TLB and cache. Generally, we believe thisis an acceptable accuracy loss to trade vastly better simulationperformance. If LRU-like policies are really needed, the L0 datacache could be bypassed and the memory model be invoked foreach memory access, in sacrifice of performance. R2VM also simulates instruction TLBand caches similar to the data cache. Each core also has its own L0instruction cache, with a simpler entry layout because read/writepermission needs not to be distinguished. To keep the overhead ofsimulating instruction cache down, instead of accessing it each time an instruction is executed, we instead do it only when a basic blockbegins, or when the instruction being translated is in a differentcache line compared to the previous instruction. For a cache linesize of 64 bytes, this means that we only need to generate a singleL0 instruction cache access for every 16-32 instructions.We also creatively use the L0 instruction cache to optimise jumpsacross pages. Traditionally, because the page mapping might changeand therefore the actual target of jump instruction might change,DBTs have to conservatively not link these blocks together. Weinstead check the L0 instruction cache (which we would need tocheck anyway when next block begins) and see if the target is thesame as the cached target. If so, the cached target is used and thecontrol does not go back to the main loop.

Our design for the memory system inher-ently supports the use of cache coherency. Whenever the cachecoherency protocol requires an invalidation, it can be flushed fromthe L0 data cache of the target core. Because all simulated coresexecute in lockstep, and there are synchronisation points beforeall memory accesses, the effect of the invalidation will be visiblebefore the next memory access.

R2VM is capable of doing user-level simulation, supervisor-levelsimulation and machine-level simulation. For user-level simulation,Linux syscalls are emulated, and for supervisor-level, supervisorbinary interface (SBI) calls are emulated.In many cases, we want to gather cache statistics with the be-haviour of operating system (OS) taken into account, but we donot want to count the OS booting and workload preparation stepsbefore the region of interest, and do not want to pay for the perfor-mance overhead of detailed models for these portions. The designof R2VM takes this into account, and both pipeline and memorymodels can be switched dynamically in the runtime. The switchingis controlled by writing a special control and status register (CSR)in the vendor-specific CSR range.

ARRV 2020, May 29, 2020, Virtual Workshop Guo, et al.

R2VM supports pipeline model switching by simply flushing thecode cache for translated binary, and let the DBT engine to usethe new model’s hooks for code generation. Moreover, since asmentioned previously in Section 3.1, each core has its own codecache for DBT-ed code, we allow the pipeline models to be specifiedper core rather than at once.The memory model is switched in the runtime by flushing theL0 data cache and the instruction cache. The cache line size is also aruntime-configurable property. For example, if both TLB and cacheare simulated, the cache line size can be set to 64 bytes. If only TLBis simulated, the cache line size can be set to 4096 bytes, turning L0data cache effectively into an L0 data TLB.If the memory model permits, R2VM can also switch betweenlockstep execution and parallel execution like other binary trans-lators during the runtime. Parallel execution is enabled on the“atomic” memory model. When paired with the “atomic” pipelinemodel this behaves functionally equivalent to QEMU and gem5’satomic model which permits fast-forwarding of aforementionedbooting and preparation steps.

As described in Section 3, R2VM offers a range of pipeline modelsand memory models to select from, and allows switching betweenthem mid-simulation. Each model shows different trade-offs. Thelist of pre-implemented pipeline and memory models can be foundin Table 1 and Table 2.Name DescriptionAtomic Cycle count not trackedSimple Each non-memory instruction takes one cycleInOrder Models a simple 5-stage in-order scalar pipeline

Table 1: List of pre-implemented pipeline models

Name DescriptionAtomic Memory accesses not trackedTLB TLB hit rate collected; cache not simulatedCache Cache hit rate collected; TLB and cache coherency notmodelled; parallel execution allowedMESI A directory-based MESI cache coherency protocolwith a shared L2. Lockstep execution required.

Table 2: List of pre-implemented memory models

For pipeline models, we validated the accuracy of the in-ordermodel against an actual RTL implementation of a RISC-V core usingCoreMark [7]. CoreMark is particularly helpful for this validation,as CoreMark’s working set is small enough to fit into caches andtherefore the memory system of the RTL implementation would notaffect the benchmark result. In our run, the RTL implementationreports 2.10 CoreMark/MHz where the in-order model, when pairedwith the atomic memory model, reports 2.09 CoreMark/MHz. Thedifference is less than 1%. The “simple” model is simply validatedby checking that all cores have their MCYCLE and MINSTRET CSRequal.For memory models, we used a few micro-benchmarks to coverthe use case for each model. For TLB and cache simulation, we used a single-core micro-benchmark that is similar to the MemLattool from the 7-zip LZMA benchmark [14]. For the MESI cache-coherency model, we used a micro-benchmark to simulate a sce-nario where two cores are heavily contending over a shared spin-lock. The memory model under test is used together with the val-idated in-order pipeline model, and we compare the number ofcycles taken to execute a benchmark in R2VM and in RTL simu-lation. The error is around for the 10% for the cache coherencymodel and lower for non-coherent models. Though not as accurateas the pipeline model, we believe at this accuracy the simulationcan provide representative-enough metrics for exploring designdecisions.

M Instructions per CPU second

Figure 5: Performance comparison between models andother simulators

We evaluated the performance of R2VM against QEMU usingthe deduplication workload from PARSEC [3] on 4 cores to test theinteger performance of the simulator (as both R2VM and QEMUinterprets floating-point operations). The kIPS numbers of the gem5simulator are from Saidi et al.’s presentation [15].As shown in Figure 5, the techniques we use lead to superb per-formance. When caches are not simulated and therefore cores canrun in parallel threads, R2VM runs at >

300 MIPS per core, even out-performing QEMU. Lockstep execution brings down performanceby 10x to ∼

30 MIPS (for 4 simulated cores in a single-threaded), butthis is still significantly faster than gem5.Thanks to our pipeline model design which moves most simu-lation to DBT compilation time rather than runtime, and to ourmemory model design which offloads most memory accesses byusing L0 caches, simulating pipelines and cache coherency proto-cols did not add a significant overhead themselves, compared tothe overhead of lockstep execution.

We have introduced R2VM, a multi-purpose binary translatingsimulator that is able to simulate multi-core RISC-V systems atthe cycle-level at high-speed. This is done by leveraging the useof fibers to support fast lockstep execution. Overall, optimisationsmade R2VM possible to achieve functional simulation performancethat exceeds that of QEMU and cycle-level simulation nearly 100times faster than gem5. ccelerate Cycle-Level Multi-Core RISC-V Simulation with Binary Translation CARRV 2020, May 29, 2020, Virtual Workshop

REFERENCES [1] Oscar Almer, Igor Böhm, Tobias Edler Von Koch, Björn Franke, Stephen Kyle,Volker Seeker, Christopher Thompson, and Nigel Topham. 2011. Scalable multi-core simulation using parallel dynamic binary translation. In . IEEE, 190–199.[2] Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator.. In

USENIXAnnual Technical Conference, FREENIX Track , Vol. 41. 46.[3] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. ThePARSEC benchmark suite: Characterization and architectural implications. In

Proceedings of the 17th international conference on Parallel architectures and com-pilation techniques . ACM, 72–81.[4] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, AliSaidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, SomayehSardashti, et al. 2011. The gem5 simulator.

ACM SIGARCH computer architecturenews

39, 2 (2011), 1–7.[5] Igor Böhm, Björn Franke, and Nigel Topham. 2010. Cycle-accurate performancemodelling in an ultra-fast just-in-time dynamic binary translation instructionset simulator. In . IEEE, 1–10.[6] Hadi Brais, Rajshekar Kalayappan, and Preeti Ranjan Panda. 2020. A Survey ofCache Simulators.

ACM Computing Surveys (CSUR) ® Proceedings of the 15th ACMSIGPLAN/SIGOPS International Conference on Virtual Execution Environments . 74–87.[10] Xuan Guo and Robert Mullins. 2019. Fast TLB Simulation for RISC-V Systems. In

Third Workshop on Computer Architecture Research with RISC-V .[11] Kim Hazelwood. 2011. Dynamic binary modification: Tools, techniques, andapplications.

Synthesis Lectures on Computer Architecture

6, 2 (2011), 1–81.[12] Ankur Limaye and Tosiron Adegbija. 2018. A workload characterization ofthe SPEC CPU2017 benchmark suite. In . IEEE, 149–158.[13] Nicholas Nethercote and Julian Seward. 2007. Valgrind: a framework for heavy-weight dynamic binary instrumentation.

ACM Sigplan notices

Workshop on Computer Architecture Research with RISC-V2014 IEEE 12th International Conferenceon Dependable, Autonomic and Secure Computing