[PDF] Enabling Virtual Memory Research on RISC-V with a Configurable TLB Hierarchy for the Rocket Chip Generator

Abstract

The Rocket Chip Generator uses a collection of parameterized processor components to produce RISC-V-based SoCs. It is a powerful tool that can produce a wide variety of processor designs ranging from tiny embedded processors to complex multi-core systems. In this paper we extend the features of the Memory Management Unit of the Rocket Chip Generator and specifically the TLB hierarchy. TLBs are essential in terms of performance because they mitigate the overhead of frequent Page Table Walks, but may harm the critical path of the processor due to their size and/or associativity. In the original Rocket Chip implementation the L1 Instruction/Data TLB is fully-associative and the shared L2 TLB is direct-mapped. We lift these restrictions and design and implement configurable, set-associative L1 and L2 TLB templates that can create any organization from direct-mapped to fully-associative to achieve the desired ratio of performance and resource utilization, especially for larger TLBs. We evaluate different TLB configurations and present performance, area, and frequency results of our design using benchmarks from the SPEC2006 suite on the Xilinx ZCU102 FPGA.

Full PDF

EEnabling Virtual Memory Research on RISC-V with aConfigurable TLB Hierarchy for the Rocket Chip Generator

Nikolaos CharalamposPapadopoulos ∗ National Technical University [email protected]

Vasileios Karakostas

National Technical University [email protected]

Konstantinos Nikas

National Technical University [email protected]

Nectarios Koziris

National Technical University [email protected]

Dionisios N. Pnevmatikatos

National Technical University [email protected]

ABSTRACT

The Rocket Chip Generator uses a collection of parameterized pro-cessor components to produce RISC-V-based SoCs. It is a powerfultool that can produce a wide variety of processor designs rangingfrom tiny embedded processors to complex multi-core systems. Inthis paper we extend the features of the Memory Management Unitof the Rocket Chip Generator and specifically the TLB hierarchy.TLBs are essential in terms of performance because they mitigatethe overhead of frequent Page Table Walks, but may harm the criti-cal path of the processor due to their size and/or associativity. Inthe original Rocket Chip implementation the L1 Instruction/DataTLB is fully-associative and the shared L2 TLB is direct-mapped.We lift these restrictions and design and implement configurable,set-associative L1 and L2 TLB templates that can create any or-ganization from direct-mapped to fully-associative to achieve thedesired ratio of performance and resource utilization, especially forlarger TLBs. We evaluate different TLB configurations and presentperformance, area, and frequency results of our design using bench-marks from the SPEC2006 suite on the Xilinx ZCU102 FPGA.

KEYWORDS

RISC-V, Rocket Chip Generator, TLB, Memory Management Unit

Reference Format:

Nikolaos Charalampos Papadopoulos, Vasileios Karakostas, KonstantinosNikas, Nectarios Koziris, and Dionisios N. Pnevmatikatos. . Enabling VirtualMemory Research on RISC-V with a Configurable TLB Hierarchy for theRocket Chip Generator. In

Proceedings of Fourth Workshop on ComputerArchitecture Research with RISC-V (CARRV 2020).

Rocket Chip Generator (RCG) is a tool that uses the open RISC-VISA to produce configurable SoCs. RCG supports fully-fledged Unix-like operating systems, and features important RISC-V extensionsand accelerators. RCG is designed to target a wide range of applica-tion domains, ranging from embedded up to complex and multicoresystems. To support this wide range of application domains, most of ∗ Contact author

CARRV 2020, May 29, 2020, Virtual Workshop . the processor components have been implemented as configurabletemplates in the Chisel high-level hardware construction language(HCL). However, some of the Rocket Chip Generator componentsare still missing support for configurability. In this paper we focuson the Memory Management Unit (MMU) and specifically on theTranslation Lookaside Buffer (TLB) hierarchy that lacks such con-figurability support. TLBs are essential in terms of performancebecause they mitigate the overhead of frequent page table walks,but may harm the critical path of the processor due to their sizeand/or associativity. Furthermore, a configurable TLB hierarchymight be useful for performance scaling for faster processors suchas the out-of-order BOOM [8].In the original Rocket Chip implementation only the numberof TLB entries is configurable; the L1 Instruction and Data TLBscan only be fully-associative and the shared L2 TLB direct-mapped.However, that approach is not optimal for applications with largememory footprints that require larger TLB reach with many entriesbecause (i) increasing the number of the fully associative L1 TLBmay increase the processor critical path and can impact the operat-ing frequency of the entire design, and (ii) a direct-mapped L2 TLBcan experience many conflict misses, leaving significant room forapplication performance improvement with the use of increasedassociativity. Clearly, this lack of configurability in the TLB maylimit the efficient applicability of Rocket Chip SoCs for applicationswith large memory footprints that stress the TLB hierarchy.In this paper we lift these restrictions and design and implementconfigurable, set-associative L1 and L2 TLB templates that can cre-ate any organization from direct-mapped to fully-associative toachieve the desired ratio of performance and resource utilization,especially for larger TLBs. Furthermore, we modify existing replace-ment policies to be compatible with our design, offering flexibilityfor performance and resource usage trade-offs.We modify the L1 and L2 TLB mechanisms and specifically howTLB lookups, refills, flushes, and replacements are handled. Chiselallows the programmer to produce circuit generators that are easilyconfigurable. With our approach, just by adjusting the number ofthe sets and the ways of the L1/L2 TLB, all the TLB circuitry isproperly configured. Corner cases such as direct-mapped and fully-associative organizations are included, and the design is tailoredto remove unnecessary components for these cases. For example, a r X i v : . [ c s . A R ] S e p ARRV 2020, May 29, 2020, Virtual Workshop N. C. Papadopoulos, V. Karakostas, K. Nikas, N. Koziris, D. N. Pnevmatikatos if a direct-mapped organization is selected there is no need forreplacement policy, so our Chisel code removes it altogether.We use different L1/L2 TLB configurations to evaluate our designwith benchmarks from the SPEC2006int suite [10]. We show thatthe largest evaluated TLB configuration improves performance byup to 15.4%, with minimal impact in area and frequency.In summary the main contributions of this paper are: • We implement a fully configurable Instruction/Data L1 TLBand shared L2 TLB that can output any design from direct-mapped to fully-associative, lifting the initial restrictionsof configurability only by the number of entries. This leadsto better scaling of performance and resources, especiallyfor large TLBs. We make our design publicly available toenable further research on the active topic of virtual memorysupport for RISC-V. • We present a case study in which we evaluate the perfor-mance and resource usage of the Rocket Chip [4] processorwith different TLB configurations, by running benchmarksfrom the SPEC2006int [10] suite on the Xilinx ZCU102 FPGA.

Here we provide information on virtual memory, the Chisel hard-ware description language, and the Rocket Chip Generator.

Virtual memory is an essential concept for processor design becauseit provides the illusion of a very large and private address space toeach process running in the system. Virtual memory offers securitythrough process isolation and also benefits programmer productiv-ity since the operating system manages the memory mappings andthe hardware accelerates the translations.RISC-V supports different Virtual Memory systems dependingon the size of the address space (e.g. RV32 Sv32, RV64 Sv39/Sv48[1]), in this paper we focus on RV64 Sv39 (39-bit address space)which supports 4KB base pages but also 2MB, 1GB super pages;the page table, that stores the memory mappings of each process,is implemented as a multi-level radix tree (3-level page table inRV64 Sv39). A processor register called

SATP (Supervisor AddressTranslation and Protection register) holds the root of the page table.The physical address is obtained after performing a sequentiallookup in each page table level. The page table walker (PTW) thatperforms the virtual-to-physical address translations is typicallyimplemented in hardware for improved performance.To accelerate address translation without accessing the pagetable on every memory reference, a Translation Lookaside Buffer(TLB) is used which keeps the recently used translations. The TLBlies on the critical path of the processor and as a result its size andassociativity are essential for the overall performance. To overcomethis problem without sacrificing the hit rate, multi-level TLB orga-nizations are used; the first level TLB (L1) is usually small (32-128entries) but very fast, while the second level TLB (L2) is usuallylarger (128-1024 entries) but slower. Finally, a Page Table Walkcache is usually implemented to hold non-leaf intermediate trans-lations of the page table to avoid searching levels of the page table(TLBs hold the leaf translations). Figure 1 shows these structures. Available at https://github.com/ncppd/rocket-chip

Figure 1: Overview of the MMU in Rocket Chip Generator.

Chisel [5] is a high-level Hardware Construction Language (HCL)embedded in the Scala language. Chisel enables the design of pow-erful circuit generators by utilizing Scala’s high-level programmingconcepts like object-orientation, functional programming, parame-terized typed and type inference. Chisel can generate synthesizableVerilog for both FPGA simulation and ASIC implementation. It canalso output cycle-accurate C++ simulators which are very usefulfor hardware simulation and debugging.

The Rocket Chip Generator (RCG) [4] generates RISC-V ISA [1, 2]based systems using Chisel. It can also be considered as a library ofprocessor parts that can easily be reused with any design written inChisel. By default, the Rocket Chip Generator instantiates Rocket,an in-order core implementation, but also supports various coreimplementations including the BOOM out-of-order processor [8].Rocket is a simple, 5-stage, in-order processor that implements theRISC-V ISA, including an MMU that supports page-based virtualmemory, TLBs, instruction and data caches, and a frontend thatfeatures dynamic branch prediction with configurable sizes.

In this section we provide an overview of the original implemen-tation of the Instruction/Data L1 and shared L2 TLB in the RocketChip Generator. Then, we present the design and implementa-tion of our proposed configurable L1 and L2 TLB. Our design canoutput any organization ranging from direct-mapped up to fully-associative TLBs.

Each processor has its own TLB hierarchy, as shown in Figure 2.The L1 Instruction and Data TLB hold address translations for theprocess code and the process data respectively. The L1 Instruc-tion/Data TLBs are built based on the same Chisel template in theRCG and only have minor differences regarding access privileges topages. The L2 TLB is shared among the L1 Instruction/Data TLBsand can contain both Instruction and Data page translations. nabling Virtual Memory Research on RISC-V with a Configurable TLB Hierarchy for the Rocket Chip Generator CARRV 2020, May 29, 2020, Virtual Workshop

Figure 2: Rocket Chip MMU organization.

The L1 Instruction/Data TLB stores the page trans-lations in registers using a vector of

Reg elements which create anarray of positive-edge-triggered registers that output a copy of theinput signal delayed by one clock cycle, depending on its activationsignal. The original L1 TLB is fully-associative with configurablenumber of entries and uses a Pseudo-LRU Replacement Policy. TheL1 TLB responds with a hit/miss indication on the next cycle andstores virtual-to-physical page translations of 4KB pages but also2MB/1GB super pages.

The Chisel template for the Page Table Walker(PTW) incorporates the shared L2 TLB. The PTW is connectedwith the L1 Instruction and Data TLBs though a Round-Robin Ar-biter that selects the target virtual address to be translated. Theshared L2 TLB is direct-mapped with configurable number of en-tries. Because of the direct-mapped organization there is no need fora replacement policy. The L2 TLB stores the page translations usingChisel’s

SyncReadMem/SeqMem construct, which can be synthe-sized to FPGA Block RAM or ASIC SRAM.

SyncReadMem basicallycreates a synchronous-read, synchronous-write memory, in thiscase with one read and one write port. Because of the

SyncReadMem construct data are fetched on the next cycle;

SyncReadMem outputsto a register with a purpose of performing a synchronous readoperation. In order for the L2 TLB to sync with the rest of the PTWmechanism there are intermediate stages until the L2 TLB informsfor a hit or miss.

The PTW Cache is a small fully-associative cache that stores the non-leaf virtual-to-physical pagetranslations. In this paper we focus on the TLBs and leave the PTWCache for future work.

In the original Rocket Chip implementation,only the number of TLB entries is configurable. However, that ap-proach is not optimal for applications with large memory footprintsthat require larger TLB reach. Increasing the number of the fullyassociative L1 TLB significantly increases the critical path of theprocessor and can impact the operating frequency of the entiredesign. This happens because fully associative TLBs are typicallyimplemented as CAMs. However, CAMs are a resource- and power-hungry structures, in both ASICs and FPGAs [25]. Considering this,the original fully-associative L1 TLB is constrained and does notscale with application requirements. Increasing the size of L1 TLBsat lower associativity may increase the TLB reach and reduce thenumber of TLB misses without affecting the the overall resourceusage/frequency.Furthermore, because the L1 TLBs need to be fast, they are im-plemented using discrete registers that are generally precious re-sources both for ASIC and FPGA implementations. To mitigate themiss overhead of a relatively small L1 TLB, a larger but slower L2TLB is introduced that stores translations in FPGA Block RAM or ASIC SRAM. However, a direct-mapped L2 TLB can experiencemany conflict misses. In addition, L2 TLB misses are even morecostly than L1 TLB misses, because they are resolved through pagewalks that incur increased latency. Associativity may reduce thenumber of conflict misses and improve the application performance.To summarize, this lack of configurability in the TLB may limitthe applicability of Rocket Chip Generator for workloads with largememory footprints that stress the TLB hierarchy.

To develop a configurable L1 TLB we must consider a set of factorsand trade-offs. More specifically, the configurable Instruction/DataTLB should use registers using Chisel’s

Reg element for fast lookuptime. In addition, the configurable Data/Instruction TLB should bebuilt by the same Chisel template with minor differences regardingthe access privileges as mentioned earlier. Our implementationadheres to the aforementioned requirements and is compatiblewith the original implementation. Next we describe how lookups,refills, replacements, and flushes are handled in our configurableL1 TLB.

Whenever an address translation is requested, weobtain a tag and an index by splitting the

VPN . Using the index welocate the target set and perform there a fully-associative searchthat matches the tag . We modify the valid bit array and construct itas a

Vec of registers, so every set has its respected valid bit arrayand can address it using the index . When a TLB refill is requested, we locate the target setthat the virtual/physical address must be inserted using the index .In case the set is not full, we select the first free slot. Otherwise, ifthe set is full we perform a Pseudo-LRU replacement.

We modify the existing pseudo-LRUreplacement policy and implement a set-associative alternative thatuses the

Reg construct. Support for a random replacement policyis already provided. A random replacement policy is an attractivealternative option thanks to its simplicity and can be also appliedto TLBs; however, it may increase the TLB miss rate and hencedegrade performance.

When the OS modifies the page table,the stale TLB entries must be flushed. This happens when the OSexecutes the sfence.vma instruction to invalidate an entry. Usingthe index we retrieve the set that includes the entry to be flushedand perform a fully-associative lookup within that set using the tag . The flushing of the TLB is done by zeroing the valid bit of thespecified entry.

We initially developed the configurable L1 TLBin an older Rocket Chip edition that supported both base and superpage sizes in the same TLB. A constraint of a set-associative TLBstructure that we must address concerns the page size: when thepage size is unknown it is difficult to determine the least significantbits of the VPN in order to select a set [22, 23]. Therefore, we selectto implement a configurable L1 TLB only for 4KB fixed page size.We also ported the configurable L1 TLB in a recent edition of theRocket Chip in which this restriction is lifted, the TLB mechanismis separate for base/super pages, so our implementation of the

ARRV 2020, May 29, 2020, Virtual Workshop N. C. Papadopoulos, V. Karakostas, K. Nikas, N. Koziris, D. N. Pnevmatikatos

Conf. No DTLB ITLB L2 TLB DTLB Reach ITLB Reach L2 TLB ReachI fully-assoc., 32 entries fully-assoc., 32 entries - 128KB 128KB - II fully-assoc., 32 entries fully-assoc., 32 entries 4-way, 128 entries 128KB 128KB 512KB III fully-assoc., 32 entries fully-assoc., 32 entries 4-way, 512 entries 128KB 128KB 2MB IV V Table 1: Rocket Chip L1 Instruction/Data TLB and shared L2 TLB configurations (Associativity /Size). configurable L1 TLB does not affect the superpage mechanism. Moredetails about the Rocket Chip versions/commits that we modifiedare presented in Section 4.

The L2 TLB was originally direct-mapped, an organization verysimple in terms of replacement policies and TLB flushing. An ad-dress translation maps only to a unique TLB entry and as a resultthere is no need for a replacement policy. The valid bit array is keptin register banks and not in the

SyncReadMem that the TLB entriesare stored. Obtaining a value from a register bank is completed inthe same cycle in contrast with the

SyncReadMem that has a cycledelay. As a result the valid bit array of the L2 TLB can be read andupdated on the same cycle. This has the benefit of manipulatingthe valid bit without accessing the TLB array. The valid bit array isconstructed as a

Vec of registers the same way as in the L1 TLB.

The L2 TLB lookup mechanism is similar to thatof the L1 TLBs. The only difference is that the lookup in the L2TLB introduces additional cycle delay due to the

SyncReadMem construct. As a result we use registers to hold intermediate state.

In case of a refill, the L2 TLB handles it similarly withthe L1 TLB. The only difference is the use of masks to update aspecific way in a set. Masks are a feature of the

SyncReadMem construct to ease updating specific indexes inside a set.

To choose a replacement policy wemust make a trade-off between area and performance. The pseudo-LRU replacement policy must keep track of the way access historyand as a result impacts the total area when the TLB is large. Onthe other hand, using a random replacement policy has a nearlyzero impact on the total area but may degrade performance. Weimplement both replacement policies for the L2 TLB. In Section 5we choose to evaluate our set-associative design with the randomreplacement policy in favor of area constraints.

Flushing a TLB entry on a set-associativeorganization means that the entry must be located inside the se-lected set. In order to fetch the tags of the selected set there must bea cycle delay because of the

SyncReadMem construct. To overcomethis overhead and keep the flushing mechanism simple, we selectto flush the whole set. Another approach would be to block the L2TLB for one cycle to retrieve the set, and then flush the specificentry. We are considering implementing that in the future.

In this section we describe our evaluation methodology, includingthe hardware/software tools, metrics, and configurations. We ini-tially developed the configurable TLB Hierarchy on an older RocketChip commit (7cd3352, April 3, 2018) that supported the XilinxZCU102 platform. Our contributions consists of about 80 and 70lines of Chisel code added for the L1 and L2 TLB. Unfortunatelythat repository does not track the recent changes in the RocketChip Generator. As a result, we opted to use the old Rocket Chipversion for our evaluation with the Xilinx ZCU102 platform. Inaddition, to ensure the relevance and compatibility of our approachwith more recent versions of Rocket Chip, we ported our design toa more recent version (27120ee, Jan 22, 2020) that also features newmechanisms such as a sectored L1 TLB to further improve the TLBreach for 4KB pages, and a separate fully-associative L1 TLB forsuper pages. Our changes amount to about 50 and 70 lines of Chiselcode for the configurable L1 and L2 TLB respectively in the recentversion. We validated our ported design using Verilator simulations,and we plan to evaluate it on other supported FPGA platforms.

We follow a two-step process during the development of our TLBhierarchy. At first, we evaluate the L1/L2 TLB using Verilator [24]to validate the correctness of our design and to remove any bugs;afterwards, we use the Vivado tools to compile our design forthe Xilinx ZCU102 FPGA. In more detail, the development phaseincludes the following:

Verilator is an open-source tool that produces high-performance cycle-accurate C++/SystemC hardware models. Using assert-printf statements debugging becomes easier as Verilatorproduces logs of high verbosity. We use the official riscv-tests [20]as a sanity check, and then orchestrate specific assembly teststhat run upon the riscv-pk [19] (lightweight proxy kernel) whichprovides virtual memory support. Unfortunately, the downside ofusing Verilator is the slow emulation speeds in contrast with FPGAs.

We use Sifive’s Freedom-U-SDK [21] whichsets up a minimal Linux environment. The Rocket Chip SoC bootsthe lightweight Buildroot [7] distribution on top of Linux kernel4.15.0 with 4KB pages. We add new Buildroot packages that in-clude simple TLB tests to verify that our design is working asexpected, tools to retrieve performance counter results, and finallythe SPEC2006 benchmarks [10] (compiled using Speckle [9]). Wemodify the Berkeley-Boot-Loader (BBL) [19]–which initializes ma-chine registers and then boots the linux kernel–to set up severalperformance counters such as ITLB/DTLB and L2 TLB misses usingthe mhpmeventXX registers. nabling Virtual Memory Research on RISC-V with a Configurable TLB Hierarchy for the Rocket Chip Generator CARRV 2020, May 29, 2020, Virtual Workshop

We use Vivado 2018.1 Design Suite for synthesisand placement. Vivado provides also results regarding resourceusage. To evaluate the impact of the TLB hierarchy on applicationperformance, we run a subset of the SPEC2006int [10] benchmarkswith different L1 Instruction/Data TLB and shared L2 TLB config-urations. In all configurations we use a 4-way 32KB instructioncache and a 4-way 16KB data cache.

To evaluate our configurable TLB hierarchy we use the follow-ing metrics: (i) FPGA resource usage, i.e., flip-flops, look-up-tables(LUTs), and block RAM, (ii) TLB performance, i.e., TLB Misses-Per-Kilo-Instructions (MPKI), and (iii) System performance, i.e.,Instructions-Per-Cycle (IPC), a performance metric that isolates theimpact of TLB implementation on the critical path, ignoring theprocessor frequency. To evaluate the TLB and system performancewe use benchmarks from the SPEC2006int suite [10]. We use themwith the test input set due to the limited physical memory (512MB)that our Xilinx ZCU102 platform exposes to the programming logic.

We evaluate our configurable TLBhierarchy using different configurations for the L1 Instruction/DataTLB and shared L2 TLB. Table 1 summarizes the evaluated con-figurations. We choose these configurations to cover a range ofsystems from small and embedded up to modern high-performancegeneral-purpose systems. The TLB reach (i.e., number of entries × page size) covered by the L1 ranges from 128KB to 512KB, andfor the L2 is up to 4MB. In the most lightweight configuration wechoose not to include an L2 TLB to quantify the performance andarea differences of the different configurations. Finally, in the mostperformant TLB configurations (Configurations IV, V) we swap thesize of the Data and Instruction TLB to identify possible changes inperformance without changing the L2 TLB. Note that in our evalu-ation we do not include a PTW Cache. Finally, the configurationscenarios are chosen to resemble well-known architectures:I. Vanilla Rocket Chip without L2 TLBII. Vanilla Rocket Chip including small L2 TLBIII. ARM Cortex A57 [3]IV. Intel Skylake [11]V. Intel Skylake with swapped Instruction/Data TLB sizes. In this section we evaluate our configurable TLB hierarchy. Thepurpose of our evaluation is twofold: (i) to show that the generateddesigns have minimal impact on area and frequency, and (ii) toshow how TLB configurability affects performance.

Figure 3 shows the area results for the various configurations. Wepresent the total area of the Rocket Chip SoC as reported by theVivado 2018.1 Implementation stage. Note that the Instruction/DataL1 TLB structures use FFs and the shared L2 TLB uses BRAMs.

Configuration

I II III IV V

Frequency (Mhz)

189 187 186 188 186

Table 2: Maximum operating frequency per configuration. Figure 3: Area results for different TLB configurations.

In the most lightweight scenarios (Conf I, II, II) Vivado 2018.1reports that the full Rocket Chip SoC occupies 12% of the total LUTs,3% of the total FFs, and 3% of the total BRAMs of the Xilinx ZCU102.Tuning up to the most performant configurations (ConfigurationIV, V) in terms of TLB hit rate, the Rocket Chip SoC occupancyincreases to 13% for total LUTs, and 4% for total FFs/BRAMs. TheFF usage is increased in Conf IV, V in order to accommodate thenew TLB entries.Table 2 shows the maximum frequency achieved with all con-figurations. The results show that the impact on the maximumoperating frequency ranges from 0.53%-1.59%. In particular, Config-uration IV has a 2 × larger DTLB, 4 × larger ITLB, and a 1024 entryL2 TLB, but exhibits only a 0.53% drop in frequency compared toConfiguration I. We now present the results of the SPEC2006int benchmarks thatwe obtained on the Xilinx ZCU102 FPGA board.Figure 4 shows the results of MPKI in the L1 Instruction/DataTLBs for the various configurations. We observe that gobmk, hm-mer, sjeng and libquantum exhibit similar behavior in L1 TLB MPKIeven with larger TLB configurations. The most demanding in termsof TLB miss rate is mcf, and even with the largest ConfigurationV the miss rate is still high. For Configuration IV - V the miss rateis nearly the same, with Configuration V performing better in alltests. Most misses come generally from the Data TLB.Figure 5 shows the results of the MPKI in the L2 TLB for thevarious configurations. The Configuration I is not included, as itlacks an L2 TLB. We observe that the L2 TLB MPKI for most bench-marks is nearly zero, particularly for the larger Configurations IVand V, thanks to the larger reach of the L2 TLB. There is also amajor improvement in mcf which stresses the most the L2 TLB. Onaverage, the miss rate for the L2 TLB is nearly zero with the largerConfigurations IV and V.Focusing on the impact of associativity, Table 3 shows the num-ber of L2 TLB misses for mcf as we increase the L2 TLB associativitybut keep the number of L2 TLB entries constant. The L1 Instruc-tion/Data TLB parameters are based on those of ConfigurationV. We observe that there is an 82.8%/83.3% reduction in TLB misseswhen associativity changes from direct-mapped to 4-way/8-way.

ARRV 2020, May 29, 2020, Virtual Workshop N. C. Papadopoulos, V. Karakostas, K. Nikas, N. Koziris, D. N. Pnevmatikatos

Figure 4: Aggregated MPKI of the L1 Data/Instruction TLBs for the various TLB configurations.Figure 5: Aggregated MPKI of the L2 TLB for the various TLB configurations.

This behavior highlights the possible impact on the miss rate thata direct-mapped TLB can have due to conflicting entries, and thebenefits of using a set-associative TLB. Note, however, that such be-havior depends on the working set of the application and its accesspattern, and that our results are for the Spec2006int benchmarkswith the rather small test input set, as explained in Section 4.

L2 TLB Associativity

Direct-mapped 4-way 8-way

Table 3: Number of L2 TLB misses for mcf as L2 TLB associa-tivity increases, with Conf. V and fixed 1024-entry L2 TLB.

Finally, Table 4 summarizes the absolute IPC value with Con-figuration I, and the IPC speedup for the configurations II-V withrespect to Configuration I. As we can see the IPC performance in-creases by up to 15.4 % depending on the demand of TLB resourcesand access patterns of every benchmark.

Prior work has focused on developing new MMU features for theRocket Chip Generator in order to improve performance (e.g., Di-rect Segments for RISC-V [14]) while future work could investigatealternative techniques (e.g., Coalesced [18] and Clustered TLBs [17],Redundant Memory Mappings [12], and Hybrid TLB Coalescing

Benchmark I II III IV V mcf 0.13 - 7.7 % 15.4 % 15.4 %gobmk 0.44 - - 2.3 % 2.3 %hmmer 0.58 - - - -sjeng 0.55 1.8 % 1.8 % 1.8 % 3.6 %libquantum 0.44 - - - -h264ref 0.77 1.4 % 1.4 % 2.6 % 2.6 %omnetpp 0.35 2.9 % 5.7 % 5.7 % 5.7 %astar 0.36 - - 2.8 % 2.8 %xalancbmk 0.36 2.8 % 8.3 % 8.3 % 8.3 %bzip2 0.51 2.0 % 4.0 % 5.9 % 5.9 %gcc 0.44 2.2 % 2.2 % 4.5 % 4.5 %

Table 4: Absolute IPC values for Conf. I and percentage ofIPC increase for Conf. II to V with respect to Conf. I. [16]) to enhance the MMU performance. Another line of prior workhas focused on bridging the FPGA-to-ASIC performance in orderto gain more insights about the actual performance of a proces-sor (e.g., [6, 13, 15]) to be fabricated and to also lower resourceusage. Furthermore, Content-Addressable-Memories (CAMs) areknown to be resource-hungry structures [26]. Magyar et al. pro-posed Golden Gate [15] to create Decoupled FPGA-acceleratedSimulators by replacing FPGA-hostile CAMs with multi-cycle mod-els, thus reducing resource utilization. As fully associative TLBs aretypically implemented as CAMs, future work on resource optimiza-tion for large fully associative TLB organizations could leveragesuch FPGA-simulated research frameworks.

In this paper we explored the Memory Management Unit of theRocket Chip Generator and lifted its implementation limitationsin the TLB hierarchy. We implemented a fully configurable L1 andL2 TLB, that can output any design from direct-mapped to fully-associative. Our design enables design space exploration and allowsthe Rocket Chip Generator to instantiate cores with TLBs thatmatch the needs of TLB intensive applications. We make our designpublicly available to enable further research on the active topic ofvirtual memory support for the RISC-V architecture.

ACKNOWLEDGMENTS

We would like to thank Dr. Tuo Li from the UNSW School of Com-puter Science and Engineering for porting the Rocket Chip Gener-ator to the Xilinx ZCU102 board, his contribution and useful tipshelped us considerably.

REFERENCES [1] A. Waterman and K. Asanovic. 2017. The RISC-V Instruction Set Manual, VolumeII: Privileged Architecture, Document Version 1.10.[2] Andrew Waterman and K. Asanovic. 2017. The RISC-V Instruction Set Manual,Volume I: User-Level ISA, Document Version 2.2.[3] ARM. [n.d.]. ARM Cortex-A57 Technical Reference Manual. http://infocenter.arm.com/help/topic/com.arm.doc.ddi0488c/DDI0488C_cortex_a57_mpcore_r1p0_trm.pdf[4] Krste AsanoviÄĞ, Rimas Avizienis, Jonathan Bachrach, Scott Beamer, DavidBiancolin, Christopher Celio, Henry Cook, Daniel Dabbelt, John Hauser, AdamIzraelevitz, Sagar Karandikar, Ben Keller, Donggyu Kim, John Koenig, YunsupLee, Eric Love, Martin Maas, Albert Magyar, Howard Mao, Miquel Moreto, AlbertOu, David A. Patterson, Brian Richards, Colin Schmidt, Stephen Twigg, Huy nabling Virtual Memory Research on RISC-V with a Configurable TLB Hierarchy for the Rocket Chip Generator CARRV 2020, May 29, 2020, Virtual Workshop

Vo, and Andrew Waterman. 2016.

The Rocket Chip Generator

Proceedings of the 49th Annual De-sign Automation Conference (DAC âĂŹ12) . Association for Computing Machinery,New York, NY, USA, 1216âĂŞ1225. https://doi.org/10.1145/2228360.2228584[6] David Biancolin, Sagar Karandikar, Donggyu Kim, Jack Koenig, Andrew Water-man, Jonathan Bachrach, and Krste Asanovic. 2019. FASED: FPGA-AcceleratedSimulation and Evaluation of DRAM. In

Proceedings of the 2019 ACM/SIGDAInternational Symposium on Field-Programmable Gate Arrays (FPGA âĂŹ19) . As-sociation for Computing Machinery, New York, NY, USA, 330âĂŞ339. https://doi.org/10.1145/3289602.3293894[7] Buildroot. 2017. Buildroot manual. https://buildroot.org/downloads/manual/manual.html[8] Christopher Celio, David A. Patterson, and Krste AsanoviÄĞ. 2015.

The BerkeleyOut-of-Order Machine (BOOM): An Industry-Competitive, Synthesizable, Parame-terized RISC-V Processor

SIGARCHComput. Archit. News . 66–78.[13] Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach, and KrsteAsanovic. 2017. Evaluation of RISC-V RTL with FPGA-Accelerated Simulation.[14] Nikhita Kunati and Michael M Swift. 2018. Implementation of Direct Segmentson a RISC-V Processor. (2018). [15] A. Magyar, D. Biancolin, J. Koenig, S. Seshia, J. Bachrach, and K. AsanoviÄĞ.2019. Golden Gate: Bridging The Resource-Efficiency Gap Between ASICs andFPGA Prototypes. In . 1–8.[16] Chang Hyun Park, Taekyung Heo, Jungi Jeong, and Jaehyuk Huh. 2017. HybridTLB Coalescing: Improving TLB Translation Coverage under Diverse FragmentedMemory Allocations. In

Proceedings of the 44th Annual International Symposiumon Computer Architecture (ISCA âĂŹ17) . Association for Computing Machinery,New York, NY, USA, 444âĂŞ456. https://doi.org/10.1145/3079856.3080217[17] B. Pham, A. Bhattacharjee, Y. Eckert, and G. H. Loh. 2014. Increasing TLB reachby exploiting clustering in page translations. In . 558–567.[18] Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattachar-jee. 2012. CoLT: Coalesced Large-Reach TLBs. In

Proceedings of the 2012 45th An-nual IEEE/ACM International Symposium on Microarchitecture (MICRO-45) . IEEEComputer Society, USA, 258âĂŞ269. https://doi.org/10.1109/MICRO.2012.32[19] RISC-V Foundation. [n.d.]. riscv-pk. https://github.com/riscv/riscv-pk/tree/master/bbl[20] RISC-V Foundation. [n.d.]. riscv-tests. https://github.com/riscv/riscv-tests[21] Sifive. [n.d.]. Freedom-U-SDK. https://github.com/sifive/freedom-u-sdk[22] Madhusudhan Talluri and Mark D. Hill. 1994. Surpassing the TLB Performanceof Superpages with Less Operating System Support. In

Proceedings of the SixthInternational Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS VI) . Association for Computing Machinery, NewYork, NY, USA, 171âĂŞ182. https://doi.org/10.1145/195473.195531[23] Madhusudhan Talluri, Shing Kong, Mark D. Hill, and David A. Patterson. 1992.Tradeoffs in Supporting Two Page Sizes. In

Proceedings of the 19th Annual In-ternational Symposium on Computer Architecture (ISCA âĂŹ92)

IEEE Transactions on Very LargeScale Integration (VLSI) Systems

22, 10 (2014), 2067–2080.[26] H. Wong, V. Betz, and J. Rose. 2014. Quantifying the Gap Between FPGA andCustom CMOS to Aid Microarchitectural Design.