A portable and Linux capable RISC-V computer system in Verilog HDL
aa r X i v : . [ c s . A R ] M a r MIURA et al.: A PORTABLE AND LINUX CAPABLE RISC-V COMPUTER SYSTEM IN VERILOG HDL PAPER
A portable and Linux capable RISC-V computer system in VerilogHDL
Junya MIURA † a) , Nonmember , Hiromu MIYAZAKI † b) , Student Member , and Kenji KISE † c) , Member
SUMMARY
RISC-V is an open and royalty free instruction set archi-tecture which has been developed at the University of California, Berkeley.The processors using RISC-V can be designed and released freely. Becauseof this, various processor cores and system on chips (SoCs) have been re-leased so far. However, there are a few public RISC-V computer systemsthat are portable and can boot Linux operating systems.In this paper, we describe a portable and Linux capable RISC-Vcomputer system targeting FPGAs in Verilog HDL. This system can beimplemented on an FPGA with fewer hardware resources, and can be im-plemented on low cost FPGAs or customized by introducing an accelerator.This paper also describes the knowledge obtained through the developmentof this RISC-V computer system. key words:
RISC-V, FPGA, computer system, Linux, soft processor, VerilogHDL
1. Introduction
RISC-V [1] is an emerging instruction set architecture (ISA)which has been developed at the University of California,Berkeley. Because it is open and royalty free, processor coresor system on chips (SoCs) of RISC-V can be designed andreleased freely. For this reason, processor cores and SoCsthat operate on Field-Programmable Gate Arrays (FPGAs)such as Rocket chip [2] and BOOM [3] have been released.In addition, various software including some compilers andoperating systems (OSs) begin to support the ISA.Linux is also built for RISC-V, on which many appli-cations can run. However, there are still a few RISC-Vcomputer systems that support Linux. Although FreedomU500 VC707 Dev Kit [4] from SiFive is available, it is dif-ficult to use the system easily because an expensive FPGAis required. Therefore, by developing a portable computersystem of RISC-V that can be easily used on a small FPGAand support Linux, it is expected to be used in various fields.In RISC-V [5], many extended instruction sets are de-fined which can be added to the base integer instruction setnamed
RV32I that supports 32-bit address space. The de-signers may select the supported instruction sets dependingon the way to use the systems. For example, the M exten-sion for integer multiplication and division, the C extensionfor the compressed instructions, and the F extension for thesingle-precision floating-point instructions are defined. Thenotation of RV32IM indicates the ISA supporting RV32I andthe M extension. In order to support Linux on a RISC-V † The authors are with the School of Computing, Tokyo Instituteof Technologya) E-mail: [email protected]) E-mail: [email protected]) E-mail: [email protected] system, not only supporting RV32IM but also supporting
RV32IMA with the A extension for the atomic instructions isnecessary.We design a portable and Linux capable RISC-V com-puter system supporting RV32IMAC and implement it target-ting a small FPGA in Verilog HDL. This paper also describesthe knowledge obtained through the development of this sys-tem.
2. Challenges to support Linux
We discuss the challenges to support Linux for a baselinecomputer system like an embedded system which often usedto control some simple hardware devices.We assume that the baseline system is made from asimple RV32I processor, a local memory, and LEDs foroutput. The processor in this system is a multi cycle designthat executes one instruction through five steps consisting ofinstruction fetch (IF), operand fetch (OF), execution (EX),memory access (MEM), and write back (WB).In the IF step, an instruction is read from the instructionmemory using the program counter (PC) as an address. Inthe OF step, the register numbers and immediate values aredecoded from the instruction and the operands are read fromthe register file. In the EX step, the operation is executed foradd, sub, branch instructions and so on. In the MEM step,the main memory is accessed for load and store instructionswith the memory address which is calculated in the EX step.In the WB step, the value which is calculated in the EX stepor read in the MEM step is written back to the register file.We can support M extension for RV32IM smoothlybecause there are small modifications to implement the mul-tiply and divide circuits in the EX step if we assume that themultiply and divide can be executed in a single cycle.2.1 Supporting the control and status register (CSR) in-structionsRISC-V defines some privilege levels where the execution ofsome operations can be limited by switching these privilegelevels. There are three privilege levels named
M-Mode forthe machine mode,
S-Mode for the supervisor mode, and
U-Mode for the user mode [6]. The M-Mode is the highestprivilege mode and U-Mode is the lowest one.To identify the privilege mode, the control and statusregisters named
CSRs are used. The registers are also usedfor exceptions, various identifications such as supportingnstruction extensions, and so on. To support these registers,it is necessary to implement the CSR instructions, whichoperate the read-modify-write on CSRs atomically.We can implement the CSR instructions by loading thevalue of CSRs like the access to the register file in the OFstep, executing in the EX step, and writing the obtained valueto CSRs in the WB step.2.2 Supporting the atomic instructionsThe atomic instructions consist of the load-reserved/store-conditional (LR/SC) instructions and the atomic memoryoperation (AMO) instructions.The LR instruction loads the data from the main mem-ory and reserves its address. The SC instruction stores thedata to the main memory if there is a valid reservation ex-ists on its address. This instruction also returns a bit thatindicates whether the data storing is a success or not.The complex atomic memory operations can be per-formed by using LR/SC instructions. To implement LR/SCinstructions, it is necessary to implement the registers whichsave the reservation address and the reservation status. TheLR instruction can be implemented by writing the reservationaddress and the status in the MEM step. The SC instructioncan be implemented by checking the reservation address andstatus. They are small modifications to implement LR/SCinstructions on the baseline computer system.The AMO instructions perform the memory read-modify-write operations. This means that an AMO instruc-tion atomically loads the value, applied the operation, andwrites the result to the memory. It is difficult to implementthe AMO instruction on the baseline computer system easilybecause the processor has only one memory access step inthe MEM step. To solve this problem, it is necessary toimplement an additional memory access step and anotherexecution step for the modify operation between these twomemory accesses. These should require a lot of designchanges and debugging.2.3 Supporting the virtual address spaceTo support the virtual address space of Linux, the addresstranslation unit which translates the virtual address to thephysical address is necessary.Because each process has its unique virtual addressspace, the virtual address for the space must be translated tothe physical address to access the physical memory. Actu-ally, the virtual addresses are managed in units of pages, andthe translation from these virtual page addresses to the phys-ical page addresses is performed. We call this translationfrom virtual address to physical address as page walk , and atranslation lookaside buffer (TLB) is a cache that stores therecently used information of this translation.RISC-V defines that the address translation should bedone by hardware. The TLB is accessed before the memoryaccess when the virtual address is used. If the TLB misses,the page walk is done by hardware. Furthermore, if the page
RVCoreM (Processor)
MMUMain Memory
TLB
RVuc (Micro Controller)
Console Disk
TLBTLB
Fig. 1
The logical organizaton of the proposed computer system. walk is failed, the page fault exception occurs to inform theevent to the OS. Also, each page has permissions for read,write, and execution. If there is no valid permission, thepage fault exception occurrs too.The page walk is defined as
Sv32 which has 32-bit ad-dress space having two memory read accesses at maximum.So, it is difficult to support the address translation on thebaseline computer system.
3. RVSoC: a portable and Linux capable RISC-V com-puter system on an FPGA
We design a portable and Linux capable RISC-V computersystem on a small FPGA, and we named
RVSoC for thesystem.Fig.1 shows the logical organization of RVSoC. Itmainly consists of the main processor named
RVCoreM , amemory management unit (MMU), the main memory, a disk,the console, and the micro controller named
RVuc .RVCoreM and RVuc are connected to the main mem-ory, the disk, and the console through the MMU. The MMUhas a Sv32 page walk unit to translate the virtual address tothe physical address. It also has three TLBs for the mem-ory access of instruction fetch, data load, and data store,respectively. It is because the pages have permissions suchas execution, read, and write. Therefore, using these threeTLBs will simplify the implementation. The two processorsaccess to the main memory, the disk, and the console byusing the memory mapped I/O (MMIO).3.1 RVCoreM: the main processorFig.2 shows the block diagram of RVCoreM. It is a multicycle processor that executes one instruction in twelve stepsand supports RV32IMAC and CSRs.The twelve steps are initialization (INI), IF, convertcompressed instruction (CVT), instruction decode (ID), OF,execution (EX1), load data (LD), atomic operate execu-tion (EX2), store data (SD), WB, update CSR and programcounter (COM), and increment of the instruction counter(FIN). INI and FIN are omitted in the figure because theyare mainly used for the debugging purpose.In Fig.2, the gray-colored rectangles are registers whichare updated at the positive edge of the clock signal. The
IURA et al.: A PORTABLE AND LINUX CAPABLE RISC-V COMPUTER SYSTEM IN VERILOG HDL pc conv r _ i r imm_gen D ec od e d r e g s reg_file CSRs r _ rr s r _ rr s r _ r c s r m ux A L U _ B A L U _ I M A L U _ C +++ m ux r _ t kn r _ j m p_p c r _ m e m _ a dd rr _ w b_d a t a r _ w b_d a t a _ c s r m ux m ux m ux + A L U _ A r _ m e m _ r d a t a r _ m e m _ w d a t a m ux w_wb_r_data w _ w b_ r _d a t a ++ IF CVT ID OF EX1 LD EX2 SD COMWB r _ a l u_ i n2 MemoryTLB MemoryTLB MemoryTLB m ux m ux Fig. 2
The block diagram of RVCoreM that is the main processor supporting RV32IMAC. orange-colored units are combinational circuits like ALUand adder. The yellow-colored unit is the main memory.Although three memories are depicted in this figure in theIF, LD, and SD steps, they are the same physical mem-ory in fact. The green-colored units are TLBs, the registerfile and the CSRs which are the combinational circuits withasynchronous memories. The blue-colored units are combi-national circuits used in the CVT and ID steps, respectively.The AMO instructions execute the memory read-modify-write as described above, where an AMO instructionloads the data, applies the operation, and stores the result tothe main memory. To support the AMO instructions, twosteps of memory access and the execution step for atomicoperation are required. Therefore, the processor executes anAMO instruction by using the LD step for the first memoryaccess, the EX2 step for the calculation, and the SD stepfor the second memory access, which perform a memoryread-modify-write.To support the compressed instructions, the conversionfrom a compressed instruction to a standard instruction isdone in the CVT step. The compressed instructions are 16-bit instructions that are encoded from standard instructionsthat meet the specific condition. Because other standard in-structions are 32-bit width, the code size can be reduced byusing this 16-bit compressed instructions. Any compressedinstruction can be expanded to the equivalent 32-bit standardinstruction. Therefore, it is possible to support compressedinstructions without adding the complex dedicated circuitfor compressed instructions in later steps by expanding com-pressed instructions in the CVT step.To improve processor performance, we apply two op-timizations. One is the use of multi-cycle divider wherethe divide and the remainder instructions are executed inabout 32 cycles to improve the operating frequency. This isbecause if divide and remainder instructions are executed inone cycle like add or sub instructions, the circuit becomes the
INI IF CVTIDOFEX1EX2 LDFIN
COM
WBSD
Page Fault Across the cache line
Fig. 3
The optimized state transition diagram for the multi cycle design.The orange, green, and blue transitions are high priority. When theseconditions are false, the black-colored transitions are used. critical path of the entire system and the operating frequencydrops significantly.The other optimization is that the state transition fromIF to FIN is modified to skip some steps by the operationof the executed instruction to reduce the number of elapsedclock cycles for some instructions. The reason for skippingsome steps depending on the executed instruction is that evenif the instruction does not have memory access, the processorhas to spend useless cycles in LD, EX2, and SD steps andthis will degrade performance.Fig.3 shows the state transition diagram. The orange,green, and blue transitions are high priority. When theseconditions are false, the transition of the black-colored arrowis applied. All instructions begin from the INI step.In the IF step, an instruction is fetched. To fetch an in-truction crossing the cache lines due to the compressed in-structions, the instruction fetch may require two main mem-ory accesses. If no page fault occurs in the IF step, the nextstep will be the CVT step. From the CVT step to the EX1step are necessary for such instructions.In the EX1 step, the transition to the WB step for in-structions other than the atomic, load, and store instructions.This is because it is sufficient to write back to the register filewithout accessing the main memory expect for the atomic,load, and store instructions.In the LD step, the load and store instructions do notneed to execute atomic operation. Therefore, the transition tothe WB step and SD step, respectively. By these transitions,the average number of execution cycles of the processor canbe reduced compared to the case of always executing twelvesteps (the orange chain line in the fig.3).The transition of the green arrow in the Fig.3 is forrequesting exception handlings. The exceptions in this figureare the page fault exception and the exception that occurredby the ECALL instruction. When these exceptions occurred,the system registers are updated appropriately and the nextinstruction is executed, so that the state transits to the COMstep. Although it is omitted, when no exception is detectedand a stall signal is detected, the processor stalls in the currentstep.3.2 Translation from the virtual to the physical addressAs described above, each page has the execute, read and writeaccess permissions, and three TLBs for each permission areused. When there is no permission from the state to beaccessed, the TLB is not hit. When the current step is theIF, the TLB depicted in the IF step in Fig.2 is accessed.Similarly, when it is the LD and SD, the TLB depicted in theLD and SD is accessed, respectively.When a TLB miss occurs, a page walk is invoked toobtain the translation. The page walk is designed as a statemachine of six states. This is because it is difficult to accessthe memory multiple times in one cycle, and it is necessaryto change the address to access the memory and calculatethe next address for each state.In the first state of the page walk, the address of the firsttranslation information is calculated,and the main memory isaccessed with the address. In the second state, the obtainedinformation is saved in the register. In the third state, theaddress of the next translation information is calculated, andthe main memory is accessed with the address again. In thefourth state, the obtained information is saved in the register.In the fifth state, the success of the page walk is judged fromthe saved registers. In the sixth state, the TLB and the pagetable entry in the main memory is updated. These statesenable a page walk of RISC-V Sv32.3.3 I/O devices and its controllerAs shown in Fig.1, we implement the console and the disk asI/O devices. The console is a device for a keyboard input and pc Local Memory D ec od e d r e g s imm_gen reg_file A L U _ B A L U _ I m ux r _ w b_d a t a ++ + + m ux m ux m ux m ux w_data_wdataw_data_addrw_data_data IF OF EX IF (WB) r _d r a m _d a t a MEM m ux Fig. 4
The block diagram of the I/O controller named RVuc. a character output, and the disk is a storage device having afile system.These devices are accessed using VirtIO [7] that is anI/O framework. The VirtIO is generally used by the emula-tor, and when accessing the actual I/O, many main memoryaccesses and loops depending on the contents of the memorydata occur. If this operation is implemented as hardware, acomplex circuit is required. Therefore, a small processornamed RVuc is implemented and used as an I/O controllerto execute these complicated processes as software.Fig.4 shows the block diagram of RVuc. It is a four stepand multi cycle processor having own local memory whichsupports RV32I.The programs of the I/O processing for the console andthe disk are stored in this local memory. RVuc can accessto DRAM or I/O control registers by using w_data_addr,w_data_wdata, and w_data_data wires in Fig.4.RVuc executes the program and I/O processing is donewhen the request of I/O is came from RVCoreM. RVCoreMstops the operation while RVuc operating. There is no ad-ditional TLB and control registers for RVuc because RVucdoes not use the virtual address space.3.4 Memory access optimizationA direct mapped cache of the write through scheme is imple-mented to reduce the access latency to the main memory, andthe memory access latency is one cycle when the cache hits.The cache block size is 16-byte. For a simple implemen-tation, the cache entry is invalidated when a correspondingentry is updated by a store instruction.This cache is located between the main memory andMMU. Because all the memory addresses to the cache arephysical ones, the cache flush due to the process switchingis not occurred. Note that this cache works as an instructioncache and a data cache for both RVCoreM and RVuc.The instruction fetch unit always requests 4-byte datato the cache. However, 4-byte fetches may be fetched acrosscache lines when supporting compressed instructions.
IURA et al.: A PORTABLE AND LINUX CAPABLE RISC-V COMPUTER SYSTEM IN VERILOG HDL YX XYY
Z YZ Y X YZLine 1Line 2
Fig. 5
The instruction fetching across two cache lines and the schemeusing the 16-bit Buffer.
Fig.5 shows the way to fetch 4-byte data across twocache lines. For simplicity, we assume that the line size is8-byte, and Line1 and Line2 store the continuous blocks.If the value of the program counter is at the position A inthe Fig.5, the 4-byte of X and Y which are 2-bytes eachcan be fetched since it does not cross cache lines. If thevalue of the program counter is at the position B, Y and Zhave to be fetched. At this time, two cache lines have to beaccessed since Y and Z are on different cache lines. Thesetwo accesses increase the number of cycles by processor stalland decrease the performance.We mitigate this problem by implementing a smallbuffer named . The 16-bit Buffer stores theupper 2-byte of the previously fetched 4-byte instruction. InFig.5, the value of the program counter is at the position A,and when X and Y are fetched, and Y is stored in the 16-bitBuffer.Then, when the value of the program counter is at theposition of B for the next instruction, it fetches 4-byte includ-ing Z. The fetched instruction is completed by concatenatingthe fetched Z and Y in the 16-bit Buffer and sending it to theCVT step. This operation can reduce the number of cacheaccesses when fetching 4-bytes across the cache line.3.5 Implementation targetting Nexys A7 FPGA boardWe describe the implementation issues on RVSoC in VerilogHDL targetting Nexys A7 FPGA board of Digilent Inc.Fig.6 shows the hardware organization implemented inVerilog HDL. The DRAM memory and the host PC areoutside of an FPGA. TLB_i, TLB_r, and TLB_w are modulesof TLBs for instruction, read, and write, respectively. Thereis no TLB between RVuc and DRAM cache module becauseRVuc uses physical addresses in the program execution. Theaccess to the DRAM memory, the console registers, and thedisk registers is controlled by the MMIO module.The Nexys A7 board has 128MB DDR2-SDRAM. Allreads to the DRAM are performed in 16-byte units becausethe cache line size is 16-byte. The writes are performed in1-byte, 2-byte, or 4-byte units executing SB (store byte), SH(store halfword), and SW (store word), respectively. Access-ing to this DRAM uses Memory Interface Generator (MIG),an IP of Xilinx, Inc. The operation frequency of the DRAM
TLB_i TLB_w Disk reg Console regRVSoCRVCoreM TLB_r MMUBufferDRAM (Main Memory / Disk)DRAM CacheDRAM Controller RVucLocal memory Host PCSerial
Fig. 6
The organization of RVSoC implemented in Verilog HDL target-ting Nexys A7 FPGA board.
100 MHz m_clkgen0 board_clk
MIG
200 MHz sys_clk
325 MHz ddr2_clk m_clkgen1 core_clk
DRAMController DRAM Cache
FIFO
RVCoreM
Fig. 7
The clock generation scheme on an FPGA using the 100MHz inputclock. is 325MHz, which is the maximum operating frequency rec-ommended by the Nexys A7 manual [8], and the operatingfrequency of the DRAM controller is 81.25MHz which is1/4 of the DRAM frequency.The 64MB DRAM memory area is used for the mainmemory, and the rest of the 64MB area is used for the disk.Therefore, the system does not use any physical disk drive.For the console input and output, serial communicationis used. A FIFO buffer for 16 characters for keyboard inputis implemented to prevent the omission of detection due tohigh-speed input. The serial communication is also used toinitiate the contents of the main memory and the disk. Thecommunication speed of the serial communication betweenan FPGA and the host PC is 8Mbaud.Fig.7 shows the clock generation scheme on an FPGAusing the 100MHz input clock. The dashed lines indicateinput and output clocks to and from an FPGA.The 200MHz clock is generated from the 100MHz in-put clock using the module m_clkgen0, which is the inputclock of the module MIG. Using this 200MHz clock, the325MHz clock for the DRAM memory and the 81.25MHzclock for the DRAM controller are generated. Then, usingthis 81.25MHz clock as input, the module m_clkgen1 gen-erates the 104MHz clock for RVCoreM and RVuc. Sincethe operating frequency of the memory controller runningat 81.25MHz and the frequency of processors running at104MHz are different, the data transfer between them isdone using asynchronous FIFOs.Table 1 lists the names of all RTL design files and showsthe number of lines of Verilog HDL code for each file and able 1
The number of lines of Verilog HDL code for each file and theirtotal. console.v 119 memory.v 841debug.v 161 microc.v 239disk.v 118 mmu.v 798dram.v 515 rvcorem.v 1,263loader.v 143 top.v 557main.v 359 total 5,113
Table 2
The main parameters for the evaluation.RVSoC Version v0.4.3Core clock frequency 104 MHzDRAM clock frequency 325 MHzTLB entries 32 × their total. Note that some necessary header files are notlisted in this table. The file of main.v is the top module forconnecting RVCoreM to the MMU and for controlling LEDsoutput on the FPGA. rvcorem.v, microc.v, and mmu.v arefiles of RVCoreM, RVuc, and MMU, respectively. dram.vis a file for DRAM controller to use the DRAM, memory.vis a file that defines the cache and local memory of RVuc,and loader.v is a file for receiving the initialization file andsending and receiving the terminal, and console.v and disk.vare files controlling each system register, and debug.v is afile for displaying debug information on the terminal.The total number of simulation files is 5,113 lines. Thefile top.v is used for only simulation, the number of VerilogHDL code lines for the FPGA logic synthesis excluding top.vis 4,556.
4. Verification and evaluation
Fig. 8
The photo of RVSoC on Nexys A7 FPGA board (left) and a screen-shot execution sl command on TeraTerm (right). The Linux kernel to be executed is version 4.15.0, andthe root file was built for two configurations using Build-root [9]. The one is targetting RV32IMAC (with the com-pressed instructions). The other is targetting RV32IMA(without the compressed instructions).4.2 VerificationWe verified RVSoC by using Verilog simulation, a softwaresimulator in C++, and an FPGA board. As a software sim-ulator, we use
SimRV [10] which can emulate an RISC-Vcomputer system.For each executed instruction, SimRV can output the ar-chitectural state which includes the contents of the programcounter, the instruction register, the general purpose regis-ters, the CSRs, and the TLBs. Similarly, we implementedthe function to output the same information with the sameformat as SimRV to the RTL of RVSoC.For verification, the architectural states obtained fromSimRV and the Verilog simulation of RVSoC were com-pared. Synopsys VCS is used for simulation to obtain thearchitecture state of RVSoC. From the comparison, we con-firmed that both architectural states for all simulated instruc-tions match completely.4.3 EvaluationFig.8 is the photo of RVSoC working on Nexys A7 FPGAboard and a screenshot on TeraTerm (right). The left side ofthe Fig.8 is RVSoC working on Nexys A7 FPGA board. Thenumber displayed on the 7-segment LED on the FPGA boardis a hexadecimal value about the number of executed instruc-tions. The right side of the figure is a screenshot execution sl command on TeraTerm displayed by serial communication.Besides the sl command, various commands on Linux suchas top , sleep, and vi can be executed.Fig.9 shows the implementation result of RVSoC on axc7a100tcsg324-1 FPGA. In this figure, the yellow blocksare RVCore, the purple blocks are RVuc, the orange blocksare the DRAM controller, the green blocks are the cache,and the light blue blocks are TLBs.Table 3 shows the number of the occupied hardwareresources of RVSoC where the number in parentheses indi-cate the percentage of the whole FPGA resources. Whentargeting the xc7a100tcsg324-1 FPGA, the utilization rate IURA et al.: A PORTABLE AND LINUX CAPABLE RISC-V COMPUTER SYSTEM IN VERILOG HDL Fig. 9
The implementation result of RVSoC on a xc7a100tcsg324-1FPGA.
Table 3
The hardware resources.Registers LUTs BRAMs6,379 (5.0%) 10,421 (16.4%) 38 (28.2%)
Table 4
The cycles per instruction (CPI) on two configurations.Executed insns Cycles CPIComp 66,067,456 1,213,305,856 18.4No-comp 66,760,704 1,233,961,984 18.5
Table 5
The cache hit rate and the miss per kilo instruction (MPKI).Access Num Hit Num Hit rate (%) MPKIComp 86,738,837 82,231,973 94.8 73.88No-comp 77,055,765 71,725,363 93.1 87.38 of hardware resources is less than 30%, and there is muchspace to implement additional logic such as accelerators.Table 4 shows the cycles per instruction (CPI) of RV-CoreM measured until the login screen of Linux is displayedon the console. The configuration of
Comp is that all pro-grams including the Linux kernel are compiled targettingRV32IMAC with compressed instructions. The configura-tion of
No-comp is that all programs are compiled targettingRV32IMA without compressed instructions.The second and third column in this table indicatethe number of executed instructions and elapsed cycle, re-spectively. The number of elapsed cycles is not countedwhile RVuc is executing the I/O processing and RVCoreM isstalling the operation.Although the minimum number of cycles to execute oneinstruction is eight, the average numbers of the executioncycles per instruction are 18.4 and 18.5 because the DRAMmemory accesses take many clock cycles.Table 5 shows the cache hit ratio and miss per kiloinstruction (MPKI) on two configurations measured duringthe 61M instruction was executed until the Linux login screenwas displayed in the simulation. We can see that the hit ratesof the cache are more than 93% for both configurations,and the cache is really effective for the implementation ofcomputer systems.Fig.10 shows the performance comparison of RVSoCand some computer systems by using Dhrystone MIPS
SimRVIntel i486DXRVSoCIntel i386DX Dhrystone MIPS
Fig. 10
The performance comparison of RVSoC and some computersystems.
Table 6
The time from the synthesis using Vivado to the display of theLinux login prompt.Synthesis Imple & bit gen Initialize Boot Total124 sec 161 sec 33 sec 12 sec 330 sec (DMIPS) as the metric when Dhrystone benchmark [11]of Buildroot package is executed. The evaluation targets areIntel i386DX [12], Intel i486DX [12], SimRV † , and RVSoC.Our proposal of RVSoC achieves almost the same perfor-mance as the Intel Core i486DX that is a processor about30 years ago. Unfortunately, the software simulator SimRVachieves higher performance than RVSoC running on anFPGA, but we see that there is no significant difference.Table 6 shows the time from the synthesis using Vivadoto the display of the Linux login prompt. on a Nexys A7FPGA board. The synthesis and implementation are runningon Intel Core i9 7920X with 64GB DDR4-SDRAM runningUbuntu 18.04. The five columns indicate the time for (1)the synthesis, (2) the place, route and a bitfile generation, (3)the transferring the boot loader image of 9.1MB and the diskimage of 16MB, (4) the Linux booting, and (5) their total.Although the maximum size of the disk is 64MB, only16MB image is used because the larger data transfer takesmuch initialization time. This system is easy-to-use becauseit can be executed in less than 6 minutes in total includingsynthesis, implementation, and bit file generation time.
5. Related works and discussion † Running on Intel Core i7 870, DDR3-SDRAM 8GB, Ubuntu16.04 machine s hard to use for those who are starting to develop RISC-Vsoftware.Litex-VexRiscv [16] is a computer system equippedwith a processor called VexRiscv [17]. VexRiscv is a pro-cessor can support Linux and RV32IMA and written inSpinalHDL [18] which is the original HDL based on Scala.An instruction cache and a data cache are implemented.This computer system can be implemented on many FPGAboards by using FPGA design / SoC builder called Litex [19].VexRiscv is written in SpinalHDL and it is not easy-to-usebecause Verilog HDL and SystemVerilog are dominant lan-guages used to implement processors [20].5.2 Obtained knowledge from the developmentThe development period of RVSoC was about half a year,through the development we obtained various findings asfollows. (1) the improvement of the debugging efficiency byusing a high-speed simulator in C++ on the system design,(2) the importance of the output function of the architecturestate, (3) the importance of the function to restart the sim-ulation from any point, and (4) the benefits of performingthe complex processings with a small processor. We willexplain these each by each.We used SimRV which is written in C++ to design RV-Soc. Because Verilog simulation is much slower than C++,using a high-speed software simulator such as SimRV makesit easier to design the processor. SimRV has a hardware-likedesign, so design changes of the system can apply to Ver-ilog HDL code easily. As a result, it was possible to designthe RVSoC more quickly than when designing it using onlyVerilog HDL code.The output function of the architecture state is used tofind implementation differences between software simulatorsand the hardware design by Verilog HDL. it is possible toeasily identify the points where the difference occurred byadding this output function. This has made it easier to findand fix bugs in the hardware design quickly.The function that can restart the simulation by VerilogHDL from the middle is necessary because the Verilog HDLsimulation is slower than the software simulator. The targetof the implementation for operating Linux in the simulationwas up to the 61Mth instruction to display the Linux loginscreen, but it would take more than 30 minutes and the logdata would be enormous if this simulation was executed bySynopsys VCS. This means that you will have to wait a longtime when debugging the operation in the latter half of thesimulation, Therefore, the architecture state, and the contentsof the memory and the disk are all saved in one log file at theend of simulation, and it was possible to simulate only thenecessary parts from the time when the log file was acquired.As a result, this restart function contributed to shortening thedevelopment period.The complex processing is executed by a micro con-troller which is a small processor. The complex I/O process-ing is used in software simulators. If the implementationof I/O processing in hardware is different from the process- ing used in the software simulators, the architectural statewill not match the software simulator because different I/Oprocessings are executed. This makes debugging includingI/O processing difficult. Therefore, not only the system wassimplified, but also debugging was easy by implementing amicro controller that performs exactly the same processingas software simulation.5.3 Expected usage of RVSoCWe are planning to release the RTL code of the designedRVSoC as an open and royalty free RTL design.Because RVSoC is a computer system that supportsLinux and uses a small amount of hardware resources, it canbe applied to various purposes.A feature of RISC-V is that it has a room for the extendedinstructions by computer system developers. The abilityof extension can be the basic requirement for application-specific accelerators and it enables to implement more spe-cialized instruction sets. For example, the RISC-V processorcore of the PULP Platform [21] has improved performanceby adding some packed-SIMD instructions, some bit ma-nipulation instructions and so on. The resource-saving ofRVSoC can be suitable for the implementation of variousaccelerators and special processor cores by adding uniqueinstructions, and the development of related software.The number of lines in Verilog HDL code of RVSoC isabout 5,000, and it is relatively easy to understand the en-tire implementation of the Linux capable computer system.Therefore, it is suitable to be used as a sample computersystem of the education on computer science.
6. Conclusion
A Linux capable computer system has to support the CSRinstructions, the atomic instructions, and the virtual addressspace. The atomic instructions are difficult to be imple-mented on a simple computer system because there are in-structions that access the main memory twice per instruction.We proposed RVSoC, a portable and Linux capableRISC-V computer system which is implemented in VerilogHDL. It mainly consists of a processor named RVCoreM,a memory management unit, the main memory, a disk, theconsole and a micro controller named RVuc. RVCoreMis a twelve step and multi-cycle processor that supportsRV32IMAC, and supports the atomic instructions by imple-menting two memory access steps. RVuc is a small processorused for the disk and console accesses.The evaluation results show that RVSoC can be imple-mented with a small amount of hardware resources such asregisters of about 5%, LUTs of about 16%, and BRAMs ofabout 28% of the target FPGA. The RTL code of the systemis about 5,000 lines, which makes it easy to understand theimplementation. The time from the synthesis using Vivadoto the display of the Linux login screen is less than 6 minutes.Such a short development time makes the proposed systemportable and easy-to-use.
IURA et al.: A PORTABLE AND LINUX CAPABLE RISC-V COMPUTER SYSTEM IN VERILOG HDL Acknowledgments
This work was supported by JSPS KAKENHI Grant Num-ber JP16H02794. This work is supported by VLSI Designand Education Center(VDEC), the University of Tokyo incollaboration with Synopsys, Inc. We thank Mr. Kuroda forcreating SimRV.
References [1] RISC-V Foundation, “RISC-V.” https://riscv.org/.[2] K. Asanovic, R. Avizienis, J. Bachrach, et al. , “The Rocket ChipGenerator,” Tech. Rep. UCB/EECS-2016-17, EECS Department,University of California, Berkeley, Apr 2016.[3] C. Celio, D.A. Patterson, and K. Asanovic, “The Berkeley Out-of-Order Machine (BOOM): An Industry-Competitive, Synthesizable,Parameterized RISC-V Processor,” Tech. Rep. UCB/EECS-2015-167, EECS Department, University of California, Berkeley, Jun2015.[4] SiFive, “freedom: Source files for SiFive’s Freedom platforms.”https://github.com/sifive/freedom.[5] A. Waterman and K. Asanovic, The RISC-V Instruction Set Manual-Volume I: User-Level ISA-Document Version 2.2. RISC-V Founda-tion, May 2017.[6] A. Waterman and K. Asanovic, The RISC-V Instruction Set Manual-Volume II: Privileged Architecture Version 1.10. RISC-V Founda-tion, May 2017.[7] OASIS, Virtual I/O Device (VIRTIO), version 1.0 ed., 2014.[8] Digilent, “Nexys A7 FPGA Board Reference Manual,” July 2019.[9] “Buildroot.” https://buildroot.org/.[10] K. Kuroda, “Design and implementation of a RISC-V emulator run-ning Linux,” Master’s thesis, Tokyo Institute of Technology, 2019.[11] R.P. Weicker, “Dhrystone: A Synthetic Systems ProgrammingBenchmark,” Commun. ACM, vol.27, no.10, pp.1013–1030, Oct.1984.[12] F. Zappa and S. Esculapio, Microcontrollers. Hardware and Firmwarefor 8-bit and 32-bit devices, LIGHTNING SOURCE Incorporated,2017.[13] J. Bachrach, H. Vo, B. Richards, et al. et al. , “Near-ThresholdRISC-V Core With DSP Extensions for Scalable IoT Endpoint De-vices,” IEEE Transactions on Very Large Scale Integration (VLSI)Systems, vol.25, no.10, pp.2700–2713, Oct 2017.
Junya Miura received the B.E degrees inDepartment of Computer Science from TokyoInstitute of Technology, Japan in 2018. He iscurrently a master course student of the Grad-uate School of Computing, Tokyo Institute ofTechnology, Japan. His research interest is com-puter architecture, high performance computingand FPGA computing.
Hiromu Miyazaki received the B.E degreesin Department of Computer Science from TokyoInstitute of Technology, Japan in 2019. He iscurrently a master course student of the Grad-uate School of Computing, Tokyo Institute ofTechnology, Japan. His research interest is com-puter architecture and FPGA computing. He isa student member of IEICE.