An Embedded RISC-V Core with Fast Modular Multiplication
11 An Embedded RISC-V Core with Fast ModularMultiplication ¨Omer Faruk Irmak, Arda Yurdakul
Abstract —One of the biggest concerns in IoT is privacy andsecurity. Encryption and authentication need big power budgets,which battery-operated IoT end-nodes do not have. Hardwareaccelerators designed for specific cryptographic operations pro-vide little to no flexibility for future updates. Custom instructionsolutions are smaller in area and provide more flexibility fornew methods to be implemented. One drawback of custominstructions is that the processor has to wait for the operationto finish. Eventually, the response time of the device to real-time events gets longer. In this work, we propose a processorwith an extended custom instruction for modular multiplication,which blocks the processor, typically, two cycles for any size ofmodular multiplication when used in Partial Execution mode.We adopted embedded and compressed extensions of RISC-Vfor our proof-of-concept CPU. Our design is benchmarked onrecent cryptographic algorithms in the field of elliptic-curvecryptography. Our CPU with 128-bit modular multiplicationoperates at 136MHz on ASIC and 81MHz on FPGA. It achievesup to 13x speed up on software implementations while reducingoverall power consumption by up to 95% with 41% average areaoverhead over our base architecture.
Index Terms —RISC-V, iot, ecc, custom instruction, extension
I. I
NTRODUCTION I OT market has been one of the driving forces of embed-ded hardware. Key enabler of IoT is cheap and capablehardware. There are multiple efforts, both in academia andindustry, that are aiming to bring costs lower while makinghardware more efficient in the tasks it is designed to perform.IoT end-node hardware should be secure and designed withpower consumption in mind. Therefore, there are efforts onboth designing new lightweight algorithms [1] that suit betterto less powerful processors and designing specialized hardwarethat tackles the heavy operations more efficiently [2].Custom instructions can be utilized for accelerating cryp-tographic operations. Fundamental and complex operationsin cryptography can be mapped to custom instructions andimplemented in hardware with fewer resources compared tofull custom accelerators. This makes using the same hardwarefor different algorithms possible as custom instructions canbe utilized in the realization of any algorithm. If the currentalgorithm turns out to be vulnerable, different solutions canbe implemented via a software update without a significantperformance penalty. ¨Omer Faruk IRMAK was with the Department of Computer Engineering,Bo˘gazic¸i University, Bebek 34342, Istanbul, TurkeyE-mail: [email protected] Yurdakul was with the Department of Computer Engineering,Bo˘gazic¸i University, Bebek 34342, Istanbul, TurkeyE-mail: [email protected] received September 30, 2020.
In this work we have designed a microprocessor core withits ISA extended with a custom instruction for Montgomerymultiplication. Modular multiplication is highly utilized inpublic key cryptography. Our proposed custom instructionimplementation can be executed both atomically and partiallyin short iterations, therefore does not degrade system responsetime. We implemented Embedded and Compressed extensionsof RISC-V (RV32EC) [3] as the base ISA of our proof-of-concept CPU. Design is benchmarked with operations onvarious cryptographic elliptic curves. Synthesis is done forboth FPGA and ASIC targets to collect area and powerconsumption metric. Our contributions can be summarized asfollows; • Propose a multiprecision MMUL custom instruction • Propose a method to partition runtime of a long-latencycustom instruction to increase responsiveness of the CPUto external events • Analyze different RISC-V instruction encodings availableto be used for custom instructionsIn the literature, there are plenty of studies for adding custominstructions to RISC-V. Yet, none of them studies the effectsof blocking the processor with a custom instruction or effectsof the encoding within our knowledge.II. M
ONTGOMERY M ULTIPLICATION I NSTRUCTION FOR
RISC-V ISAModular multiplication is the operation of P = ( A ∗ B ) mod N . One of the key efficient algorithms in thisarea is Montgomery Multiplication [4]. For operands withlength of n in bits, Montgomery multiplication calculates M M U L ( A, B, N ) = ( A ∗ B ∗ R -1 ) mod N where R =2 (2 n ) mod N , n -1 < N < n and gcd ( R, N ) = 1 . We chosethe Radix-2 Montgomery Multiplication (R2MM) algorithm[5] for the implementation. R2MM is suitable for a simplehardware implementation as it is composed of additions andshifts.In RISC-V, different instruction formats have already beendefined. Some of them can be seen in Figure 1. Regardlessof the instruction encoding, we decided MMUL instruction towork on memory addresses unlike any instruction in RISC-Vspecification, which strictly works on register values. When itcomes to multiprecision operations, defining a unified interfaceon memory addresses is more performant. The key point thathas to be made clear is the layout of in memory. Constrain-ing how operands should be arranged may result in lowerperformance as it may require application code to rearrangeoperands in memory. MMUL requires 3 memory addresses a r X i v : . [ c s . A R ] S e p for the inputs and a single memory address for the output.Length of the operands must be encoded in the instruction forflexibility. Operand length may be limited by the hardwareimplementation of the MMUL instruction. In our referencedesign, maximum operand length is a hardware constraint thatis defined at the synthesis phase.imm rs1 fnc3 rd opcode I-typefnc7 rs2 rs1 fnc3 rd opcode R-typers3 fnc2 rs2 rs1 fnc3 rd opcode R4-type Fig. 1. Candidate RISC-V instruction formats
If application can guarantee that all operands will be in acertain offset from a base address in memory as shown inFigure 2, a single memory address stored in rs1 is enoughfor the input operands. Thus I-type instruction format can beused. Fields fnc3 and imm provide 15 bits in the instructionto be used for encoding length, which can be encoded in bitsto give a maximum of 32768 bit operands.
Fig. 2. Memory layout for I-type (left) and R-type (right) MMUL
Likewise, if multiplicand and multiplier are guaranteed to bealways in fixed positions relative to each other but modulusmay be in random addresses as shown in Figure 2; R-typeformat can be used. Two source registers, rs1 and rs2, wouldbe used as base addresses. The fnc3 and fnc7 fields give 10bits of space which enables, if length is encoded in bits, amaximum of 1024 bit operands.Lastly, for the best performance in all cases, if R4-typeformat is used, all operands can be in their independentaddresses stored in rs1, rs2 and rs3 as shown in Figure 3.This format leaves only 5 bits (fnc3 and fnc2) which is notenough for length to be encoded in bits. Encoding operandlength in words is another option which makes 1024 bit (2 *32) length operands for RV32 and 2048 bit (2 * 64) operandsfor RV64 possible.In this work, we decided to use R4-type instruction formatbecause it imposes no memory layout restrictions. UsingGCC directive .insn [6] in this decision process sped up thedevelopment.As it can be seen in Figure 4, MMUL is coupled with thedatapath of the processor. Addresses of the operands are readdirectly from their respective registers of the Register File and Fig. 3. Memory layout for R4-type MMULFig. 4. Integration of MMUL in datapath fed to the ALU in the datapath. Memory address to be workedon is calculated in the ALU by adding the offset value suppliedby the MMUL to the base address read from the Register File.LSU is triggered by MMUL module to load or store fromthe calculated address. Operands are loaded at the start ofthe execution and stored in MMUL module during the entireoperation. All execution is controlled by MMUL itself.III. P
ARTIAL E XECUTION M ODE
R2MM [5] has a loop with n iterations and one finalsubtraction after exiting the loop. Our implementation takes2 clock cycles for each loop iteration and one last cyclefor the subtraction. So in total, one MMUL operation takes n + 1 clock cycles for calculations, ∗ W ORDS memoryload operations for fetching the three operands and
W ORDS memory write operations for writing back the result where
W ORDS = ceil ( n/W ORD SIZE ) .As the instructions are atomic, during MMUL operation,processor will be unresponsive to any event that may happen.For some applications this may be problematic because of thereal time constraints they have. To remedy this we can movethe loop in our algorithm from hardware to software, allowingour processor to service interrupts in between loop iterations. Fig. 5. MMUL partial execution time diagram
To achieve this behaviour, which we call partial execution ,our implementation has a special Control and Status Register
Fig. 6. Benchmark Speedups, Base RV32EC Architecture (BA), Custom Instruction with Atomic Execution (CI-AE), Custom Instruction with Partial Execution(CI-PE) (CSR). If partial execution is enabled by a write with csrwr instruction to this register, which is directly connected to”Execution Mode Select” signal in Figure 4, current MMULinstruction is retired after each bit is processed. Applicationcode has to execute another MMUL instruction for each bit ofoperands, ie. n calls to the MMUL for an n-bit Montgomerymultiplication operation, as shown in Figure 5. First call tothe MMUL does the memory load operations, while last callwrites back the result. In this case, maximum latency ofMMUL instruction drops to either ∗ W ORDS memory loadoperations + 2 cycles or
W ORDS memory write operations+ 3 cycles depending on the memory operation latencies.Performance penalty of this, which will be later presented,is minimal when used with loop unrolling.IV. A
NALYSIS
A. Base Architecture
To set a baseline for our work, we designed an in-order2-stage RV32EC core with minimal area while maintainingcomparable level of performance. Coremark and Dhrystoneare run both on microriscy [7] and our core using the samememory modules. While our core scored 0.905 in Coremarkand 1805 in Dhrystone, microriscy [7] scored 0.878 and1644, respectively. Even though our core is slightly fasterthan microriscy, they can be considered equal in terms ofperformance.
B. Benchmarks
Our design is benchmarked with multiple ECC curves.Software implementations of FourQ (128-bit)[8], NIST P- 256 (256-bit)[9], Curve25519 (256-bit)[10] and ARIS (anauthentication scheme based on FourQ) [11] are run on ourprocessor and microriscy/zeroriscy [7] cores from PULP as thereference designs. Later, modular multiplication and squaringimplementations are replaced with a sequence of MMULinstructions and run on our modified core. No modificationsare made to any other part of the code.In Figure 6, speedup of ECC benchmarks can be seen. Fullsoftware benchmarks are run on microriscy/zeroriscy [7] andour base architecture (BA) as the control group. Tests wheremodular multiplication operation is implemented with ourcustom instruction are labeled Custom Instruction with AtomicExecution (CI-AE) and Custom Instruction with Partial Exe-cution (CI-PE). There is a significant speed up in all curveoperations. This speed up contributes to lowering total energyconsumption. Our experimental setup uses a memory unitwith single cycle read latency. Longer latency memories likeEEPROM, FLASH are widely used as instruction memories. Ifa longer latency instruction memory was used in benchmarks,results would be even more in favour of our implementation.MMUL instruction takes multiple cycles thus allowing a newinstruction to be fetched before it finishes, therefore CPU isless likely to be stalled waiting for new instructions.Performance penalty of partial execution is negligible whenpaired with loop unrolling. Depending on the compiler out-put, if not interrupted, a Montgomery multiplication can beexecuted in the same amount of time as atomic execution.For power consumption analysis, FPGA tools are used.Design is synthesized on a Xilinx XC7Z020-1 FPGA andactivity data is gather with Xilinx’s development environment,
Vivado. Activity data is then used to increase dynamic powerestimation accuracy.
TABLE IA
VERAGE P OWER C ONSUMPTION (W) D
URING
A M
ODULAR M ULTIPLICATION
Static Dynamic Total
BA 0.107 0.154 0.261CI-AE 0.105 0.064 0.170CI-PE 0.106 0.120 0.226
Power consumption of different configurations can be seenin Table I. While static power consumption shows only smallchanges, dynamic power goes down significantly. This canbe explained with the power consumption per design blockduring fully software and with custom instruction runs ofthe benchmark (Table II). When executing solely standardinstructions, as in BA column of Table II, every module ofthe CPU works synchronously. When MMUL instruction is inprogress, rest of the CPU is idle. Biggest gain comes from thefetch stage because only four instruction fetches are neededper modular multiplication with our custom instruction and itis the biggest module in the design.
TABLE IIA
VERAGE D YNAMIC P OWER C ONSUMPTION P ER M ODULE (W)
BA CI-AE CI-PE
Fetch Stage 0.058 0.002 0.026Decoder 0.014 0.001 0.006ALU 0.031 0.001 0.008Register File 0.012 0.002 0.003MMUL 0 0.054 0.053
Both average power consumption and execution time godown in our implementation. Naturally, product of these twometrics follow this trend as well. Normalized energy consump-tion values can be seen in Table III. For FourQ, a 128-bit curve,roughly 90% of energy is saved while P-256/C25519, 256-bitcurves, savings go up to 95%. As the prime that curves usegets bigger, performance increases and this results in higherenergy savings. R2MM scales better for larger operands withan O(n) time complexity [5] [12].
TABLE IIIN
ORMALIZED E NERGY C ONSUMPTION (P OWER X C LOCK C YCLES ) BA CI-AE CI-PE F ou r Q KeyGen 1 0.10 0.13Sign 1 0.13 0.18Verify 1 0.09 0.12 P KeyGen 1 0.05 0.07Sign 1 0.06 0.08Verify 1 0.05 0.07 C KeyGen 1 0.05 0.07Sign 1 0.05 0.07Verify 1 0.05 0.06
Although MMUL itself is fairly small, it adds 33% areaoverhead (from 487 to 649 slices) to our base architectureand operating frequency goes down by 9% (from 89MHz to81Mhz) in FPGA synthesis. Using the TSMC OSU 0.18umtechnology, ASIC synthesis shows 49% area overhead (from872 FF and 8106 gates to 1305 FF and 12105 gates) and 8% decrease (from 148MHz to 136Mhz) in operating frequency.Depending on the requirements of the application, a differentimplementation of Montgomery multiplication may be usedfor the required balance between performance gain and areaoverhead. It is debated [13] that ECC is too complex to beused on IoT devices, yet even new lightweight algorithmsintroduce similar overheads when accelerated in hardware. Togive a comparison, Tehrani et. al. [14] accelerate LightweightBlock Ciphers on a RV32I platform. On average, their workintroduces 58% area overhead.V. C
ONCLUSION
In this paper, we proposed a microprocessor core with acustom instruction for Montgomery multiplication. Radix-2Montgomery multiplication is implemented as an instruction.For better system response times, a partial execution scheme isproposed, enabling the instruction to be completed in multipleshort-latency iterations. The resulting hardware is realized onFPGA and as an ASIC. 136 MHz clock speed on ASICand 81 MHz clock speed on FPGA with 128-bit MMULmodule is achieved. It achieves up to 13x speed up on variouscryptographic curves compared to software implementationswhile reducing overall power consumption by up to 95%.R
EFERENCES[1] C. S. Division, I. T. Laboratory, N. I. of Standards, Technology,and D. of Commerce, “Lightweight cryptography.” [Online]. Available:https://csrc.nist.gov/projects/lightweight-cryptography[2] B. Blaner, B. Abali, B. M. Bass, S. Chari, R. Kalla, S. Kunkel,K. Lauricella, R. Leavens, J. J. Reilly, and P. A. Sandon, “Ibm power7+processor on-chip accelerators for cryptography and active memoryexpansion,”
IBM Journal of Research and Development , vol. 57, no. 6,pp. 3:1–3:16, Nov 2013.[3] “Offical riscv foundation website,” Oct 2019. [Online]. Available:https://riscv.org/[4] P. L. Montgomery, “Modular multiplication without trial division,”
Mathematics of Computation , vol. 44, no. 170, pp. 519–521, 1985.[5] A. F. Tenca and C. K. Koc, “A scalable architecture for modularmultiplication based on montgomery’s algorithm,”
IEEE Transactionson Computers , vol. 52, no. 9, pp. 1215–1221, Sep. 2003.[6] [Online]. Available: https://embarc.org/man-pages/as/RISC 002dV002dFormats.html[7] P. Davide Schiavone, F. Conti, D. Rossi, M. Gautschi, A. Pullini, E. Fla-mand, and L. Benini, “Slow and steady wins the race? a comparisonof ultra-low-power risc-v cores for internet-of-things applications,” in , Sep. 2017, pp. 1–8.[8] C. Costello and P. Longa, “Fourq: Four-dimensional decompositions ona q-curve over the mersenne prime,” in
ASIACRYPT , 2015.[9] P. Hess, “Sec 2: Recommended elliptic curve domain parameters,” 2000.[10] D. J. Bernstein, “Curve25519: New diffie-hellman speed records,” in
Public Key Cryptography - PKC 2006 , M. Yung, Y. Dodis, A. Kiayias,and T. Malkin, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg,2006, pp. 207–228.[11] R. Behnia, M. O. Ozmen, and A. A. Yavuz, “Aris: Authenticationfor real-time iot systems,” in
ICC 2019 - 2019 IEEE InternationalConference on Communications (ICC) , 2019, pp. 1–6.[12] A. Karatsuba and Y. P. Ofman, “Multiplication of many-digital numbersby automatic computers,” in
Doklady Akademii Nauk SS , vol. 14, no.145, 1963, pp. 293–294.[13] M. Samaila, J. Sequeiros, T. Simes, M. Freire, and P. Incio, “Iot-harpseca: A framework and roadmap for secure design and developmentof devices and applications in the iot space,”
IEEE Access , vol. PP, pp.1–1, 01 2020.[14] E. Tehrani, T. Graba, A. S. Merabet, S. Guilley, and J. Danger, “Clas-sification of lightweight block ciphers for specific processor acceleratedimplementations,” in2019 26th IEEE International Conference onElectronics, Circuits and Systems (ICECS)