[PDF] Ara: A 1 GHz+ Scalable and Energy-Efficient RISC-V Vector Processor with Multi-Precision Floating Point Support in 22 nm FD-SOI

Abstract

In this paper, we present Ara, a 64-bit vector processor based on the version 0.5 draft of RISC-V's vector extension, implemented in GlobalFoundries 22FDX FD-SOI technology. Ara's microarchitecture is scalable, as it is composed of a set of identical lanes, each containing part of the processor's vector register file and functional units. It achieves up to 97% FPU utilization when running a 256 x 256 double precision matrix multiplication on sixteen lanes. Ara runs at more than 1 GHz in the typical corner (TT/0.80V/25 oC) achieving a performance up to 33 DP-GFLOPS. In terms of energy efficiency, Ara achieves up to 41 DP-GFLOPS/W under the same conditions, which is slightly superior to similar vector processors found in literature. An analysis on several vectorizable linear algebra computation kernels for a range of different matrix and vector sizes gives insight into performance limitations and bottlenecks for vector processors and outlines directions to maintain high energy efficiency even for small matrix sizes where the vector architecture achieves suboptimal utilization of the available FPUs.

Full PDF

11 Ara: A 1 GHz+ Scalable and Energy-EfﬁcientRISC-V Vector Processor with Multi-PrecisionFloating Point Support in 22 nm FD-SOI

Matheus Cavalcante, ∗ Fabian Schuiki, ∗ Florian Zaruba, ∗ Michael Schaffner, ∗ Luca Benini, ∗† Fellow, IEEE

Abstract —In this paper, we present Ara, a 64-bit vectorprocessor based on the version 0.5 draft of RISC-V’s vectorextension, implemented in G

LOBAL F OUNDRIES ×

256 doubleprecision matrix multiplication on sixteen lanes. Ara runs at morethan 1 GHz in the typical corner (TT/0.80 V/25 ◦ C), achieving aperformance up to 33 DP−GFLOPS. In terms of energy efﬁciency,Ara achieves up to 41 DP−GFLOPS/W under the same conditions,which is slightly superior to similar vector processors foundin literature. An analysis on several vectorizable linear algebracomputation kernels for a range of different matrix and vectorsizes gives insight into performance limitations and bottlenecksfor vector processors and outlines directions to maintain highenergy efﬁciency even for small matrix sizes where the vectorarchitecture achieves suboptimal utilization of the available FPUs.

Index Terms —Vector processor, SIMD, RISC-V.

I. I

NTRODUCTION T HE end of Dennard scaling caused the race for perfor-mance through higher frequencies to halt more than adecade ago, when an increasing integration density stoppedtranslating into proportionate increases in performance orenergy efﬁciency [1]. Processor frequencies plateaued, incitinginterest in parallel multi-core architectures. These architectures,however, fail to address the efﬁciency limitation created bythe inherent fetching and decoding of elementary instructions,which only keep the processor datapath busy for a very shortperiod of time. Moreover, power dissipation limits how muchintegrated logic can be turned on simultaneously, increasingthe energy efﬁciency requirements of modern systems [2], [3].In instruction-based programmable architectures, the keychallenge is how to mitigate the Von Neumann Bottleneck(VNB) [4]. Despite the ﬂexibility of multi-core designs, they failto exploit the regularity of data-parallel applications. Each coretends to execute the same instructions many times—a waste interms of both area and energy [5]. The strong emergence ofmassively data-parallel workloads, such as data analytics andmachine learning [6], created a major window of opportunity forarchitectures that effectively exploit data parallelism to achieveenergy efﬁciency. The most successful of these architectures are ∗ Integrated Systems Laboratory of ETH Z¨urich, Z¨urich, Switzerland. † Department of Electrical, Electronic, and Information Engineering GuglielmoMarconi of the University of Bologna, Bologna, Italy. E-mail: { matheusd,fschuiki, zarubaf, mschaffner, lbenini } at iis.ee.ethz.ch. General Purpose Graphics Processing Units (GPUs) [7], whichheavily leverage data-parallel multithreading to relax the VNBthrough the so-called single instruction, multiple thread (SIMT)approach [8]. GPUs dominate the energy efﬁciency race, beingpresent in 70% of the Green500 ranks [9]. They are also highlysuccessful as data-parallel accelerators in high-performanceembedded applications, such as self-driving cars [10].The quest for extreme energy efﬁciency in data-parallelexecution has also revamped interest on vector architectures.This kind of architecture was cutting-edge during anothertechnology scaling crisis, namely the one related to circuitsbased on the Emitter-Coupled Logic technology [11]. Today,designers and architects are reconsidering vector processingapproaches, as they promise to address the VNB very effec-tively [12], providing better energy efﬁciency than a general-purpose processor for applications that ﬁt the vector processingmodel [5]. A single vector instruction can be used to expressa data-parallel computation on a very large vector, therebyamortizing the instruction fetch and decode overhead. Theeffect is even more pronounced than for SIMT architectures,where instruction fetches are only amortized over the number ofparallel scalar execution units in a “processing block”: for thelatest NVIDIA Volta GPUs, such blocks are only 32 elementslong [13]. Therefore, vector processors provide a notablyeffective model to efﬁciently execute the data parallelism ofscientiﬁc and matrix-oriented computations [14], [15], as wellas digital signal processing and machine learning algorithms.The renewed interest in vector processing is reﬂected bythe introduction of vector instruction extensions in all popularInstruction Set Architectures (ISAs), such as the proprietaryARM ISA [16] and the open-source RISC-V ISA [17]. Inthis paper, we set out to analyze the scalability and energyefﬁciency of vector processors by designing and implementinga RISC-V-based architecture in an advanced ComplementaryMetal-Oxide-Semiconductor (CMOS) technology. The designwill be open-sourced under a liberal license as part of the PULPPlatform . The key contributions of this paper are:1) The architecture of a parametric in-order high-performance64-bit vector unit based on the version 0.5 draft of RISC-V’s vector extension [18]. The vector processor wasdesigned for a memory bandwidth per peak performanceratio of 2 B/DP−FLOP, and works in tandem with Ariane,an open-source application-class RV64GC scalar core. See https://pulp-platform.org/. a r X i v : . [ c s . A R ] O c t The vector unit supports mixed-precision arithmetic withdouble, single, and half-precision ﬂoating point operands.2) Performance analysis on key data-parallel kernels, bothcompute- and memory-bound, for variable problem sizesand design parameters. The performance is shown to meetthe rooﬂine achievable performance boundary, as long asthe vector length is at least a few times longer than thenumber of physical lanes.3) An architectural exploration and scalability analysis ofthe vector processor with post-implementation results ex-tracted from G

LOBAL F OUNDRIES

ACKGROUND AND R ELATED WORK

Single instruction, multiple data (SIMD) architectures share—thus amortize—the instruction fetch among multiple identicalprocessing units. This architectural model can be seen asinstructions operating on vectors of operands. The approachworks well as long as the control ﬂow is regular, i.e., it ispossible to formulate the problem in terms of vector operations.

A. Array processors

Array processors implement a packed-SIMD architecture.This type of processor has several independent but identicalprocessing elements (PEs), all operating on commands from ashared control unit. Figure 1 shows an execution pattern for adummy instruction sequence. The number of PEs determinesthe vector length, and the architecture can be seen as a widedatapath encompassing all subwords, each handled by a PE [19]. PE PE PE PE tld ld ld ld mul mul mul mul add add add add st st st st Fig. 1. Execution pattern on an array processor [20].

A limitation of such an architecture is that the vector lengthis ﬁxed. It is commonly encoded into the instruction itself,meaning that each expansion of the vector length comes with another ISA extension. For instance, Intel’s ﬁrst version ofthe Streaming SIMD Extensions (SSEs) operates on 128 bitregisters, whereas the Advanced Vector Extension (AVX) andAVX-512 evolution operates on 256 and 512-bit wide registers,respectively [21]. ARM provides packed-SIMD capability viathe “Neon” extension, operating on 128 bit wide registers [22].RISC-V also supports packed-SIMD via DSP extensions [23].

B. Vector processors

Vector processors are time-multiplexed versions of arrayprocessors, implementing vector-SIMD instructions. Severalspecialized functional units stream the micro-operations onconsecutive cycles, as shown in Figure 2. By doing so, thenumber of functional units no longer constrains the vectorlength, which can be dynamically conﬁgured. As opposed topacked-SIMD, long vectors do not need to be subdivided intoﬁxed-size chunks, but can be issued using a single vectorinstruction. Hence, vector processors are potentially moreenergy efﬁcient than an equivalent array processor since manycontrol signals can be kept constant throughout the computation,and the instruction fetch cost is amortized among many cycles.

LDMULALUST tld ld ld ld mul mul mul mul add add add add st st st st Fig. 2. Execution pattern on a vector processor [20].

The history of vector processing starts with the traditionalvector machines from the sixties and seventies, with thebeginnings of the Illiac IV project [14]. The trend continuedthroughout the next two decades, with work on supercomputerssuch as the Cray-1 [11]. At the end of the century, however,microprocessor-based systems approached or surpassed theperformance of vector supercomputers at much lower costs [24],due to intense work on superscalar and Very Long InstructionWord (VLIW) architectures. It is only recently that vectorprocessors got renewed interest from the scientiﬁc community.Vector processors found a way into Field-Programmable GateArrays (FPGAs) as general-purpose accelerators. VIPERS [25]is a vector processor architecture loosely compliant withVIRAM [26], with several FPGA-speciﬁc optimizations. VE-GAS [27] is a soft vector processor operating directly onscratchpad memory instead of on a Vector Register File (VRF).ARM is moving into Cray-inspired processing with theirScalable Vector Extension (SVE) [16]. The extension is basedon the vector register architecture introduced with the Cray-1, and leaves the vector length as an implementation choice(from 128 bit to 2048 bit, in 128 bit increments). It is possibleto write code agnostic to the vector length, so that differentimplementations can run the same software. The ﬁrst system toadopt this extension is Fujitsu’s A64FX, at a peak performance of 2.7 DP−TFLOPS in a 7 nm process, which is competitivein terms of peak performance to leading-edge GPUs [28].The open RISC-V ISA speciﬁcation is also leading an efforttowards vector processing through its vector extension [18].This extension is in active development, and, at the time ofthis writing, its latest version was the 0.7. When comparedwith ARM SVE, RISC-V does not put any limits on the vectorlength. Moreover, the extension makes it possible to tradeoff the number of architectural vector registers against longervectors. Due to the availability of open-source RISC-V scalarcores, together with the liberal license of the ISA itself, wechose to design our vector processor based on this extension.One crucial issue in vector processing design is how tomaximize the utilization of the vector lanes. Beldianu andZiavras [12] and Lu et al. [29] explore sharing a pool ofvector units among different threads. The intelligent sharingof vector units on multi-core increases their efﬁciency andthroughput when compared to multi-core with per-core privatevector units [12]. A 32-bit implementation of the idea at TSMC40 nm process is presented at [30]. However, the ISA consideredat such implementation is limited [29] when compared to RISC-V’s vector extension, lacking, for example, the Fused Multiply-Add (FMA) instruction, strictly required in high-performanceworkloads. Moreover, the wider 64-bit datapath of our vectorunit implies a drastic complexity increase of the FMA units anda larger VRF, and consequently a quantitative energy efﬁciencycomparison between Ara and [30] is not directly possible. Wecompared the achieved vector lanes’ utilization in Section V-A.

C. SIMT

SIMT architectures represent an amalgamation of the ﬂexi-bility of multiple instruction, multiple data (MIMD) and theefﬁciency of SIMD designs. While SIMD architectures applyone instruction to multiple data lanes, SIMT designs apply oneinstruction to multiple independent threads in parallel [8]. TheNVIDIA Volta GV100 GPU is a state-of-the-art example ofthis architecture, with 64 “processing blocks,” called StreamingMultiprocessors (SMs) by NVIDIA, each handling 32 threads.A SIMD instruction exposes the vector length to theprogrammer and requires manual branching control, usuallyby setting ﬂags that indicate which lanes are active for agiven vector instruction. SIMT designs, on the other hand,allow the threads to diverge, although substantial performanceimprovement can be achieved if they remain synchronized [8].SIMD and SIMT designs also handle data accesses differently.Since GPUs lack a control processor, hardware is necessary todynamically coalesce memory accesses into large contiguouschunks [12]. While this approach simpliﬁes the programmingmodel, it also incurs into a considerable energy overhead [31].

D. Vector thread

Another compromise between SIMD and MIMD are vectorthread (VT) architectures [31], which support loops with cross-iteration dependencies and arbitrary internal control ﬂow [32].Similar to SIMT designs—and unlike SIMD—VT architecturesleverage the threading concept instead of the more rigid notionof lanes, and hence provide a mechanism to handle program divergence. The main difference between SIMT and VT is thatin the latter the vector instructions reside in another thread, andscalar bookkeeping instructions can potentially run concurrentlywith the vector ones. This division alleviates the problem ofSIMT threads running redundant scalar instructions that mustbe later coalesced in hardware. Hwacha is a VT architecturebased on a custom RISC-V extension, recently achieving64 DP−GFLOPS in ST 28 nm FD-SOI technology [33].Many vector architectures report only full-system metrics ofperformance and efﬁciency, such as memory hierarchy or mainmemory controllers. This is the case of Fujitsu’s A64FX [28].As our focus is on the core execution engine, we will mainlycompare our vector unit with Hwacha in Section VI-C. Hwachais an open-sourced design architecture for which informationabout the internal organization is available, allowing for a fairquantitative comparison on a single processing engine.III. A

RCHITECTURE

In this section, we introduce the microarchitecture of Ara, ascalable high-performance vector unit based on RISC-V’s vectorextension. As illustrated in Figure 3a, Ara works in tandemwith Ariane [34], an open-source Linux-capable application-class core. To this end, Ariane has been extended to drive theaccompanying vector unit as a tightly coupled coprocessor.

A. Ariane

Ariane is an open-source, in-order, single-issue, 64-bitapplication-class processor implementing RV64GC [34]. Ithas support for hardware multiply/divide and atomic memoryoperations, as well as an IEEE-compliant FPU [35]. It hasbeen manufactured in G

LOBAL F OUNDRIES M e m o r y I n t e r c onn ec t W D a t a W i d t h C onv e r t e r W 64

ARIANE

RV64GC

ARA

RV64V W Sequencer

I$ D$OpQueue

Store Unit

VLSU

OpQueue

SLDU

AckScalar result N · · Operation N · · @ N · N · AckScalar result

Load UnitAddrGen N · PCGen InstructionIF ID Issue EX Commit S c o r e bo a r d I SS U E R e g F il e R ea d S C O R E B OA R D R e g F il e W r it e C S R W r it e Ara front endFPUMultiplierCSR BufferALULSU

DecoderDispatcher L a n e L a n e … L a n e N - (a) Block diagram of an Ara instance with N parallel lanes. Ara receivesits commands from Ariane, a RV64GC scalar core. The vector unit hasa main sequencer; N parallel lanes; a Slide Unit (SLDU); and a VectorLoad/Store Unit (VLSU). The memory interface is W bit wide. LaneSequencerVRF Arbiter · R W S R A M Bank 0Bank 4 Bank 1Bank 5 Bank 2Bank 6 Bank 3Bank 7

OperationAckScalar result

FPU MUL

OpQueue

ALU

OpQueue ·

64 3 · V L S U op e r a nd s · S L DU op e r a nd s ·

64 3 · V L S U S L DU

64 64 S ca l a r r e s u lt OperationOperand requests

LANE (b) Block diagram of one lane of Ara. It contains a lane sequencer(handling up to 8 vector instructions); a 16 KiB vector register ﬁle; tenoperand queues; an integer Arithmetic Logic Unit (ALU); an integermultiplier (MUL); and a Floating Point Unit (FPU).Fig. 3. Top-level block diagram of Ara.

Instructions are acknowledged as soon as Ara determinesthat they will not throw any exceptions. This happens early intheir execution, usually after their decoding. Because vectorinstructions can run for an extended number of cycles (aspresented in Figure 2), they may get acknowledged manycycles before the end of their execution, potentially freeing thescalar cores to continue execution of its instruction stream. Thedecoupled execution works well, except when Ariane expectsa result from Ara, e.g., reading an element of a vector register.The interface between Ariane and Ara is lightweight, beingsimilar to the Rocket Custom Coprocessor Interface (RoCC),for use with the Rocket Chip [36]. The difference betweenthem is that dispatcher pushes the decoded instruction to Ara,while RoCC leaves the full decoding task to the coprocessor.

B. Sequencer

The sequencer is responsible for keeping track of thevector instructions running on Ara, dispatching them to thedifferent execution units and acknowledging them with Ariane.This unit is the single block that has a global view of theinstruction execution progress across all lanes. The sequencercan handle up to eight parallel instructions. This ensures Ara has instructions enqueued for execution, avoiding starvationdue to the non-speculative dispatch policy of Ara’s front end.Hazards among pending vector instructions are resolved bythis block. Structural hazards arise due to architectural decisions(e.g., shared paths between the ALU and the SLDU) or if afunctional unit is not able to accept yet another instruction dueto the limited capacity of its operation queue. The sequencerdelays the issue of vector instructions until the structural hazardhas been resolved (i.e., the offending instruction completes).The sequencer also stores information about which vectorinstruction is accessing which vector register. This informationis used to determine data hazards between instructions. Forexample, if a vector instruction tries to write to a vectorregister that is already being written, the sequencer will ﬂag theexistence of a write-after-write (WAW) data hazard betweenthem. Read-after-write (RAW), write-after-read (WAR) andWAW hazards are handled in the same manner. Unlike structuralhazards, data hazards do not need to stall the sequencer, asthey are handled on a per-element basis downstream.

C. Slide unit

The SLDU is responsible for handling instructions thatmust access all VRF banks at once. It handles, for example,the insertion of an element into a vector, the extraction ofan element from a vector, vector shufﬂes, and vector slides( v d [ i ] ← v s [ i + slide amount ] ). This unit may also be extendedto support basic vector reductions, such as vector-add andinternal product. The support for vector reductions is consideredan optional feature in the current version of RISC-V’s vectorextension [18]. For simplicity, we decided not to support them,taking into consideration that an O ( n ) vector reduction can stillbe implemented as a sequence of O ( log n ) vector slides andthe corresponding arithmetic instruction [24]. D. Vector load/store unit

Ara has a single memory port, whose width is chosen tokeep the memory bandwidth per peak performance ratio ﬁxedat 2 B/DP−FLOP. As illustrated in Figure 3a, Ara has anaddress generator, responsible for determining which memoryaddress will be accessed. This can either be i) unit-strideloads and stores, which access a contiguous chunk of memory;ii) constant-stride memory operations, which access memoryaddresses spaced with a ﬁxed offset; and iii) scatters and gathers,which use a vector of offsets to allow general access patterns.After address generation, the unit coalesces unit-stride memoryoperations into burst requests, avoiding the need to request theindividual elements from memory. The burst start address andthe burst length are then sent to either the load or the storeunit, both of which are responsible for initiating data transfersthrough Ara’s Advanced eXtensible Interface (AXI) interface.

E. Lane organization

Ara can be conﬁgured with a variable number of identicallanes, each one with the architecture shown in Figure 3b. Eachlane has its own lane sequencer, responsible for keeping trackof up to eight parallel vector instructions. Each lane also hasa VRF and an accompanying arbiter to orchestrate its access,operand queues, an integer ALU, an integer MUL, and an FPU.Each lane contains part of Ara’s whole VRF and executionunits. Hence, most of the computation is contained withinone lane, and instructions that need to access all the VRFbanks at once (e.g., instructions that execute at the VLSU orat the SLDU) use data interfaces between the lanes and theresponsible computing units. Each lane also has a commandinterface attached to the main sequencer, through which thelanes indicate they ﬁnished the execution of an instruction.

1) Lane sequencer:

The lane sequencer is responsible forissuing vector instructions to the functional units, controllingtheir execution in the context of a single lane. Unlike the mainsequencer, the lane sequencers do not store the state of therunning instructions, avoiding data duplication across lanes.They also initiate requests to read operands from the VRF. Wegenerate up to ten independent requests to the VRF arbiter.Operand fetch and result write-back are decoupled fromeach. Starvation is avoided via a self-regulated process, throughback pressure due to unavailable operands. By throttling theoperation request rate, the lane sequencer indirectly limits the rate at which results are produced. This is used to handle datahazards, by ensuring that dependent instructions run at the samepace: if instruction i depends on instruction j , the operandsof instruction i are requested only if instruction j producedresults in the previous cycle. There is no forwarding logic.

2) Vector register ﬁle:

The VRF is at the core of every vectorprocessor. Because several instructions can run in parallel, theregister ﬁle must be able to support enough throughput tosupply the functional units with operands and absorb theirresults. In RISC-V’s vector extension, the predicated multiply-add instruction is the worst case regarding throughput, readingfour operands to produce one result.Due to the massive area and power overhead of multi-portedmemory cuts, which usually require custom transistor-leveldesign, we opted not to use a monolithic VRF with several ports.Instead, Ara’s vector register ﬁle is composed of a set of single-ported (1RW) banks. The width of each bank is constrained tothe datapath width of each lane, i.e., 64 bit, to avoid subwordselection logic. Therefore, in steady state, ﬁve banks areaccessed simultaneously to sustain maximum throughput forthe predicated multiply-add instruction. Ara’s register ﬁle haseight banks per lane, providing some margin on the bankingfactor. This VRF structure (eight 64-bit wide 1RW banks)is replicated at each lane, and all inter-lane communicationis concentrated at the VLSU and SLDU. We used a high-performance memory cut to meet a target operating frequencyof 1 GHz. These memories, however, cannot be fully clock-gated. The cuts do consume less power in idle state, a NOPcosting about 10% of the power required by a write operation.A multi-banked VRF raises the problem of banking conﬂicts,which occur when several functional units need to access thesame bank. These are resolved dynamically with a weightedround-robin arbiter per bank with two priority levels. Low-throughput instructions, such as memory operations, are as-signed a lower priority. By doing so, their irregular accesspattern does not disturb other concurrent high-throughputinstructions (e.g., ﬂoating-point instructions).Figure 4b shows how the vector registers are mapped ontothe banks. The initial bank of each vector register is shifted ina “barber’s pole” fashion. This avoids initial banking conﬂictswhen the functional units try to fetch the ﬁrst element ofdifferent vector registers, which are all mapped onto the samebank in a pure element-partitioned approach [24] of Figure 4a.Vector registers can also hold scalar values. In this case, thescalar value is replicated at each lane at the ﬁrst position ofthe vector register. Scalar values are only read/written onceper lane, and are logically replicated by the functional units.

3) Operand queues:

The multi-banked organization of theVRF can lead to banking conﬂicts when several functionalunits try to access operands in the same bank. Each lane hasa set of operand queues between the VRF and the functionalunits to absorb such banking conﬂicts. There are ten operandqueues: four of them are dedicated to the FPU/MUL unit,three of them to the ALU (two of which are shared withthe SLDU), and another three to the VLSU. Each queueis 64 bit wide and their depth was chosen via simulation.The queue depth depends on the functional unit’s latencyand throughput, so that low-throughput functional units, as

Bank v v v v . . . . . . . . . . . . . . . . . . . . . . . .(a) Without “barber’s pole” shift. 0 1 2 3 4 5 6 70 1 2 3 4 5 6 78 9 10 11 12 13 14 15 . . . . . . . . . . . . . . . . . . . . . . . .(b) With “barber’s pole” shift.Fig. 4. VRF organization inside one lane. Darker colors highlight the initialelement of each vector register v i . In a), all vector registers start at the samebank. In b), the vector registers follow a “barber’s pole” pattern, the startingbank being shifted for every vector register. the VLSU, require shallower queues than the FPUs. Queuesbetween the functional units’ output ports and the vector registerﬁle absorb banking conﬂicts on the write-back path to the VRF.Each lane has two of such queues, one for the FPU/MUL andone for the ALU. Together with the decoupled operand fetchmechanism discussed in Section III-E1 and the barber’s poleVRF organization of Section III-E2, the operand queues allowfor a pipelined execution of vector instructions. While bubblesoccur sporadically due to banking conﬂicts, it is possible to ﬁllthe pipeline even with a succession of short vector instructions.

4) Execution units:

Each lane has three execution units, aninteger ALU, an integer MUL, and an FPU, all of them operatingon a 64-bit datapath. The MUL shares the operand queues withthe FPU, and they cannot be used simultaneously, since wedo not expect the simultaneous use of the integer multiplierand the ﬂoating-point unit to be a common case. With theexception of this constraint, vector chaining is allowed betweenany execution units, as long as they are executing instructionswith regular access patterns (i.e., no vector shufﬂes).It is possible to subdivide the 64-bit datapath, trading offnarrower data formats by a corresponding increase in perfor-mance. The three execution units have a 64 bit/cycle throughput,regardless of the data format of the computation. We developedour multi-precision ALU and MUL, both producing 1 × ×

32, 4 ×

16, and 8 × ENCHMARKS

Memory bandwidth is often a limiting factor when it comesto processor performance, and many optimizations revolve around scheduling memory and arithmetic operations with thepurpose of hiding memory latency. The relationship betweenprocessor performance and memory bandwidth can be analyzedwith the rooﬂine model [37]. This model shows the peakachievable performance (in OP/cycle) as a function of thearithmetic intensity I , deﬁned as the algorithm-dependent ratioof operations per byte of memory trafﬁc.Accordingly to this model, computations can be eithermemory-bound or compute-bound [38], the peak performancebeing achievable only if the algorithm’s arithmetic intensity, inoperations per byte, is higher than the processor’s performanceper memory bandwidth ratio. For Ara, it enters its compute-bound regime when the arithmetic intensity is higher than0.5 DP−FLOP/B. The memory bandwidth determines the slopeof the performance boundary in the memory-bound regime. Weconsider three benchmarks to explore the architecture instancesof the vector processor with distinct arithmetic intensities thatfully span the two regions of the rooﬂine.Our ﬁrst algorithm is MATMUL, a n × n double-precisionmatrix multiplication C ← AB + C . The algorithm requires n ﬂoating-point operations—one FMA is considered astwo operations—and at least n bytes of memory transfers.Therefore, the algorithm has an arithmetic intensity of at least I MATMUL ≥ n DP−FLOP/B . (1)We will consider matrices of size at least 16 ×

16 across severalAra instances. The rooﬂine model shows that it is possible toachieve the system’s peak performance with these matrix sizes.Matrix multiplication is neither embarrassingly memory-bound nor compute-bound, since its arithmetic intensity growswith O ( n ) . Nevertheless, it is interesting to see how Ara behaveson highly memory-bound as well as fully compute-bound cases.DAXPY, Y ← α X + Y , is a common algorithmic building blockof more complex Basic Linear Algebra Subprograms (BLAS)routines. Considering vectors of length n , DAXPY requires n FMAs and at least n bytes of memory transfers. DAXPY istherefore a heavily memory-bound algorithm, with an arithmeticintensity of 1/12 DP−FLOP/B.We explore the extremely compute-bound spectrum withthe tensor convolution DCONV, a routine which is at thecore of convolutional networks. In terms of size, we took theﬁrst layer of GoogLeNet [39], with a 64 × × × × ×

112 input images. Each point of the inputimage must be convolved with the weights, resulting in atotal of 64 × × × × ×

112 FMAs, or 236 DP−MFLOP.In terms of memory, we will consider that the input matrix(after padding) is loaded exactly once, or 3 × ×

118 doubleprecision loads, together with the write-back of the result, or64 × ×

112 double precision stores. The 6.44 MiB of mem-ory transfers imply an arithmetic intensity of 34.9 DP−FLOP/B,making this kernel heavily compute-bound on Ara.V. P

ERFORMANCE ANALYSIS

In this section, we analyze Ara in terms of its peakperformance across several design parameters. We use thematrix multiplication kernel to explore architectural limitationsin depth, before analyzing how such limitations manifestthemselves for the other kernels.

A. Matrix multiplication

Figure 5 shows the performance measurements of thematrix multiplication C ← AB + C , for several Ara instancesand problem sizes n × n . For problems “large enough,” theperformance results meet the peak performance boundary. Fora matrix multiplication of size 256 × + % [33] andBeldianu and Ziavras’s 97% [30] functional units’ utilization.The performance scalability comes, however, at a price. Morelanes require larger problem sizes to fully exploit the maximumperformance, even though all problem sizes fall into thecompute-bound regime. Smaller problems, however, cannotfully utilize the functional units. It is important to note that thislimiting effect can also be observed in other vector processorssuch as Hwacha (see comparison in Section V-D). .

25 0 . n

16 32 64 128 256 [24.5%][35.8%] [14.5%][17.4%][31.0%] [10.2%][10.4%][22.5%][43.0%] [5.2%][5.8%][6.9%][21.2%] [1.8%][1.9%][2.5%][2.8%] I ss u e r a t e Arithmetic intensity [DP−FLOP/B] P e rf o r m a n ce [ D P − F L O P / c y c l e ] (cid:96) = (cid:96) = (cid:96) = (cid:96) = Fig. 5. Performance results for the matrix multiplication C ← AB + C , withdifferent number of lanes (cid:96) , for several n × n problem sizes. The bold redline depicts a performance boundary due to the instruction issue rate. Thenumbers between brackets indicate the performance loss, with respect to thetheoretically achievable peak performance. This effect is attributed to two main reasons: ﬁrst, the initial-ization of the vector register ﬁle before starting computation;and second, the rate at which the vector instructions are issued toAra. The former is analyzed in detail in Appendix A. The latteris related to the rate at which the vector FMA instructions areissued. To understand this, consider that smaller vectors occupythe pipeline for fewer cycles, and more vector instructionsare required to fully utilize the FPUs. If every vector FMAinstruction occupies the FPUs for τ cycles and they are issuedevery δ cycles, the system performance ω is limited by ω ≤ Π τδ . (2)For the n × n matrix multiplication, τ is equal to n / Π . Weuse this together with Equation (1) to rewrite this constraint interms of the arithmetic intensity I MATMUL , resulting in ω ≤ δ I MATMUL . (3)This translates to another performance boundary in the rooﬂineplot, purely dependent on the instruction issue rate. The FMA instructions are issued every ﬁve cycles, as discussedin Appendix A. This shifts the rooﬂine of the architecture asillustrated with the bold line in Figure 5. Note that, for 16lanes, even the performance of a 64 ×

64 matrix multiplicationends up being limited by the vector instruction issue rate.The performance degradation with shorter vectors could bemitigated with a more complex instruction issue mechanism,either going superscalar or introducing a VLIW capable ISA toincrease the issue rate. Shorter vectors bring vector processorsto an array processor, where the vector instructions executefor a single cycle. This puts pressure on the issue logic,demanding more than a simple single-issue in-order core. Forexample, all ARM Cortex-A cores with Neon capability arealso superscalar [40]. Another alternative would be the use of aMIMD approach where the lanes would be decoupled, runninginstructions issued by different scalar cores, as discussed byLu et al. [29]. While ﬁne-grain temporal sharing of the vectorunits achieves an exciting increase of the FPU utilization [29],duplication of the instruction issue logic could also degradethe energy efﬁciency achieved by the design.

B. AXPY

As discussed in Section IV, DAXPY is a heavily memory-bound kernel, with an arithmetic intensity of 0.083 DP−FLOP/B.It is no surprise that the measured performance for such akernel are much less than the system’s peak performancein the compute-bound region. For an Ara instance withtwo lanes, we measure 0.65 DP−FLOP/cycle, which is 98%of the theoretical performance limit. For sixteen lanes, theachieved 4.27 DP−FLOP/cycle is still 80% of the theoreticallimit β I DAXPY from the rooﬂine plot. The limiting factor is theconﬁguration of the vector unit, whose overhead increases theruntime from the ideal 96 cycles to 120 cycles.

C. Convolution

Convolutions are heavily compute-bound kernels, with anarithmetic intensity up to of 34.9 DP−FLOP/B. With two lanes,it achieves a performance up to 3.73 DP−FLOP/cycle. We no-tice some performance degradation for sixteen lanes, where thekernel achieves 26.7 DP−FLOP/cycle, i.e., an FPU utilizationof 83.2%, close to the performance achieved by the 128 × . DAX P Y M A T M U L D C ONV . .

125 2 8 [4.0%][6.2%][12%][20%] [1.8%][1.9%][2.5%][2.8%] [6.7%][7.8%][9.4%][17%]

Arithmetic intensity [DP−FLOP/B] P e rf o r m a n ce [ D P − F L O P / c y c l e ] (cid:96) = (cid:96) = (cid:96) = (cid:96) = Fig. 6. Performance results for the three considered benchmarks, with differentnumber of lanes (cid:96) . AXPY uses vectors of length 256, the MATMUL is betweenmatrices of size 256 × D. Performance comparison with Hwacha

For comparison with Ara, we measured Hwacha’s perfor-mance for the matrix multiplication benchmark, using thepublicly available Hardware Description Language (HDL)sources and tooling scripts from their GitHub repository . Wewere not able to reproduce the 32 ×

32 double precision matrixmultiplication performance claimed by Dabbelt et al. [5]. Thisis because Hwacha relies on a closed-source L2 cache, whereasits public version has a limited memory system with no bankedcache and a broadcast hub to ensure coherence. This effectivelylimits Hwacha’s memory bandwidth to 128 bit/cycle, starvingthe FMA units and capping the achievable performance.Table I brings the performance achieved by Ara and thepublished results for Hwacha [5] side by side. For a faircomparison, the rooﬂine boundaries are identical betweenthe compared architectures. For small problems, for whicha direct comparison is possible, Ara utilizes its FPUs muchbetter than the equivalent Hwacha instances. For the instanceswith two lanes, Ara utilizes its FPUs 66% more than theequivalent Hwacha instance, for a relatively small 32 × ×

128 MATMUL, close to the performancelevel that Ara achieves. However, these results cannot bereproduced on the current open-source version of Hwacha,possibly due to the memory system limitation outlined above. See https://github.com/ucb-bar/hwacha-template/tree/a5ed14a. TABLE IN

ORMALIZED ACHIEVED PERFORMANCE BETWEEN EQUIVALENT A RAAND H WACHA INSTANCES FOR A MATRIX MULTIPLICATION , WITHDIFFERENT n × n PROBLEM SIZES . Π n Ara Hwacha a Ara Hwacha Ara Hwacha16 49.5% — 25.4% — 12.8% —32 82.6% 49.9% 53.4% 35.6% 27.6% 22.4%64 89.6% — 77.5% — 45.6% —128 94.3% — 93.1% — 78.8% — a Performance results extracted from [5].

VI. I

MPLEMENTATION RESULTS

In this section, we analyze the implementation of severalAra instances, in terms of area, power and energy efﬁciency.

A. Methodology

Ara was synthesized for G

LOBAL F OUNDRIES . Ara’s performance and power ﬁgures of meritare measured running the kernels on a cycle-accurate RegisterTransfer Level (RTL) simulation. We used Synopsys PrimeTime2016.12 to extract the power ﬁgures with activities obtainedwith timing information from the implemented design atTT/0.80 V/25 ◦ C. Table II summarizes Ara’s design parameters.

TABLE IID

ESIGN PARAMETERS . (cid:96) ∈ [ , , , ] Memory width (cid:96) bitOperating corner TT/0.80 V/25 ◦ CTarget frequency 1 GHz

VRF

Size 16 KiB/lane

Because the maximum frequencies achieved after synthesisare usually higher than the ones achieved after the back-endﬂow, the system was synthesized for a clock period constraint250 ps shorter than the target clock period of 1 ns. The systemcan be tuned for even higher frequencies by deploying ForwardBody-Biasing (FBB) techniques, at the expense of an increasein leakage power. In average, the ﬁnal designs have a mix of72.9% Low Voltage Threshold (LVT) cells and 27.1% SuperLow Voltage Threshold (SLVT) cells.

B. Physical implementation

We implemented four Ara instances, with two, four, eightand sixteen lanes. The instance with four lanes was placedand routed as a 1.125 mm × LOB - AL F OUNDRIES

AB C DE FG H I J (a) Place-and-route results of an Ara instance with four lanes, highlightingits internal blocks: A) lane 0; B) lane 1; C) lane 2; D) lane 3; E) SLDU; F)sequencer; G) VLSU; H) Ara front end; I) Ariane; J) memory interconnect.

AB CD E F (b) Detail of one of Ara’s lanes, highlighting its internal blocks: A) lanesequencer; B) VRF; C) operand queues; D) MUL; E) FPU; F) ALU.Fig. 7. Place-and-route results of an Ara instance with four lanes inG

LOBAL F OUNDRIES

22 nm technology on a 1.125 mm × Our vector processor is scalable, in the sense that Ariane canbe reused without changes to drive a wide range of different laneparameters. Furthermore, each vector lane touches only its ownsection of the VRF, hence it does not introduce any scalabilitybottlenecks. Scalability is only limited by the units that needto interface with all lanes at once, namely the main sequencer,the VLSU, and the SLDU. Beldianu and Ziavras [30] andHwacha [33], on the other hand, have a dedicated memory portper lane. This solves the scalability issue locally, by controllingthe growth of the memory interface, but pushes the memoryinterconnect issue further upstream, as its wide memory systemmust be able to aggregate multiple parallel requests from allthese ports to achieve their maximum memory throughput.We decided not to deploy lane-level Power Gating (PG) orBody-Biasing (BB) techniques, due to their signiﬁcant area andtiming impact. In terms of area, both techniques would requirean isolation ring 10 µm-wide around each PG/BB domain, orat least an 8% increase in the area of each lane. In terms oftiming, isolation cells between power domains and separatedclock trees would impact Ara’s operating frequency. Assuming these cells would be in the critical path between the lanes andthe VLSU, this would incur into a 10% clock frequency penalty.Reverse Body-Biasing lowers the leakage, but also impactsfrequency, since it cannot be applied to high-performance LVTand SLVT cells. Furthermore, PG (and, to a lesser degree, BB)would introduce signiﬁcant (in the order of 10 −

15 cycles) turn-on transition times, which could be tolerable only if coupledwith a scheduling policy for power managing the lanes. Thesetechniques are out of the scope of the current work.

C. Performance, power, and area results

Table III summarizes the post-place-and-route results ofseveral Ara instances. Overall, the instances achieve nominal op-erating frequencies around 1.2 GHz, where we chose the typicalcorner, TT/0.80 V/25 ◦ C, for comparison with equivalent resultsfrom Hwacha [41]. For completeness, Table III also presentstiming results for the worst-case corner, i.e., SS/0.72 V/125 ◦ C.The two-lane instance has its critical path inside the doubleprecision FMA. This block relies on the automatic retimingfeature from Synopsys Design Compiler, and the registerplacement could be further improved by hand-tuning, or byincreasing the number of pipeline stages. Another critical pathis on the combinational handshake between the VLSU andits operand queues in the lanes. Both paths are about 40 gatedelays long. Timing of the instances with eight and sixteenlanes becomes increasingly critical, due to the widening ofAra’s memory interface. This happens when the VLSU collects64 bit words from all the lanes, realigns and packs them into awide word to be sent to memory. The instance with 16 lanesincurs into a 17% clock frequency penalty when comparedwith the frequency achieved by the instance with two lanes.The silicon area and leakage power of the accompanyingscalar core are amortized among the lanes, which can be seenwith the decreasingly area per lane ﬁgure of merit. Figure 8shows the area breakdown of an Ara instance with four lanes.Ara’s total area (excluding the scalar core) is 2.46 MGE, out ofwhich each lane amounts to 575 kGE. The area of the vectorunit is dominated by the lanes, while the other blocks amountto only 7% of the total area. The area of the lanes is dominatedby the VRF (35%), the FPU (27%), and the multiplier (18%). ba ALULane sequencerVLSUSLDUSequencerFront end Lane 3Lane 2Lane 1 FPULane 0 MULQueueVRF

Fig. 8. Area breakdowns of a) an Ara instance with four lanes with detail on b)one of its lanes. Ara’s total area, excluding the scalar processor, is 2.46 MGE.Each lane has about 575 kGE.

In terms of post-synthesis logic area, a Hwacha instance with TABLE IIIP

OST - PLACE - AND - ROUTE ARCHITECTURAL COMPARISON BETWEEN SEVERAL A RA INSTANCES IN G LOBAL F OUNDRIES

TECHNOLOGY IN TERMS OF PERFORMANCE , POWER CONSUMPTION , AND ENERGY EFFICIENCY . InstanceFigure of merit (cid:96) = (cid:96) = (cid:96) = (cid:96) = Clock (nominal) [GHz] 1.25 1.25 1.17 1.04

Clock (worst-case) [GHz] 0.92 0.93 0.87 0.78

Area [kGE] 2228 3434 5902 10 735

Area per lane [kGE] 1114 858 738 671

Kernel matmul a dconv b daxpy c matmul dconv daxpy matmul dconv daxpy matmul dconv daxpy Performance [DP−GFLOPS] 4.91 4.66 0.82 9.80 9.22 1.56 18.2 16.9 2.80 32.4 27.7 4.44

Core power [mW] 138 130 68.2 259 239 113 456 420 183 794 676 280Leakage [mW] 7.2 11.2 21.1 31.4Ariane/Ara [mW] 22/116 22/108 20/48 27/232 29/210 25/88 28/428 29/391 24/159 31/763 31/646 25/255

Core power per lane [mW] 69 65 34 65 60 28 57 54 23 50 42 15

Efﬁciency [DP−GFLOPS/W] 35.6 35.8 12.0 37.8 38.6 13.8 39.9 40.2 15.3 40.8 41.0 15.9 a Double precision ﬂoating point 256 ×

256 matrix multiplication. b Double precision ﬂoating point tensor convolution with sizes from the ﬁrst layer ofGoogLeNet. Input size is 3 × ×

112 and kernel size is 64 × × × c Double precision AXPY of vectors with length 256. four lanes uses 0.354 mm [5], or 1098 kGE . When comparingpost-synthesis results, Hwacha is 9% smaller than the equivalentAra instance. The trend is also valid for equivalent instanceswith eight and sixteen lanes. The main reason for this areadifference is that Hwacha has only half as many multipliers asAra, i.e., Hwacha has one MUL per two FMA units [42]. Thesemultipliers make up for a 9% area difference. Moreover, theseHwacha instances do not support mixed-precision arithmetic [5],and its support would incur into a 4% area overhead [41]. Ara,however, has a simpler execution mechanism than Hwacha’sVector Runahead Unit [42], contributing to the area difference.We used the placed-and-routed designs to analyze theperformance and energy efﬁciency of Ara when running the con-sidered benchmarks. Due to the asymmetry between the codethat runs in Ariane and in Ara, we extracted switching activitiesby running the benchmarks with netlists back annotated withtiming information. As expected, the energy efﬁciency of Aracoupled to an Ariane core is considerably higher than that ofan Ariane core alone. For instance, a 256 ×

256 integer matrixmultiplication achieves up to 43.6 GOPS/W energy efﬁciencyon an Ara with four lanes, whereas a comparable benchmarkruns at 17 GOPS/W on Ariane [34]. In that case, the instructionand data caches alone are responsible for 46% of Ariane’s powerdissipation. In Ara’s case, most of the memory accesses godirectly into the VRF and energy spent for cache accesses canbe amortized over many vector lanes and cycles, increasing thesystem’s energy efﬁciency with an increasing number of lanes.A Hwacha implementation in ST 28 nm FD-SOI technology(at an undisclosed condition) achieves a peak energy efﬁciencyof 40 DP−GFLOPS/W [33]. Adjusting for scaling gains [1], an As Dabbelt et al. [5] do not specify the technology they used, we consideredan ideal scaling from 28 nm to 22 nm. Therefore, we considered one GE in28 nm to be ( / ) bigger than one GE in 22 nm, or 0.322 µm . energy efﬁciency of 41 DP−GFLOPS/W is comparable to theenergy efﬁciency of the large Ara instances running MATMUL.VII. C ONCLUSIONS

In this work, we presented Ara, a parametric in-order high-performance energy-efﬁcient 64-bit vector unit based on theversion 0.5 draft of RISC-V’s vector extension. Ara actsas a coprocessor tightly coupled to Ariane, an open-sourceapplication-class RV64GC core. Ara’s microarchitecture wasdesigned with scalability in mind. To this end, it is composedof a set of identical lanes, each hosting part of the system’svector register ﬁle and functional units. The lanes communicatewith each other via the VLSU and the SLDU, responsible forexecuting instructions that touch all the VRF banks at once.These units arguably represent the weak points when it comes toscalability, because they get wider with an increasing number oflanes. Other architectures take an alternative approach, havingseveral narrow memory ports instead of a single wide one. Thisapproach does not solve the scalability problem, but just deﬂectsit further to the memory interconnect and cache subsystem.We measured the performance of Ara using matrix multiplica-tion, convolution (both compute-bound), and AXPY (memory-bound) double-precision kernels. For problems “large enough,”the compute-bound kernels almost saturate the FPUs, with themeasured performance of a 256 ×

256 matrix multiplicationonly 3% below the theoretically achievable peak performance.In terms of performance and power, we presented post-place-and-route results for Ara conﬁgurations with two up to sixteenlanes in G

LOBAL F OUNDRIES × more energy efﬁcient than Ariane alone whenrunning an equivalent benchmark. An instance of our design with sixteen lanes achieves up to about 41 DP−GFLOPS/Wrunning computationally intensive benchmarks, comparable tothe energy efﬁciency of the equivalent Hwacha implementation.We decided not to restrain the performance analysis to verylarge problems, and observed a performance degradation forproblems whose size is comparable to the number of vectorlanes. This is not a limitation of Ara per se, but rather ofvector processors in general, when coupled to a single-issuein-order core. The main reason for the low FPU utilization forsmall problems is the rate at which the scalar core issues vectorinstructions. With our MATMUL implementation, Ariane issuesa vector FMA instruction every ﬁve cycles, and the shorter thevector length is, the more vector instructions are required to ﬁllthe pipeline. By decoupling operand fetch and result write-back,Ara tries to eliminate bubbles that would have a signiﬁcantimpact on short-lived vector instructions. While the achievedperformance in this case is far from the peak, it is nonethelessclose to the instruction issue rate performance boundary.To this end, we believe that it would be interesting toinvestigate whether and to what extent this performance limitcould be mitigated by leveraging a superscalar or VLIW-capablecore to drive the vector coprocessor. While using multiplesmall cores to drive the vector lanes increases their individualutilization, maintaining an optimal energy efﬁciency mightmean the usage of fewer lanes than physically available, i.e., alower overall utilization of the functional units. In any case,care must be taken to ﬁnd an equilibrium between the high-performance and energy-efﬁciency requirements of the design.A CKNOWLEDGMENTS

We would like to thank Frank G¨urkaynak and FrancescoConti for the helpful discussions and insights.R

EFERENCES[1] R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge,“Near-threshold computing: Reclaiming Moore’s law through energyefﬁcient integrated circuits,”

Proceedings of the IEEE , vol. 98, no. 2, pp.253–266, Feb. 2010.[2] I. Hwang and M. Pedram, “A comparative study of the effectiveness ofCPU consolidation versus dynamic voltage and frequency scaling in avirtualized multicore server,”

IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems , vol. 24, no. 6, pp. 2103–2116, Jun. 2016.[3] S. Kiamehr, M. Ebrahimi, M. S. Golanbari, and M. B. Tahoori,“Temperature-aware dynamic voltage scaling to improve energy efﬁciencyof near-threshold computing,”

IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems , vol. 25, no. 7, pp. 2017–2026, Jul. 2017.[4] J. Backus, “Can programming be liberated from the von Neumann style?:A functional style and its algebra of programs,”

Commun. ACM , vol. 21,no. 8, pp. 613–641, Aug. 1978.[5] D. Dabbelt, C. Schmidt, E. Love, H. Mao, S. Karandikar, and K. Asanovi´c,“Vector processors for energy-efﬁcient embedded systems,” in

Proceedingsof the Third ACM International Workshop on Many-core EmbeddedSystems , ser. MES ’16. New York, NY, USA: ACM, 2016, pp. 10–16.[6] V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efﬁcient processing of deepneural networks: A tutorial and survey,”

Proceedings of the IEEE , vol.105, no. 12, pp. 2295–2329, Dec. 2017.[7] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C.Phillips, “GPU computing,”

Proceedings of the IEEE , vol. 96, no. 5, pp.879–899, May 2008.[8] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla:A uniﬁed graphics and computing architecture,”

IEEE Micro

CoRR , 2016.[Online]. Available: http://arxiv.org/abs/1604.07316[11] R. M. Russell, “The CRAY-1 computer system,”

Commun. ACM , vol. 21,no. 1, pp. 63–72, Jan. 1978.[12] S. F. Beldianu and S. G. Ziavras, “Performance-energy optimizationsfor shared vector accelerators in multicores,”

IEEE Transactions onComputers , vol. 64, no. 3, pp. 805–817, Mar. 2015.[13]

NVIDIA Tesla V100 GPU Architecture , NVIDIA, Aug. 2017,v1.1. [Online]. Available: https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf[14] M. M. Mano, C. R. Kime, and T. Martin,

Logic and Computer DesignFundamentals , 5th ed. Hoboken, NJ, USA: Pearson High Education,2015.[15] J. L. Hennessy and D. A. Patterson,

Computer Architecture: A Quantita-tive Approach , 5th ed. San Francisco, CA, USA: Morgan KaufmannPublishers Inc., 2011.[16] N. Stephens, S. Biles, M. Boettcher, J. Eapen, M. Eyole, G. Gabrielli,M. Horsnell, G. Magklis, A. Martinez, N. Premillieu, A. Reid, A. Rico,and P. Walker, “The ARM Scalable Vector Extension,”

IEEE Micro ,vol. 37, no. 2, pp. 26–39, Mar. 2017.[17] A. Waterman and K. Asanovi´c,

The RISC-V Instruction Set Manual:User-Level ISA , CS Division, EECS Department, University of California,Berkeley, CA, USA, Jun. 2019, version 20190608-Base-Ratiﬁed.[18] “Working draft of the proposed RISC-V V vector extension,”2019, accessed on March 1, 2019. [Online]. Available:https://github.com/riscv/riscv-v-spec[19] A. Peleg and U. Weiser, “MMX technology extension to the Intelarchitecture,”

IEEE Micro , vol. 16, no. 4, pp. 42–50, Aug. 1996.[20] M. J. Flynn, “Some computer organizations and their effectiveness,”

IEEETransactions on Computers , vol. C-21, no. 9, pp. 948–960, Sep. 1972.[21] J. Reinders, “Intel AVX-512 instructions,”

Intel Software DeveloperZone , Jun. 2017. [Online]. Available: https://software.intel.com/en-us/blogs/2013/avx-512-instructions[22] ARM, “Neon,” Accessed on May 1, 2019. [Online]. Available:https://developer.arm.com/architectures/instruction-sets/simd-isas/neon[23] M. Gautschi, P. D. Schiavone, A. Traber, I. Loi, A. Pullini, D. Rossi,E. Flamand, F. K. G¨urkaynak, and L. Benini, “Near-threshold RISC-Vcore with DSP extensions for scalable IoT endpoint devices,”

IEEETransactions on Very Large Scale Integration (VLSI) Systems , vol. 25,no. 10, pp. 2700–2713, Oct. 2017.[24] K. Asanovi´c, “Vector microprocessors,” Ph.D. dissertation, University ofCalifornia, Berkeley, 1998.[25] J. Yu, C. Eagleston, C. H.-Y. Chou, M. Perreault, and G. Lemieux,“Vector processing as a soft processor accelerator,”

ACM Trans.Reconﬁgurable Technol. Syst. , vol. 2, no. 2, pp. 12:1–12:34, Jun. 2009.[Online]. Available: http://doi.acm.org/10.1145/1534916.1534922[26] C. E. Kozyrakis and D. A. Patterson, “Scalable vector processors forembedded systems,”

IEEE Micro , vol. 23, no. 6, pp. 36–45, 2003.[27] C. H. Chou, A. Severance, A. D. Brant, Z. Liu, S. Sant, and G. Lemieux,“VEGAS: Soft vector processor with scratchpad memory,”

Proceedingsof ACM/SIGDA International Symposium on Field Programmable GateArrays (FPGA) , pp. 15–24, 2011.[28] T. Yoshida, “Fujitsu high performance CPU for the Post-K computer,”in

Hot Chips: A Symposium on High Performance Chips , ser. HC30,Cupertino, CA, USA, Aug. 2018.[29] Y. Lu, S. Rooholamin, and S. G. Ziavras, “Vector coprocessorvirtualization for simultaneous multithreading,”

ACM Trans. Embed.Comput. Syst. , vol. 15, no. 3, pp. 57:1–57:25, May 2016. [Online].Available: http://doi.acm.org/10.1145/2898364[30] S. F. Beldianu and S. G. Ziavras, “ASIC design of shared vectoraccelerators for multicore processors,” in ,Oct. 2014, pp. 182–189.[31] Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, andK. Asanovi´c, “Exploring the tradeoffs between programmability andefﬁciency in data-parallel accelerators,”

SIGARCH Comput. Archit. News ,vol. 39, no. 3, pp. 129–140, 2011.[32] R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper,and K. Asanovic, “The vector-thread architecture,”

SIGARCH Comput.Archit. News , vol. 32, no. 2, pp. 52–, Mar. 2004. [Online]. Available:http://doi.acm.org/10.1145/1028176.1006736[33] C. Schmidt, A. Ou, and K. Asanovi´c, “Hwacha: A data-parallel RISC-Vextension and implementation,” in

Inaugural RISC-V Summit Proceedings .Santa Clara, CA, USA: RISC-V Foundation, Dec. 2018. [Online]. Avail- able: https://content.riscv.org/wp-content/uploads/2018/12/Hwacha-A-Data-Parallel-RISC-V-Extension-and-Implementation-Schmidt-Ou-.pdf[34] F. Zaruba and L. Benini, “The cost of application-class processing: Energyand performance analysis of a Linux-ready 1.7GHz 64bit RISC-V corein 22nm FDSOI technology,” arXiv e-prints , Apr. 2019.[35] S. Mach, D. Rossi, G. Tagliavini, A. Marongiu, and L. Benini, “Atransprecision ﬂoating-point architecture for energy-efﬁcient embeddedcomputing,” in Commun. ACM ,vol. 52, no. 4, pp. 65–76, Apr. 2009.[38] G. Ofenbeck, R. Steinmann, V. Caparros, D. G. Spampinato, andM. Pueschel, “Applying the rooﬂine model,” in

IEEE InternationalSymposium on Performance Analysis of Systems and Software (ISPASS) ,Mar. 2014, pp. 76–85.[39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in

Computer Vision and Pattern Recognition (CVPR) , 2015. [Online].Available: http://arxiv.org/abs/1409.4842[40] ARM, “Arm Cortex-A series processors,” Accessed onOctober 20, 2019. [Online]. Available: https://developer.arm.com/ip-products/processors/cortex-a[41] Y. Lee, C. Schmidt, S. Karandikar, D. Dabbelt, A. Ou, and K. Asanovi´c,“Hwacha preliminary evaluation results,” University of California atBerkeley, Berkeley, CA, USA, Tech. Rep. UCB/EECS-2015-264, Dec.2015.[42] Y. Lee, A. Ou, C. Schmidt, S. Karandikar, H. Mao, and K. Asanovi´c, “TheHwacha microarchitecture manual,” University of California at Berkeley,Berkeley, CA, USA, Tech. Rep. UCB/EECS-2015-263, Dec. 2015. A PPENDIX

A. Implementation and execution of a matrix multiplication

Here we analyze in depth the implementation and executionof the n × n matrix multiplication. We assume the matricesare stored in row-major order. Our implementation uses a tiledapproach working on t rows of matrix C at a time. Figure 9presents the matrix multiplication algorithm, working on tilesof size t × n . The algorithm showcases how the ISA handlesscalability via strip-mined loops [24]. Line 3 is uses the setvl instruction, which sets the vector length for the following vectorinstructions, and enables the same code to be used for vectorprocessors with different maximum vector length VLMAX.Once inside the strip-mined loop, there are three distinctcomputation phases: I) read a block of matrix C ; II) the actualcomputation of the matrix multiplication, and; III) write theresult to memory. Phases I and III take O ( n ) cycles, whereasthe phase II takes O ( n ) cycles. The core part of Figure 9is the for loop of line 11, where most of the time is spentand where the FPUs are used. Listing 1 shows the resultingRISC-V vector assembly code for the phase II of the matrixmultiplication, considering a block size of four rows. We ignoresome control ﬂow instructions at the start and end of Listing 1,which handle the outer for loop. c ← ; while c < n do { Strip-mining loop } vl ← min ( n − c , VLMAX ) ; r ← ; while r < n do for j ← to min ( r , t ) − do { Phase I } Load row C [ r + j , c ] into vector register v C j ; end for for i ← to n − do { Phase II } Load row B [ i , c ] into vector register v B ; for j ← to min ( r , b ) − do Load element A [ j , i ] ; Broadcast A [ j , i ] into vector register v A ; v C j ← v A v B + v C j ; end for end for for j ← to min ( r , t ) − do { Phase III } Store vector register v C j into C [ r + j , c ] ; end for r ← r + t ; end while c ← c + vl ; end while Fig. 9. Algorithm for the matrix multiplication C ← AB + C .Listing 1E XCERPT OF THE MATRIX MULTIPLICATION IN

RISC-V V

ECTOREXTENSION ASSEMBLY , WITH A BLOCK SIZE OF FOUR ROWS . ; a0: pointer to A ; a1: pointer to B ; a2: A row size ; a3: B row size vld vB0 , 0(a1) ; load row of B add a1 , a1 , a3 ; bump B pointer vld vB1 , 0(a1) ; load row of B add a1 , a1 , a3 ; bump B pointer ld t0 , 0(a0) ; / load element of A add a0 , a0 , a2 ; | bump A pointer vins vA , t0 , zero ; | move from Ariane to Ara vmadd vC0 , vA , vB0 , vC0 ; \ vector multiply -add ld t0 , 0(a0) add a0 , a0 , a2 vins vA , t0 , zero vmadd vC1 , vA , vB1 , vC1 ld t0 , 0(a0) add a0 , a0 , a2 vins vA , t0 , zero vmadd vC2 , vA , vB2 , vC2 ld t0 , 0(a0) add a0 , a0 , a2 vins vA , t0 , zero vmadd vC3 , vA , vB0 , vC3 vld vB0 , 0(a1) ; load row of B add a1 , a1 , a3 ; bump B pointer ld t0 , 0(a0) ; / load element of A add a0 , a0 , a2 ; | bump A pointer vins vA , t0 , zero ; | move from Ariane to Ara vmadd vC0 , vA , vB1 , vC0 ; \ vector multiply -add ... ld t0 , 0(a0) add a0 , a0 , a2 vins vA , t0 , zero vmadd vC3 , vA , vB1 , vC3 After loading one row of matrix B , the kernel consists offour repeating instructions, responsible for, respectively: i) load the element A [ j , i ] into a general-purpose register t0 ; ii) bumpaddress A [ j , i ] preparing for next iteration; iii) broadcast scalarregister t0 into vector register v A ; iv) multiply-add instruction v C i ← v A v B + v C i . As Ariane is a single-issue core, this kernelruns in at least four cycles. In steady state, however, we measurethat each loop iteration runs in ﬁve cycles. The reason for this,as shown in the pipeline diagram of Figure 10, is one bubbledue to the data dependence between the scalar load (whichtakes two cycles) and the broadcast instruction. Instruction Cycle LD IS EX EX CO

ADD

IS EX CO

VINS — IS EX EX CO

VMADD

IS EX EX CO LD IS EX EX

Fig. 10. Pipeline diagram of the matrix multiplication kernel. Only threepipeline stages are highlighted: IS is Instruction Issue, EX is Execution Stage,CO is Commit Stage. Ariane has two commit ports into the scoreboard.

We used loop unrolling and software pipelining to code thealgorithm of Figure 9 as our C implementation. The use ofthese techniques to improve performance is visible in Listing 1.We unrolled of the for loop of line 11 in Figure 9, whichcorrespond to lines 11-14, repeated t times on the followinglines in Listing 1. This avoids any branching at the end ofthe loop. Moreover, two vectors hold rows of matrix B . Thisdouble buffering allows for the simultaneous loading of onerow in vector vB1 , in line 9, while vB0 is used for the FMAs,as in line 14 in Listing 1. After line 28, vB1 is used for thecomputation, while another row of B is loaded into vB0 .The three phases of the computation can be distinguishedclearly in Figure 11, which shows the utilization of the VLSUand FPU for a 32 ×

32 matrix multiplication on a four-lane Arainstance. Note how the FPUs are almost fully utilized duringphase II, while being almost idle otherwise. LD U tili za ti on [ % ] FPU

Time [ × cycles] ST Fig. 11. Utilization of Ara’s functional units for a 32 ×

32 matrix multiplicationon an Ara instance with four lanes.

Matheus Cavalcante received the M.Sc. degree inIntegrated Electronic Systems from the GrenobleInstitute of Technology (Phelma), France, in 2018. Heis currently pursuing a Ph.D. degree at the IntegratedSystems Laboratory of ETH Z¨urich, Switzerland. Hisresearch interests include high performance computearchitectures and interconnection networks.

Fabian Schuiki received the B.Sc. and M.Sc. degreein electrical engineering from the ETH Z¨urich in 2014and 2016, respectively. He is currently pursuing aPh.D. degree with the Digital Circuits and Systemsgroup of Luca Benini. His research interests includetransprecision computing as well as near- and in-memory processing.

Florian Zaruba received his B.Sc. degree from TUWien in 2014 and his M.Sc. from the ETH Z¨urich in2017. He is currently pursuing a Ph.D. degree at theIntegrated Systems Laboratory. His research interestsinclude design of very large scale integration circuitsand high performance computer architectures.

Michael Schaffner received his M.Sc. and Ph.D.degrees from ETH Z¨urich, Switzerland, in 2012and 2017. He has been a research assistant at theIntegrated Systems Laboratory, ETH Z¨urich, andDisney Research, Z¨urich, from 2012 to 2017, wherehe was working on digital signal and video processing.From 2017 to 2018 he has been a postdoctoralresearcher at the Integrated Systems Laboratory, ETHZ¨urich, focusing on the design of RISC-V processorsand efﬁcient co-processors. Since 2019, he has beenwith the ASIC development team at Google CloudPlatforms, Sunnyvale, USA, where he is involved in processor design. MichaelSchaffner received the ETH Medal for his Diploma thesis in 2013.