[PDF] An Open-Source Platform for High-Performance Non-Coherent On-Chip Communication

Abstract

On-chip communication infrastructure is a central component of modern systems-on-chip (SoCs), and it continues to gain importance as the number of cores, the heterogeneity of components, and the on-chip and off-chip bandwidth continue to grow. Decades of research on on-chip networks enabled cache-coherent shared-memory multiprocessors. However, communication fabrics that meet the needs of heterogeneous many-cores and accelerator-rich SoCs, which are not, or only partially, coherent, are a much less mature research area. In this work, we present a modular, topology-agnostic, high-performance on-chip communication platform. The platform includes components to build and link subnetworks with customizable bandwidth and concurrency properties and adheres to a state-of-the-art, industry-standard protocol. We discuss microarchitectural trade-offs and timing/area characteristics of our modules and show that they can be composed to build high-bandwidth (e.g., 2.5 GHz and 1024 bit data width) end-to-end on-chip communication fabrics (not only network switches but also DMA engines and memory controllers) with high degrees of concurrency. We design and implement a state-of-the-art ML training accelerator, where our communication fabric scales to 1024 cores on a die, providing 32 TB/s cross-sectional bandwidth at only 24 ns round-trip latency between any two cores.

Full PDF

11 An Open-Source Platform for High-PerformanceNon-Coherent On-Chip Communication

Andreas Kurth,

Student Member, IEEE,

Wolfgang R ¨onninger, Thomas Benz, Matheus Cavalcante,

StudentMember, IEEE,

Fabian Schuiki, Florian Zaruba,

Student Member, IEEE, and Luca Benini,

Fellow, IEEE

Abstract —On-chip communication infrastructure is a central component of modern systems-on-chip (SoCs), and it continues to gainimportance as the number of cores, the heterogeneity of components, and the on-chip and off-chip bandwidth continue to grow. Decades ofresearch on on-chip networks enabled cache-coherent shared-memory multiprocessors. However, communication fabrics that meet theneeds of heterogeneous many-cores and accelerator-rich SoCs, which are not, or only partially, coherent, are a much less mature researcharea. In this work, we present a modular, topology-agnostic, high-performance on-chip communication platform. The platform includescomponents to build and link subnetworks with customizable bandwidth and concurrency properties and adheres to a state-of-the-art,industry-standard protocol. We discuss microarchitectural trade-offs and timing/area characteristics of our modules and show that they canbe composed to build high-bandwidth (e.g., 2.5 GHz and 1024 bit data width) end-to-end on-chip communication fabrics (not only networkswitches but also DMA engines and memory controllers) with high degrees of concurrency. We design and implement a state-of-the-art MLtraining accelerator, where our communication fabric scales to 1024 cores on a die, providing 32 TB/s cross-sectional bandwidth at only24 ns round-trip latency between any two cores. (cid:70)

NTRODUCTION O N - CHIP NETWORKS are the primary means of communi-cation inside modern multi- and many-core processingSoCs [1]. As the number of cores, the heterogeneity ofcomponents, and the on- and off-chip bandwidth continueto grow to meet ever higher application demands, on-chipnetworks continue to gain importance. Decades of researchon on-chip networks were instrumental for breakthroughs inscalability of homogeneous shared-memory multiprocessors,and a continuation of this research is necessary to realize thefull potential of many-core accelerators and accelerator-richheterogeneous SoCs.Ideally, SoC designers could compose on-chip networksfrom a platform of components according to the requirementsof their application. The central design goals of such aplatform are: (G1)

Elementary, modular components that canimplement any topology and that separate concerns such asrouting and buffering. (G2)

Parametrizable components (e.g.,data width, transaction concurrency) to cover a large designspace. (G3)

Bridging components to connect heterogeneousSoC elements (e.g., GPU SMs, DMA engines, and domain-speciﬁc accelerators) and their subnetworks, each with unique,application-driven latency and bandwidth requirements. (G4)

Compliance with an industry-standard protocol for extensibil-ity, third-party compatibility, and veriﬁability. (G5)

Detailedcharacterization of the complexity and trade-offs of thecomponents in terms of performance vs. cost (area, power) toguide design and optimization efforts.Commercial offerings that meet (parts of) these goals existfrom multiple vendors (details in § § (G1) , high-performance on-chip communication platform ofparametrizable components (G2) for a state-of-the-art,industry-standard protocol (G4) ( § (G3) . We publish the modules of ourplatform, implemented in industry-standard Sys-temVerilog, under a permissive open-source licensefor research and industrial usage.2) We discuss microarchitectural trade-offs and tim-ing/area characteristics of the modules in our plat-form (G5) , both theoretically/asymptotically and withtopographical synthesis results ( § § § a r X i v : . [ c s . A R ] S e p This paper is organized as follows: We present the ar-chitecture of our on-chip communication platform in § §

3. Wethen use our platform to design, implement, and evaluate thecommunication fabric of a state-of-the-art many-core MLTaccelerator in §

4. Finally, we compare with related work in § § RCHITECTURE

Current on-chip communication is centered around thepremise of high-bandwidth point-to-point data transfers. Tofulﬁll this premise despite increasing point-to-point latency,three central traits of current on-chip communication protocolsare: burst-based transactions , multiple outstanding transactions ,and transaction reordering . Our design targets these centraltraits in general, so the concepts we present potentiallyapply to a wide range of modern on-chip protocols. Moretangibly, we adhere to the latest revision (5) of the AMBAAdvanced eXtensible Interface (AXI) [5]. AXI is one of theindustry-dominant protocols and the only protocol with anopen, royalty-free speciﬁcation and a widespread adoption incurrent systems designed by many different companies. Otherprotocols with similar properties are discussed in § master port and a slave port isstructured into two directions (read and write), into channelsfor commands , data , and responses , and into transfer items called beats . A write transaction starts with one beat on the writecommand channel followed by one or multiple beats on the write data channel and ends with a single beat on the writeresponse channel . A read transaction starts with one beat on the read command channel and ends with one or the last of multiplebeats on the read response channel . Each transaction has an ID . IDs deﬁne the order of transactions and beats accordingto the following rules: (O1) Inter-Transaction Ordering: Anytwo transactions in the same direction and ID are ordered. (O2)

Response Ordering: Any two responses with the samedirection and ID must be in the same order as their commands. (O3)

Write Beat Ordering: Write data beats do not havean ID and are therefore always ordered. Each channel hasmultiple isodirectional payload signals and two signals for bi-directional ﬂow control. We focus on valid - ready ﬂow control,where the channel master drives valid and the payload and thechannel slave drives ready (but other ﬂow control schemes, e.g.,credit-based, are possible). A handshake occurs when valid and ready are high on a rising clock edge. There are two essentialrules in this valid - ready ﬂow control: (F1) Stability Rule: Once valid is high, it and the payload must not change until afterthe next handshake. (F2)

Acyclicity Rule: The channel receivermay depend on valid to be high before setting ready high, butthe channel sender may not depend on ready to be high beforesetting valid high.An overview of the modules in our on-chip communicationplatform is given in Table 1. In this section, we discuss theirmicroarchitecture and design trade-offs, from elementarycomponents through all essential interconnecting modules toendpoints of increasing complexity.

Network multiplexers and demultiplexers are the elementarycomponents that join multiple ports to one and split one portinto multiple, respectively. In doing so, they must adhere to

Category Module Section

ElementaryComponents Network Multiplexer 2.1.1Network Demultiplexer 2.1.2NetworkJunctions Crossbar 2.2.1Crosspoint 2.2.2ConcurrencyControl ID Remapper 2.3.1ID Serializer 2.3.2Data WidthConverters Data Upsizer 2.4.1Data Downsizer 2.4.2Data Movement DMA Engine 2.5On-Chip MemoryEndpoints Simplex Memory Controller 2.6.1Duplex Memory Controller 2.6.2Last Level Cache 2.7

Table 1. Overview of the modules in our on-chip communication platform.Figure 1. Architecture of our multiplexer , drawn with two slave ports. the relations between the channels and to the ordering rules (O1–3) . They are obviously used to build network junctions(e.g., crossbars), but they can be reused far beyond that becausethey implement a central part of the communication protocol.In fact, these elementary components are essential for almostall modules of our platform.

The multiplexer, which connects multiple slave ports to onemaster port, consists of multiplexing components for theforward channels and demultiplexing components for thebackward channels. The complexity lies in demultiplexingthe backward channels, because the multiplexer needs theinformation to which output a beat on a backward channelmust be routed. Multiplexing the command channels simplyrequires the selection of a valid beat, with the restriction that aselection must be stable once made (F1) .Our multiplexer architecture is shown in Fig. 1. We ﬁrstprepend the ID of each command beat with the ID of theslave port. We then select among beats on the commandchannels with round-robin (RR) arbitration trees. For writes,the decision is forwarded through a ﬁrst-in ﬁrst-out buffer(FIFO) to a multiplexer for the write data beats, which issufﬁcient due to (O3) . As commands out of our multiplexercarry the input port information in the most signiﬁcantbits (MSBs) of their ID, routing responses is as simple asdemultiplexing based on the MSBs and then truncatingthe ID to the original width. Another key advantage isthat transactions with the same ID from any two differentslave ports remain independent, so (O1) does not restrictcommunication through our multiplexer. Note that streamdemultiplexing means the payload is the same for all demuxoutputs and only the handshake signals are (de)multiplexed.Alternative multiplexer architectures could do withoutextending the ID, for example by allowing only transactionswith different IDs concurrently or by remapping IDs internally. writecmd[1] writecmd[0]StreamDemux RR Arbiterwriteresp[1] writeresp[0] RR Arbiterreadresp[1] readresp[0]writeresp readrespwritecmd writedata[1] writedata[0]StreamDemuxwritedatawritemst sel readcmd[1] readcmd[0]StreamDemuxreadcmd readmst sel ID FSMindex&counter

ID ID

FSMindex&counter ID Figure 2. Architecture of our demultiplexer , drawn with two master ports.

However, the former restricts communication, and the lattersigniﬁcantly increases the complexity of the multiplexer.Nonetheless, some network modules grow exponentially incomplexity with the ID width. We have a modular solution tothis challenge with the ID width converters discussed in § The demultiplexer, which connects one slave port to multiplemaster ports, is more complex than the multiplexer due to theordering rules: When the demultiplexer gets two commandswith the same ID and direction (O1) that go to two differentmaster ports, it must deliver the corresponding responsesin the same order (O2) . After the demultiplexer, however,transactions on different master ports are independent, sothe demultiplexer cannot rely on the order of downstreamresponses to fulﬁll (O2) .Our demultiplexer architecture, shown in Fig. 2, solves thisby enforcing that all concurrent transactions with the samedirection and ID target the same master port. For example,when a write with ID A targets master port , it is onlyforwarded if no writes with ID A to master ports other than have outstanding responses; otherwise, the write must wait. Totrack this information, the demultiplexer contains one counterand one index register per ID and direction. Commands thatfulﬁll the aforementioned requirement increase the counter;the (last) response decreases the counter. A stream registerbetween the write command channel and the demultiplexerof the write data channel stores the master port index of anongoing write burst while the command channel is indepen-dently handshaked (F1) . Write commands and data bursts aresent in lockstep due to (O3) ; without this restriction, the writecommand and data channels could deadlock downstream.The multiple read and write response channels are joinedthrough a round-robin arbitration tree.Alternative demultiplexer architectures could do withoutrequiring all concurrent transactions with the same directionand ID to target the same master port, for example by remap-ping IDs internally. However, this signiﬁcantly increases thecomplexity of the demultiplexer, which would have to reorderresponses internally to fulﬁll (O2) . Instead of introducing thiscomplexity, we let a master use different IDs for differentendpoints if it can handle out-of-order responses. The elementary components in § Figure 3. Architecture of our crossbar , drawn with two slave and threemaster ports. Each fat arrow represents a ﬁve channel (see §

2) connection.Components with dashed outline are optional.

In the standard conﬁguration, all slave ports use the sameaddress range for one master port, but different conﬁgurationswould be possible. There are two alternatives for handlingtransactions to an address range that is not deﬁned in adecoder. First, one master port can be deﬁned as default port.This is useful, for example, in a hierarchical topology whereeach downlink has a speciﬁc address range and any addressoutside the downlink addresses is sent to higher hierarchylevels through the uplink. Second, one can instantiate anerror slave, which terminates all transactions with protocol-compliant error responses. These two alternatives can beselected per slave port with a synthesis parameter.Optional pipeline registers can be inserted on all or someof the ﬁve channels of each internal connection. These registerscut all combinational signals (including handshake signals),thereby adding a cycle of latency per channel and pipeliningthe crossbar so its critical path is no longer than that of thedemultiplexer or multiplexer. These pipeline registers canbe added without risking deadlocks, but this is not trivial:Of the four Coffman conditions [6], (1) Mutual Exclusion isfulﬁlled on the write data channel after the multiplexer, (2)Hold and Wait is fulﬁlled as each pipeline register must holdits value once ﬁlled, (3) No Preemption is fulﬁlled by (O3) onthe write data channel, and (4) Circular Wait would be fulﬁlledby round-robin arbitration of write command and data beats.However, the demultiplexer breaks condition (4) by restrictingwrite commands to be issued in lockstep with write data bursts(i.e., the next write command is only issued after the previouswrite data burst has completed), thereby preventing deadlocksdespite pipeline registers, which introduce condition (2).

As the multiplexers in the crossbar expand the ID width, themaster ports of the crossbar have a wider ID than the slaveports. This prevents the direct use of our crossbar as nodesin a regular on-chip network where each node (also called“router”) has isomorphous slave and master ports. To solvethis problem, we introduce a crosspoint.Our crosspoint, shown in Fig. 4, has two additional proper-ties over the crossbar that make it better suited for composingarbitary regular on-chip topologies. First, it contains a crossbarthat is not necessarily fully connected: The connection betweenany slave and master port can be omitted with a synthesisparameter. This is useful to prevent routing loops when amodule has both a master and a slave port into the crosspoint,and it minimizes the physical resources on links that wouldbe unused. Second, the crosspoint contains an ID remapper

Figure 4. Architecture of our crosspoint , drawn with four slave and masterports. Each arrow represents a ﬁve channel (see §

2) connection.Figure 5. Architecture of our

ID remapper , drawn with up to four uniqueconcurrent IDs (per direction). ( § The ID of transactions is central to their ordering (O1–2) . Essentially, the commands and responses of any twotransactions can be independently reordered if they havedifferent IDs. This makes a high number of possible IDsattractive to prevent bottlenecks due to ordering constraints.However, tracking a high number of IDs is complex fornetwork components (e.g., demultiplexer §§ reducing theID width (as extending it is trivial). There are two ﬁrst-orderparameters for ID reduction: the width of IDs at the output, O ,and the maximum number of unique IDs at the input, U . Therelation between O and U determines whether all transactionsthat were independent at the input remain independent atthe output: If U ≤ O , every unique ID at the input can berepresented by a unique ID at the output, therefore retainingtransaction independence. This means the sparsely used inputID space can be ‘compressed’ to a narrower, densely usedoutput ID space by remapping IDs ( § U > O , thereare not enough output IDs to represent all U unique IDs. Thismeans some transactions with originally different IDs willhave to be mapped to the same ID, thereby serializing them( § Our ID remapper, shown in Fig. 5, remaps IDs with one tableper direction. The table has as many entries as there are uniqueinput IDs, and it is indexed by the output ID. Each table entryhas two ﬁelds: the input ID and a counter that records howmany transactions with the same ID are in ﬂight. The counteris incremented on command handshakes and decrementedon (last) response handshakes. The mapping from input to writecmd readrespwriterespwritedata readrespwriterespwritedata readcmdwritecmd f (ID) readcmd ID writedatawritedatawritecmdwritecmd pushID = pop writerespIDwriteresp IDreadcmdreadcmd pushID = pop readrespIDreadresp mstsel mstsel f (ID) Figure 6. Architecture of our

ID serializer , drawn with four master portIDs (per direction). output IDs is injective. Obtaining the input ID from an outputID (to remap responses) is as simple as indexing the table.Determining the output ID for an input ID (to remap requests)requires a comparison of the input ID to all IDs in the table. Ifthe table currently contains an entry for the input ID, the sameoutput ID must be used (O1) . If the table does not currentlycontain an entry for the input ID, the output ID is the indexof the next free table entry.Alternative ID remapper architectures could feature anadditional table indexed by input IDs to look output IDs up.However, under the assumption of the remapper that theinput ID space is sparse, such an additional table would bemostly empty. Therefore, it would be a poor usage of hardwareresources and we omit it at the cost of a longer ID translationpath, which could be pipelined.

If the number of unique IDs at the input of the ID widthconverter, U , exceeds the number of available IDs at theoutput, O , both the input and the output ID space aredensely used. In this case, it is not possible to retain theuniqueness of all IDs during conversion, and we call thetransformation that imposes additional ordering serialization .Serialized transactions still have concurrently outstandingrequests, but they are now required to be handled in-order.Our ID serializer, shown in Fig. 6, transforms IDs withone FIFO per direction and master port ID. At the slave portof the serializer, a demultiplexer assigns transactions to oneof the FIFO submodules through a combinational function f of the request ID (e.g., the ID modulo the number of masterport IDs). The demultiplexer is a reduced conﬁguration of ournetwork demultiplexer ( § f assigns identical IDs to the same master port (and thus thesame output ID (O1) ). In each FIFO submodule, the ID of arequest is pushed into a FIFO and then truncated to zero. ThisFIFO reﬂects the transaction ID in responses (O2) , and the lastresponse of a transaction pops from the FIFO. After the FIFOs,an instance of our network multiplexer ( § The data width of network components depends on theirbandwidth requirements. For instance, the master port ofa high-performance DMA engine might have 512 bit datawidth while that of a 64-bit processor core typically has 64 bit.This extends to subnetworks, e.g., separate networks for theDMA engine and the cores. However, as subnetworks withdifferent data widths are joined, e.g., at endpoints such asmemories, data width converters (DWCs) are required toconvert between data widths. DWCs can be either upsizers ,converting from narrow to wide, or downsizers , convertingfrom wide to narrow. Although similar in purpose, up- anddownsizer are not fully symmetric. In fact, the upsizer hashigher performance requirements than the downsizer, sinceit must utilize the higher-bandwidth network as much aspossible to minimize the impact on other components on thehigh-bandwidth network.

A data upsizer has a narrow slave port of data width D N and awider master port of data width D W . In the simplest operatingmode, pass-through, the upsizer does lane selection on readresponses (Fig. 7a), selecting a slice of a wide incoming word,and lane steering on write data, aligning narrow incoming datainto the wider outgoing word (Fig. 7b). In pass-through mode,the upsizer does not change the number of bytes transferred ineach beat. This can be required by transaction attributes (e.g.,to device memory). In terms of performance, however, thisunderutilizes the high-bandwidth network, which inherits thethroughput of its low-bandwidth counterpart. Utilization canbe increased by reshaping incoming bursts with many narrowbeats into bursts with fewer wide beats: several narrow writedata beats are packed into one wide beat, and one wide readresponse beat is serialized into several narrow beats.Our data upsizer, shown in Fig. 7c, is capable of upsizingbetween interfaces of any data width. It is composed by twomodules, read and write upsizers, that perform lane selectionand steering, besides deciding whether to upsize the requestbased on the transaction properties. Due to (O3) , only onewrite upsizer is needed, containing a buffer of width D W toperform data packing. On the read response channel, thedata upsizer handles a certain number of outstanding readtransactions in parallel. Each incoming read transaction isassigned an idle read upsizer, unless there is an active upsizerhandling a transaction with the same ID. For that case, weensure (O1) by enforcing that incoming transactions with thesame ID are handled by the same read upsizer. Each readupsizer has a D W buffer to hold incoming beats. This avoidsblocking the wide read response channel during serialization. A data downsizer has a wide slave port of data width D W and a narrower master port of data width D N . In the simplestoperating mode, pass-through, the downsizer does steeringon the read data channel and selection on the write datachannel, symmetrical to the base operations of the data upsizer.Our downsizer, shown in Fig. 7d, differs from the upsizer intwo key points: First, the downsizer has lower performancerequirements than the data upsizer, since it connects to alower-bandwidth subnetwork, e.g., peripherals. This means itdoes not need to support multiple outstanding reads. Second, when downsizing, the downsizer converts few wide beatsinto multiple narrow beats. It is possible that the resultingburst is longer than the longest buffer allowed by the protocol.In this case, the downsizer needs to break the incoming burstinto a sequence of bursts. To handle this corner case, amongothers, the control logic of the read and write downsizers ismore complex than those in the upsizer. Transferring large amounts of data at high bandwidth requiresdedicated components for data movement called direct memoryaccess (DMA) engines . Our DMA architecture is designed to bemodular, dividing the unit into two parts: a system-speciﬁcfrontend and a backend implementing the data movementwithin the on-chip interconnect. We deﬁne a simple, yetwell-deﬁned interface uniting both parts: a one-dimensionaland contiguous memory block of arbitrary length, source,and destination address, called

1D transfer . We chose thisinterface abstraction because 1D transfers map very well toburst-based transactions. More complex transfers, such asmulti-dimensional or strided accesses, are decomposed bythe frontend into 1D transfers. As the frontend is highlysystem-speciﬁc, we will not discuss it.In the backend, the burst reshaper , shown in Fig. 8a, dividesthe arbitrary-length 1D transfers into protocol-compliantbursts (adhering to, e.g., address boundaries and maximumnumber of beats). On arrival of a new 1D transfer, theburst converter loads length, source address, and destinationaddress into internal registers. The burst boundaries processdetermines the number of bytes that can be requested inthe next burst. With this, the burst reshaper calculates theaddress of the next burst and the remaining bytes left in the1D transfer. Each protocol-compliant burst is then translatedby the data mover unit, shown in Fig. 8b, into a read and awrite command as well as a read and a write data job. Thecommands are issued as beats on the command channels.The data jobs are forwarded to the data path. The data path ,shown in Fig. 8c, receives read data beats, realigns the data tocompensate for different byte offsets between the read andwrite data streams, and issues write data beats. The data pathconsists of two independent processes. The read process isrealigns and buffers incoming data. If a burst starts on anunaligned address, some leading bytes (“head”) in the ﬁrstbeat are invalid and are masked. Similarly, a burst may end onan unaligned address, in which case some trailing bytes in thelast beat (“tail”) need to be masked. The write process drainsdata from the buffer as soon as it is available and masks itaccording to the destination address offset with the strobesignal of the write data channel.

On-chip memories are an important class of endpoints foron-chip network transactions. In this section, we describe twomemory controllers through which standard single-port staticrandom access memory (SRAM) macros can be connected tothe on-chip network.

The architecture of our simplex on-chip memory controlleris shown in Fig. 9.

Simplex in this context means that thecontroller in each clock cycle can either read or write memory,

Figure 7. Architecture of our data width converters (DWCs) . (a) Data selection in the read response and (b) data steering in the write data channel ofthe upsizer. (c) Upsizer, drawn with two outstanding read transactions. (d) Downsizer.Figure 8. Architecture of our

DMA engine . (a) Burst reshaper. (b) Data mover. (c) Data path, drawn for 64 bit data width.Figure 9. Architecture of our simplex on-chip memory controller , withthe on-chip network slave port at the top and the memory master portat the bottom. The memory master port has the same data width as thenetwork slave port. as is natural for a single-port SRAM. The memory controllerﬁrst translates read commands and write commands pluswrite data into memory requests. An arbiter then forwardseither a read or a write memory request per clock cycle. Thisarbiter optionally takes quality of service (QoS) attributesof a command into account and can prioritize write beats,which cannot be interleaved due to (O3) , over read beats. Astream fork unit splits address and data, which go to thememory interface, and meta data (e.g., the transaction ID),which are used by the memory controller to form responses inthe network protocol. A converter translates the address anddata stream into memory interface signals (with stream ﬂowcontrol on the request and no handshaking on the responsepath). The memory responses are then joined with meta datato form read or write responses, which are ﬁnally issued onthe corresponding network response channel.The simplex memory controller cannot achieve the fullbidirectional bandwidth of the duplex on-chip network inter-face, which has separate channels for read and write data. Theduplex memory controller removes this limitation.

The architecture of our duplex memory controller is shown inFig. 10. To saturate the read and write data channels of theon-chip network simultaneously (thus duplex ), this memorycontroller has at least two independent memory master portsas well as one simplex controller for writes and one for reads.A network demultiplexer statically routes all writes through

Figure 10. Architecture of our duplex on-chip memory controller withfour address-interleaved memory master ports. the left controller and all reads through the right controller.The unused resources inside both simplex controllers areoptimized away during synthesis. A logarithmic memoryinterconnect then routes each request to one of the memorymaster ports, which are address-interleaved.The duplex memory controller can fully saturate both theread and the write data channel of the on-chip network in theabsence of conﬂicts on the memory ports. However, irregulartrafﬁc (e.g., misaligned addresses, mixed wide and narrowbeats) can give rise to a signiﬁcant conﬂict rate. To reduceconﬂicts, the banking factor (i.e., the number of memory masterports per network slave port) can be increased to any integerhigher than 2 (at the cost of more wide and shallow SRAMmacros when the memory capacity is to remain constant).

In contrast to the on-chip memory controllers of § last level cache (LLC) . Even though traditionally cachesare not seen as part of the communication infrastructure, weinclude this LLC in ours because it can reduce latency andbandwidth between its slave (ingress) port and its master(reﬁll) port. This is very useful, for example, in front of anoff-chip memory controller. Figure 11. Architecture of our last level cache (LLC) . Our LLC’s set associativity, number of cache lines, andnumber of cache blocks per cache line are synthesis parame-ters, giving complete control over the physical size and shapeof the cache. It uses a write-back, read and write allocatedata policy with pseudo-random eviction. The cache supportsconcurrent read and write accesses as well as eviction andreﬁll operations. Reads are interleaved while adhering to (O1–2) . Transactions that hit in the cache can bypass earliertransactions that missed in the cache and are currently beingserviced (i.e., eviction and reﬁll) as far as permitted by (O1) .As not all applications beneﬁt from a hardware-managedcache, our LLC can be reconﬁgured at runtime to partiallyor fully become a software-managed SPM. This option isavailable at the granularity of single cache sets. It is possibleto use the entire data memory of our LLC as SPM. In thatcase, all accesses outside the address range of the SPM bypassthe core of the LLC and are directly forwarded to the masterport. This bypass is also used for non-cacheable transactions.The architecture of our LLC is shown in Fig. 11. Like mostcomponents in our platform, the LLC is implemented withthe stream-based control scheme that is natural to on-chipcommunication. The main idea is to start from the commandand write data beats at the slave port, then transform, split,and merge them into descriptors that ﬂow through the cacheand give rise to new commands (for evictions and reﬁlls)and eventually to read and write responses. Starting at theslave port, commands are decoded by address and memoryattributes and either sent to the bypass or into the coreof the LLC. A command beat enters the cache over thecommand splitting units. They split the command downinto descriptors, each of which targets exactly one cache line.These splitters also determine whether the access targets acache set or an SPM region. Afterwards, the descriptors arearbitrated together with ﬂush descriptors into a commonpipeline. The descriptors then enter the hit-miss detectionunit. Descriptors ﬂagged as SPM simply ﬂow through thisunit, whereas all other descriptors perform a lookup insidethe tag storage. The comparison and eviction unit determinesthe exact cache line and set of the descriptor. Additionally, thisunit determines whether the descriptor gives rise to a reﬁllor eviction. Descriptors that miss in the cache are sent to theeviction and reﬁll pipeline, whereas descriptors that hit bypass

Figure 12. Minimum clock period and corresponding area of our multi-plexer in GF22FDX for 2 to 32 slave ports and 6 ID bits. this pipeline, which reduces their access latency. Two unitsensure the descriptors maintain data consistency and adhere to (O1–3) : The index and miss counters prevent that a descriptorsin the hit bypass overtakes another descriptor in the misspipeline with the same ID. The line lock allows only onedescriptor to operate on a cache line and set at a time, whichprevents data corruption that could occur from descriptorsevicting a cache line used by another descriptor. Four unitsmanipulate the data SRAMs of our LLC: the eviction and reﬁllunits, which update the state of the data prior to a requestedoperation, and the read and write units, which perform theactual cache operation. All four units are connected over alogarithmic memory interconnect to the data SRAMs. Thedata width of the data channels and the SRAM data portscorrespond to the cache block width. This setup allows allfour units to concurrently have one descriptor each active onthe data, thereby using the maximum available bandwidth ofthe slave and the master port of the LLC.

MPLEMENTATION R ESULTS

This section provides quantitative and asymptotic complexityresults for our network components. These results are essentialfor architects to assess the feasibility and strike trade-offs inthe design of on-chip networks.We implement the components presented in § ◦ C). We synthesize with Synopsys De-signCompiler 2019.12 using topographical mode, so physicalplace-and-route constraints, dimensions, and delays are takeninto account. For the isolated implementation of the modules,each input is driven by a D-ﬂip-ﬂop (FF), and each outputdrives a D-FF. Unless we vary it in the evaluation, we setthe address and data width to 64 bit and the slave port IDwidth to 6 bit. Before undergoing synthesis, all modules havebeen veriﬁed for protocol compliance in RTL simulation underextensive directed and constrained random veriﬁcation tests.

The critical path of the multiplexer goes through from a slaveport command channel through the arbitration tree on itshandshake signals and the multiplexers on its payload signalsto a master port command channel. For S slave ports, it scaleswith O (log S ) due to the logarithmic depth of the arbitrationtree and the multiplexers. The area scales O ( S ) due to thelinear area of the arbitration tree and the multiplexers. Thearea is further linear in the ID width and the maximumnumber of write transactions due to the FIFO between writecommand and data channel, but this part is usually negligible.Fig. 12 shows the area and timing characteristics of our Figure 13. Minimum clock period and corresponding area of our demul-tiplexer in GF22FDX: (a) with 2 to 32 master ports and 6 ID bits, and(b) with 4 master ports and 2 to 8 ID bits.Figure 14. Minimum clock period and corresponding area of our crossbar with 4 slave ports, fully connected and unpipelined, in GF22FDX: (a) with2 to 8 master ports, 4 slave ports and 6 ID bits, and (b) with 4 master portsand 2 to 8 ID bits at the slave port. multiplexer: for 2 to 32 slave ports, the critical path increaseslogarithmically from 190 to 270 ps, and the area increaseslinearly from 2 to 30 kGE.

The critical path of the demultiplexer goes from a slavecommand channel through ID lookup to a command channelon one of the master ports. It scales with O ( M ) as the streamdemultiplexers grow linearly in area with the master portsand topographical synthesis takes the distance increase intoaccount. The area scales with O ( M ) due to the linear areaof the arbitration trees and the stream demultiplexers. TheID width I is critical for the demultiplexer: the area scaleswith O (cid:0) I (cid:1) due to the exponential number of counters (onefor every possible ID), and the critical path scales with O ( I ) because every ID bit adds a multiplexer level in the indexinglogic of the counters. Fig. 13 shows the area and timingcharacteristics of our demultiplexer: For 2 to 32 master portsand 6 ID bits (Fig. 13a), the critical path increases linearly from330 to 430 ps, and the area increases linearly from 22 to 38 kGE.The curve is non-monotonic mainly in two points, wherethe synthesizer selects disproportionately strong and largebuffers to reach the target frequency. For 4 master ports and2 to 8 ID bits (Fig. 13b), the critical path increases linearlyfrom 250 to 400 ps, and the area increases exponentially from5 to 95 kGE. Depending on the ID width, the critical pathcan be signiﬁcantly longer than in the multiplexer, so thedemultiplexer will be the critical stage in a pipelined networkjunction. For a fully-connected crossbar with S slave ports, M masterports and I bits at the slave port, the critical path is dominatedby the demultiplexer, thus scales with O ( M + I ) . The areais the sum of the area of the S demultiplexers and M multiplexers plus a small overhead for each slave port foraddress decoding and the error slave (when instantiated). Thearea thus scales with O (cid:0) M S + 2 I S (cid:1) . Fig. 14 shows the areaand timing characteristics of a fully-connected, unpipelinedinstance of our crossbar: For 4 slave ports, 2 to 8 master (M) (M) (2 I )(I) Figure 15. Minimum clock period and corresponding area of our cross-point with 4 slave ports, fully connected and pipelined, in GF22FDX:(a) with 2 to 8 master ports, 4 slave ports and 6 ID bits, and (b) with 4master ports and 2 to 8 ID bits at the ports.Figure 16. Minimum clock period and corresponding area of our

IDremapper in GF22FDX: (a) for 1 to 64 concurrent unique IDs and 8transactions per ID, and (b) for 16 concurrent unique IDs and 1 to 32transactions per ID. ports and 6 ID bits (Fig. 14a), the critical path increaseslinearly from 400 to 450 ps, and the area increases linearlyfrom 111 to 156 kGE. As was the case for the demultiplexer( § The critical path of a fully pipelined crosspoint goes fromthe internal pipeline register of a master port into the tableof an ID remapper. For M master ports (Fig. 15a), it scaleswith O ( M ) from 610 to 630 ns as topographical synthesistakes the area increase into account. The area also scales with O ( M ) but much more signiﬁcantly from 243 to 587 kGE asthe crossbar and the number of ID remappers scale linearly.Regarding the ID width I , the crosspoint is dominated bythe demultiplexer: For 2 to 8 ID bits in a × conﬁguration(Fig. 15b), the area scales with O (cid:0) I (cid:1) from 127 to 1181 kGEand the critical path scales with O ( I ) from 290 to 800 ps. The critical path of our ID remapper goes from the input IDthrough the ID equality comparators in in the table, througha leading-zero counter (LZC) to determine the matchingor the ﬁrst free output ID, into a table counter entry. Foran input ID width of I , up to U concurrent unique IDs(per direction), and up to T transactions per ID, it scaleswith O (log I + log U + log T ) . The area is dominated bythe tables, which have U entries with I + log T bit each.Additionally, the LZCs have an area of O ( U log U ) . The totalarea thus scales with O ( U ( I + log T + log U )) . Fig. 16 showsthe area and timing characteristics of our ID remapper: For U = T = U = 48 and then linearly to 640 psfor U = 64 as path delays due to the linearly growing tablestart to dominate. The area increases linearly from 1 to 41 kGE. Figure 17. Minimum clock period and corresponding area of our

IDserializer in GF22FDX: (a) for 1 to 32 IDs at the master port and 8transactions per master port ID, and (b) for 4 IDs and 1 to 32 transactionsper ID at the master port.

The highest (rightmost) conﬁguration can remap up to 512transactions in both directions with up to 64 unique IDsconcurrently, but the area and critical path costs are quitehigh. In comparison, for U = 16 concurrent unique IDsand T = . × lower area and . × shorter critical path. The critical path of the ID serializer goes through the demulti-plexer, the push side of the ID FIFO, and the arbitration treein the multiplexer. For U M IDs at the master port and T trans-actions per master port ID, it scales with O (log U M + log T ) .The area scales with O ( U M + T ) due to the linear area ofall components in either U M or T . Fig. 17 shows the areaand timing characteristics of our serializer: For U M = T = 8 transactions per masterport ID (Fig. 17a), the critical path increases logarithmicallyfrom 195 to 410 ps, and the area increases linearly from2 to 109 kGE. Clearly, compressing a densely used ID spaceis expensive in terms of area. This cost can be reduced byﬁxing U M at a low value and varying T : For U M = 4 IDs and T = U M = 4 , T = 32 at . × less area and . × shortercritical path. For our data downsizer between a wide slave port of width D W and a narrow master port of width D N , the critical pathgoes through the data selection and steering logic, scalinglogarithmically with the downsize ratio O (log ( D W /D N )) .The area is O ( D N D W ) , the ﬁrst term accounting for themultiplexing logic for data selection and steering, and thesecond accounting for the registers that hold a wide beat fordata packing on the write data channel. Fig. 18a (left side)shows the area and timing characteristics of our downsizer: fora master port of width 64 bits and a slave port of width 8 to 32bits, the critical path decreases with increasing width of theslave port (and decreasing downsize ratio), from 365 to 390 ps,while the area grows linearly from 23 to 25 kGE.For the data upsizer between a narrow slave port of width D N and a wide master port of width D W , the critical path goesthrough the data selection logic and the round-robin arbiter,scaling linearly with the number of read upsizers R and log-arithmically with the upsize ratio, O ( R log ( D W /D N )) . The m i n i m u m c l o c k p e r i o d [ p s ] (b) min. tckarea (right) 30405060 a r e a a t m i n i m u m c l o c k p e r i o d [ k G E ] ( R )( R ) Figure 18. Minimum clock period and corresponding area of: (a) our datadownsizer and upsizer , considering a master port 64-bits wide and aslave port 8 to 512-bits wide and (b) our data upsizer , considering amaster port 64-bits wide, a slave port 128-bits wide, and 1 to 8 readupsizers.Figure 19. Minimum clock period and corresponding area in GF22FDXof (a) our

DMA engine for 16 to 1024 bit data width, and (b) our simplexon-chip memory controller for 8 to 1024 bit data width. area of the upsizer scales with O ( RD N D W ) , compoundingthe effect of the multiplexing logic for data selection andsteering, D N , and of the R D W -bit registers holding widebeats for data serialization on the read data channel. Fig. 18a(right side) shows the area and timing characteristics of ourupsizer: for a master port of width 64 bits and a slave portof width 128 to 512 bits, the critical path increases with theincreasing upsize ratio, from 380 to 405 ps, while the areaincreases from 27 to 35 kGE. Fig. 18b shows the area andtiming characteristics of the data upsizer from 64 to 128 bits,for 1 to 8 read upsizers. These have an important effect on thearea and critical path of the upsizer. The critical path of theupsizer increases linearly from 380 to 485 ps, while the areaincreases from 27 to 59 kGE. The area of the DMA engine scales with O ( D ) , where D isthe data width, due to the linearly growing alignment buffer.The critical path is dominated by the barrel shifter, whichscales with O (log D ) . For 16 to 1024 bit data width (Fig. 19),the critical path increases logarithmically from 290 to 400 psand the area increases linearly from 25 to 141 kGE. As theDMA engine uses the same ID for all transactions, the IDwidth affects neither area nor critical path. For a simplex on-chip memory controller with a data widthof D , the critical path is constant and found between thecommand slave channels and the memory request master port.The critical path does not depend on D as the transformationof commands does not depend on the data width. Fig. 19bshows the area and timing characteristics: The area scaleslinearly with O ( D ) from 13 to 53 kGE; this linear dependencyis caused by the dominant read response buffers needed forresponse path decoupling. The critical path remains roughlyconstant around 290 ps. The ID width has no impact on thecritical path, as the simplex controller handles all requests inorder and only buffers the ID for the response. The area scaleswith O ( I ) due to these buffers. Figure 20. Minimum clock period and corresponding area of our duplexon-chip memory controller in GF22FDX: (a) for 8 to 1024 bit data widthand two memory master ports, and (b) for 64 bit data width and 1 to 8memory master ports. (1)(1) (log L)(L)

Figure 21. Minimum clock period and corresponding area of our last levelcache in GF22FDX, with a set associativity of 4, 16 blocks per cache line,8 B per block, and 64 bit addresses, (a) without SRAM and (b) with SRAM.

The critical path of the duplex controller goes from theslave port command channels through the demultiplexer,one simplex memory controller, and the logarithmic memoryinterconnect to a memory request port. For a data width of D and B memory master ports, it scales with O (log D ) . The areais composed of the demultiplexer, the two simplex memorycontrollers, and the logarithmic interconnect, and thus scaleswith O ( B + D ) . Fig. 20 shows the area and timing character-istics of our duplex memory controller: For D = B = 2 memory master ports (Fig. 20a), thecritical path increases logarithmically from 280 to 330 ps, andthe area increases linearly from 20 to 175 kGE. For D = 64 bitdata width and B = O ( B ) from 28 to 34 kGE. Regarding the ID width I , the complexity is deﬁned by the demultiplexer. We evaluate our LLC with a set associativity of 4, 16 blocksper cache line, and 8 B per block, and we vary the cache sizethrough the number of cache lines L . Area and critical path ofa cache are commonly dominated by its SRAM macros, but itis essential that the control logic adds only minimal overhead.The control logic remains constant in area when increasingthe cache size with L , as shown in Fig. 21a. The critical path isinside the tag lookup unit, starting at the tag memory, goingthrough the tag comparators, and ending again in the tagmemory. The logic on the critical path does not increase with L , however the tag memories get larger and thus becomeslower (Fig. 21b). Changing the ID width would scale thearea with O (cid:0) I (cid:1) due to the ID counters instantiated in thebypass multiplexer and the counters in the hit-miss unit. TheID width has no inﬂuence on the critical path.The LLC including the SRAM macros is characterizedin Fig. 21b. Compared to the area of the control logic alone(Fig. 21a), the SRAM macros occupy 8 to 64 times morearea for a cache size of 64 to 1024 KiB. The delays of thememory dominates the critical path of the design. Thus,the area occupied for control logic is below 10 % alreadyat 128 KiB, and becomes marginal at larger sizes. Critical Path Area

Multiplexer O (log S ) O ( S ) Demultiplexer O ( M + I ) O (cid:0) M + 2 I (cid:1) Crossbar O ( M + I ) O (cid:0) MS + 2 I S (cid:1) Crosspoint O ( M + I ) O (cid:0) M + 2 I (cid:1) ID Remapper O (log I + log U + log T ) O ( U ( I + log T + log U )) ID Serializer O (log U M + log T ) O ( U M + T ) Data Upsizer O ( R log ( D W /D N )) O ( RD W D N ) Data Downsizer O (log ( D W /D N )) O ( D W D N ) DMA Engine O (log D ) O ( D ) Simplex Mem. Ctrl. O (1) O ( D ) Duplex Mem. Ctrl. O (log D + log B + I ) O (cid:0) D + B + 2 I (cid:1) Last Level Cache O (1) O (cid:0) I (cid:1) Legend: M = number of master ports; S = number of slave ports. D = data width; D W = data width of the wide interface; D N = datawidth of the narrow interface; I = ID width; U = concurrent uniqueIDs; U M = concurrent unique IDs at the master port; T = concurrenttransactions per ID. B = number of memory master ports. R = number ofread upsizers. Table 2. Overview of the complexity of our network components.

Table 2 gives an overview of the asymptotic complexityof our network components. The critical path of all com-ponents scales at worst linearly in their parameters, formost components and parameters even logarithmically. Asthe absolute results of the minimum clock period show,the critical path of all components remains below 500 pspost-topographical-synthesis in the large design space weevaluated. This shows our components are suited for a widerange of target frequencies and bandwidths, up to 2 GHz.When even higher frequencies are required, most componentscan be parametrized to have a critical path below 330 ps,which would allow to clock them up to 3 GHz. The area ofmost components scales linearly in their parameters, with thenotable exception of the ID width, which causes an exponentialgrowth of the demultiplexer and all components containing it.As the absolute results show, most components ﬁt a few tens ofkGE when not pushed to the highest possible clock frequencyand parametrization. Even more complex components, suchas a × crossbar with up to 256 independent concurrenttransactions, ﬁt in a modest 100 kGE when clocked at 2.5 GHz.While component-wise results are important to show thecomplexity and trade-offs in the microarchitecture of our on-chip communication platform, they of course cannot show thefull picture of a real on-chip network. In the next section, weanalyze a full on-chip network. YSTEM C ASE S TUDY

In this section, we design, implement, and evaluate the on-chip networks of a many-core ﬂoating-point accelerator ( § § The

Manticore architecture [7] is a state-of-the-art manycoreprocessor for high-performance, high-efﬁciency, data-parallelﬂoating-point computing. A Manticore accelerator consistsof four chiplet dies on an interposer. Each chiplet, shownin Fig. 22, contains 1024 cores grouped in 128 clusters, one8 GiB HBM2E controller and PHY, 27 MiB L2 memory, onePCIe 5.0 x16 controller and PHY, and three die-to-die link Figure 22. Conceptual ﬂoorplan of one Manticore [7] chiplet die.Figure 23. Manticore’s on-chip network. Each arrow represents a ﬁvechannel (see §

2) connection from a master port to a slave port. Fat arrowsmean 512 bit data width, thin arrows 64 bit. Numbers above arrows indicatemaximum transaction concurrency in the form unique IDs / transactionsper ID / total transactions per link . (D2D) PHYs to the other chiplets. Each cluster contains eightsmall 32-bit integer RISC-V cores, each controlling a large,double-precision ﬂoating-point unit (FPU), and 128 KiB L1memory organized in 32 SRAM banks. As primary meansfor moving data into and out of L1, each cluster containstwo of our DMA engines ( § The on-chip network is designed with four main goals: (G1)

High bandwidth between units within the same quadrantfor effective local data sharing. (G2)

High bandwidth betweenthe chiplet-level I/Os (i.e., HBM2E, PCIe, D2D) and anycluster for effective data input and output. (G3)

Low latencybetween any two cores for efﬁcient concurrency. (G4)

Minimalinterference between the wide bursts of the DMA engine andthe word-wise accesses of the cores for maximum networkutilization. The network, shown in Fig. 23, has the followingproperties to meet these goals: (1) Physically separate net-works for trafﬁc by DMA engines and cores to meet (G4) . (2)Tree topology to meet (G2–3) . (3) Fully-connected crossbars

Figure 24. Microarchitecture and dimensions of Manticore’s on-chipnetwork. (a) L1 network. (b) L3 network. Only select connections aredrawn for reasons of lucidity. within each quadrant to meet (G1) . (4) Links with the samewidth from the HBM2E controller all the way down to theDMA engine in each cluster to meet (G2) . The clock frequencyof the entire network is 1 GHz. The data width of the DMAnetwork is set to 512 bit, which corresponds to one of the fourports into the HBM2E controller. Therefore, saturating thefull HBM2E bandwidth requires concurrent transactions fromonly four DMA engines in different L2 quadrants. The datawidth of the core network is set to 64 bit, which is native forthe load/store unit of a core.The concurrency of transactions is another importantaspect of the network design. The numbers above an arrowin Fig. 23 deﬁne the number of concurrent unique IDs,transactions per ID, and total transactions per link (readsand writes separate), respectively. ID width converters ( § ➊ . Transactions by the 8 cores in the cluster areindependent, and each core can have at most 1 outstandingtransaction ➋ . The L1 network maintains the independence ofall DMA and core transactions, and the number of unique IDsexpands accordingly, as do the total transactions ➌ . The L2network maintains the independence of DMA transactions butlimits their total below the sum of the incoming ports ➍ . Thereason is that the maximum roundtrip latency at this level is 60cycles, so a higher number of concurrent transactions wouldnot increase bandwidth or utilization. The concurrency ondownlinks is generally constrained to match that of an uplinkinto the lower network, e.g., ➎ . This means each networklevel can handle transactions from the uplink slave port in thesame way as transactions from downlink slave ports. The microarchitecture and physical dimensions of one L1 andL3 network are shown in Fig. 24. (The L2 network is verysimilar to the L1 and omitted for brevity.) For the L1 network,the downlink ports are in the left third of each cluster, closeto the cluster’s memory and internal interconnect, and theuplink port is in the middle of the narrow side. For the L2and L3 network, the downlink ports are at one quarter of thewide side (determined by the lower network level), and theuplink port is in the middle of the narrow side. To isolate thetiming closure of individual network levels, we cut all paths Unit L1 L2 L3 EntireNetwork

Clock Frequency [GHz] 1.00 1.00 1.00 1.00Routing Density* [%] 59.6 49.6 45.7 —Area per Inst. [mm ] 0.41 1.40 2.99 30.43 ] 13.21 11.23 5.98 30.43Area per Chiplet † [%] 9.05 7.69 4.10 20.84Area per Core+FPU [µm ] 12 900 10 970 5840 29 710*Routing density along wider dimension (i.e., where routing is denser). † Relative to chiplet area without I/O controllers and PHYs.

Table 3. Implementation results of Manticore’s on-chip network. at the uplink ports ➏ . Correspondingly, all downlink inputsare driven by FFs and all downlink outputs drive FFs. Thereare two central challenges in the physical implementation ofthe networks. First, the extremely wide aspect ratio: while onewide dimension is determined by the side length quadrant, theother dimension should be as narrow as possible to minimizethe area of the network. Second, routing and wire congestion:each of the ﬁve interfaces has ca. 3300 separate wires, and eachnetwork level is fully connected. Routing the wires of a singleinterface horizontally occupies a height of ca. 100 µm on allthree metal layers available for inter-cell horizontal routing.To mitigate congestion, the crossbar, with its fanout of wiresbetween demultiplexers and multiplexers, should be placedand routed as compact as possible ➐ . The crossbar nonethelessincurs a signiﬁcant combinational delay. To accommodate thisdespite the long distances due to the extreme aspect ratio, weinsert registers around the crossbar ➑ . In contrast to pipelininginside the crossbar, much fewer registers are required, whichagain beneﬁts the compact layout of the crossbar. In the L3network (Fig. 24b), pairs of L2 networks share one port on theHBM2E controller. Cores on the narrow access the wide HBMports through data width converters. Because the HBM2Econtroller is located on the left side of the chiplet, the leftL3 network simply feeds two connections from the right L3network through pipeline registers to the controller ➒ . IDremappers are used to reduce ID widths according to theconcurrency design described in § ➓ .The implementation results of Manticore’s on-chip net-work are listed in Table 3. We have been able to close timingand DRC of the entire network after place and route at 1 GHz.For this, we ﬁrst loosely constrained the narrow dimension todetermine the required number of pipeline registers aroundthe crossbar, then we reduced the narrow dimension until thedesign could no longer be routed without failing timing orDRC. As the high routing densities show, the area of eachnetwork level is mainly determined by the available routingchannels. The total area of the network is 30.43 mm , whichis 20.84 % of the chiplet area without I/O controllers andPHYs. Put differently, Manticore’s entire high-bandwidth, low-latency, hierarchical on-chip network requires 29 710 µm percore. This is merely about the same area as one core (withoutany cache) and FPU, which are highly area-efﬁcient. We use cycle-accurate register-transfer level (RTL) simulationto assess the performance of Manticore’s on-chip network.As we are not interested in cluster-internal data movementand simulating 1024 cores and FPUs at this accuracy isprohibitively slow, we ﬁrst extract the DMA transactions,

Unit Convolution Matrix Mul. base chunked pipe’d base pipe’dOp. Intensity [dpﬂop/B] 2.2 15.9 15.9 2.7 2.7HBM BW [GB/s] 262* 103 6 262* 235L3 Agg. BW [GB/s] 262 103 6 262 235L2 Agg. BW [GB/s] 262 103 26 262 235L1 Agg. BW [GB/s] 262 103 103 262 572Performance [Gdpﬂop/s] 571 1638 † † † *Of which 256 GB/s are on the read channel, which is its maximum. † This corresponds to an FPU utilization of ca. 80 %, which is the maximumall 8 FPUs in a cluster can sustain for real kernels.

Table 4. Performance of Manticore for different NN layer implementations. cluster-internal computations, and their interdependenciesfrom an RTL simulation of an application in an isolated cluster.We then substitute each cluster by its DMA engine in thesimulation of the entire network and use the extracted patternsto inject DMA trafﬁc. We characterize the performance of twofundamental kernels, a convolutional neural network (NN)layer and a fully-connected NN layer, which together amountto 95 to 99 % of the ﬂoating-point operations (FLOPs) in MLT.In a convolutional NN layer, a set of input layers (matrices)is convolved with a ﬁlter kernel into a set of output layers.Each output layer consists of data from all input layers,and each pair of output and input layers has its own ﬁlterkernel. We use 128 input and output layers, each with 32rows and columns, and a × kernel. In the baselineimplementation, each cluster computes an entire output layer.As all input layers do not ﬁt into the local memory of acluster, it loads chunks of input layers. Thus, each clusterneeds to load each input layer once per output layer. Asthe ﬁrst result column in Table 4 shows, this implies a verylow operational intensity and entails that performance isbound by the HBM memory bandwidth. One strategy toalleviate this is to let each cluster compute a tile in a chunk ofmultiple output matrices. As the input layers can be reusedfor multiple output layers, this reduces the amount of datatransferred per computation. For a chunk size of 8 (secondcolumn), the operational intensity is sufﬁciently high that theperformance becomes compute-bound. To save even moreoff-chip bandwidth (e.g., for energy efﬁciency or if no HBM isavailable) without sacriﬁcing performance, the hierarchicalnetwork can be used to form a processing pipeline whereclusters obtain their input matrix from another cluster insteadof off-chip memory. The third column shows that when all16 clusters within one L2 quadrant form such a pipeline,the off-chip memory trafﬁc can be massively reduced whileperformance is maintained. Trafﬁc is also reduced on the L2and L3 networks because data, once it is in the local memoryof a cluster, is mainly transferred through the L1 networks.In a fully connected-layer, each cluster computes a tile ofthe output matrix in a matrix-matrix multiplication. The tilesize is chosen so that two input matrix tiles and the outputmatrix tile together ﬁt twice (for double buffering) into localmemory. With 128 KiB memory, the tile size for a Manticorecluster is 52. Even though matrix multiplication is theoreticallycompute-bound, tiling signiﬁcantly reduces the operationalintensity. Thus, as the fourth column of Table 4 shows, thebaseline implementation is memory-bound at the HBM. Inthe baseline implementation, all clusters within a quadrantsimultaneously load the same tile of one input matrix fromHBM. The hierarchical network presents an opportunity to reduce this bandwidth: The clusters within one L1 quadrantcan be arranged to form a pipeline, where tiles of the inputmatrix rotate between clusters. As the last column shows, thisallows to attain compute-bound peak performance. ELATED W ORKS

Network-on-chip (NoC) topologies, routing algorithms, ﬂowcontrol schemes, and router architectures have been subjectto a vast amount of research (see [1], [8]–[10] for detailedreviews). Important conclusions from this research are thatthe optimal on-chip network topology highly depends on thetarget application and computer architecture, and that routingstrategies and ﬂow control schemes are intertwined with thecommunication protocol, which all connected componentsneed to adhere to. Thus, we do not try to innovate in this ﬁeld.Rather, the modules in our platform allow to build an on-chipnetwork with arbitrary topology that adheres to a state-of-the-art, industry-standard protocol, following the paradigmput forward by application-speciﬁc NoC research efforts (see[11] for an up-to-date survey). Additionally, our elementarycomponents allow to design custom network modules, shouldour pre-conﬁgured modules not sufﬁce.Design space exploration and electronic design automation(EDA) for on-chip networks is a research ﬁeld in its own right.For instance, xENoC [12] is a tool to generate synthesizableRTL code from an XML speciﬁcation of a network. HeMPS [13]is a tool to generate RTL code of an multiprocessor system-on-chip (MPSoC) including its network from a SystemC model.Open-Scale [14] is similar to HeMPS but primarily targetsreal-time systems and uses the HERMES [15] frameworkto generate its NoC. Finally, optimization algorithms arebeing employed to design on-chip networks (e.g., [16]). Whileour platform is designed with design space exploration inmind, we consider it an orthogonal problem to designingand characterizing on-chip network components: our com-ponents could be integrated into a design space explorationframework, which could then generate on-chip networksfor heterogeneous SoCs that adhere to an industry-standardcommunication protocol.Non-coherent on-chip communication is central for hetero-geneous, accelerator-rich SoCs [17]. Protocols similar to AMBAAXI5 [5], which our platform directly supports, are IBM’sCoreConnect [18], Silicore’s Wishbone [19], Accellera’s OpenCore Protocol (OCP) [20], and SiFive’s TileLink UncachedHeavyweight (TL-UH) [21]. They all, like AXI, are royalty-freestandards. CoreConnect, Wishbone, and OCP provide a subsetof the features of AXI5, and while they had been used inthe past, they are nowadays not nearly as widely used asAXI. TL-UH, like AXI5, supports burst transactions, multipleoutstanding transactions, and transaction reordering and usesvalid-ready ﬂow control. TL-UH has stricter forward progressrequirements than AXI5, which our modules could also fulﬁll.While the speciﬁcations deﬁne interfaces and protocols foron-chip communication, they do not describe the architectureof network modules implementing them; that is an importantcontribution of our work. The OpenSoC Fabric [22] is an open-source implementation of a custom non-coherent protocol,with an interface to AXI-Lite in development. AXI-Lite doesnot support bursts or transaction reordering and is thereforenot suited for high-performance communication. The ESPproject [23] provides an open-source implementation of a 2D-mesh NoC with coherent and non-coherent layers and acustom protocol. In contrast, our platform is topology-agnosticand adheres to an industry-standard protocol.Commercial intellectual property (IP) offerings for AXIexist from multiple vendors, e.g., Arm’s CoreLink NetworkInterconnect IPs [24], Synopsys’ DesignWare IPs [25], andArteris’ FlexNoC [26], and they are used in many modernSoCs. The architecture and performance of these IPs is notpublic. To the best of our knowledge, our work is the ﬁrst topresent the microarchitecture, complexity, and performance ofa state-of-the art, industry-standard on-chip communicationprotocol and to provide a free, open-source implementationsufﬁciently mature for ASIC tapeouts (e.g., [7], [27]).Cache-coherent on-chip communication protocols cur-rently in use include Intel’s UltraPath Interconnect [28],AMD’s scalable data fabric [29], IBM’s Power9 on-chipinterconnect [30], AMBA AXI Coherency Extensions (ACE) [5],AMBA5 Coherent Hub Interface (CHI) [31], and TileLinkCached (TL-C) [21]. ACE and TL-C are extensions of AXI andTL-UH, respectively. As such, our platform could be extendedfor coherent communication by adding channels, transactions,and properties deﬁned by these speciﬁcations. The otherprotocols are standalone speciﬁcations with very differentproperties. For instance, we refer to [32] for an open-sourcebridge for connecting to CHI from AXI. With such a bridge,our platform can connect to a coherent system interconnect ifneeded, possibly extending to multiple chips. Coherency inon-chip networks has been studied extensively in research,e.g., [33]–[35]. A prominent system example is SCORPIO [36],where a coherent mesh NoC interconnects 36 homogeneouscores on a die. Their work focuses on the NoC and routerarchitecture for a coherent homogeneous multi-core, while wedesign an end-to-end non-coherent on-chip communicationplatform suitable for heterogeneous many-cores.Generators for cache-coherent on-chip networks have beenpresented in multiple works: Open2C [37] contains a libraryof components and controllers for coherent networks writtenin Chisel. Like us, they present an LLC, which is separatedfrom a coherence directory. In their 512 KiB L2 cache, thearea overhead of control logic and buffers is 38 %, whereas anidentical parametrization of our LLC has only 3 % overhead.The Rocket chip generator [38] constructs SoCs written inChisel, and the coherent NoC adheres to TL-C. OpenPiton [39]generates tile-based manycore processors with a 2D mesh,coherent NoC. One tile has an area of 1.17 mm when targetingIBM’s 32 nm SOI process at 1 GHz. Of the tile area, 22.3 % areoccupied by 32 KiB of distributed L2 cache and directorycontroller and 2.7 % by the × NoC router. Accountingfor one full technology node difference, the equivalent areain GF22FDX would be ca. 660 kGE and 80 kGE for 32 KiBL2 cache and NoC router, respectively. The control logic oftheir L2 cache is ca. 3.3 times larger than that of our LLC,which could be due to the cache directory. Their × NoCrouter (without any virtual channels) has about the samesize as a × conﬁguration of our crosspoint (with up to 16reorderable IDs). Open2C and OpenPiton implement a customprotocol, which complicates connectivity with third-partycomponents, whereas we adhere to an industry-dominantprotocol. The modules in our work are implemented insynthesizable SystemVerilog, so they could be integrated intoa higher-level generator as well. ONCLUSION

We presented a high-performance non-coherent on-chip com-munication platform that suits the needs of heterogeneousmany-core and accelerator-rich SoCs. The components of theplatform are not only topology-agnostic and parametrizable toﬁt a wide design space but also include bridges and convertersto link subnetworks with different bandwidth and concurrencyproperties. We characterized microarchitectural trade-offs andtiming/area characteristics and showed that our platformcan be used to build high-bandwidth end-to-end on-chipnetworks with high degrees of concurrency. We used ourplatform to design and implement a state-of-the art 1024-core MLT accelerator in a modern 22 nm technology, whereour communication fabric provides 32 TB/s cross-sectionalbandwidth at only 24 ns round-trip latency between any twocores. Our platform adheres to an industry-standard, royalty-free protocol, and its modules, written in SystemVerilog,are available under a permissive open-source license athttps://github.com/pulp-platform/axi. R EFERENCES [1] N. Jerger et al. , On-Chip Networks: Second Edition . Morgan & Clay-pool, 2017.[2] Qualcomm Inc., “Snapdragon 865 5G mobile platform,” 2020.[3] B. Wheeler, “Tomahawk 4 switch ﬁrst to 25.6 Tbps,”

MicroprocessorReport , 2019.[4] R. Smith, “NVIDIA Ampere unleashed: NVIDIA announces newGPU architecture, A100 GPU, and accelerator,”

AnandTech , 2020.[5]

AMBA AXI and ACE Protocol Speciﬁcation Issue F.b , Arm Ltd., 2017.[6] E. G. Coffman et al. , “System deadlocks,”

ACM Comp. Surv. , 1971.[7] F. Zaruba et al. , “Manticore: A 4096-core RISC-V chiplet architecturefor ultra-efﬁcient ﬂoating-point computing,” in

IEEE Hot Chips , Aug.2020.[8] S. Pasricha et al. , On-Chip Communication Architectures: System onChip Interconnect . Elsevier Science, 2010.[9] J. Flich et al. , Designing Network On-Chip Architectures in the NanoscaleEra . CRC Press, 2010.[10] S. Kundu et al. , Network-on-Chip: The Next Generation of System-on-Chip Integration . CRC Press, 2014.[11] A. Cilardo et al. , “Design automation for application-speciﬁc on-chipinterconnects: A survey,”

Integration , 2016.[12] J. Joven et al. , “xENoC - an experimental network-on-chip envi-ronment for parallel distributed computing on NoC-based MPSoCarchitectures,” in

PDP , 2008.[13] E. A. Carara et al. , “HeMPS - a framework for NoC-based MPSoCgeneration,” in

IEEE ISCS , 2009.[14] R. Busseuil et al. , “Open-Scale: A scalable, open-source NoC-basedMPSoC for design space exploration,” in

ReConFig , 2011.[15] F. Moraes et al. , “HERMES: an infrastructure for low area overheadpacket-switching networks on chip,”

Integration , 2004.[16] B. K. Joardar et al. , “Learning-based application-agnostic 3D NoCdesign for heterogeneous manycore systems,”

IEEE TC , 2019.[17] D. Giri, et al. , “Accelerators and coherence: An SoC perspective,”

IEEE Micro , 2018.[18]

CoreConnect Processor Local Bus Speciﬁcation , IBM Inc., 2007.[19]

Wishbone B4 SoC Interconnection Architecture , Silicore Corp., 2010.[20] Accellera Inc.,

Open Core Protocol Speciﬁcation Release 3.0 , 2013.[21]

SiFive TileLink Speciﬁcation v1.8.0 , SiFive Inc., 2019.[22] F. Fatollahi-Fard et al. , “OpenSoC Fabric: On-chip network generator,”in

IEEE ISPASS , 2016.[23] D. Giri et al. , “NoC-based support of heterogeneous cache-coherencemodels for accelerators,” in

IEEE/ACM NOCS , 2018.[24]

ARM CoreLink NIC-400 TRM, Revision G , Arm Ltd., 2016.[25] Synopsys Inc., “DesignWare IP solutions for AMBA AXI 4,” 2018.[26] J.-J. Lecler et al. , “Application driven network-on-chip architectureexploration and reﬁnement for a complex SoC,”

Design Automationfor Embedded Systems , Jun 2011.[27] F. Zaruba et al. , “The ﬂoating point trinity: A multi-modal approachto extreme energy-efﬁciency and performance,” in

IEEE ICECS , 2019.[28] D. Mulnix, “Intel Xeon processor scalable family technical overview,”Intel Corp., 2017. [29] T. Burd et al. , “Zeppelin: An SoC for multichip architectures,”

IEEEJSSC , 2019.[30] S. K. Sadasivam et al. , “IBM Power9 processor architecture,”

IEEEMicro , 2017.[31]

AMBA5 CHI Speciﬁcation Issue D , Arm Ltd., 2019.[32] M. Cavalcante et al. , “Design of an open-source bridge betweennon-coherent burst-based and coherent cache-line-based memorysystems,” in

ACM CF , 2020.[33] N. Eisley et al. , “In-network cache coherence,” in

IEEE/ACM MICRO ,2006.[34] N. D. Enright Jerger et al. , “Virtual tree coherence: Leveragingregions and in-network multicast trees for scalable cache coherence,”in

IEEE/ACM MICRO , 2008.[35] N. Agarwal et al. , “In-network coherence ﬁltering: Snoopy coherencewithout broadcasts,” in

IEEE/ACM MICRO , 2009.[36] B. K. Daya et al. , “SCORPIO: A 36-core research chip demonstratingsnoopy coherence on a scalable mesh NoC with in-network ordering,”in

ACM/IEEE ISCA , 2014.[37] A. Butko et al. , “Open2C: Open-source generator for exploration ofcoherent cache memory subsystems,” in

ACM MEMSYS , 2018.[38] K. Asanovi´c et al. , “The Rocket chip generator,” EECS Department,University of California, Berkeley, Tech. Rep., Apr 2016.[39] J. Balkind et al. , “OpenPiton: An open source manycore researchframework,” in

ACM ASPLOS , 2016.

Andreas Kurth received his BSc and MSc degree inelectrical engineering and information technology fromETH Zurich in 2014 and 2017, respectively. He is currentlypursuing a PhD degree in the Digital Circuits and Systemsgroup of Prof. Benini. His research interests include thearchitecture and programming of heterogeneous SoCsand accelerator-rich computing systems.

Wolfgang R ¨onninger received his BSc and MSc degreein electrical engineering and information technology fromETH Zurich in 2017 and 2019, respectively. He currentlyworks as a research assistant in the Digital Circuits andSystems group of Prof. Benini. His research interests in-clude high-performance on-chip communication networksand general-purpose memory hierarchies.

Thomas Benz received his BSc and MSc degree inelectrical engineering and information technology fromETH Zurich in 2018 and 2020, respectively. He is currentlypursuing a PhD degree in the Digital Circuits and Systemsgroup of Prof. Benini. His research interests includeenergy-efﬁcient high-performance computer architecturesand the design of ASICs.

Matheus Cavalcante received his MSc degree in inte-grated electronic systems from the Grenoble Institute ofTechnology (Phelma) in 2018. He is currently pursuing aPhD degree in the Digital Circuits and Systems groupof Prof. Benini. His research interests include vectorprocessing and high-performance computer architectures.

Fabian Schuiki received his BSc and MSc degree inelectrical engineering and information technology fromETH Zurich in 2014 and 2017, respectively. He is currentlypursuing a PhD degree in the Digital Circuits and Systemsgroup of Prof. Benini. His research interests includecomputer architecture, transprecision computing, as wellas near-memory and in-memory processing.

Florian Zaruba received his BSc degree from TU Wienin 2014 and his MSc from the ETH Zurich in 2017. He iscurrently pursuing a PhD degree in the Digital Circuits andSystems group of Prof. Benini. His research interestsinclude design of VLSI circuits and high-performancecomputer architectures.